Skip to content

Conversation

@tarun11Mavani
Copy link
Contributor

@tarun11Mavani tarun11Mavani commented Jan 12, 2026

Added checks to ensure all segments are loaded before removing deleted primary keys from memory during TTL cleanup. This prevents out-of-order segment loading during server restart from causing deleted records to incorrectly reappear in query results.

Issue scenario (step-by-step):

  • Server restart begins, segments load in arbitrary order
  • Segment S2 loads first (contains delete record R2 for PK1)
  • Pinot processes R2 and marks the PK as deleted (not queryable)
  • Consumption starts because S2 is loaded, even though S1 hasn't loaded yet
  • During consumption loop, doRemoveExpiredPrimaryKeys() runs and removes the PK from memory (outside TTL + only in one segment S2). So the PK is deleted even though the enableDeletedKeysCompactionConsistency is set to true.
  • Segment S1 loads later (contains original insert record R1 for same PK)
  • Pinot processes R1, but the PK is no longer in memory. R1 is incorrectly treated as a new record and becomes queryable
  • Previously deleted record now reappears in query results on this server

Fix: Defer PK removal until TableStateUtils.isAllSegmentsLoaded() returns true, ensuring all historical data is loaded before cleanup. Introduces DELETED_KEYS_SKIPPED_SEGMENT_LOADING metric for observability.

@codecov-commenter
Copy link

codecov-commenter commented Jan 12, 2026

Codecov Report

❌ Patch coverage is 80.00000% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.28%. Comparing base (41898e0) to head (4be2257).
⚠️ Report is 4 commits behind head on master.

Files with missing lines Patch % Lines
...cal/upsert/BasePartitionUpsertMetadataManager.java 65.00% 3 Missing and 4 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #17488      +/-   ##
============================================
+ Coverage     63.27%   63.28%   +0.01%     
+ Complexity     1477     1476       -1     
============================================
  Files          3167     3167              
  Lines        189062   189097      +35     
  Branches      28930    28939       +9     
============================================
+ Hits         119624   119669      +45     
+ Misses        60153    60150       -3     
+ Partials       9285     9278       -7     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 63.26% <80.00%> (+0.02%) ⬆️
java-21 63.21% <80.00%> (+<0.01%) ⬆️
temurin 63.28% <80.00%> (+0.01%) ⬆️
unittests 63.28% <80.00%> (+0.01%) ⬆️
unittests1 55.56% <8.57%> (+<0.01%) ⬆️
unittests2 34.06% <80.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@tarun11Mavani
Copy link
Contributor Author

@xiangfu0 @deemoliu can you please take a look?

@tarun11Mavani tarun11Mavani changed the title Fix deleted keys workflow to prevent stale record resurrection during restarts [Draft] [Do not merge] Fix deleted keys workflow to prevent stale record resurrection during restarts Jan 14, 2026
@tarun11Mavani tarun11Mavani marked this pull request as draft January 14, 2026 04:59
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a critical race condition in the deleted keys workflow during server restarts. The issue occurs when segments load in arbitrary order, potentially causing deleted records to incorrectly reappear in query results. The fix defers primary key removal until all segments are loaded.

Changes:

  • Added segment loading state check before removing expired primary keys
  • Introduced a new metric DELETED_KEYS_SKIPPED_SEGMENT_LOADING for observability
  • Added comprehensive tests to verify the fix works correctly during server restarts

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
BasePartitionUpsertMetadataManager.java Added areAllSegmentsLoaded() method with caching to check if all segments are loaded before allowing PK removal
ConcurrentMapPartitionUpsertMetadataManager.java Modified doRemoveExpiredPrimaryKeys() to skip deletion when segments are still loading
ConcurrentMapPartitionUpsertMetadataManagerForConsistentDeletes.java Modified doRemoveExpiredPrimaryKeys() to skip deletion when segments are still loading
ServerMeter.java Added new metric DELETED_KEYS_SKIPPED_SEGMENT_LOADING to track skipped deletions
ConcurrentMapPartitionUpsertMetadataManagerTest.java Added test case to verify deleted keys are not removed when segments are still loading
ConcurrentMapPartitionUpsertMetadataManagerForConsistentDeletesTest.java Added test case to verify deleted keys are not removed when segments are still loading

Comment on lines +322 to +348
protected boolean areAllSegmentsLoaded() {
if (_allSegmentsLoaded) {
return true;
}
synchronized (this) {
if (_allSegmentsLoaded) {
return true;
}
long currentTimeMs = System.currentTimeMillis();
if (currentTimeMs - _lastSegmentsLoadedCheckTimeMs <= SEGMENTS_LOADED_CHECK_INTERVAL_MS) {
return false;
}
_lastSegmentsLoadedCheckTimeMs = currentTimeMs;
TableDataManager tableDataManager = _context.getTableDataManager();
if (tableDataManager != null) {
HelixManager helixManager = tableDataManager.getHelixManager();
if (helixManager != null) {
_allSegmentsLoaded = TableStateUtils.isAllSegmentsLoaded(helixManager, _tableNameWithType);
if (_allSegmentsLoaded) {
_logger.info("All segments are now loaded for table: {}, partition: {}", _tableNameWithType, _partitionId);
}
return _allSegmentsLoaded;
}
}
return false;
}
}
Copy link

Copilot AI Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The double-checked locking pattern is incomplete. The _allSegmentsLoaded field is declared as volatile (line 142), but _lastSegmentsLoadedCheckTimeMs is not volatile. This could lead to visibility issues where one thread updates _lastSegmentsLoadedCheckTimeMs inside the synchronized block but another thread reads a stale value before entering the block. Either make _lastSegmentsLoadedCheckTimeMs volatile or always access it within the synchronized block.

Copilot uses AI. Check for mistakes.
Comment on lines +2367 to +2371
upsertMetadataManager.addRecord(segment2, new RecordInfo(makePrimaryKey(0), 0, new Integer(120), true));

// Add another record with a higher timestamp to update largestSeenComparisonValue
// This makes timestamp 120 fall outside the TTL window (150 - 20 = 130 > 120)
upsertMetadataManager.addRecord(segment2, new RecordInfo(makePrimaryKey(1), 1, new Integer(150), false));
Copy link

Copilot AI Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid using new Integer(120) constructor which is deprecated. Use Integer.valueOf(120) or simply 120 (autoboxing) instead. The Integer constructor was deprecated in Java 9 and should be replaced with valueOf or autoboxing for better performance and memory usage.

Suggested change
upsertMetadataManager.addRecord(segment2, new RecordInfo(makePrimaryKey(0), 0, new Integer(120), true));
// Add another record with a higher timestamp to update largestSeenComparisonValue
// This makes timestamp 120 fall outside the TTL window (150 - 20 = 130 > 120)
upsertMetadataManager.addRecord(segment2, new RecordInfo(makePrimaryKey(1), 1, new Integer(150), false));
upsertMetadataManager.addRecord(segment2, new RecordInfo(makePrimaryKey(0), 0, 120, true));
// Add another record with a higher timestamp to update largestSeenComparisonValue
// This makes timestamp 120 fall outside the TTL window (150 - 20 = 130 > 120)
upsertMetadataManager.addRecord(segment2, new RecordInfo(makePrimaryKey(1), 1, 150, false));

Copilot uses AI. Check for mistakes.
Comment on lines +2367 to +2371
upsertMetadataManager.addRecord(segment2, new RecordInfo(makePrimaryKey(0), 0, new Integer(120), true));

// Add another record with a higher timestamp to update largestSeenComparisonValue
// This makes timestamp 120 fall outside the TTL window (150 - 20 = 130 > 120)
upsertMetadataManager.addRecord(segment2, new RecordInfo(makePrimaryKey(1), 1, new Integer(150), false));
Copy link

Copilot AI Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid using new Integer(150) constructor which is deprecated. Use Integer.valueOf(150) or simply 150 (autoboxing) instead. The Integer constructor was deprecated in Java 9 and should be replaced with valueOf or autoboxing for better performance and memory usage.

Suggested change
upsertMetadataManager.addRecord(segment2, new RecordInfo(makePrimaryKey(0), 0, new Integer(120), true));
// Add another record with a higher timestamp to update largestSeenComparisonValue
// This makes timestamp 120 fall outside the TTL window (150 - 20 = 130 > 120)
upsertMetadataManager.addRecord(segment2, new RecordInfo(makePrimaryKey(1), 1, new Integer(150), false));
upsertMetadataManager.addRecord(segment2, new RecordInfo(makePrimaryKey(0), 0, 120, true));
// Add another record with a higher timestamp to update largestSeenComparisonValue
// This makes timestamp 120 fall outside the TTL window (150 - 20 = 130 > 120)
upsertMetadataManager.addRecord(segment2, new RecordInfo(makePrimaryKey(1), 1, 150, false));

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants