-
Notifications
You must be signed in to change notification settings - Fork 114
[controller] Fix user-initiated pushes blocked by TTL repushes #2391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[controller] Fix user-initiated pushes blocked by TTL repushes #2391
Conversation
User-initiated batch pushes were incorrectly blocked when a compliance repush (TTL repush) was ongoing. The issue was that TTL repushes use the "venice_ttl_re_push_" prefix, but the code only checked for the "venice_re_push_" prefix when determining if an existing push is a repush. Updated VeniceParentHelixAdmin.incrementVersionIdempotent() to check both isPushIdRePush() and isPushIdTTLRePush() when identifying repushes, allowing user pushes to correctly kill TTL repushes and proceed. Added unit tests to verify the fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes a bug where user-initiated batch pushes were incorrectly blocked by ongoing TTL (compliance) repushes. The root cause was that the repush detection logic only checked for regular repushes with the "venice_re_push_" prefix, missing TTL repushes that use the "venice_ttl_re_push_" prefix.
Changes:
- Updated repush detection logic in
incrementVersionIdempotent()to check for both regular and TTL repushes - Added comprehensive unit tests to verify TTL repush detection and all repush type identification
- Added clarifying comments explaining that user-initiated pushes should be able to kill compliance repushes
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| VeniceParentHelixAdmin.java | Fixed repush detection logic on lines 1813-1815 to include TTL repushes; added explanatory comments on lines 1846-1847 |
| TestVeniceParentHelixAdmin.java | Added two comprehensive test methods (testTTLRepushDetection and testRepushTypeIdentification) to verify the fix |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Thank you for the feedback - you were right to question this! After reviewing the actual DEPEND-92712 incident details, I now understand the issue: Actual Incident AnalysisThe blocking push job ID was: This is a regular push with no prefix (not The Real ProblemThe current logic at line 1841 only kills existing pushes if they are repushes: } else if (isExistingPushJobARepush && !pushType.isIncremental() && !isIncomingPushJobARepush) {
killOfflinePush(clusterName, currentPushTopic.get(), true);
}Since the compliance push had no repush prefix, it wasn't killed, and the user push was blocked. My Fix is IncorrectMy current fix adds TTL repushes to repush detection, but this doesn't solve the actual problem since the blocking push was a regular push, not a repush. What Should the Fix Be?The requirement from DEPEND-92712 is: "user-initiated pushes should be able to kill compliance/automated pushes" However, there's no way to distinguish user-initiated vs compliance pushes just from the push job ID. Questions:
I'll hold off on updating this PR until we clarify the correct approach. Sorry for the confusion! |
| boolean isExistingPushJobARepush = Version.isPushIdRePush(existingPushJobId); | ||
| boolean isIncomingPushJobARepush = Version.isPushIdRePush(pushJobId); | ||
| boolean isExistingPushJobARepush = | ||
| Version.isPushIdRePush(existingPushJobId) || Version.isPushIdTTLRePush(existingPushJobId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we bundle this into a utility like Version.isSystemRepush() or something of that nature? That seems to be the intent and would mean only updating one place later should/when we add new variants like this.
|
|
||
| // Kill the existing job if incoming push type is not an inc push and also not a repush job. | ||
| // This includes both regular repushes and TTL repushes (compliance repushes). | ||
| // User-initiated pushes should be able to kill compliance repushes (DEPEND-92712). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we currently do comments with internal LinkedIn jira? I guess it's not bad, but it's a little non-sequitor. I might prefer a github issue or something like that, but I understand that github issues never really made it into our process.
What changes and why
User-initiated batch pushes were incorrectly blocked when a compliance repush (TTL repush) was ongoing, even though user pushes should be able to kill repushes and proceed. This issue was reported in DEPEND-92712.
Root Cause:
The bug was in
VeniceParentHelixAdmin.incrementVersionIdempotent()at lines 1813-1815. The code only checkedVersion.isPushIdRePush()which looks for the"venice_re_push_"prefix. However, TTL repushes (compliance repushes) use the"venice_ttl_re_push_"prefix, so they were not being detected as repushes. This caused the logic to fall through to throwing aConcurrentBatchPushException, blocking user-initiated pushes.How the fix achieves the goal
Updated the repush detection logic to check for both regular repushes and TTL repushes:
When
isExistingPushJobARepushis true and the incoming push is a user push (not a repush), the code at line 1841 correctly kills the existing repush and allows the user push to proceed.Testing done
Unit Tests Added:
testTTLRepushDetection()- Verifies that TTL repush IDs are correctly identified as repushes using the new logictestRepushTypeIdentification()- Validates all repush type detection including the bug scenarioTest Results:
Pre-commit Validation:
Files Changed
services/venice-controller/src/main/java/com/linkedin/venice/controller/VeniceParentHelixAdmin.java- Bug fixservices/venice-controller/src/test/java/com/linkedin/venice/controller/TestVeniceParentHelixAdmin.java- Unit testsChecklist