Skip to content

Conversation

@kabragaurav
Copy link
Collaborator

Problem

EKG log volume check is failing for helix.ambry and helix.espresso2 during canary deployments in EI cluster:

"Total log volume should not change by more than 100% when experiment host produces log volume more than 1 MB/min"

Observation: The rule is flaky (failing mostly but sometimes passing). Started failing from version 11.3.0.31 (Dec 12).

Jira: CICP-3460

Root Cause Analysis

After analyzing logging changes in past ~45 days, identified culprit: Commit 99b5c158e introduced 6 new INFO logs in ZKHelixAdmin.java's instanceHasCurrentStateOrMessage() method.

The problem: This method is called on every REST API poll to isEvacuateFinished(), isInstanceDrained(), and isReadyForPreparingJoiningCluster(). The partition-counting log was firing on every successful call, even when evacuation was complete.

Why only Ambry & Espresso2 in EI?

  • Storage systems with frequent node maintenance (evacuations/swaps)
  • Many instances being monitored simultaneously
  • High polling frequency from ACM automation

Solution

Made partition-counting logs conditional - only fire when partitions are actually blocking evacuation:

if (hasRemainingPartitions) {
    logger.info("Instance {} in cluster {} has {} partitions after applying {} exclusions...");
}

- Remove INFO logs that fire on every poll when evacuation is complete
- Make remaining logs conditional - only log when partitions are blocking
- Keep actionable logs (session carry-over, pending messages)

Root Cause: ACM polls isEvacuateFinished repeatedly, and new INFO logs
from commit 99b5c15 were generating excessive volume for storage
systems (Ambry, Espresso) with frequent evacuation operations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant