Reduce log volume in evacuation check methods #107

kabragaurav · 2026-01-14T14:20:18Z

Problem

EKG log volume check is failing for helix.ambry and helix.espresso2 during canary deployments in EI cluster:

"Total log volume should not change by more than 100% when experiment host produces log volume more than 1 MB/min"

Observation: The rule is flaky (failing mostly but sometimes passing). Started failing from version 11.3.0.31 (Dec 12).

Jira: CICP-3460

Root Cause Analysis

After analyzing logging changes in past ~45 days, identified culprit: Commit 99b5c158e introduced 6 new INFO logs in ZKHelixAdmin.java's instanceHasCurrentStateOrMessage() method.

The problem: This method is called on every REST API poll to isEvacuateFinished(), isInstanceDrained(), and isReadyForPreparingJoiningCluster(). The partition-counting log was firing on every successful call, even when evacuation was complete.

Why only Ambry & Espresso2 in EI?

Storage systems with frequent node maintenance (evacuations/swaps)
Many instances being monitored simultaneously
High polling frequency from ACM automation

Solution

Made partition-counting logs conditional - only fire when partitions are actually blocking evacuation:

if (hasRemainingPartitions) {
    logger.info("Instance {} in cluster {} has {} partitions after applying {} exclusions...");
}

- Remove INFO logs that fire on every poll when evacuation is complete - Make remaining logs conditional - only log when partitions are blocking - Keep actionable logs (session carry-over, pending messages) Root Cause: ACM polls isEvacuateFinished repeatedly, and new INFO logs from commit 99b5c15 were generating excessive volume for storage systems (Ambry, Espresso) with frequent evacuation operations.

kabragaurav added 2 commits January 14, 2026 19:34

Merge branch 'dev' into fix/CICP-3460-reduce-evacuation-log-volume

7e12235

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce log volume in evacuation check methods #107

Reduce log volume in evacuation check methods #107

Uh oh!

kabragaurav commented Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Reduce log volume in evacuation check methods #107

Are you sure you want to change the base?

Reduce log volume in evacuation check methods #107

Uh oh!

Conversation

kabragaurav commented Jan 14, 2026

Problem

Root Cause Analysis

Solution

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant