Skip to content

Conversation

@harshach
Copy link
Collaborator

@harshach harshach commented Jan 13, 2026

Describe your changes:

Fixes

I worked on ... because ...

Type of change:

  • Bug fix
  • Improvement
  • New feature
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation

Checklist:

  • I have read the CONTRIBUTING document.
  • My PR title is Fixes <issue-number>: <short explanation>
  • I have commented on my code, particularly in hard-to-understand areas.
  • For JSON Schema changes: I updated the migration scripts or explained why it is not needed.

Summary by Gitar

  • Request latency tracking enhancement:
    • Added context propagation methods (getContext, setContext, clearContext) in RequestLatencyContext.java for virtual thread metric aggregation
    • Thread-safe timing with AtomicLong and AtomicInteger ensures accurate metrics across concurrent operations
  • Bulk operation executor refactor:
    • New BulkExecutor.java replaces BulkOperationSemaphore with clearer metric instrumentation for database and search operations
    • Wraps virtual thread executor with connection-aware throttling and comprehensive operation tracking
  • Search operation instrumentation:
    • Added metric tracking to ElasticSearchSearchManager, OpenSearchSearchManager, and aggregation managers
    • Database query logging enhanced in OMSqlLogger with request context integration
  • Bulk operation configuration:
    • BulkOperationConfiguration with auto-scaling based on connection pool size (default: 20% for bulk ops, 80% for user traffic)
    • Configuration example added to conf/openmetadata.yaml
  • Comprehensive test coverage:
    • New BulkExecutorTest (364 lines) with concurrency and throttling tests
    • OMSqlLoggerTest added for SQL logging verification

This will update automatically on new commits.


Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves slow request metric calculation and adds bulk operation configuration for fine-tuning concurrent database operations. The changes address connection pool exhaustion during bulk operations by introducing a semaphore-based throttling mechanism and improve metric accuracy across virtual threads using atomic operations.

Changes:

  • Introduced BulkOperationSemaphore to limit concurrent DB operations during bulk processing and prevent connection pool starvation
  • Modified RequestLatencyContext to use atomic operations (AtomicLong, AtomicInteger) for thread-safe metric accumulation across virtual threads
  • Added BulkOperationConfiguration with auto-scaling support based on connection pool size
  • Updated EntityRepository.bulkCreateOrUpdateEntities() to use semaphore throttling and propagate request latency context to virtual threads
  • Added comprehensive test coverage for concurrency scenarios, semaphore behavior, and metrics tracking

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
RequestLatencyContext.java Changed timing counters from primitive longs/ints to AtomicLong/AtomicInteger for thread-safe metric accumulation; added context propagation methods (getContext/setContext/clearContext) for virtual threads; added reset() method for testing
EntityRepository.java Integrated semaphore-based throttling for bulk operations; added request latency context propagation to virtual threads; replaced fixed thread pool with virtual thread executor; added timeout handling for overload scenarios
BulkOperationSemaphore.java New singleton class managing concurrent DB operations using fair semaphore with configurable permits and timeouts; supports auto-scaling based on connection pool size
BulkOperationConfiguration.java New configuration class with parameters for max concurrent operations, connection pool percentage allocation, auto-scaling flag, and acquire timeout
OpenMetadataApplicationConfig.java Added bulkOperationConfiguration property with getter that returns default instance if not configured
OpenMetadataApplication.java Initialized BulkOperationSemaphore during application startup using config and connection pool size
RequestLatencyContextTest.java Added 480+ lines of new tests covering context propagation, bulk operations simulation, concurrent operations, stress testing, and timing accuracy
BulkOperationSemaphoreTest.java New test file with 263 lines covering initialization, permit acquisition/release, concurrent access, timeouts, and auto-scaling calculations
BulkOperationIntegrationTest.java New integration test file with 528 lines covering semaphore throttling, metrics tracking across virtual threads, real workload simulation, and concurrent bulk requests
Comments suppressed due to low confidence (1)

openmetadata-service/src/test/java/org/openmetadata/service/monitoring/RequestLatencyContextTest.java:37

  • The test uses reflection to clear static maps in RequestLatencyContext, but the production code now has a public reset() method (line 382-389 in RequestLatencyContext.java) that does exactly this. Replace the reflection-based clearStaticMaps() method with a simple call to RequestLatencyContext.reset() to improve maintainability and avoid fragility.
  private void clearStaticMaps() throws Exception {
    java.lang.reflect.Field requestTimersField =
        RequestLatencyContext.class.getDeclaredField("requestTimers");
    requestTimersField.setAccessible(true);
    ((java.util.concurrent.ConcurrentHashMap<?, ?>) requestTimersField.get(null)).clear();

// Initialize bulk operation semaphore for connection-aware throttling
BulkOperationSemaphore.initialize(
catalogConfig.getBulkOperationConfiguration(),
catalogConfig.getDataSourceFactory().getMaxSize());
Copy link

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The call to getMaxSize() on DataSourceFactory may fail if the factory is of type HikariCPDataSourceFactory which uses maximumPoolSize instead. The getMaxSize() method is inherited from Dropwizard's DataSourceFactory base class, but HikariCPDataSourceFactory has its own maximumPoolSize field. Verify that getMaxSize() returns the correct value, or consider using HikariCPDataSourceFactory.getMaximumPoolSize() directly when the factory is of that type. If this is correct, consider adding a comment explaining the relationship between these properties.

Copilot uses AI. Check for mistakes.
List<CompletableFuture<Void>> futures = new ArrayList<>();

// Capture the request latency context from the parent thread for propagation to virtual threads
RequestLatencyContext.RequestContext parentLatencyContext = RequestLatencyContext.getContext();
Copy link

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parent latency context is captured but there's no null check. If bulkCreateOrUpdateEntities is called outside of a request context (e.g., from a background job or during testing), parentLatencyContext will be null. While line 6891 uses setContext(parentLatencyContext) which has null protection, logging and debugging will be affected. Consider adding a null check with a debug log to indicate when bulk operations occur outside request tracking.

Copilot uses AI. Check for mistakes.
Comment on lines 69 to 70
@Max(200)
private int maxConcurrentDbOperations = 10;
Copy link

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The maximum validation is set to 200, but the Javadoc recommendations go up to 80 for 16+ vCore databases. This discrepancy could be confusing. Either update the documentation to explain why 200 is the maximum allowed value (e.g., for very large enterprise deployments), or reduce the @max constraint to align with the documented recommendations.

Copilot uses AI. Check for mistakes.

@Getter
private static class RequestContext {
public static class RequestContext {
Copy link

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RequestContext inner class visibility was changed from private to public to support context propagation. However, this exposes internal implementation details. Consider whether the public exposure is necessary, or if a narrower API (e.g., package-private or protected) would be sufficient. If public is required, add comprehensive Javadoc explaining the class's purpose and how external code should interact with it.

Copilot uses AI. Check for mistakes.
private BulkOperationSemaphore(int maxConcurrentOperations, long acquireTimeoutMs) {
this.maxPermits = maxConcurrentOperations;
this.acquireTimeoutMs = acquireTimeoutMs;
this.semaphore = new Semaphore(maxConcurrentOperations, true); // Fair ordering
Copy link

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a fair semaphore (fairness = true) ensures FIFO ordering but has significant performance overhead compared to non-fair semaphores. For bulk operations, strict ordering may not be necessary. Consider whether fair ordering is required for your use case, or if switching to fairness = false would provide better throughput without impacting correctness. Document the reasoning for using fair mode if it's intentional.

Suggested change
this.semaphore = new Semaphore(maxConcurrentOperations, true); // Fair ordering
// Use non-fair semaphore for better throughput; FIFO ordering is not required for bulk operations.
this.semaphore = new Semaphore(maxConcurrentOperations, false);

Copilot uses AI. Check for mistakes.
Comment on lines 6880 to 6884
LOG.info(
"Starting bulk operation for {} entities with {}/{} permits available",
entities.size(),
semaphore.availablePermits(),
semaphore.getMaxPermits());
Copy link

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This INFO-level log is emitted for every bulk operation request and could generate significant log volume in production environments with frequent bulk operations. Consider either changing this to DEBUG level, or adding rate limiting to prevent log flooding. This is particularly important since bulk operations may be frequent during ingestion workflows.

Copilot uses AI. Check for mistakes.
Comment on lines 52 to 53
private static final ExecutorService VIRTUAL_THREAD_EXECUTOR =
Executors.newVirtualThreadPerTaskExecutor();
Copy link

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The virtual thread executor is created as a static final field but is never explicitly shut down. While virtual thread executors don't strictly require shutdown for cleanup, it's good practice to shut down executors in test teardown to avoid resource leaks and ensure clean test isolation. Consider adding executor shutdown in the @AfterAll method or closing the executor after each test.

Copilot uses AI. Check for mistakes.
pmbrull
pmbrull previously approved these changes Jan 14, 2026
@github-actions
Copy link
Contributor

The Java checkstyle failed.

Please run mvn spotless:apply in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Java code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

@github-actions
Copy link
Contributor

The Java checkstyle failed.

Please run mvn spotless:apply in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Java code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

@gitar-bot
Copy link

gitar-bot bot commented Jan 15, 2026

🔍 CI failure analysis for f3fa73f: Two unrelated CI failures: (1) Flaky glossary test with status mismatch, (2) SonarCloud runner out of disk space. Both are infrastructure/environmental issues.

Issue

Two unrelated CI failures: (1) Maven PostgreSQL test flakiness, (2) Maven SonarCloud infrastructure failure.

Root Cause

PostgreSQL CI: Flaky test in GlossaryResourceTest.testGlossaryImportExport expects status Approved but gets In Review.

SonarCloud CI: Infrastructure failure - GitHub Actions runner ran out of disk space.

Details

PostgreSQL CI Failure

Test location:

openmetadata-service/src/test/java/org/openmetadata/service/resources/glossary/GlossaryResourceTest.java:1070

Status mismatch:

- Expected: ...Approved,"#FF5733"...
+ Actual:   ...In Review,"#FF5733"...

Why unrelated: PR only modifies bulk operations, metrics tracking, and logging. No glossary, CSV import/export, or approval workflow code was changed.

SonarCloud CI Failure

Error:

System.IO.IOException: No space left on device
'/home/runner/actions-runner/cached/_diag/Worker_20260115-044421-utc.log'

The GitHub Actions runner exhausted disk space after ~3.5 hours during the Maven build step. This is a transient infrastructure issue unrelated to code changes.

Job details:

  • Job ID: 60432956942
  • Failed step: Build with Maven
  • Duration: ~3.5 hours before disk exhaustion

Both failures are environmental/infrastructure issues not caused by PR #25275's code changes.

Code Review ✅ Approved 8 resolved / 8 findings

Well-implemented bulk operation executor with proper thread safety and comprehensive metric instrumentation. All previous findings have been addressed.

Resolved ✅ 8 resolved
Bug: Race condition in double-checked locking singleton initialization

📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/BulkExecutor.java:259-269
The BulkExecutor singleton uses double-checked locking but the instance field is not declared as volatile. This can lead to a race condition where one thread may see a partially constructed BulkExecutor object. In Java, without volatile, the compiler and CPU are allowed to reorder operations, potentially exposing an incompletely initialized instance to another thread.

Impact: A thread could observe an incompletely constructed BulkExecutor with null fields like executor, leading to NullPointerException or incorrect behavior.

Fix: Add volatile modifier to the instance field:

private static volatile BulkExecutor instance;

This pattern is well-documented - see "Effective Java" Item 83 and the Java Memory Model specification.

Bug: Unsafe reset of non-volatile isShutdown flag

📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/BulkExecutor.java:289-292
In the reset() method, instance.isShutdown = true is set before calling shutdownNow(), but the isShutdown field is not volatile. Other threads calling submit() may still see isShutdown = false due to lack of memory visibility guarantees.

Additionally, setting instance = null while holding the lock doesn't guarantee visibility to other threads that call getInstance() since instance is not volatile (as noted in the previous finding).

Impact: During reset operations (typically in tests), submissions might still succeed briefly after shutdown has been initiated, or getInstance() might return the old instance briefly.

Fix: Make both instance and isShutdown volatile:

private static volatile BulkExecutor instance;
private volatile boolean isShutdown = false;
Bug: Format specifier in LOG.info will cause IllegalFormatException

📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java:745-748
The log statement uses %.1f format specifier with SLF4J's parameterized logging:

LOG.info(
    "Bulk operation completed: entity={}, total={}, succeeded={}, failed={}, "
        + "wallClockMs={}, avgEntityMs={}, maxEntityMs={}, throughput={:.1f}/s",
    ...
    throughput);

SLF4J does not support Java format specifiers like %.1f - it only supports {} placeholders. This will result in the literal string {:.1f} appearing in the log output instead of the formatted value.

Impact: Log output will show {:.1f} instead of the actual throughput value, making metrics hard to read.

Fix: Use {} placeholder and format the throughput value before logging:

LOG.info(
    "Bulk operation completed: entity={}, total={}, succeeded={}, failed={}, "
        + "wallClockMs={}, avgEntityMs={}, maxEntityMs={}, throughput={}/s",
    entityType,
    entities.size(),
    successRequests.size(),
    failedRequests.size(),
    totalDurationMs,
    avgEntityLatencyMs,
    maxEntityLatencyMs,
    String.format("%.1f", throughput));
Edge Case: Auto-scaling may produce too few threads with small connection pools

📄 openmetadata-service/src/main/java/org/openmetadata/service/config/BulkOperationConfiguration.java:67-76
When maxThreads is not explicitly set (defaults to -1), the auto-scaling logic calculates threads as 20% of connection pool size:

if (this.maxThreads < 1) {
  effectiveMaxThreads = Math.max(2, (int) (connectionPoolSize * 0.20));
}

With a small connection pool (e.g., 5 connections), this produces Math.max(2, 1) = 2 threads. Combined with queue size of maxThreads * 25 = 50, this may be too conservative for bulk operations.

More importantly, the calculation uses 20% of connections for bulk operations to leave 80% for user traffic, but this assumes bulk operations are always background tasks. If bulk operations are user-initiated (e.g., bulk import API), users might experience unexpectedly slow performance.

Impact: Small deployments may have very limited bulk processing capacity.

Suggestion: Consider documenting this behavior clearly and possibly having a higher minimum (e.g., 4-5 threads) or allowing configuration override guidance.

Bug: Race condition in internalTimerStartNanos check-then-act

📄 openmetadata-service/src/main/java/org/openmetadata/service/monitoring/RequestLatencyContext.java:99-104
The startDatabaseOperation() method reads internalTimerStartNanos, performs a calculation, then resets it to 0. This is a check-then-act sequence that is not atomic. When multiple child threads call this method concurrently on the same shared RequestContext, they can race on reading and writing internalTimerStartNanos:

  1. Thread A reads internalTimerStartNanos = 12345
  2. Thread B reads internalTimerStartNanos = 12345
  3. Thread A adds to internalTime and sets internalTimerStartNanos = 0
  4. Thread B adds the same value again to internalTime (double-counting)

The field is marked volatile but that doesn't make the compound operation atomic. Consider using AtomicLong with getAndSet(0) to atomically read and reset the value, or accept that internal time tracking may have some imprecision in multi-threaded scenarios by documenting this behavior.

...and 3 more from earlier reviews

What Works Well

The BulkExecutor singleton now uses proper volatile double-checked locking. Context propagation for metrics across worker threads is well-designed with atomic operations. Comprehensive test coverage including concurrency tests validates the thread pool behavior. The LOG.info format specifier bugs have been fixed throughout using String.format().

Tip

Comment Gitar fix CI or enable auto-apply: gitar auto-apply:on

Options

Auto-apply is off Gitar will not commit updates to this branch.
Display: compact Hiding non-applicable rules.

Comment with these commands to change:

Auto-apply Compact
gitar auto-apply:on         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | This comment will update automatically (Docs)

@sonarqubecloud
Copy link

@harshach harshach merged commit f81bb04 into main Jan 15, 2026
28 of 34 checks passed
@harshach harshach deleted the improve_slow_request_metric branch January 15, 2026 22:41
Vishnuujain pushed a commit that referenced this pull request Jan 20, 2026
…tune (#25275)

* Improve Slow request metric calculation; Add bulkSync config to fine-tune

* Add clear metric instrumentation for bulk operations

* Address gitar comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants