Improve Slow request metric calculation; Add bulkSync config to fine-tune #25275

harshach · 2026-01-13T18:42:43Z

Describe your changes:

Fixes

I worked on ... because ...

Type of change:

Checklist:

I have read the CONTRIBUTING document.
My PR title is Fixes <issue-number>: <short explanation>
I have commented on my code, particularly in hard-to-understand areas.
For JSON Schema changes: I updated the migration scripts or explained why it is not needed.

Summary by Gitar

Request latency tracking enhancement:
- Added context propagation methods (getContext, setContext, clearContext) in RequestLatencyContext.java for virtual thread metric aggregation
- Thread-safe timing with AtomicLong and AtomicInteger ensures accurate metrics across concurrent operations
Bulk operation executor refactor:
- New BulkExecutor.java replaces BulkOperationSemaphore with clearer metric instrumentation for database and search operations
- Wraps virtual thread executor with connection-aware throttling and comprehensive operation tracking
Search operation instrumentation:
- Added metric tracking to ElasticSearchSearchManager, OpenSearchSearchManager, and aggregation managers
- Database query logging enhanced in OMSqlLogger with request context integration
Bulk operation configuration:
- BulkOperationConfiguration with auto-scaling based on connection pool size (default: 20% for bulk ops, 80% for user traffic)
- Configuration example added to conf/openmetadata.yaml
Comprehensive test coverage:
- New BulkExecutorTest (364 lines) with concurrency and throttling tests
- OMSqlLoggerTest added for SQL logging verification

_{This will update automatically on new commits.}

…tune

Copilot

Pull request overview

This PR improves slow request metric calculation and adds bulk operation configuration for fine-tuning concurrent database operations. The changes address connection pool exhaustion during bulk operations by introducing a semaphore-based throttling mechanism and improve metric accuracy across virtual threads using atomic operations.

Changes:

Introduced BulkOperationSemaphore to limit concurrent DB operations during bulk processing and prevent connection pool starvation
Modified RequestLatencyContext to use atomic operations (AtomicLong, AtomicInteger) for thread-safe metric accumulation across virtual threads
Added BulkOperationConfiguration with auto-scaling support based on connection pool size
Updated EntityRepository.bulkCreateOrUpdateEntities() to use semaphore throttling and propagate request latency context to virtual threads
Added comprehensive test coverage for concurrency scenarios, semaphore behavior, and metrics tracking

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
RequestLatencyContext.java	Changed timing counters from primitive longs/ints to AtomicLong/AtomicInteger for thread-safe metric accumulation; added context propagation methods (getContext/setContext/clearContext) for virtual threads; added reset() method for testing
EntityRepository.java	Integrated semaphore-based throttling for bulk operations; added request latency context propagation to virtual threads; replaced fixed thread pool with virtual thread executor; added timeout handling for overload scenarios
BulkOperationSemaphore.java	New singleton class managing concurrent DB operations using fair semaphore with configurable permits and timeouts; supports auto-scaling based on connection pool size
BulkOperationConfiguration.java	New configuration class with parameters for max concurrent operations, connection pool percentage allocation, auto-scaling flag, and acquire timeout
OpenMetadataApplicationConfig.java	Added bulkOperationConfiguration property with getter that returns default instance if not configured
OpenMetadataApplication.java	Initialized BulkOperationSemaphore during application startup using config and connection pool size
RequestLatencyContextTest.java	Added 480+ lines of new tests covering context propagation, bulk operations simulation, concurrent operations, stress testing, and timing accuracy
BulkOperationSemaphoreTest.java	New test file with 263 lines covering initialization, permit acquisition/release, concurrent access, timeouts, and auto-scaling calculations
BulkOperationIntegrationTest.java	New integration test file with 528 lines covering semaphore throttling, metrics tracking across virtual threads, real workload simulation, and concurrent bulk requests

Comments suppressed due to low confidence (1)

openmetadata-service/src/test/java/org/openmetadata/service/monitoring/RequestLatencyContextTest.java:37

The test uses reflection to clear static maps in RequestLatencyContext, but the production code now has a public reset() method (line 382-389 in RequestLatencyContext.java) that does exactly this. Replace the reflection-based clearStaticMaps() method with a simple call to RequestLatencyContext.reset() to improve maintainability and avoid fragility.

  private void clearStaticMaps() throws Exception {
    java.lang.reflect.Field requestTimersField =
        RequestLatencyContext.class.getDeclaredField("requestTimers");
    requestTimersField.setAccessible(true);
    ((java.util.concurrent.ConcurrentHashMap<?, ?>) requestTimersField.get(null)).clear();

Copilot · 2026-01-13T18:49:45Z

openmetadata-service/src/main/java/org/openmetadata/service/OpenMetadataApplication.java

+    // Initialize bulk operation semaphore for connection-aware throttling
+    BulkOperationSemaphore.initialize(
+        catalogConfig.getBulkOperationConfiguration(),
+        catalogConfig.getDataSourceFactory().getMaxSize());


The call to getMaxSize() on DataSourceFactory may fail if the factory is of type HikariCPDataSourceFactory which uses maximumPoolSize instead. The getMaxSize() method is inherited from Dropwizard's DataSourceFactory base class, but HikariCPDataSourceFactory has its own maximumPoolSize field. Verify that getMaxSize() returns the correct value, or consider using HikariCPDataSourceFactory.getMaximumPoolSize() directly when the factory is of that type. If this is correct, consider adding a comment explaining the relationship between these properties.

Copilot · 2026-01-13T18:49:45Z

openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java

    List<CompletableFuture<Void>> futures = new ArrayList<>();

+    // Capture the request latency context from the parent thread for propagation to virtual threads
+    RequestLatencyContext.RequestContext parentLatencyContext = RequestLatencyContext.getContext();


The parent latency context is captured but there's no null check. If bulkCreateOrUpdateEntities is called outside of a request context (e.g., from a background job or during testing), parentLatencyContext will be null. While line 6891 uses setContext(parentLatencyContext) which has null protection, logging and debugging will be affected. Consider adding a null check with a debug log to indicate when bulk operations occur outside request tracking.

Copilot · 2026-01-13T18:49:46Z

...tadata-service/src/main/java/org/openmetadata/service/config/BulkOperationConfiguration.java

+  @Max(200)
+  private int maxConcurrentDbOperations = 10;


The maximum validation is set to 200, but the Javadoc recommendations go up to 80 for 16+ vCore databases. This discrepancy could be confusing. Either update the documentation to explain why 200 is the maximum allowed value (e.g., for very large enterprise deployments), or reduce the @max constraint to align with the documented recommendations.

Copilot · 2026-01-13T18:49:46Z

...etadata-service/src/main/java/org/openmetadata/service/monitoring/RequestLatencyContext.java

+
  @Getter
-  private static class RequestContext {
+  public static class RequestContext {


The RequestContext inner class visibility was changed from private to public to support context propagation. However, this exposes internal implementation details. Consider whether the public exposure is necessary, or if a narrower API (e.g., package-private or protected) would be sufficient. If public is required, add comprehensive Javadoc explaining the class's purpose and how external code should interact with it.

Copilot · 2026-01-13T18:49:46Z

openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/BulkOperationSemaphore.java

+  private BulkOperationSemaphore(int maxConcurrentOperations, long acquireTimeoutMs) {
+    this.maxPermits = maxConcurrentOperations;
+    this.acquireTimeoutMs = acquireTimeoutMs;
+    this.semaphore = new Semaphore(maxConcurrentOperations, true); // Fair ordering


Using a fair semaphore (fairness = true) ensures FIFO ordering but has significant performance overhead compared to non-fair semaphores. For bulk operations, strict ordering may not be necessary. Consider whether fair ordering is required for your use case, or if switching to fairness = false would provide better throughput without impacting correctness. Document the reasoning for using fair mode if it's intentional.

Suggested change

this.semaphore = new Semaphore(maxConcurrentOperations, true); // Fair ordering

// Use non-fair semaphore for better throughput; FIFO ordering is not required for bulk operations.

this.semaphore = new Semaphore(maxConcurrentOperations, false);

Copilot · 2026-01-13T18:49:47Z

openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java

+    LOG.info(
+        "Starting bulk operation for {} entities with {}/{} permits available",
+        entities.size(),
+        semaphore.availablePermits(),
+        semaphore.getMaxPermits());


This INFO-level log is emitted for every bulk operation request and could generate significant log volume in production environments with frequent bulk operations. Consider either changing this to DEBUG level, or adding rate limiting to prevent log flooding. This is particularly important since bulk operations may be frequent during ingestion workflows.

Copilot · 2026-01-13T18:49:47Z

...adata-service/src/test/java/org/openmetadata/service/jdbi3/BulkOperationIntegrationTest.java

+  private static final ExecutorService VIRTUAL_THREAD_EXECUTOR =
+      Executors.newVirtualThreadPerTaskExecutor();


The virtual thread executor is created as a static final field but is never explicitly shut down. While virtual thread executors don't strictly require shutdown for cleanup, it's good practice to shut down executors in test teardown to avoid resource leaks and ensure clean test isolation. Consider adding executor shutdown in the @AfterAll method or closing the executor after each test.

github-actions · 2026-01-14T19:24:44Z

The Java checkstyle failed.

Please run mvn spotless:apply in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Java code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

…o improve_slow_request_metric

github-actions · 2026-01-15T04:46:58Z

The Java checkstyle failed.

Please run mvn spotless:apply in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Java code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

gitar-bot · 2026-01-15T04:49:59Z

🔍 CI failure analysis for f3fa73f: Two unrelated CI failures: (1) Flaky glossary test with status mismatch, (2) SonarCloud runner out of disk space. Both are infrastructure/environmental issues.

Issue

Two unrelated CI failures: (1) Maven PostgreSQL test flakiness, (2) Maven SonarCloud infrastructure failure.

Root Cause

PostgreSQL CI: Flaky test in GlossaryResourceTest.testGlossaryImportExport expects status Approved but gets In Review.

SonarCloud CI: Infrastructure failure - GitHub Actions runner ran out of disk space.

Details

PostgreSQL CI Failure

Test location:

openmetadata-service/src/test/java/org/openmetadata/service/resources/glossary/GlossaryResourceTest.java:1070

Status mismatch:

- Expected: ...Approved,"#FF5733"...
+ Actual:   ...In Review,"#FF5733"...

Why unrelated: PR only modifies bulk operations, metrics tracking, and logging. No glossary, CSV import/export, or approval workflow code was changed.

SonarCloud CI Failure

Error:

System.IO.IOException: No space left on device
'/home/runner/actions-runner/cached/_diag/Worker_20260115-044421-utc.log'

The GitHub Actions runner exhausted disk space after ~3.5 hours during the Maven build step. This is a transient infrastructure issue unrelated to code changes.

Job details:

Job ID: 60432956942
Failed step: Build with Maven
Duration: ~3.5 hours before disk exhaustion

Both failures are environmental/infrastructure issues not caused by PR #25275's code changes.

Code Review ✅ Approved 8 resolved / 8 findings

Well-implemented bulk operation executor with proper thread safety and comprehensive metric instrumentation. All previous findings have been addressed.

Resolved ✅ 8 resolved

Bug: Race condition in double-checked locking singleton initialization

📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/BulkExecutor.java:259-269
The BulkExecutor singleton uses double-checked locking but the instance field is not declared as volatile. This can lead to a race condition where one thread may see a partially constructed BulkExecutor object. In Java, without volatile, the compiler and CPU are allowed to reorder operations, potentially exposing an incompletely initialized instance to another thread.

Impact: A thread could observe an incompletely constructed BulkExecutor with null fields like executor, leading to NullPointerException or incorrect behavior.

Fix: Add volatile modifier to the instance field:
private static volatile BulkExecutor instance;
This pattern is well-documented - see "Effective Java" Item 83 and the Java Memory Model specification.

Bug: Unsafe reset of non-volatile isShutdown flag

📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/BulkExecutor.java:289-292
In the reset() method, instance.isShutdown = true is set before calling shutdownNow(), but the isShutdown field is not volatile. Other threads calling submit() may still see isShutdown = false due to lack of memory visibility guarantees.

Additionally, setting instance = null while holding the lock doesn't guarantee visibility to other threads that call getInstance() since instance is not volatile (as noted in the previous finding).

Impact: During reset operations (typically in tests), submissions might still succeed briefly after shutdown has been initiated, or getInstance() might return the old instance briefly.

Fix: Make both instance and isShutdown volatile:
private static volatile BulkExecutor instance;
private volatile boolean isShutdown = false;

Bug: Format specifier in LOG.info will cause IllegalFormatException

📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java:745-748
The log statement uses %.1f format specifier with SLF4J's parameterized logging:
LOG.info(
    "Bulk operation completed: entity={}, total={}, succeeded={}, failed={}, "
        + "wallClockMs={}, avgEntityMs={}, maxEntityMs={}, throughput={:.1f}/s",
    ...
    throughput);
SLF4J does not support Java format specifiers like %.1f - it only supports {} placeholders. This will result in the literal string {:.1f} appearing in the log output instead of the formatted value.

Impact: Log output will show {:.1f} instead of the actual throughput value, making metrics hard to read.

Fix: Use {} placeholder and format the throughput value before logging:
LOG.info(
    "Bulk operation completed: entity={}, total={}, succeeded={}, failed={}, "
        + "wallClockMs={}, avgEntityMs={}, maxEntityMs={}, throughput={}/s",
    entityType,
    entities.size(),
    successRequests.size(),
    failedRequests.size(),
    totalDurationMs,
    avgEntityLatencyMs,
    maxEntityLatencyMs,
    String.format("%.1f", throughput));

Edge Case: Auto-scaling may produce too few threads with small connection pools

📄 openmetadata-service/src/main/java/org/openmetadata/service/config/BulkOperationConfiguration.java:67-76
When maxThreads is not explicitly set (defaults to -1), the auto-scaling logic calculates threads as 20% of connection pool size:
if (this.maxThreads < 1) {
  effectiveMaxThreads = Math.max(2, (int) (connectionPoolSize * 0.20));
}
With a small connection pool (e.g., 5 connections), this produces Math.max(2, 1) = 2 threads. Combined with queue size of maxThreads * 25 = 50, this may be too conservative for bulk operations.

More importantly, the calculation uses 20% of connections for bulk operations to leave 80% for user traffic, but this assumes bulk operations are always background tasks. If bulk operations are user-initiated (e.g., bulk import API), users might experience unexpectedly slow performance.

Impact: Small deployments may have very limited bulk processing capacity.

Suggestion: Consider documenting this behavior clearly and possibly having a higher minimum (e.g., 4-5 threads) or allowing configuration override guidance.

Bug: Race condition in internalTimerStartNanos check-then-act

📄 openmetadata-service/src/main/java/org/openmetadata/service/monitoring/RequestLatencyContext.java:99-104
The startDatabaseOperation() method reads internalTimerStartNanos, performs a calculation, then resets it to 0. This is a check-then-act sequence that is not atomic. When multiple child threads call this method concurrently on the same shared RequestContext, they can race on reading and writing internalTimerStartNanos:

Thread A reads internalTimerStartNanos = 12345

Thread B reads internalTimerStartNanos = 12345

Thread A adds to internalTime and sets internalTimerStartNanos = 0

Thread B adds the same value again to internalTime (double-counting)

The field is marked volatile but that doesn't make the compound operation atomic. Consider using AtomicLong with getAndSet(0) to atomically read and reset the value, or accept that internal time tracking may have some imprecision in multi-threaded scenarios by documenting this behavior.

...and 3 more from earlier reviews

What Works Well

The BulkExecutor singleton now uses proper volatile double-checked locking. Context propagation for metrics across worker threads is well-designed with atomic operations. Comprehensive test coverage including concurrency tests validates the thread pool behavior. The LOG.info format specifier bugs have been fixed throughout using String.format().

Tip

Comment Gitar fix CI or enable auto-apply: gitar auto-apply:on

Options

Auto-apply is off Gitar will not commit updates to this branch.
Display: compact Hiding non-applicable rules.

Comment with these commands to change:

`Auto-apply`	`Compact`
`gitar auto-apply:on`	`gitar display:verbose`

_{Was this helpful? React with 👍 / 👎 | This comment will update automatically (Docs)}

sonarqubecloud · 2026-01-15T06:54:53Z

Quality Gate passed for 'open-metadata-ingestion'

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

…tune (#25275) * Improve Slow request metric calculation; Add bulkSync config to fine-tune * Add clear metric instrumentation for bulk operations * Address gitar comments

Improve Slow request metric calculation; Add bulkSync config to fine-…

a79bb34

…tune

harshach requested a review from a team as a code owner January 13, 2026 18:42

harshach temporarily deployed to test January 13, 2026 18:42 — with GitHub Actions Inactive

harshach had a problem deploying to test January 13, 2026 18:42 — with GitHub Actions Failure

harshach temporarily deployed to test January 13, 2026 18:42 — with GitHub Actions Inactive

harshach requested a review from Copilot January 13, 2026 18:42

github-actions bot added backend safe to test Add this label to run secure Github workflows on PRs labels Jan 13, 2026

harshach requested review from TeddyCr, mohityadav766 and pmbrull January 13, 2026 18:43

Copilot started reviewing on behalf of harshach January 13, 2026 18:43 View session

Copilot AI reviewed Jan 13, 2026

View reviewed changes

pmbrull previously approved these changes Jan 14, 2026

View reviewed changes

Add clear metric instrumentation for bulk operations

f3dc543

harshach dismissed pmbrull’s stale review via f3dc543 January 14, 2026 18:59

harshach had a problem deploying to test January 14, 2026 18:59 — with GitHub Actions Error

Merge branch 'main' into improve_slow_request_metric

0934ced

harshach temporarily deployed to test January 14, 2026 19:22 — with GitHub Actions Inactive

harshach had a problem deploying to test January 14, 2026 19:22 — with GitHub Actions Failure

harshach temporarily deployed to test January 14, 2026 19:22 — with GitHub Actions Inactive

harshach had a problem deploying to test January 14, 2026 19:22 — with GitHub Actions Failure

harshach added 2 commits January 14, 2026 20:43

Address gitar comments

e48ce13

Merge remote-tracking branch 'origin/improve_slow_request_metric' int…

f3fa73f

…o improve_slow_request_metric

harshach had a problem deploying to test January 15, 2026 04:44 — with GitHub Actions Failure

harshach temporarily deployed to test January 15, 2026 04:44 — with GitHub Actions Inactive

harshach had a problem deploying to test January 15, 2026 04:44 — with GitHub Actions Failure

harshach temporarily deployed to test January 15, 2026 04:44 — with GitHub Actions Inactive

mohityadav766 approved these changes Jan 15, 2026

View reviewed changes

harshach merged commit f81bb04 into main Jan 15, 2026
28 of 34 checks passed

harshach deleted the improve_slow_request_metric branch January 15, 2026 22:41

This was referenced Jan 16, 2026

fix(mcp): clarify search_metadata tool documentation for LLMs #25306

Merged

Fix: Upgrade MCP SDK to 0.17.1 #25311

Merged

	this.semaphore = new Semaphore(maxConcurrentOperations, true); // Fair ordering
	// Use non-fair semaphore for better throughput; FIFO ordering is not required for bulk operations.
	this.semaphore = new Semaphore(maxConcurrentOperations, false);

		private static final ExecutorService VIRTUAL_THREAD_EXECUTOR =
		Executors.newVirtualThreadPerTaskExecutor();

Improve Slow request metric calculation; Add bulkSync config to fine-tune #25275

Improve Slow request metric calculation; Add bulkSync config to fine-tune #25275

Uh oh!

Conversation

harshach commented Jan 13, 2026 • edited by gitar-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe your changes:

Type of change:

Checklist:

Summary by Gitar

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 14, 2026

Uh oh!

github-actions bot commented Jan 15, 2026

Uh oh!

gitar-bot bot commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue

Root Cause

Details

PostgreSQL CI Failure

SonarCloud CI Failure

What Works Well

Uh oh!

sonarqubecloud bot commented Jan 15, 2026

Quality Gate passed for 'open-metadata-ingestion'

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

harshach commented Jan 13, 2026 •

edited by gitar-bot bot

Loading

gitar-bot bot commented Jan 15, 2026 •

edited

Loading