-
Notifications
You must be signed in to change notification settings - Fork 114
[server] throttling deletes to limit disk i/o #2397
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
2e7fca9 to
cc511e8
Compare
cc511e8 to
0c96638
Compare
huangminchn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick turnaround Yug!
Can we double check the code path used during backup version deletion? My understanding is that it's going through "RocksDBStoragePartition#drop()".
We can add a test case for backup version deletion if none exists yet, and validate the deletion rate.
| new VeniceStoreVersionConfig(testStore, veniceServerProperties, PersistenceType.ROCKS_DB); | ||
| StorageEngine storeEngine = factory.getStorageEngine(testStoreConfig); | ||
| Assert.assertNotNull(storeEngine, "Storage engine should be created successfully"); | ||
| factory.removeStorageEngine(storeEngine); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we double check that this code path is used during backup version deletion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it is in the code path. The call stack looks like the following
VersionBackend.delete()
→ IngestionBackend.removeStorageEngine(topicName)
→ StorageService.removeStorageEngine(kafkaTopic)
→ StorageEngine.drop()
→ AbstractStorageEngine.drop() [iterates partitions]
→ RocksDBStoragePartition.drop()
→ RocksDB.destroyDB()
|
|
||
| // SstFileManager deletion rate limiting - disabled by default (0) for backward compatibility | ||
| // Recommended: 64 MB/s (67108864) for nodes with 'discard' mount option | ||
| this.sstFileManagerDeleteRateBytesPerSecond = | ||
| props.getSizeInBytes(ROCKSDB_SST_FILE_MANAGER_DELETE_RATE_BYTES_PER_SECOND, 0L); | ||
| this.sstFileManagerMaxTrashDBRatio = props.getDouble(ROCKSDB_SST_FILE_MANAGER_MAX_TRASH_DB_RATIO, 0.25); // 25% | ||
| // default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this mean that 75% of the DB will be deleted in a throttled manner, but then the last 25% will be immediately deleted ?
We need the throttling to continue to be applied all the way through .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think that is what the config value means. It specifies the upper limit on trash size to live DB size ratio that can be tolerated before files are immediately deleted, overriding the rate limit. So, when the ratio of trash to live db hits 25% throttling will be ignored.
I have added a test to validate the deletion rate |
Problem Statement
Venice storage nodes with the discard mount option enabled are experiencing disk health check failures during large store deletions due to synchronous TRIM operations saturating disk I/O, causing the DiskHealthCheckService to timeout and mark nodes unhealthy.
Solution
Added configurable deletion rate limiting for RocksDB's
SstFileManagerto throttle file deletion operations during store cleanup:Added two new configuration parameters to RocksDBServerConfig:
rocksdb.sst.file.manager.delete.rate.bytes.per.second&rocksdb.sst.file.manager.max.trash.db.ratioUpdated
RocksDBStorageEngineFactoryto apply deletion rate limits to the sharedSstFileManagerinstance when configuredCode changes
Concurrency-Specific Checks
Both reviewer and PR author to verify
synchronized,RWLock) are used where needed.ConcurrentHashMap,CopyOnWriteArrayList).How was this PR tested?
Does this PR introduce any user-facing or breaking changes?