-
Notifications
You must be signed in to change notification settings - Fork 427
Batched persistence with a queuing KVStore (PoC) #4310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Add a KVStoreSync wrapper that queues writes and removes until explicitly committed. This enables batching multiple monitor updates into a single persistence operation, reducing I/O overhead. Key features: - Writes and removes are queued until commit() is called - Reads check pending ops first, then fall back to inner store - Multiple ops to the same key are deduplicated (latest wins) - List operations reflect pending writes and removes Co-Authored-By: Claude Opus 4.5 <[email protected]>
Add a new `kv_store` parameter to `ChannelManager::new()` and a `persist()` method that handles persistence of the channel manager to the provided KVStore. This moves persistence logic into ChannelManager itself, similar to what the background processor currently does externally. Changes: - Add `K: Deref` type parameter to `ChannelManager` where `K::Target: KVStoreSync` - Add `kv_store: K` field to the struct - Add `persist()` method that checks `get_and_clear_needs_persistence()` and writes to the KV store if needed - Update `ChannelManagerReadArgs` to include `kv_store` parameter - Update `AChannelManager` trait with `KVStore` and `K` associated types - Update type aliases (`SimpleArcChannelManager`, `SimpleRefChannelManager`, `SimpleArcPeerManager`, `SimpleRefPeerManager`, etc.) - Update test infrastructure and doctests Co-Authored-By: Claude Opus 4.5 <[email protected]>
|
👋 Hi! I see this is a draft PR. |
Modify get_and_clear_pending_msg_events to take the write lock on total_consistency_lock for the entire operation, then serialize and persist the ChannelManager before returning the events. This ensures the persisted state exactly matches the events being returned. Previously, persist() was called first (releasing the lock), then events were gathered under a read lock. This left a window where incoming messages could modify state between persist and event gathering, causing the persisted snapshot to not match the returned events. Also adds write_without_lock() method to enable serialization when the caller already holds the consistency lock. Co-Authored-By: Claude Opus 4.5 <[email protected]>
Modify process_events_body macro to persist the ChannelManager before handing events to the user handler. This ensures that if the handler crashes, events will be replayed on restart since they're still in the persisted state. Also calls commit() on the KVStore after writing, which flushes all pending writes (including any queued monitor updates) atomically. The persist() method is removed as persistence now happens automatically as part of event processing. Co-Authored-By: Claude Opus 4.5 <[email protected]>
|
Added three commits that make persistence atomic with event processing. First, extracted a |
Summary
This PR introduces a mechanism to batch channel monitor and channel manager persistence, committing all writes together at well-defined chokepoints.
Instead of persisting monitors synchronously when they're updated, writes go to a
QueuedKVStoreSyncwrapper that queues them in memory without touching disk. Multiple writes to the same monitor are deduplicated, keeping only the latest state. At chokepoints where data leaves the system (get_and_clear_pending_msg_events,get_and_clear_pending_events), we callcommit()on the queued store, which flushes all pending writes (monitors + channel manager) to the underlying storage in a single batch.It is safe to postpone writes as long as no data has left the system - neither over the network (Lightning messages to peers) nor to the application (events delivered to the user). Once data leaves these boundaries, the node has made external commitments that must survive a crash. By persisting exactly at these exit points, we ensure the persisted state always reflects what has been externally communicated. If we crash before commit, nothing was sent so we replay from the last committed state; if we crash after, all persisted state reflects what was sent. This eliminates the race window where a monitor is persisted but the manager isn't yet, which can cause state desync and force closures on restart.
What's Missing
This is a proof-of-concept. To be production-ready:
Atomic multi-key writes in KVStore: The current
commit()writes keys sequentially. A crash mid-commit could leave partial state. Production use requires a KVStore implementation that supports atomic multi-key writes. Note that this is impossible for filesystem-based backends likeFilesystemStore- the desync gap cannot be fully closed with that backend. However, SQLite or other database backends can achieve true atomicity using database transactions.Complete persistence scenario coverage: Only
get_and_clear_pending_msg_eventsandget_and_clear_pending_eventscurrently trigger commits. Other scenarios may need coverage:ChannelManager serialization size: The current approach serializes the entire ChannelManager on each commit. For nodes with many channels, this could be expensive. Potential optimizations:
Full node integration test: Add a test that uses
QueuedKVStoreSyncwith a complete node setup, verifying that batched persistence works correctly across channel creation, payments, and node restart/reload.