Batched persistence with a queuing KVStore (PoC) #4310

joostjager · 2026-01-13T14:26:12Z

Summary

This PR introduces a mechanism to batch channel monitor and channel manager persistence, committing all writes together at well-defined chokepoints.

Instead of persisting monitors synchronously when they're updated, writes go to a QueuedKVStoreSync wrapper that queues them in memory without touching disk. Multiple writes to the same monitor are deduplicated, keeping only the latest state. At chokepoints where data leaves the system (get_and_clear_pending_msg_events, get_and_clear_pending_events), we call commit() on the queued store, which flushes all pending writes (monitors + channel manager) to the underlying storage in a single batch.

It is safe to postpone writes as long as no data has left the system - neither over the network (Lightning messages to peers) nor to the application (events delivered to the user). Once data leaves these boundaries, the node has made external commitments that must survive a crash. By persisting exactly at these exit points, we ensure the persisted state always reflects what has been externally communicated. If we crash before commit, nothing was sent so we replay from the last committed state; if we crash after, all persisted state reflects what was sent. This eliminates the race window where a monitor is persisted but the manager isn't yet, which can cause state desync and force closures on restart.

What's Missing

This is a proof-of-concept. To be production-ready:

Atomic multi-key writes in KVStore: The current commit() writes keys sequentially. A crash mid-commit could leave partial state. Production use requires a KVStore implementation that supports atomic multi-key writes. Note that this is impossible for filesystem-based backends like FilesystemStore - the desync gap cannot be fully closed with that backend. However, SQLite or other database backends can achieve true atomicity using database transactions.
Complete persistence scenario coverage: Only get_and_clear_pending_msg_events and get_and_clear_pending_events currently trigger commits. Other scenarios may need coverage:
- Block sync / chain monitoring updates
- Shutdown / graceful close
- Background persistence timer (if any pending writes exist)
ChannelManager serialization size: The current approach serializes the entire ChannelManager on each commit. For nodes with many channels, this could be expensive. Potential optimizations:
- Decompose ChannelManager state on a per-channel basis (still committed atomically, but only dirty channels are re-serialized)
- Apply a binary diff algorithm for incremental updates, persisting only the delta from the previous state
Full node integration test: Add a test that uses QueuedKVStoreSync with a complete node setup, verifying that batched persistence works correctly across channel creation, payments, and node restart/reload.

Add a KVStoreSync wrapper that queues writes and removes until explicitly committed. This enables batching multiple monitor updates into a single persistence operation, reducing I/O overhead. Key features: - Writes and removes are queued until commit() is called - Reads check pending ops first, then fall back to inner store - Multiple ops to the same key are deduplicated (latest wins) - List operations reflect pending writes and removes Co-Authored-By: Claude Opus 4.5 <[email protected]>

Add a new `kv_store` parameter to `ChannelManager::new()` and a `persist()` method that handles persistence of the channel manager to the provided KVStore. This moves persistence logic into ChannelManager itself, similar to what the background processor currently does externally. Changes: - Add `K: Deref` type parameter to `ChannelManager` where `K::Target: KVStoreSync` - Add `kv_store: K` field to the struct - Add `persist()` method that checks `get_and_clear_needs_persistence()` and writes to the KV store if needed - Update `ChannelManagerReadArgs` to include `kv_store` parameter - Update `AChannelManager` trait with `KVStore` and `K` associated types - Update type aliases (`SimpleArcChannelManager`, `SimpleRefChannelManager`, `SimpleArcPeerManager`, `SimpleRefPeerManager`, etc.) - Update test infrastructure and doctests Co-Authored-By: Claude Opus 4.5 <[email protected]>

ldk-reviews-bot · 2026-01-13T14:26:15Z

👋 Hi! I see this is a draft PR.
I'll wait to assign reviewers until you mark it as ready for review.
Just convert it out of draft status when you're ready for review!

Modify get_and_clear_pending_msg_events to take the write lock on total_consistency_lock for the entire operation, then serialize and persist the ChannelManager before returning the events. This ensures the persisted state exactly matches the events being returned. Previously, persist() was called first (releasing the lock), then events were gathered under a read lock. This left a window where incoming messages could modify state between persist and event gathering, causing the persisted snapshot to not match the returned events. Also adds write_without_lock() method to enable serialization when the caller already holds the consistency lock. Co-Authored-By: Claude Opus 4.5 <[email protected]>

Modify process_events_body macro to persist the ChannelManager before handing events to the user handler. This ensures that if the handler crashes, events will be replayed on restart since they're still in the persisted state. Also calls commit() on the KVStore after writing, which flushes all pending writes (including any queued monitor updates) atomically. The persist() method is removed as persistence now happens automatically as part of event processing. Co-Authored-By: Claude Opus 4.5 <[email protected]>

joostjager · 2026-01-14T14:20:05Z

Added three commits that make persistence atomic with event processing. First, extracted a write_without_lock() method that serializes the ChannelManager assuming the caller already holds the consistency lock. Then modified get_and_clear_pending_msg_events to hold the write lock for the entire operation—gathering events and persisting atomically—so the persisted state exactly matches the returned events with no race window. Finally, modified process_events_body to persist the ChannelManager before calling user event handlers (ensuring crashed handlers replay events on restart) and to call commit() on the KVStore to flush all pending writes atomically, including any queued monitor updates. The persist() method is removed since persistence now happens automatically during event processing.

joostjager and others added 2 commits January 13, 2026 13:57

joostjager and others added 3 commits January 14, 2026 14:56

extract write without locks

9b77f09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Batched persistence with a queuing KVStore (PoC) #4310

Batched persistence with a queuing KVStore (PoC) #4310

joostjager commented Jan 13, 2026

Uh oh!

ldk-reviews-bot commented Jan 13, 2026

Uh oh!

joostjager commented Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Batched persistence with a queuing KVStore (PoC) #4310

Are you sure you want to change the base?

Batched persistence with a queuing KVStore (PoC) #4310

Conversation

joostjager commented Jan 13, 2026

Summary

What's Missing

Uh oh!

ldk-reviews-bot commented Jan 13, 2026

Uh oh!

joostjager commented Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants