Skip to content

Conversation

@joostjager
Copy link
Contributor

Summary

This PR introduces a mechanism to batch channel monitor and channel manager persistence, committing all writes together at well-defined chokepoints.

Instead of persisting monitors synchronously when they're updated, writes go to a QueuedKVStoreSync wrapper that queues them in memory without touching disk. Multiple writes to the same monitor are deduplicated, keeping only the latest state. At chokepoints where data leaves the system (get_and_clear_pending_msg_events, get_and_clear_pending_events), we call commit() on the queued store, which flushes all pending writes (monitors + channel manager) to the underlying storage in a single batch.

It is safe to postpone writes as long as no data has left the system - neither over the network (Lightning messages to peers) nor to the application (events delivered to the user). Once data leaves these boundaries, the node has made external commitments that must survive a crash. By persisting exactly at these exit points, we ensure the persisted state always reflects what has been externally communicated. If we crash before commit, nothing was sent so we replay from the last committed state; if we crash after, all persisted state reflects what was sent. This eliminates the race window where a monitor is persisted but the manager isn't yet, which can cause state desync and force closures on restart.

What's Missing

This is a proof-of-concept. To be production-ready:

  1. Atomic multi-key writes in KVStore: The current commit() writes keys sequentially. A crash mid-commit could leave partial state. Production use requires a KVStore implementation that supports atomic multi-key writes. Note that this is impossible for filesystem-based backends like FilesystemStore - the desync gap cannot be fully closed with that backend. However, SQLite or other database backends can achieve true atomicity using database transactions.

  2. Complete persistence scenario coverage: Only get_and_clear_pending_msg_events and get_and_clear_pending_events currently trigger commits. Other scenarios may need coverage:

    • Block sync / chain monitoring updates
    • Shutdown / graceful close
    • Background persistence timer (if any pending writes exist)
  3. ChannelManager serialization size: The current approach serializes the entire ChannelManager on each commit. For nodes with many channels, this could be expensive. Potential optimizations:

    • Decompose ChannelManager state on a per-channel basis (still committed atomically, but only dirty channels are re-serialized)
    • Apply a binary diff algorithm for incremental updates, persisting only the delta from the previous state
  4. Full node integration test: Add a test that uses QueuedKVStoreSync with a complete node setup, verifying that batched persistence works correctly across channel creation, payments, and node restart/reload.

joostjager and others added 2 commits January 13, 2026 13:57
Add a KVStoreSync wrapper that queues writes and removes until
explicitly committed. This enables batching multiple monitor updates
into a single persistence operation, reducing I/O overhead.

Key features:
- Writes and removes are queued until commit() is called
- Reads check pending ops first, then fall back to inner store
- Multiple ops to the same key are deduplicated (latest wins)
- List operations reflect pending writes and removes

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Add a new `kv_store` parameter to `ChannelManager::new()` and a
`persist()` method that handles persistence of the channel manager
to the provided KVStore. This moves persistence logic into
ChannelManager itself, similar to what the background processor
currently does externally.

Changes:
- Add `K: Deref` type parameter to `ChannelManager` where
  `K::Target: KVStoreSync`
- Add `kv_store: K` field to the struct
- Add `persist()` method that checks `get_and_clear_needs_persistence()`
  and writes to the KV store if needed
- Update `ChannelManagerReadArgs` to include `kv_store` parameter
- Update `AChannelManager` trait with `KVStore` and `K` associated types
- Update type aliases (`SimpleArcChannelManager`, `SimpleRefChannelManager`,
  `SimpleArcPeerManager`, `SimpleRefPeerManager`, etc.)
- Update test infrastructure and doctests

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@ldk-reviews-bot
Copy link

👋 Hi! I see this is a draft PR.
I'll wait to assign reviewers until you mark it as ready for review.
Just convert it out of draft status when you're ready for review!

joostjager and others added 3 commits January 14, 2026 14:56
Modify get_and_clear_pending_msg_events to take the write lock on
total_consistency_lock for the entire operation, then serialize and
persist the ChannelManager before returning the events. This ensures
the persisted state exactly matches the events being returned.

Previously, persist() was called first (releasing the lock), then events
were gathered under a read lock. This left a window where incoming
messages could modify state between persist and event gathering,
causing the persisted snapshot to not match the returned events.

Also adds write_without_lock() method to enable serialization when the
caller already holds the consistency lock.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Modify process_events_body macro to persist the ChannelManager before
handing events to the user handler. This ensures that if the handler
crashes, events will be replayed on restart since they're still in
the persisted state.

Also calls commit() on the KVStore after writing, which flushes all
pending writes (including any queued monitor updates) atomically.

The persist() method is removed as persistence now happens automatically
as part of event processing.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@joostjager
Copy link
Contributor Author

Added three commits that make persistence atomic with event processing. First, extracted a write_without_lock() method that serializes the ChannelManager assuming the caller already holds the consistency lock. Then modified get_and_clear_pending_msg_events to hold the write lock for the entire operation—gathering events and persisting atomically—so the persisted state exactly matches the returned events with no race window. Finally, modified process_events_body to persist the ChannelManager before calling user event handlers (ensuring crashed handlers replay events on restart) and to call commit() on the KVStore to flush all pending writes atomically, including any queued monitor updates. The persist() method is removed since persistence now happens automatically during event processing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants