graph/db: remove ChannelGraph cacheMu #9804

ellemouton · 2025-05-12T06:32:15Z

We remove the mutex that was previously held between DB calls and calls that update the graphCache. This is done so that the underlying DB calls can make use of any batch requests which they currently cannot since the mutex prevents multiple requests from calling the methods at once.

The reason the cacheMu was originally added here was during a code refactor that moved the graphCache out of the KVStore and into the ChannelGraph and the aim was then to have a best effort way of ensuring that updates to the DB and updates to the graphCache were as consistent/atomic as possible.

We remove the mutex that was previously held between DB calls and calls that update the graphCache. This is done so that the underlying DB calls can make use of any batch requests which they currently cannot since the mutex prevents multiple requests from calling the methods at once. The reason the cacheMu was originally added here was during a code refactor that moved the `graphCache` out of the `KVStore` and into the `ChannelGraph` and the aim was then to have a best effort way of ensuring that updates to the DB and updates to the graphCache were as consistent/atomic as possible.

coderabbitai · 2025-05-12T06:32:24Z

Important

Review skipped

Auto reviews are limited to specific labels.

🏷️ Labels to auto review (1)

llm-review

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

bitromortac

tACK, I see more stable behavior with respect to peer connections, I haven't seen pong disconnects

edit: I still have seen some, though I think it's reduced

bhandras

Thanks! LGTM 🎉

graph/db/graph.go

ziggie1984

LGTM (was introduced here #9550, just for reference)

yyforyongyu · 2025-05-12T09:56:14Z

This will change the behavior of how we update the cache and db. Previously we'd put cache write and db write in a new batch.Request's Update, which then gets executed in,

lnd/batch/batch.go

Line 57 in ee25c22

err := req.Update(tx)

Here we use a lock to make sure the cache and disk stay consistent,

lnd/batch/batch.go

Lines 43 to 49 in ee25c22

    
           // If a cache lock was provided, hold it until the this method returns. 
        
           // This is critical for ensuring external consistency of the operation, 
        
           // so that caches don't get out of sync with the on disk state. 
        
           if b.locker != nil { 
        
           	b.locker.Lock() 
        
           	defer b.locker.Unlock() 
        
           }

With the current change there's no such guarantee.

ellemouton · 2025-05-12T10:11:09Z

@yyforyongyu - indeed, although I think this is the result of pulling the graphCache out of the CRUD layer and not necessarily this change specifically.

I think the problem with the behaviour before (A) was: we'd update the cache even if the overall Update could potentially fail. Whereas now, (B) we only update the cache if the DB write is successful. So both versions dont really guarantee that the two are in sync i'd say. That's why we then added this new cacheMu at the ChannelGraph layer - to make everything overall more consistent - but this then lead to the issue of batch.Request not being used properly.

If we want the exact exact same behaviour as before, then we'd need to thread the graphCache back to the CRUD layer & have it do the updates at transaction time - but im not sure this is worth-it/necessary (but i can do it if we think that's the wanted behaviour here)? We definitely want to keep the ChannelGraph as the owner of the cache for future updates.

so i think we need to decide if we prefer behaviour A to B (both are slightly incorrect - we'd need to do something way more complex to make the cache completely consistent with the db)

ziggie1984 · 2025-05-12T10:12:06Z

Trying to give a high level overview what what resulted in flapping peer connections:

Ping/Pong times have a default timeout of 30sec and all of a sudden peers were disconnecting quite regularly, even ChannelPeers were part of this problem although they count as "persistentPeers" normally however if we encounter unstable peers over time we will expontenially backoff => I think we should at one point drop the connection completely so that we do not end up never connecting again when the timeBackoff would be very high. So we need at one point stop this cycle.

Can lead to a max disconnect time of 1 h (default):

lnd/server.go

Line 4796 in ee25c22

backoff := s.nextPeerBackoff(pubStr, p.StartTime())

So the problem on the Ping/Pong side was not that we did not receive the Pong but that we were not able to process the queue which was built up in the readHandler of the brontide peer in time. See here some examples:

node 1:

2025-05-09 02:54:41.755 [WRN] PEER: Peer(0206dfc7fa5f56f3ce752c6621b1d03223c11b16cffb05835b67481c12c6d87031): pong response failure for 0206dfc7fa5f56f3ce752c6621b1d03223c11b16cffb05835b67481c12c6d87031@96.230.252.205:47013: timeout while waiting for pong response -- disconnecting
2025-05-09 02:54:41.755 [INF] PEER: Peer(0206dfc7fa5f56f3ce752c6621b1d03223c11b16cffb05835b67481c12c6d87031): disconnecting 0206dfc7fa5f56f3ce752c6621b1d03223c11b16cffb05835b67481c12c6d87031@96.230.252.205:47013, reason: pong response failure for 0206dfc7fa5f56f3ce752c6621b1d03223c11b16cffb05835b67481c12c6d87031@96.230.252.205:47013: timeout while waiting for pong response -- disconnecting


node 2:

2025-05-09 02:54:41.756 [WRN] PEER: Peer(028c589131fae8c7e2103326542d568373019b50a9eb376a139a330c8545efb79a): pong response failure for 028c589131fae8c7e2103326542d568373019b50a9eb376a139a330c8545efb79a@96.230.252.205:9735: timeout while waiting for pong response -- disconnecting
2025-05-09 02:54:41.756 [INF] PEER: Peer(028c589131fae8c7e2103326542d568373019b50a9eb376a139a330c8545efb79a): disconnecting 028c589131fae8c7e2103326542d568373019b50a9eb376a139a330c8545efb79a@96.230.252.205:9735, reason: pong response failure for 028c589131fae8c7e2103326542d568373019b50a9eb376a139a330c8545efb79a@96.230.252.205:9735: timeout while waiting for pong response -- disconnecting


node 2 pings node 1:

2025-05-09 02:54:11.755 [DBG] PEER: Peer(028c589131fae8c7e2103326542d568373019b50a9eb376a139a330c8545efb79a): Sending Ping(ping_bytes=0040042017936267548896bf81897816ab6547eaf5825f3017650000000000000000000072c43c4e8408d4e84ac8726eb8df9f5d97ef7b6bfc2615642485e1d9290cdd164d6b1d68ed5c0217d46227ef) to 028c589131fae8c7e2103326542d568373019b50a9eb376a139a330c8545efb79a@96.230.252.205:9735


node 1 pongs node 2:

2025-05-09 02:54:11.758 [DBG] PEER: Peer(0206dfc7fa5f56f3ce752c6621b1d03223c11b16cffb05835b67481c12c6d87031): Sending Pong(len(pong_bytes)=2133) to 0206dfc7fa5f56f3ce752c6621b1d03223c11b16cffb05835b67481c12c6d87031@96.230.252.205:47013


just as before, node 2 is in the middle of processing a bunch of ChannelUpdates when it times out

2025-05-09 02:54:41.756 [WRN] PEER: Peer(028c589131fae8c7e2103326542d568373019b50a9eb376a139a330c8545efb79a): pong response failure for 028c589131fae8c7e2103326542d568373019b50a9eb376a139a330c8545efb79a@96.230.252.205:9735: timeout while waiting for pong response -- disconnecting

The readHandler would block until slots would be available:

lnd/peer/brontide.go

Line 1839 in ee25c22

case <-ms.producerSema:

We only had 5 slots available initially however even 1000 slots lead to the same disconnects as test showed.

Although the processing of most of the Gossip messages is done in a async way here:

lnd/discovery/gossiper.go

Lines 940 to 941 in ee25c22

    
           select { 
        
           case d.networkMsgs <- nMsg:

The update of the DB was done sequentially due to the above MutexLock so the async nature didn't really have an effect if we had a burst of Gossip Messages.

While we are at it I think we need to make sure all messages also the QueryMsges are done in an async way to not block the readHandler.

For example this case here:

lnd/discovery/gossiper.go

Line 881 in ee25c22

err := syncer.ProcessQueryMsg(m, peer.QuitSignal())

cc @yyforyongyu

yyforyongyu · 2025-05-12T11:30:17Z

@ziggie1984 but the msg is done via go d.handleNetworkMessages? basically,

the msgStream has its own msgConsumer, which could be blocked at the apply here,

lnd/peer/brontide.go

Line 1814 in ee25c22

ms.apply(msg)

when we receive a gossip msg from the wire, it's added to the discStream, which is not blocking unless ms.producerSema is blocked,

lnd/peer/brontide.go

Line 2176 in ee25c22

discStream.AddMsg(msg)

unless the apply is blocked here, the ms.producerSema should be refilled immediately,

lnd/peer/brontide.go

Lines 1814 to 1821 in ee25c22

    
           ms.apply(msg) 
        
           // We've just successfully processed an item, so we'll signal 
        
           // to the producer that a new slot in the buffer. We'll use 
        
           // this to bound the size of the buffer to avoid allowing it to 
        
           // grow indefinitely. 
        
           select { 
        
           case ms.producerSema <- struct{}{}:

the discStream is created via newDiscMsgStream, whose apply is hooked to the ProcessRemoteAnnouncement, which sends the gossip msg here,

lnd/discovery/gossiper.go

Line 941 in ee25c22

case d.networkMsgs <- nMsg:

The gossip msg is processed in another goroutine,

lnd/discovery/gossiper.go

Lines 1530 to 1532 in ee25c22

    
           go d.handleNetworkMessages( 
        
           	announcement, &announcements, annJobID, 
        
           )

unless it's blocked by the validation barrier here,

lnd/discovery/gossiper.go

Line 1521 in ee25c22

annJobID, err := d.vb.InitJobDependencies(

I don't see how it can affect the msgConsumer? Unless the sending to d.networkMsgs is blocked here since d.networkMsgs is not buffered,

lnd/discovery/gossiper.go

Lines 940 to 941 in ee25c22

    
           select { 
        
           case d.networkMsgs <- nMsg:

yyforyongyu · 2025-05-12T11:42:45Z

then we'd need to thread the graphCache back to the CRUD layer & have it do the updates at transaction time - but im not sure this is worth-it/necessary (but i can do it if we think that's the wanted behaviour here)?

Yeah def not that route. I think the old cache design needs more work, meanwhile just wanna make sure there's no side effect as this particular area is hard to debug or even notice in the first place. I guess since we have behavior B we should be fine as no outdated state will be preserved since we also do freshness check before saving to the db.

ziggie1984 · 2025-05-12T12:04:26Z

I don't see how it can affect the msgConsumer? Unless the sending to d.networkMsgs is blocked here since d.networkMsgs is not buffered,

I think that's the blocker and also the fact that the other calls (query calls) are not done in a async manner, so I think both of these reasons kick in if we have a burst of messages. (I particular so this behaviour with pinned peers which for every reconnect fetch make a historic rescan I think.

lnd/discovery/gossiper.go

Line 582 in 8044891

gossiper.vb = NewValidationBarrier(1000, gossiper.quit)

djkazic · 2025-05-12T12:49:31Z

I was able to fix the disconnects with an extremely hacky patch.

Not suggesting that we merge this -- but it's nonetheless an interesting datapoint. In the patch I push aside gossip messages into a buffered channel with a separate goroutine to drain it.

This allows Pong messages that were buried underneath a mountain of gossip messages to be processed in a timely manner as the Pongs are now surfaced quickly instead of waiting for the gossip messages to be read and processed. Even though this only amounts to 2 to 3 ms per gossip message if there's a lot of them that can easily push us beyond the 30 second limit.

djkazic@ada9c21

djkazic · 2025-05-13T03:49:26Z

tACK

My LND node has experienced improved connectivity for channels in the last ~13 hrs since this was deployed. There's a handful of DC'ing peers but I think that's normal. 30 of those were pong timeout disconnects.

ellemouton · 2025-05-13T12:21:09Z

@yyforyongyu -

I guess since we have behavior B we should be fine as no outdated state will be preserved since we also do freshness check before saving to the db.

just want to check that this means you are ok with this fix?

yyforyongyu

🙏

ellemouton requested a review from bitromortac May 12, 2025 07:10

bitromortac approved these changes May 12, 2025

View reviewed changes

bhandras approved these changes May 12, 2025

View reviewed changes

ziggie1984 reviewed May 12, 2025

View reviewed changes

graph/db/graph.go Show resolved Hide resolved

ellemouton added the no-changelog label May 12, 2025

ziggie1984 approved these changes May 12, 2025

View reviewed changes

saubyk assigned ellemouton May 12, 2025

saubyk added this to the v0.19.0 milestone May 12, 2025

Roasbeef mentioned this pull request May 12, 2025

discovery: make broadcast calls from gossiper to server async #9802

Closed

ellemouton requested a review from yyforyongyu May 13, 2025 12:20

yyforyongyu approved these changes May 13, 2025

View reviewed changes

guggero merged commit b0cba7d into lightningnetwork:master May 13, 2025
91 of 99 checks passed

This was referenced May 14, 2025

[graph-work-side-branch]: side branch for graph work #9692

Merged

multi: various test preparations for different graph store impl #9800

Merged

morehouse mentioned this pull request May 14, 2025

peer: only send pings if the connection is idle for one minute #9805

Open

graph/db: remove ChannelGraph cacheMu #9804

graph/db: remove ChannelGraph cacheMu #9804

Uh oh!

Conversation

ellemouton commented May 12, 2025

Uh oh!

coderabbitai bot commented May 12, 2025

Review skipped

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

bitromortac left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bhandras left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ziggie1984 left a comment

Choose a reason for hiding this comment

Uh oh!

yyforyongyu commented May 12, 2025

Uh oh!

ellemouton commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ziggie1984 commented May 12, 2025

Uh oh!

yyforyongyu commented May 12, 2025

Uh oh!

yyforyongyu commented May 12, 2025

Uh oh!

ziggie1984 commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

djkazic commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

djkazic commented May 13, 2025

Uh oh!

ellemouton commented May 13, 2025

Uh oh!

yyforyongyu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

CodeRabbit Configuration File (`.coderabbit.yaml`)

bitromortac left a comment •

edited

Loading

ellemouton commented May 12, 2025 •

edited

Loading

ziggie1984 commented May 12, 2025 •

edited

Loading

djkazic commented May 12, 2025 •

edited

Loading