Skip to content

Conversation

ellemouton
Copy link
Collaborator

We remove the mutex that was previously held between DB calls and calls that update the graphCache. This is done so that the underlying DB calls can make use of any batch requests which they currently cannot since the mutex prevents multiple requests from calling the methods at once.

The reason the cacheMu was originally added here was during a code refactor that moved the graphCache out of the KVStore and into the ChannelGraph and the aim was then to have a best effort way of ensuring that updates to the DB and updates to the graphCache were as consistent/atomic as possible.

We remove the mutex that was previously held between DB calls and calls
that update the graphCache. This is done so that the underlying DB calls
can make use of any batch requests which they currently cannot since the
mutex prevents multiple requests from calling the methods at once.

The reason the cacheMu was originally added here was during a code
refactor that moved the `graphCache` out of the `KVStore` and into the
`ChannelGraph` and the aim was then to have a best effort way of
ensuring that updates to the DB and updates to the graphCache were as
consistent/atomic as possible.
Copy link
Contributor

coderabbitai bot commented May 12, 2025

Important

Review skipped

Auto reviews are limited to specific labels.

🏷️ Labels to auto review (1)
  • llm-review

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@ellemouton ellemouton requested a review from bitromortac May 12, 2025 07:10
Copy link
Collaborator

@bitromortac bitromortac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tACK, I see more stable behavior with respect to peer connections, I haven't seen pong disconnects

edit: I still have seen some, though I think it's reduced

Copy link
Collaborator

@bhandras bhandras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! LGTM 🎉

Copy link
Collaborator

@ziggie1984 ziggie1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (was introduced here #9550, just for reference)

@yyforyongyu
Copy link
Member

This will change the behavior of how we update the cache and db. Previously we'd put cache write and db write in a new batch.Request's Update, which then gets executed in,

err := req.Update(tx)

Here we use a lock to make sure the cache and disk stay consistent,

lnd/batch/batch.go

Lines 43 to 49 in ee25c22

// If a cache lock was provided, hold it until the this method returns.
// This is critical for ensuring external consistency of the operation,
// so that caches don't get out of sync with the on disk state.
if b.locker != nil {
b.locker.Lock()
defer b.locker.Unlock()
}

With the current change there's no such guarantee.

@ellemouton
Copy link
Collaborator Author

ellemouton commented May 12, 2025

@yyforyongyu - indeed, although I think this is the result of pulling the graphCache out of the CRUD layer and not necessarily this change specifically.

I think the problem with the behaviour before (A) was: we'd update the cache even if the overall Update could potentially fail. Whereas now, (B) we only update the cache if the DB write is successful. So both versions dont really guarantee that the two are in sync i'd say. That's why we then added this new cacheMu at the ChannelGraph layer - to make everything overall more consistent - but this then lead to the issue of batch.Request not being used properly.

If we want the exact exact same behaviour as before, then we'd need to thread the graphCache back to the CRUD layer & have it do the updates at transaction time - but im not sure this is worth-it/necessary (but i can do it if we think that's the wanted behaviour here)? We definitely want to keep the ChannelGraph as the owner of the cache for future updates.

so i think we need to decide if we prefer behaviour A to B (both are slightly incorrect - we'd need to do something way more complex to make the cache completely consistent with the db)

@ziggie1984
Copy link
Collaborator

Trying to give a high level overview what what resulted in flapping peer connections:

  1. Ping/Pong times have a default timeout of 30sec and all of a sudden peers were disconnecting quite regularly, even ChannelPeers were part of this problem although they count as "persistentPeers" normally however if we encounter unstable peers over time we will expontenially backoff => I think we should at one point drop the connection completely so that we do not end up never connecting again when the timeBackoff would be very high. So we need at one point stop this cycle.

Can lead to a max disconnect time of 1 h (default):

lnd/server.go

Line 4796 in ee25c22

backoff := s.nextPeerBackoff(pubStr, p.StartTime())

  1. So the problem on the Ping/Pong side was not that we did not receive the Pong but that we were not able to process the queue which was built up in the readHandler of the brontide peer in time. See here some examples:
node 1:

2025-05-09 02:54:41.755 [WRN] PEER: Peer(0206dfc7fa5f56f3ce752c6621b1d03223c11b16cffb05835b67481c12c6d87031): pong response failure for 0206dfc7fa5f56f3ce752c6621b1d03223c11b16cffb05835b67481c12c6d87031@96.230.252.205:47013: timeout while waiting for pong response -- disconnecting
2025-05-09 02:54:41.755 [INF] PEER: Peer(0206dfc7fa5f56f3ce752c6621b1d03223c11b16cffb05835b67481c12c6d87031): disconnecting 0206dfc7fa5f56f3ce752c6621b1d03223c11b16cffb05835b67481c12c6d87031@96.230.252.205:47013, reason: pong response failure for 0206dfc7fa5f56f3ce752c6621b1d03223c11b16cffb05835b67481c12c6d87031@96.230.252.205:47013: timeout while waiting for pong response -- disconnecting


node 2:

2025-05-09 02:54:41.756 [WRN] PEER: Peer(028c589131fae8c7e2103326542d568373019b50a9eb376a139a330c8545efb79a): pong response failure for 028c589131fae8c7e2103326542d568373019b50a9eb376a139a330c8545efb79a@96.230.252.205:9735: timeout while waiting for pong response -- disconnecting
2025-05-09 02:54:41.756 [INF] PEER: Peer(028c589131fae8c7e2103326542d568373019b50a9eb376a139a330c8545efb79a): disconnecting 028c589131fae8c7e2103326542d568373019b50a9eb376a139a330c8545efb79a@96.230.252.205:9735, reason: pong response failure for 028c589131fae8c7e2103326542d568373019b50a9eb376a139a330c8545efb79a@96.230.252.205:9735: timeout while waiting for pong response -- disconnecting


node 2 pings node 1:

2025-05-09 02:54:11.755 [DBG] PEER: Peer(028c589131fae8c7e2103326542d568373019b50a9eb376a139a330c8545efb79a): Sending Ping(ping_bytes=0040042017936267548896bf81897816ab6547eaf5825f3017650000000000000000000072c43c4e8408d4e84ac8726eb8df9f5d97ef7b6bfc2615642485e1d9290cdd164d6b1d68ed5c0217d46227ef) to 028c589131fae8c7e2103326542d568373019b50a9eb376a139a330c8545efb79a@96.230.252.205:9735


node 1 pongs node 2:

2025-05-09 02:54:11.758 [DBG] PEER: Peer(0206dfc7fa5f56f3ce752c6621b1d03223c11b16cffb05835b67481c12c6d87031): Sending Pong(len(pong_bytes)=2133) to 0206dfc7fa5f56f3ce752c6621b1d03223c11b16cffb05835b67481c12c6d87031@96.230.252.205:47013


just as before, node 2 is in the middle of processing a bunch of ChannelUpdates when it times out

2025-05-09 02:54:41.756 [WRN] PEER: Peer(028c589131fae8c7e2103326542d568373019b50a9eb376a139a330c8545efb79a): pong response failure for 028c589131fae8c7e2103326542d568373019b50a9eb376a139a330c8545efb79a@96.230.252.205:9735: timeout while waiting for pong response -- disconnecting

The readHandler would block until slots would be available:

case <-ms.producerSema:

We only had 5 slots available initially however even 1000 slots lead to the same disconnects as test showed.

Although the processing of most of the Gossip messages is done in a async way here:

lnd/discovery/gossiper.go

Lines 940 to 941 in ee25c22

select {
case d.networkMsgs <- nMsg:

The update of the DB was done sequentially due to the above MutexLock so the async nature didn't really have an effect if we had a burst of Gossip Messages.

  1. While we are at it I think we need to make sure all messages also the QueryMsges are done in an async way to not block the readHandler.

For example this case here:

err := syncer.ProcessQueryMsg(m, peer.QuitSignal())

cc @yyforyongyu

@yyforyongyu
Copy link
Member

@ziggie1984 but the msg is done via go d.handleNetworkMessages? basically,

the msgStream has its own msgConsumer, which could be blocked at the apply here,

ms.apply(msg)

when we receive a gossip msg from the wire, it's added to the discStream, which is not blocking unless ms.producerSema is blocked,

discStream.AddMsg(msg)

unless the apply is blocked here, the ms.producerSema should be refilled immediately,

lnd/peer/brontide.go

Lines 1814 to 1821 in ee25c22

ms.apply(msg)
// We've just successfully processed an item, so we'll signal
// to the producer that a new slot in the buffer. We'll use
// this to bound the size of the buffer to avoid allowing it to
// grow indefinitely.
select {
case ms.producerSema <- struct{}{}:

the discStream is created via newDiscMsgStream, whose apply is hooked to the ProcessRemoteAnnouncement, which sends the gossip msg here,

case d.networkMsgs <- nMsg:

The gossip msg is processed in another goroutine,

lnd/discovery/gossiper.go

Lines 1530 to 1532 in ee25c22

go d.handleNetworkMessages(
announcement, &announcements, annJobID,
)

unless it's blocked by the validation barrier here,

annJobID, err := d.vb.InitJobDependencies(

I don't see how it can affect the msgConsumer? Unless the sending to d.networkMsgs is blocked here since d.networkMsgs is not buffered,

lnd/discovery/gossiper.go

Lines 940 to 941 in ee25c22

select {
case d.networkMsgs <- nMsg:

@yyforyongyu
Copy link
Member

then we'd need to thread the graphCache back to the CRUD layer & have it do the updates at transaction time - but im not sure this is worth-it/necessary (but i can do it if we think that's the wanted behaviour here)?

Yeah def not that route. I think the old cache design needs more work, meanwhile just wanna make sure there's no side effect as this particular area is hard to debug or even notice in the first place. I guess since we have behavior B we should be fine as no outdated state will be preserved since we also do freshness check before saving to the db.

@ziggie1984
Copy link
Collaborator

ziggie1984 commented May 12, 2025

I don't see how it can affect the msgConsumer? Unless the sending to d.networkMsgs is blocked here since d.networkMsgs is not buffered,

I think that's the blocker and also the fact that the other calls (query calls) are not done in a async manner, so I think both of these reasons kick in if we have a burst of messages. (I particular so this behaviour with pinned peers which for every reconnect fetch make a historic rescan I think.

gossiper.vb = NewValidationBarrier(1000, gossiper.quit)

@djkazic
Copy link
Contributor

djkazic commented May 12, 2025

I was able to fix the disconnects with an extremely hacky patch.

Not suggesting that we merge this -- but it's nonetheless an interesting datapoint. In the patch I push aside gossip messages into a buffered channel with a separate goroutine to drain it.

This allows Pong messages that were buried underneath a mountain of gossip messages to be processed in a timely manner as the Pongs are now surfaced quickly instead of waiting for the gossip messages to be read and processed. Even though this only amounts to 2 to 3 ms per gossip message if there's a lot of them that can easily push us beyond the 30 second limit.

djkazic@ada9c21

@djkazic
Copy link
Contributor

djkazic commented May 13, 2025

tACK

My LND node has experienced improved connectivity for channels in the last ~13 hrs since this was deployed. There's a handful of DC'ing peers but I think that's normal. 30 of those were pong timeout disconnects.

@ellemouton ellemouton requested a review from yyforyongyu May 13, 2025 12:20
@ellemouton
Copy link
Collaborator Author

@yyforyongyu -

I guess since we have behavior B we should be fine as no outdated state will be preserved since we also do freshness check before saving to the db.

just want to check that this means you are ok with this fix?

Copy link
Member

@yyforyongyu yyforyongyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙏

@guggero guggero merged commit b0cba7d into lightningnetwork:master May 13, 2025
91 of 99 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants