-
Notifications
You must be signed in to change notification settings - Fork 2.2k
graph/db: remove ChannelGraph cacheMu #9804
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
graph/db: remove ChannelGraph cacheMu #9804
Conversation
We remove the mutex that was previously held between DB calls and calls that update the graphCache. This is done so that the underlying DB calls can make use of any batch requests which they currently cannot since the mutex prevents multiple requests from calling the methods at once. The reason the cacheMu was originally added here was during a code refactor that moved the `graphCache` out of the `KVStore` and into the `ChannelGraph` and the aim was then to have a best effort way of ensuring that updates to the DB and updates to the graphCache were as consistent/atomic as possible.
Important Review skippedAuto reviews are limited to specific labels. 🏷️ Labels to auto review (1)
Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tACK, I see more stable behavior with respect to peer connections, I haven't seen pong disconnects
edit: I still have seen some, though I think it's reduced
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! LGTM 🎉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (was introduced here #9550, just for reference)
This will change the behavior of how we update the cache and db. Previously we'd put cache write and db write in a new Line 57 in ee25c22
Here we use a lock to make sure the cache and disk stay consistent, Lines 43 to 49 in ee25c22
With the current change there's no such guarantee. |
@yyforyongyu - indeed, although I think this is the result of pulling the graphCache out of the CRUD layer and not necessarily this change specifically. I think the problem with the behaviour before (A) was: we'd update the cache even if the overall Update could potentially fail. Whereas now, (B) we only update the cache if the DB write is successful. So both versions dont really guarantee that the two are in sync i'd say. That's why we then added this new If we want the exact exact same behaviour as before, then we'd need to thread the graphCache back to the CRUD layer & have it do the updates at transaction time - but im not sure this is worth-it/necessary (but i can do it if we think that's the wanted behaviour here)? We definitely want to keep the ChannelGraph as the owner of the cache for future updates. so i think we need to decide if we prefer behaviour A to B (both are slightly incorrect - we'd need to do something way more complex to make the cache completely consistent with the db) |
Trying to give a high level overview what what resulted in flapping peer connections:
Can lead to a max disconnect time of 1 h (default): Line 4796 in ee25c22
The readHandler would block until slots would be available: Line 1839 in ee25c22
We only had 5 slots available initially however even 1000 slots lead to the same disconnects as test showed. Although the processing of most of the Gossip messages is done in a async way here: Lines 940 to 941 in ee25c22
The update of the DB was done sequentially due to the above MutexLock so the async nature didn't really have an effect if we had a burst of Gossip Messages.
For example this case here: Line 881 in ee25c22
cc @yyforyongyu |
@ziggie1984 but the msg is done via the Line 1814 in ee25c22
when we receive a gossip msg from the wire, it's added to the Line 2176 in ee25c22
unless the Lines 1814 to 1821 in ee25c22
the Line 941 in ee25c22
The gossip msg is processed in another goroutine, Lines 1530 to 1532 in ee25c22
unless it's blocked by the validation barrier here, Line 1521 in ee25c22
I don't see how it can affect the Lines 940 to 941 in ee25c22
|
Yeah def not that route. I think the old cache design needs more work, meanwhile just wanna make sure there's no side effect as this particular area is hard to debug or even notice in the first place. I guess since we have behavior B we should be fine as no outdated state will be preserved since we also do freshness check before saving to the db. |
I think that's the blocker and also the fact that the other calls (query calls) are not done in a async manner, so I think both of these reasons kick in if we have a burst of messages. (I particular so this behaviour with pinned peers which for every reconnect fetch make a historic rescan I think. Line 582 in 8044891
|
I was able to fix the disconnects with an extremely hacky patch. Not suggesting that we merge this -- but it's nonetheless an interesting datapoint. In the patch I push aside gossip messages into a buffered channel with a separate goroutine to drain it. This allows Pong messages that were buried underneath a mountain of gossip messages to be processed in a timely manner as the Pongs are now surfaced quickly instead of waiting for the gossip messages to be read and processed. Even though this only amounts to 2 to 3 ms per gossip message if there's a lot of them that can easily push us beyond the 30 second limit. |
tACK My LND node has experienced improved connectivity for channels in the last ~13 hrs since this was deployed. There's a handful of DC'ing peers but I think that's normal. 30 of those were |
just want to check that this means you are ok with this fix? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙏
We remove the mutex that was previously held between DB calls and calls that update the graphCache. This is done so that the underlying DB calls can make use of any batch requests which they currently cannot since the mutex prevents multiple requests from calling the methods at once.
The reason the cacheMu was originally added here was during a code refactor that moved the
graphCache
out of theKVStore
and into theChannelGraph
and the aim was then to have a best effort way of ensuring that updates to the DB and updates to the graphCache were as consistent/atomic as possible.