-
Notifications
You must be signed in to change notification settings - Fork 2.2k
graphdb: uniformly use fine grained locking when accessing the graph db cache #10247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @Roasbeef, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request refactors the locking mechanism within the graph database cache to enhance concurrency and overall system responsiveness. Previously, a broad mutex lock could serialize access to the graph DB, leading to performance bottlenecks. The updated approach employs more granular locking, ensuring the mutex is held only for the brief periods required for cache operations. This change is validated by a new property-based test designed to expose concurrency issues. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the locking strategy for the graph database cache, moving from a coarse-grained mutex held for the duration of database transactions to a more fine-grained approach where locks are acquired only when accessing the cache. This is a great improvement for performance, as it significantly reduces lock contention and prevents blocking other goroutines. The addition of a rapid
-based concurrent property test is also excellent for increasing confidence in the correctness of these concurrent changes.
My main concern, which you've also raised, is the potential for cache consistency violations. By releasing the lock between the database transaction and the cache update, a window is created where the cache can be out of sync with the database. I've added a detailed comment on this. Overall, this is a solid PR that addresses an important performance issue.
if len(chansClosed) > 0 { | ||
c.cacheMu.Lock() | ||
for _, channel := range chansClosed { | ||
c.rejectCache.remove(channel.ChannelID) | ||
c.chanCache.remove(channel.ChannelID) | ||
} | ||
c.cacheMu.Unlock() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this fine-grained locking approach improves performance by reducing lock contention, it introduces a time window where the cache can be inconsistent with the database.
Specifically, between the kvdb.Update
call completing and the cache being updated here, other goroutines can read stale data from the cache. For example, another thread could read a channel from chanCache
that has just been pruned from the database, leading pathfinding algorithms to construct routes through channels that are already closed.
While routing failures are generally recoverable, this inconsistency might be undesirable. If strong consistency is a requirement, an alternative would be to hold the write lock across both the database transaction and the cache update, which is what the previous implementation did. Given the performance goals of this PR, this trade-off might be acceptable, but it's a critical point to consider and document.
if len(closedChans) > 0 { | ||
s.cacheMu.Lock() | ||
for _, channel := range closedChans { | ||
s.rejectCache.remove(channel.ChannelID) | ||
s.chanCache.remove(channel.ChannelID) | ||
} | ||
s.cacheMu.Unlock() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
48b1e16
to
3ace97b
Compare
In this commmit, we switch kv and SQL stores to only hold cacheMu while touching the caches. This avoids long-held locks during heavyweight DB work and lets readers use RLock when hitting the chan cache. The cache purges and inserts now take the mutex only when there is actual work to do.
In this commit, we add a Rapid-powered property test that drives ChannelGraph through its public API from multiple goroutines. The workload mixes adds, pruning, deletes, horizon scans, and zombie transitions to shake out deadlocks introduced by the cache mutex changes.
In this commit, we make the cache concurrency property test track inserted channel metadata so subsequent operations reuse real channel IDs, outpoints, and block hashes. This increases the odds of hitting cache mutation paths instead of falling back to no-op code paths. Also skip the long-running property when tests are executed with -short.
3ace97b
to
cd6b27b
Compare
I noticed in #10128 that at times, we'll hold the cache mutex the entire time a DB transaction is active. This effectively makes all access to the graph DB fully serial. As seen in #10245, if we hold this mutex for too long, while network activity is on going for transactions, then we'll end up blocking other goroutines unnecessarily.
In this PR, to resolve that we switch to a fine grained locking methodology. Before this commit, we'd just grab the lock, then user a defer, while typically only access the cache at the very end of the function. Now we'll grab the mutex only when we need it.
We also add randomized concurrent test using
rapid
to attempt to catch any rare deadlocks that this set of changes may introduce.One open question is: are there any cache consistency violations that may arise as a result of this PR that we need to worry about?