Skip to content

Conversation

Roasbeef
Copy link
Member

I noticed in #10128 that at times, we'll hold the cache mutex the entire time a DB transaction is active. This effectively makes all access to the graph DB fully serial. As seen in #10245, if we hold this mutex for too long, while network activity is on going for transactions, then we'll end up blocking other goroutines unnecessarily.

In this PR, to resolve that we switch to a fine grained locking methodology. Before this commit, we'd just grab the lock, then user a defer, while typically only access the cache at the very end of the function. Now we'll grab the mutex only when we need it.

We also add randomized concurrent test using rapid to attempt to catch any rare deadlocks that this set of changes may introduce.

One open question is: are there any cache consistency violations that may arise as a result of this PR that we need to worry about?

Copy link

Summary of Changes

Hello @Roasbeef, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the locking mechanism within the graph database cache to enhance concurrency and overall system responsiveness. Previously, a broad mutex lock could serialize access to the graph DB, leading to performance bottlenecks. The updated approach employs more granular locking, ensuring the mutex is held only for the brief periods required for cache operations. This change is validated by a new property-based test designed to expose concurrency issues.

Highlights

  • Fine-grained Locking: The graph DB cache now uses fine-grained locking, acquiring the cache mutex only when directly interacting with the cache, rather than holding it for the entire duration of a database transaction. This change aims to improve concurrency and prevent serialization of graph DB access.
  • Improved Concurrency: By reducing the scope of the cache mutex, the system can avoid unnecessary blocking of goroutines, especially during network activity, addressing performance issues observed in related pull requests.
  • New Concurrency Test: A new randomized concurrent test has been added using the rapid framework. This test rigorously verifies the cache mutex behavior under contention by simulating various concurrent operations on the ChannelGraph's public API, helping to catch potential deadlocks or race conditions introduced by the locking changes.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the locking strategy for the graph database cache, moving from a coarse-grained mutex held for the duration of database transactions to a more fine-grained approach where locks are acquired only when accessing the cache. This is a great improvement for performance, as it significantly reduces lock contention and prevents blocking other goroutines. The addition of a rapid-based concurrent property test is also excellent for increasing confidence in the correctness of these concurrent changes.

My main concern, which you've also raised, is the potential for cache consistency violations. By releasing the lock between the database transaction and the cache update, a window is created where the cache can be out of sync with the database. I've added a detailed comment on this. Overall, this is a solid PR that addresses an important performance issue.

Comment on lines +1537 to 1546
if len(chansClosed) > 0 {
c.cacheMu.Lock()
for _, channel := range chansClosed {
c.rejectCache.remove(channel.ChannelID)
c.chanCache.remove(channel.ChannelID)
}
c.cacheMu.Unlock()
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

While this fine-grained locking approach improves performance by reducing lock contention, it introduces a time window where the cache can be inconsistent with the database.

Specifically, between the kvdb.Update call completing and the cache being updated here, other goroutines can read stale data from the cache. For example, another thread could read a channel from chanCache that has just been pruned from the database, leading pathfinding algorithms to construct routes through channels that are already closed.

While routing failures are generally recoverable, this inconsistency might be undesirable. If strong consistency is a requirement, an alternative would be to hold the write lock across both the database transaction and the cache update, which is what the previous implementation did. Given the performance goals of this PR, this trade-off might be acceptable, but it's a critical point to consider and document.

Comment on lines +2417 to 2429
if len(closedChans) > 0 {
s.cacheMu.Lock()
for _, channel := range closedChans {
s.rejectCache.remove(channel.ChannelID)
s.chanCache.remove(channel.ChannelID)
}
s.cacheMu.Unlock()
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to the kv_store.go implementation, this change introduces a potential cache consistency issue. There's a window between the database transaction committing and the cache being updated where other goroutines could read stale data. Please see my more detailed comment on graph/db/kv_store.go.

In this commmit, we switch kv and SQL stores to only hold cacheMu while
touching the caches. This avoids long-held locks during heavyweight DB
work and lets readers use RLock when hitting the chan cache. The cache
purges and inserts now take the mutex only when there is actual work to
do.
In this commit, we add a Rapid-powered property test that drives
ChannelGraph through its public API from multiple goroutines. The
workload mixes adds, pruning, deletes, horizon scans, and zombie
transitions to shake out deadlocks introduced by the cache mutex
changes.
In this commit, we make the cache concurrency property test track
inserted channel metadata so subsequent operations reuse real channel
IDs, outpoints, and block hashes. This increases the odds of hitting
cache mutation paths instead of falling back to no-op code paths. Also
skip the long-running property when tests are executed with -short.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant