-
Notifications
You must be signed in to change notification settings - Fork 2.2k
fix switch deadlock #10035
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix switch deadlock #10035
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @ziggie1984, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request addresses a critical deadlock (writer starvation) within the htlcswitch
component by refining the mutex locking strategy. The change ensures that a read lock on the indexMtx
is released as soon as the necessary index lookups are complete, preventing it from being held during potentially long-running operations and thus improving concurrency and preventing starvation. The fix is also documented in the release notes.
Highlights
- Concurrency Fix: The
indexMtx
inhtlcswitch.getLocalLink
is now explicitly unlocked earlier, immediately after all necessary index lookups are performed. This prevents the read lock from being held during subsequent, potentially expensive operations, which was causing writer starvation and deadlocks. - Deadlock Prevention: By reducing the critical section where the
indexMtx
is held, this change directly mitigates a known deadlock (writer starvation) issue within the switch, improving overall concurrency and stability. - Documentation: An entry has been added to the
release-notes-0.19.2.md
file, documenting the fix for the switch deadlock.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a potential deadlock (writer starvation) in the HTLC switch by releasing the read lock on s.indexMtx
earlier within the getLocalLink
function. The defer
statement for unlocking has been replaced with explicit RUnlock
calls on all execution paths before any potentially long-running operations. This change is well-reasoned and correctly implemented, ensuring the lock is always released exactly once. The accompanying release note update is also clear and accurate. The changes improve the robustness and performance of the switch. Overall, this is a solid fix.
277a965
to
32351f7
Compare
We now unlock the mutex lock of the switch as soon as possible to avoid potetnial deadlock in the htlc switch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🎉
32351f7
to
edb7342
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, nice investigation 🙏
edit: just for clarification, I wouldn't call it writer starvation but a deadlock, because the golang RWMutex
blocks further readers from acquiring it if in the meantime a write was registered
a sample program which illustrates that golang and its Mutex prioritise Write Locks to prevent write starvation which basically lead to the deadlock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice findings! Tho it looks like there are other deadlocks we need to fix🤦🏻
The problem seems to be this line,
Lines 2049 to 2050 in b3eb9a3
// Attach the Switch's failAliasUpdate function to the link. | |
link.attachFailAliasUpdate(s.failAliasUpdate) |
failAliasUpdate
requires a read lock here,
Lines 2671 to 2676 in b3eb9a3
func (s *Switch) failAliasUpdate(scid lnwire.ShortChannelID, | |
incoming bool) *lnwire.ChannelUpdate1 { | |
// This function does not defer the unlocking because of the database | |
// lookups for ChannelUpdate. | |
s.indexMtx.RLock() |
but inside AddLink
we require a write lock,
Lines 2029 to 2031 in b3eb9a3
func (s *Switch) AddLink(link ChannelLink) error { | |
s.indexMtx.Lock() | |
defer s.indexMtx.Unlock() |
Then createFailureWithUpdate
will just deadlock, which is used at multiple places...
Hmm could you specify the scenario in detail I cannot see the other deadlocks right now (apart form that the design of supplying a functions which requires a mutex of a whole subsystem is pretty bad design and surely there are very likely more scenarios where the above can happen) |
Took a quick look, basically
Imo the |
Fixes #10008
In the issue above we had a writer starvation because although we only aquire the RLock in:
getLocalLink
which if the balance is not enough creates a failUpdate (which also needs the RLock) so far so good that would not deadlock, however there is a chance a writer regssiters in between and if that happens we have a writer starvation and the parts of the system will deadlock.In a above case the server mutex is locked and it will cause almost the whole system to lock down.
Important goroutine dump logs:
Here the
getLocalLink
call:It clearly locks at the switch read mutex.
Which means this lock was registered before:
which also holds the server mutex that's why all the P2P connection bork.