Skip to content

Conversation

ziggie1984
Copy link
Collaborator

@ziggie1984 ziggie1984 commented Jul 4, 2025

Fixes #10008

In the issue above we had a writer starvation because although we only aquire the RLock in:

getLocalLink which if the balance is not enough creates a failUpdate (which also needs the RLock) so far so good that would not deadlock, however there is a chance a writer regssiters in between and if that happens we have a writer starvation and the parts of the system will deadlock.

In a above case the server mutex is locked and it will cause almost the whole system to lock down.

Important goroutine dump logs:

Here the getLocalLink call:

goroutine 216074 [sync.RWMutex.RLock, 4 minutes]: 1 times: [[216074]
sync.runtime_SemacquireRWMutexR(0xc2120595219f4ec4?, 0xc0?, 0x4691180?)
        runtime/sema.go:100 +0x25
sync.(*RWMutex).RLock(...)
        sync/rwmutex.go:72
github.com/lightningnetwork/lnd/htlcswitch.(*Switch).failAliasUpdate(0xc00c3d2700, {0x0, 0x0, 0x0}, 0x0)
        github.com/lightningnetwork/lnd/htlcswitch/switch.go:2667 +0x65
github.com/lightningnetwork/lnd/htlcswitch.(*channelLink).createFailureWithUpdate(0xc010837808, 0xa7?, {0x0?, 0x20?, 0x0?}, 0x2a6a308)
        github.com/lightningnetwork/lnd/htlcswitch/link.go:874 +0x56
github.com/lightningnetwork/lnd/htlcswitch.(*channelLink).canSendHtlc(0xc010837808, {0x3e8, 0x2faf080, 0x0, 0x118, {0x0, 0xffffff4c}, 0x90}, {0xf0, 0xec, ...}, ...)
        github.com/lightningnetwork/lnd/htlcswitch/link.go:3400 +0x1fb
github.com/lightningnetwork/lnd/htlcswitch.(*channelLink).CheckHtlcTransit(0xc010837808, {0xf0, 0xec, 0xf7, 0x6d, 0x82, 0xa5, 0xb3, 0x7d, 0x0, ...}, ...)
        github.com/lightningnetwork/lnd/htlcswitch/link.go:3359 +0x191
github.com/lightningnetwork/lnd/htlcswitch.(*Switch).getLocalLink(0xc00c3d2700, 0xc0051b32c0, 0xc0130a5208)
        github.com/lightningnetwork/lnd/htlcswitch/switch.go:922 +0x227
github.com/lightningnetwork/lnd/htlcswitch.(*Switch).SendHTLC(0xc00c3d2700, {0xee84330?, 0xc5fe98d3?, 0xe90e?}, 0x3a1468f, 0xc0130a5208)
        github.com/lightningnetwork/lnd/htlcswitch/switch.go:565 +0xd8
github.com/lightningnetwork/lnd/routing.(*paymentLifecycle).sendAttempt(0xc00597bc00, 0xc0050b0e08)
        github.com/lightningnetwork/lnd/routing/payment_lifecycle.go:695 +0x4b0
github.com/lightningnetwork/lnd/routing.(*paymentLifecycle).resumePayment(0xc00597bc00, {0x2d64880, 0xc014ebfb20})
        github.com/lightningnetwork/lnd/routing/payment_lifecycle.go:308 +0x387
github.com/lightningnetwork/lnd/routing.(*ChannelRouter).sendPayment(0xc0016261b0, {0x2d646f0?, 0x46bb800?}, 0x18376, {0xf0, 0xec, 0xf7, 0x6d, 0x82, 0xa5, ...}, ...)
        github.com/lightningnetwork/lnd/routing/router.go:1271 +0x2e9
github.com/lightningnetwork/lnd/routing.(*ChannelRouter).SendPaymentAsync.func1()
        github.com/lightningnetwork/lnd/routing/router.go:916 +0x23e
created by github.com/lightningnetwork/lnd/routing.(*ChannelRouter).SendPaymentAsync in goroutine 216016
        github.com/lightningnetwork/lnd/routing/router.go:910 +0xec

It clearly locks at the switch read mutex.

Which means this lock was registered before:

goroutine 220562 [sync.RWMutex.Lock, 4 minutes]: 1 times: [[220562]
sync.runtime_SemacquireRWMutex(0x62e8e0?, 0xf8?, 0x7fdef424c928?)
        runtime/sema.go:105 +0x25
sync.(*RWMutex).Lock(0x7ff4fd90ea68?)
        sync/rwmutex.go:153 +0x6a
github.com/lightningnetwork/lnd/htlcswitch.(*Switch).RemoveLink(0xc00c3d2700, {0x3, 0xfa, 0x5, 0xc5, 0x14, 0x84, 0xc3, 0xac, 0xa, ...})
        github.com/lightningnetwork/lnd/htlcswitch/switch.go:2342 +0x38
github.com/lightningnetwork/lnd/peer.(*Brontide).addLink(0xc010935688, 0xc002f4d0b0, 0xc0071aefc0, 0xc002f4d230, 0xc011817770, 0x1, {0x0, {{0x0, 0x0, 0x0, ...}, ...}})
        github.com/lightningnetwork/lnd/peer/brontide.go:1470 +0x993
github.com/lightningnetwork/lnd/peer.(*Brontide).loadActiveChannels(0xc010935688, {0xc012f396c8, 0x1, 0xc01096f300?})
        github.com/lightningnetwork/lnd/peer/brontide.go:1331 +0x1568
github.com/lightningnetwork/lnd/peer.(*Brontide).Start(0xc010935688)
        github.com/lightningnetwork/lnd/peer/brontide.go:884 +0x772
github.com/lightningnetwork/lnd.(*server).peerInitializer(0xc000d09508, 0xc010935688)
        github.com/lightningnetwork/lnd/server.go:4588 +0x238
created by github.com/lightningnetwork/lnd.(*server).peerConnected in goroutine 220546
        github.com/lightningnetwork/lnd/server.go:4505 +0xdff

which also holds the server mutex that's why all the P2P connection bork.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @ziggie1984, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical deadlock (writer starvation) within the htlcswitch component by refining the mutex locking strategy. The change ensures that a read lock on the indexMtx is released as soon as the necessary index lookups are complete, preventing it from being held during potentially long-running operations and thus improving concurrency and preventing starvation. The fix is also documented in the release notes.

Highlights

  • Concurrency Fix: The indexMtx in htlcswitch.getLocalLink is now explicitly unlocked earlier, immediately after all necessary index lookups are performed. This prevents the read lock from being held during subsequent, potentially expensive operations, which was causing writer starvation and deadlocks.
  • Deadlock Prevention: By reducing the critical section where the indexMtx is held, this change directly mitigates a known deadlock (writer starvation) issue within the switch, improving overall concurrency and stability.
  • Documentation: An entry has been added to the release-notes-0.19.2.md file, documenting the fix for the switch deadlock.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a potential deadlock (writer starvation) in the HTLC switch by releasing the read lock on s.indexMtx earlier within the getLocalLink function. The defer statement for unlocking has been replaced with explicit RUnlock calls on all execution paths before any potentially long-running operations. This change is well-reasoned and correctly implemented, ensuring the lock is always released exactly once. The accompanying release note update is also clear and accurate. The changes improve the robustness and performance of the switch. Overall, this is a solid fix.

@ziggie1984 ziggie1984 force-pushed the fix-switch-deadlock branch from 277a965 to 32351f7 Compare July 4, 2025 10:55
@ziggie1984 ziggie1984 self-assigned this Jul 4, 2025
@ziggie1984 ziggie1984 added this to the v0.19.2 milestone Jul 4, 2025
We now unlock the mutex lock of the switch as soon as possible to
avoid potetnial deadlock in the htlc switch.
Copy link
Collaborator

@bhandras bhandras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🎉

@ziggie1984 ziggie1984 force-pushed the fix-switch-deadlock branch from 32351f7 to edb7342 Compare July 4, 2025 11:50
Copy link
Collaborator

@bitromortac bitromortac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice investigation 🙏

edit: just for clarification, I wouldn't call it writer starvation but a deadlock, because the golang RWMutex blocks further readers from acquiring it if in the meantime a write was registered

@ziggie1984
Copy link
Collaborator Author

ziggie1984 commented Jul 4, 2025

a sample program which illustrates that golang and its Mutex prioritise Write Locks to prevent write starvation which basically lead to the deadlock

https://go.dev/play/p/zy8-FVSm87M

Copy link
Member

@yyforyongyu yyforyongyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice findings! Tho it looks like there are other deadlocks we need to fix🤦🏻

The problem seems to be this line,

lnd/htlcswitch/switch.go

Lines 2049 to 2050 in b3eb9a3

// Attach the Switch's failAliasUpdate function to the link.
link.attachFailAliasUpdate(s.failAliasUpdate)

failAliasUpdate requires a read lock here,

lnd/htlcswitch/switch.go

Lines 2671 to 2676 in b3eb9a3

func (s *Switch) failAliasUpdate(scid lnwire.ShortChannelID,
incoming bool) *lnwire.ChannelUpdate1 {
// This function does not defer the unlocking because of the database
// lookups for ChannelUpdate.
s.indexMtx.RLock()

but inside AddLink we require a write lock,

lnd/htlcswitch/switch.go

Lines 2029 to 2031 in b3eb9a3

func (s *Switch) AddLink(link ChannelLink) error {
s.indexMtx.Lock()
defer s.indexMtx.Unlock()

Then createFailureWithUpdate will just deadlock, which is used at multiple places...

@ziggie1984
Copy link
Collaborator Author

Then createFailureWithUpdate will just deadlock, which is used at multiple places...

Hmm could you specify the scenario in detail I cannot see the other deadlocks right now (apart form that the design of supplying a functions which requires a mutex of a whole subsystem is pretty bad design and surely there are very likely more scenarios where the above can happen)

@yyforyongyu
Copy link
Member

Took a quick look, basically createFailureWithUpdate is used in,

  • canSendHtlc, which is used in,
    • CheckHtlcForward
    • CheckHtlcTransit, which is the deadlock fixed here
  • CheckHtlcForward also directly uses it
  • processRemoteAdds

Imo the processRemoteAdds is the most dangerous one. For instance, during startup when we AddLink in switch, we will call link.Start(). In this link's startup flow, we could reforward packets, which calls processRemoteAdds. And if an onion error occurs there, createFailureWithUpdate is called, then deadlock. Maybe there are more cases, but in general I will see if I can refactor this area while implementing dyn. Meanwhile if we find more cases we could put bandage fixes in the rc.

@yyforyongyu yyforyongyu merged commit ff32e90 into lightningnetwork:master Jul 4, 2025
35 checks passed
@ziggie1984 ziggie1984 deleted the fix-switch-deadlock branch July 4, 2025 19:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[bug]: all peers disconnect upon one borked connection
4 participants