fix switch deadlock #10035

ziggie1984 · 2025-07-04T10:46:14Z

In the issue above we had a writer starvation because although we only aquire the RLock in:

getLocalLink which if the balance is not enough creates a failUpdate (which also needs the RLock) so far so good that would not deadlock, however there is a chance a writer regssiters in between and if that happens we have a writer starvation and the parts of the system will deadlock.

In a above case the server mutex is locked and it will cause almost the whole system to lock down.

Important goroutine dump logs:

Here the getLocalLink call:

goroutine 216074 [sync.RWMutex.RLock, 4 minutes]: 1 times: [[216074]
sync.runtime_SemacquireRWMutexR(0xc2120595219f4ec4?, 0xc0?, 0x4691180?)
        runtime/sema.go:100 +0x25
sync.(*RWMutex).RLock(...)
        sync/rwmutex.go:72
github.com/lightningnetwork/lnd/htlcswitch.(*Switch).failAliasUpdate(0xc00c3d2700, {0x0, 0x0, 0x0}, 0x0)
        github.com/lightningnetwork/lnd/htlcswitch/switch.go:2667 +0x65
github.com/lightningnetwork/lnd/htlcswitch.(*channelLink).createFailureWithUpdate(0xc010837808, 0xa7?, {0x0?, 0x20?, 0x0?}, 0x2a6a308)
        github.com/lightningnetwork/lnd/htlcswitch/link.go:874 +0x56
github.com/lightningnetwork/lnd/htlcswitch.(*channelLink).canSendHtlc(0xc010837808, {0x3e8, 0x2faf080, 0x0, 0x118, {0x0, 0xffffff4c}, 0x90}, {0xf0, 0xec, ...}, ...)
        github.com/lightningnetwork/lnd/htlcswitch/link.go:3400 +0x1fb
github.com/lightningnetwork/lnd/htlcswitch.(*channelLink).CheckHtlcTransit(0xc010837808, {0xf0, 0xec, 0xf7, 0x6d, 0x82, 0xa5, 0xb3, 0x7d, 0x0, ...}, ...)
        github.com/lightningnetwork/lnd/htlcswitch/link.go:3359 +0x191
github.com/lightningnetwork/lnd/htlcswitch.(*Switch).getLocalLink(0xc00c3d2700, 0xc0051b32c0, 0xc0130a5208)
        github.com/lightningnetwork/lnd/htlcswitch/switch.go:922 +0x227
github.com/lightningnetwork/lnd/htlcswitch.(*Switch).SendHTLC(0xc00c3d2700, {0xee84330?, 0xc5fe98d3?, 0xe90e?}, 0x3a1468f, 0xc0130a5208)
        github.com/lightningnetwork/lnd/htlcswitch/switch.go:565 +0xd8
github.com/lightningnetwork/lnd/routing.(*paymentLifecycle).sendAttempt(0xc00597bc00, 0xc0050b0e08)
        github.com/lightningnetwork/lnd/routing/payment_lifecycle.go:695 +0x4b0
github.com/lightningnetwork/lnd/routing.(*paymentLifecycle).resumePayment(0xc00597bc00, {0x2d64880, 0xc014ebfb20})
        github.com/lightningnetwork/lnd/routing/payment_lifecycle.go:308 +0x387
github.com/lightningnetwork/lnd/routing.(*ChannelRouter).sendPayment(0xc0016261b0, {0x2d646f0?, 0x46bb800?}, 0x18376, {0xf0, 0xec, 0xf7, 0x6d, 0x82, 0xa5, ...}, ...)
        github.com/lightningnetwork/lnd/routing/router.go:1271 +0x2e9
github.com/lightningnetwork/lnd/routing.(*ChannelRouter).SendPaymentAsync.func1()
        github.com/lightningnetwork/lnd/routing/router.go:916 +0x23e
created by github.com/lightningnetwork/lnd/routing.(*ChannelRouter).SendPaymentAsync in goroutine 216016
        github.com/lightningnetwork/lnd/routing/router.go:910 +0xec

It clearly locks at the switch read mutex.

Which means this lock was registered before:

goroutine 220562 [sync.RWMutex.Lock, 4 minutes]: 1 times: [[220562]
sync.runtime_SemacquireRWMutex(0x62e8e0?, 0xf8?, 0x7fdef424c928?)
        runtime/sema.go:105 +0x25
sync.(*RWMutex).Lock(0x7ff4fd90ea68?)
        sync/rwmutex.go:153 +0x6a
github.com/lightningnetwork/lnd/htlcswitch.(*Switch).RemoveLink(0xc00c3d2700, {0x3, 0xfa, 0x5, 0xc5, 0x14, 0x84, 0xc3, 0xac, 0xa, ...})
        github.com/lightningnetwork/lnd/htlcswitch/switch.go:2342 +0x38
github.com/lightningnetwork/lnd/peer.(*Brontide).addLink(0xc010935688, 0xc002f4d0b0, 0xc0071aefc0, 0xc002f4d230, 0xc011817770, 0x1, {0x0, {{0x0, 0x0, 0x0, ...}, ...}})
        github.com/lightningnetwork/lnd/peer/brontide.go:1470 +0x993
github.com/lightningnetwork/lnd/peer.(*Brontide).loadActiveChannels(0xc010935688, {0xc012f396c8, 0x1, 0xc01096f300?})
        github.com/lightningnetwork/lnd/peer/brontide.go:1331 +0x1568
github.com/lightningnetwork/lnd/peer.(*Brontide).Start(0xc010935688)
        github.com/lightningnetwork/lnd/peer/brontide.go:884 +0x772
github.com/lightningnetwork/lnd.(*server).peerInitializer(0xc000d09508, 0xc010935688)
        github.com/lightningnetwork/lnd/server.go:4588 +0x238
created by github.com/lightningnetwork/lnd.(*server).peerConnected in goroutine 220546
        github.com/lightningnetwork/lnd/server.go:4505 +0xdff

which also holds the server mutex that's why all the P2P connection bork.

gemini-code-assist

Summary of Changes

Hello @ziggie1984, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical deadlock (writer starvation) within the htlcswitch component by refining the mutex locking strategy. The change ensures that a read lock on the indexMtx is released as soon as the necessary index lookups are complete, preventing it from being held during potentially long-running operations and thus improving concurrency and preventing starvation. The fix is also documented in the release notes.

Highlights

Concurrency Fix: The indexMtx in htlcswitch.getLocalLink is now explicitly unlocked earlier, immediately after all necessary index lookups are performed. This prevents the read lock from being held during subsequent, potentially expensive operations, which was causing writer starvation and deadlocks.
Deadlock Prevention: By reducing the critical section where the indexMtx is held, this change directly mitigates a known deadlock (writer starvation) issue within the switch, improving overall concurrency and stability.
Documentation: An entry has been added to the release-notes-0.19.2.md file, documenting the fix for the switch deadlock.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request addresses a potential deadlock (writer starvation) in the HTLC switch by releasing the read lock on s.indexMtx earlier within the getLocalLink function. The defer statement for unlocking has been replaced with explicit RUnlock calls on all execution paths before any potentially long-running operations. This change is well-reasoned and correctly implemented, ensuring the lock is always released exactly once. The accompanying release note update is also clear and accurate. The changes improve the robustness and performance of the switch. Overall, this is a solid fix.

We now unlock the mutex lock of the switch as soon as possible to avoid potetnial deadlock in the htlc switch.

bhandras

LGTM 🎉

bitromortac

LGTM, nice investigation 🙏

edit: just for clarification, I wouldn't call it writer starvation but a deadlock, because the golang RWMutex blocks further readers from acquiring it if in the meantime a write was registered

ziggie1984 · 2025-07-04T12:31:16Z

a sample program which illustrates that golang and its Mutex prioritise Write Locks to prevent write starvation which basically lead to the deadlock

https://go.dev/play/p/zy8-FVSm87M

yyforyongyu

Nice findings! Tho it looks like there are other deadlocks we need to fix🤦🏻

The problem seems to be this line,

lnd/htlcswitch/switch.go

Lines 2049 to 2050 in b3eb9a3

    
           // Attach the Switch's failAliasUpdate function to the link. 
        
           link.attachFailAliasUpdate(s.failAliasUpdate)

failAliasUpdate requires a read lock here,

lnd/htlcswitch/switch.go

Lines 2671 to 2676 in b3eb9a3

    
           func (s *Switch) failAliasUpdate(scid lnwire.ShortChannelID, 
        
           	incoming bool) *lnwire.ChannelUpdate1 { 
        
           	// This function does not defer the unlocking because of the database 
        
           	// lookups for ChannelUpdate. 
        
           	s.indexMtx.RLock()

but inside AddLink we require a write lock,

lnd/htlcswitch/switch.go

Lines 2029 to 2031 in b3eb9a3

    
           func (s *Switch) AddLink(link ChannelLink) error { 
        
           	s.indexMtx.Lock() 
        
           	defer s.indexMtx.Unlock()

Then createFailureWithUpdate will just deadlock, which is used at multiple places...

ziggie1984 · 2025-07-04T13:04:53Z

Then createFailureWithUpdate will just deadlock, which is used at multiple places...

Hmm could you specify the scenario in detail I cannot see the other deadlocks right now (apart form that the design of supplying a functions which requires a mutex of a whole subsystem is pretty bad design and surely there are very likely more scenarios where the above can happen)

yyforyongyu · 2025-07-04T13:32:06Z

Took a quick look, basically createFailureWithUpdate is used in,

canSendHtlc, which is used in,
- CheckHtlcForward
- CheckHtlcTransit, which is the deadlock fixed here
CheckHtlcForward also directly uses it
processRemoteAdds

Imo the processRemoteAdds is the most dangerous one. For instance, during startup when we AddLink in switch, we will call link.Start(). In this link's startup flow, we could reforward packets, which calls processRemoteAdds. And if an onion error occurs there, createFailureWithUpdate is called, then deadlock. Maybe there are more cases, but in general I will see if I can refactor this area while implementing dyn. Meanwhile if we find more cases we could put bandage fixes in the rc.

gemini-code-assist bot reviewed Jul 4, 2025

View reviewed changes

ziggie1984 requested review from bhandras and bitromortac July 4, 2025 10:52

ziggie1984 force-pushed the fix-switch-deadlock branch from 277a965 to 32351f7 Compare July 4, 2025 10:55

ziggie1984 added the bug fix label Jul 4, 2025

ziggie1984 self-assigned this Jul 4, 2025

ziggie1984 added this to the v0.19.2 milestone Jul 4, 2025

ziggie1984 added 2 commits July 4, 2025 13:15

switch: unlock mutex lock earlier

68b7aaa

We now unlock the mutex lock of the switch as soon as possible to avoid potetnial deadlock in the htlc switch.

docs: add release-notes

edb7342

bhandras approved these changes Jul 4, 2025

View reviewed changes

ziggie1984 force-pushed the fix-switch-deadlock branch from 32351f7 to edb7342 Compare July 4, 2025 11:50

bitromortac reviewed Jul 4, 2025

View reviewed changes

bitromortac approved these changes Jul 4, 2025

View reviewed changes

yyforyongyu approved these changes Jul 4, 2025

View reviewed changes

yyforyongyu merged commit ff32e90 into lightningnetwork:master Jul 4, 2025
35 checks passed

guggero mentioned this pull request Jul 4, 2025

release: create v0.19.2-rc1 branch #9986

Merged

ziggie1984 deleted the fix-switch-deadlock branch July 4, 2025 19:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix switch deadlock #10035

fix switch deadlock #10035

Uh oh!

ziggie1984 commented Jul 4, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

bhandras left a comment

Uh oh!

bitromortac left a comment •

edited

Loading

Uh oh!

ziggie1984 commented Jul 4, 2025 •

edited

Loading

Uh oh!

yyforyongyu left a comment

Uh oh!

ziggie1984 commented Jul 4, 2025

Uh oh!

yyforyongyu commented Jul 4, 2025

Uh oh!

Uh oh!

Uh oh!

	// Attach the Switch's failAliasUpdate function to the link.
	link.attachFailAliasUpdate(s.failAliasUpdate)

	func (s *Switch) failAliasUpdate(scid lnwire.ShortChannelID,
	incoming bool) *lnwire.ChannelUpdate1 {

	// This function does not defer the unlocking because of the database
	// lookups for ChannelUpdate.
	s.indexMtx.RLock()

	func (s *Switch) AddLink(link ChannelLink) error {
	s.indexMtx.Lock()
	defer s.indexMtx.Unlock()

fix switch deadlock #10035

fix switch deadlock #10035

Uh oh!

Conversation

ziggie1984 commented Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

bhandras left a comment

Choose a reason for hiding this comment

Uh oh!

bitromortac left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ziggie1984 commented Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yyforyongyu left a comment

Choose a reason for hiding this comment

Uh oh!

ziggie1984 commented Jul 4, 2025

Uh oh!

yyforyongyu commented Jul 4, 2025

Uh oh!

Uh oh!

Uh oh!

ziggie1984 commented Jul 4, 2025 •

edited

Loading

bitromortac left a comment •

edited

Loading

ziggie1984 commented Jul 4, 2025 •

edited

Loading