-
Notifications
You must be signed in to change notification settings - Fork 2.2k
multi: prevent goroutine leak in brontide #10012
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multi: prevent goroutine leak in brontide #10012
Conversation
Important Review skippedAuto reviews are limited to specific labels. 🏷️ Labels to auto review (1)
Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
65fb58f
to
82a2d85
Compare
82a2d85
to
34d53d3
Compare
abced28
to
b870fb7
Compare
discovery/gossiper.go
Outdated
maxPrematureUpdates, | ||
lru.WithDeleteCallback( | ||
func(k uint64, cmsg *cachedNetworkMsg) { | ||
// for every network message which is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I see we've changed approaches. Don't we still want to ensure that the goroutine created to wait the response is cleaned up as soon as possible?
A premature channel update is an update for a channel we don't know about. It's of common use in the itests
due to lack of instant block propagation (one node gets the block first, sends the ann before the other has seen the block. It can even be a zombie edge.
In the common case, we hear of the block then we can return an error and the goroutine exits. However, if it's a zombie, and stays that way for weeks, then these goroutines will still pile up. Or if it's a nearly valid, but fake channel update, we'll never actually process it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about it a more, this doest at least restrict the amount of these goroutines waiting for a premature update to be processed to maxPrematureUpdates
, which currently is 100
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exactly I think that's the cleanest solution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also added the safeguard to exit the goroutine after a timeout, I think this combination should be a good temporary fix until we have the actor model in place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this if we already have remoteGossipMsgTimeout
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the remoteGossipMsgTimeout
is just a safeguard to prevent issues if developers forget to write into the error channel to prevent potential leaks in the future. So it is an extra safetynet to prevent goroutine leaks because we cannot guarantee the error Chan is used currently in the future, that's why I was adding the comment that it is just temporary until we can replace this code design with the actor model proposed by roasbeef
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed the goroutine
running this on my node, and memory usage + goroutines are stable now, so I think that approach is good to go |
6692785
to
de88a18
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this bug is introduced in #9875, which adds a goroutine to log the error - i think we should just remove the goroutine instead, as it only logs the error but not processing it. By making a dramatic change to the tool we use lru
and another timeout mechanism is an overkill imo.
discovery/gossiper.go
Outdated
maxPrematureUpdates, | ||
lru.WithDeleteCallback( | ||
func(k uint64, cmsg *cachedNetworkMsg) { | ||
// for every network message which is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this if we already have remoteGossipMsgTimeout
?
peer/brontide.go
Outdated
"msg %T: %v", msg, | ||
err) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think we might as well remove this goroutine as it does nothing but logging the error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as described below, yes we can do that but that would not solve the problem we have here it would just shadow it and down the road when using the response we would run into this issue again.
de88a18
to
32af9d5
Compare
The idea behind this change is to fix the issue we have by never writing in the errorChannel which we assume should happen also see the comment here: Lines 3137 to 3139 in 1d2e547
Moreover it is just logging for now, but should be enhanced in the future, see also the TODO to punish the Peer potentially. So the introduction of the goroutine in itself did not really introduce the issue by itself but rather revealed a deeper problem we are trying to fix with this PR. I leave it to the majority but I would like to see this change be merged into LND. |
32af9d5
to
ce879b7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Posted a proposal and a question to verify this code doesn't introduce a dead lock.
ce879b7
to
b4fb92a
Compare
Def - but I think the root problem is we are firing goroutines that without controlling their lifecycle, which is an anti-go thing as we should never start a goroutine without knowing how it will stop. In addition if we are just looking for temporary solution here I don't see why we can't just remove it, since we wanna have a more comprehensive fix via the actor? |
b7c81d7
to
fc0dcfb
Compare
Ok I think you have a valid point, removed the goroutine when processing network responses and also fixed a potential race condition. |
discovery/gossiper.go
Outdated
// premature ChannelUpdates. These are pointers and we might in the | ||
// meantime receive new premature ChannelUpdate for this exact channel | ||
// which will read also the same premature ChannelUpdates currently in | ||
// the LRU cache. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm...I think the cache uses sync map under the hood?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok I think you are right, because we only read the *cachedNetworkMsg
value we are concurrent safe.
7959ccf
to
024ae3a
Compare
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
The code changes prevent a goroutine leak in the brontide by removing the waiting goroutine and adding comments to explain the change. The review suggests improvements to the comments for better context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🧆
024ae3a
to
699c097
Compare
We cannot rely on a response currently so we avoid spawning goroutines. This is just a temporary fix to avoid the goroutine leak.
699c097
to
e6aff21
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM🚢
This makes sure that goroutines do not pile up in case premature channel updates are received which are never processed but get deleted form the prematureUpdate LRU cache in case the maximum limit of 100 premature updates is reached and therefore old chan update message get deleted.
Depends on: lightninglabs/neutrino#322