Skip to content

Conversation

ziggie1984
Copy link
Collaborator

@ziggie1984 ziggie1984 commented Jun 30, 2025

This makes sure that goroutines do not pile up in case premature channel updates are received which are never processed but get deleted form the prematureUpdate LRU cache in case the maximum limit of 100 premature updates is reached and therefore old chan update message get deleted.

Depends on: lightninglabs/neutrino#322

Copy link
Contributor

coderabbitai bot commented Jun 30, 2025

Important

Review skipped

Auto reviews are limited to specific labels.

🏷️ Labels to auto review (1)
  • llm-review

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@ziggie1984 ziggie1984 force-pushed the fix-goroutine-leak branch from 65fb58f to 82a2d85 Compare July 1, 2025 18:49
@ziggie1984 ziggie1984 added this to the v0.19.2 milestone Jul 1, 2025
@ziggie1984 ziggie1984 self-assigned this Jul 1, 2025
@ziggie1984 ziggie1984 marked this pull request as ready for review July 1, 2025 18:52
@ziggie1984 ziggie1984 force-pushed the fix-goroutine-leak branch from 82a2d85 to 34d53d3 Compare July 1, 2025 19:25
@ziggie1984 ziggie1984 requested a review from starius July 1, 2025 19:25
@ziggie1984 ziggie1984 force-pushed the fix-goroutine-leak branch 2 times, most recently from abced28 to b870fb7 Compare July 1, 2025 23:49
maxPrematureUpdates,
lru.WithDeleteCallback(
func(k uint64, cmsg *cachedNetworkMsg) {
// for every network message which is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I see we've changed approaches. Don't we still want to ensure that the goroutine created to wait the response is cleaned up as soon as possible?

A premature channel update is an update for a channel we don't know about. It's of common use in the itests due to lack of instant block propagation (one node gets the block first, sends the ann before the other has seen the block. It can even be a zombie edge.

In the common case, we hear of the block then we can return an error and the goroutine exits. However, if it's a zombie, and stays that way for weeks, then these goroutines will still pile up. Or if it's a nearly valid, but fake channel update, we'll never actually process it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it a more, this doest at least restrict the amount of these goroutines waiting for a premature update to be processed to maxPrematureUpdates, which currently is 100.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exactly I think that's the cleanest solution

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also added the safeguard to exit the goroutine after a timeout, I think this combination should be a good temporary fix until we have the actor model in place.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this if we already have remoteGossipMsgTimeout?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the remoteGossipMsgTimeout is just a safeguard to prevent issues if developers forget to write into the error channel to prevent potential leaks in the future. So it is an extra safetynet to prevent goroutine leaks because we cannot guarantee the error Chan is used currently in the future, that's why I was adding the comment that it is just temporary until we can replace this code design with the actor model proposed by roasbeef

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the goroutine

@ziggie1984
Copy link
Collaborator Author

running this on my node, and memory usage + goroutines are stable now, so I think that approach is good to go

@ziggie1984 ziggie1984 force-pushed the fix-goroutine-leak branch from 6692785 to de88a18 Compare July 2, 2025 10:43
Copy link
Member

@yyforyongyu yyforyongyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this bug is introduced in #9875, which adds a goroutine to log the error - i think we should just remove the goroutine instead, as it only logs the error but not processing it. By making a dramatic change to the tool we use lru and another timeout mechanism is an overkill imo.

maxPrematureUpdates,
lru.WithDeleteCallback(
func(k uint64, cmsg *cachedNetworkMsg) {
// for every network message which is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this if we already have remoteGossipMsgTimeout?

peer/brontide.go Outdated
"msg %T: %v", msg,
err)
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think we might as well remove this goroutine as it does nothing but logging the error?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as described below, yes we can do that but that would not solve the problem we have here it would just shadow it and down the road when using the response we would run into this issue again.

@ziggie1984 ziggie1984 force-pushed the fix-goroutine-leak branch from de88a18 to 32af9d5 Compare July 2, 2025 11:46
@ziggie1984
Copy link
Collaborator Author

Note that this bug is introduced in #9875, which adds a goroutine to log the error - i think we should just remove the goroutine instead, as it only logs the error but not processing it. By making a dramatic change to the tool we use lru and another timeout mechanism is an overkill imo.

The idea behind this change is to fix the issue we have by never writing in the errorChannel which we assume should happen also see the comment here:

lnd/discovery/gossiper.go

Lines 3137 to 3139 in 1d2e547

// NOTE: We don't return anything on the error channel for this
// message, as we expect that will be done when this
// ChannelUpdate is later reprocessed.

Moreover it is just logging for now, but should be enhanced in the future, see also the TODO to punish the Peer potentially.

So the introduction of the goroutine in itself did not really introduce the issue by itself but rather revealed a deeper problem we are trying to fix with this PR.

I leave it to the majority but I would like to see this change be merged into LND.

@ziggie1984 ziggie1984 force-pushed the fix-goroutine-leak branch from 32af9d5 to ce879b7 Compare July 2, 2025 12:42
Copy link
Collaborator

@starius starius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Posted a proposal and a question to verify this code doesn't introduce a dead lock.

@ziggie1984 ziggie1984 force-pushed the fix-goroutine-leak branch from ce879b7 to b4fb92a Compare July 2, 2025 13:21
@yyforyongyu
Copy link
Member

So the introduction of the goroutine in itself did not really introduce the issue by itself but rather revealed a deeper problem we are trying to fix with this PR.

Def - but I think the root problem is we are firing goroutines that without controlling their lifecycle, which is an anti-go thing as we should never start a goroutine without knowing how it will stop. In addition if we are just looking for temporary solution here I don't see why we can't just remove it, since we wanna have a more comprehensive fix via the actor?

@ziggie1984 ziggie1984 force-pushed the fix-goroutine-leak branch 2 times, most recently from b7c81d7 to fc0dcfb Compare July 2, 2025 14:52
@ziggie1984 ziggie1984 requested review from yyforyongyu and starius July 2, 2025 14:53
@ziggie1984
Copy link
Collaborator Author

Ok I think you have a valid point, removed the goroutine when processing network responses and also fixed a potential race condition.

// premature ChannelUpdates. These are pointers and we might in the
// meantime receive new premature ChannelUpdate for this exact channel
// which will read also the same premature ChannelUpdates currently in
// the LRU cache.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm...I think the cache uses sync map under the hood?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I think you are right, because we only read the *cachedNetworkMsg value we are concurrent safe.

@ziggie1984 ziggie1984 force-pushed the fix-goroutine-leak branch 2 times, most recently from 7959ccf to 024ae3a Compare July 2, 2025 16:13
@yyforyongyu
Copy link
Member

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The code changes prevent a goroutine leak in the brontide by removing the waiting goroutine and adding comments to explain the change. The review suggests improvements to the comments for better context.

Copy link
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🧆

@ziggie1984 ziggie1984 force-pushed the fix-goroutine-leak branch from 024ae3a to 699c097 Compare July 3, 2025 04:25
We cannot rely on a response currently so we avoid spawning
goroutines. This is just a temporary fix to avoid the goroutine
leak.
@ziggie1984 ziggie1984 force-pushed the fix-goroutine-leak branch from 699c097 to e6aff21 Compare July 3, 2025 04:28
@ziggie1984 ziggie1984 requested a review from yyforyongyu July 3, 2025 04:29
Copy link
Member

@yyforyongyu yyforyongyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM🚢

@yyforyongyu yyforyongyu merged commit ffd944e into lightningnetwork:master Jul 3, 2025
37 of 40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants