-
Notifications
You must be signed in to change notification settings - Fork 61
Previous connections stay active after reconnect, causing resource exhaustion #1435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…lity without too much test induced damage
4ce642b
to
b31bf6a
Compare
bording
approved these changes
Aug 21, 2024
andreasohlund
approved these changes
Aug 22, 2024
abparticular
approved these changes
Aug 22, 2024
danielmarbach
added a commit
that referenced
this pull request
Aug 22, 2024
* Abort loop earlier if possible * Reduce nesting * Swap the connection when ready * Explicit break to reduce nesting * Nullable enable * Move connection related stuff into the connection folder * Adjust the channel provider design slightly to achieve better testability without too much test induced damage * Test various races that can occur during shutdown and reconnection --------- Co-authored-by: Daniel Marbach <danielmarbach@users.noreply.github.com>
danielmarbach
added a commit
that referenced
this pull request
Aug 22, 2024
* Abort loop earlier if possible * Reduce nesting * Swap the connection when ready * Explicit break to reduce nesting * Nullable enable * Move connection related stuff into the connection folder * Adjust the channel provider design slightly to achieve better testability without too much test induced damage * Test various races that can occur during shutdown and reconnection --------- Co-authored-by: Daniel Marbach <danielmarbach@users.noreply.github.com>
danielmarbach
added a commit
that referenced
this pull request
Aug 22, 2024
* Abort loop earlier if possible * Reduce nesting * Swap the connection when ready * Explicit break to reduce nesting * Nullable enable * Move connection related stuff into the connection folder * Adjust the channel provider design slightly to achieve better testability without too much test induced damage * Test various races that can occur during shutdown and reconnection --------- Co-authored-by: Daniel Marbach <danielmarbach@users.noreply.github.com>
danielmarbach
added a commit
that referenced
this pull request
Aug 22, 2024
* Abort loop earlier if possible * Reduce nesting * Swap the connection when ready * Explicit break to reduce nesting * Nullable enable * Move connection related stuff into the connection folder * Adjust the channel provider design slightly to achieve better testability without too much test induced damage * Test various races that can occur during shutdown and reconnection --------- Co-authored-by: Daniel Marbach <danielmarbach@users.noreply.github.com>
danielmarbach
added a commit
that referenced
this pull request
Aug 22, 2024
* Abort loop earlier if possible * Reduce nesting * Swap the connection when ready * Explicit break to reduce nesting * Nullable enable * Move connection related stuff into the connection folder * Adjust the channel provider design slightly to achieve better testability without too much test induced damage * Test various races that can occur during shutdown and reconnection --------- Co-authored-by: Daniel Marbach <danielmarbach@users.noreply.github.com>
danielmarbach
added a commit
that referenced
this pull request
Aug 22, 2024
* Abort loop earlier if possible * Reduce nesting * Swap the connection when ready * Explicit break to reduce nesting * Nullable enable * Move connection related stuff into the connection folder * Adjust the channel provider design slightly to achieve better testability without too much test induced damage * Test various races that can occur during shutdown and reconnection --------- Co-authored-by: Daniel Marbach <danielmarbach@users.noreply.github.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Symptoms
Network connections between the client and the RabbitMQ broker may be interrupted. During interruptions the transport attempts to reconnect to the broker. During reconnection attempts, previous connections may stay open and eventually lead to resource exhaustion on the client, which requires the client to be restarted.
Who's affected
All users are affected.
Root cause
Previous connections are not properly disposed and there are a number of race conditions that can lead to new connections being opened that remain active even after the transport is already shutdown.
Backported to
Original content
Replaces:
PR that brings the small adjustments that are needed to test things plus the tests to master to reproduce the problem on master
This PR fixes #1268 and quite likely eventually:
Changes required to make the code more testable
The PR attempts to make good enough tradeoffs between not introducing too much test induced damage while introducing test that allow to simulate various races that can occur.
Given the reconnection logic is not really on the super critical hot path the above changes seem to be a good tradeoff
Discussed various ideas with @bording and @andreasohlund that also included things like the time provider, more interfaces around connection factory etc. which also comes with various issues. For example, the NET8 version can use the time provider but then backports would not be able to and even with the time provider in place you would still have to have a mechanism to intercept the offloading of the reconnection loop to synchronize the test in various places.
Approach
Instead of directly accessing and overriding the
connection
field, the new code creates new connections and atomically swaps the old connection out when the new connection is ready.This approach has been chosen to preserve one of previous design choices predating this change.
The decision to allow publish attempts to fail when the connection is down is deliberate and aligns with the overall strategy of balancing reliability and consistency in the system. The reasoning here is that it’s quite rare for the publish connection to be down while the message pump connection remains operational, so letting the publish fail allows the system to handle the failure more predictably. By doing this, we can lean on NServiceBus’s recoverability mechanisms, like retries and the broker’s ability to resend messages, rather than introducing complex logic to handle these failures internally.
Blocking the send to wait for the connection to re-establish was ruled out because it could lead to ghost messages. These would occur if a handler blocks on sending, but then the message pump connection is reset, leading to a situation where the message is sent, but the handler fails. This creates an inconsistent state that is harder to manage. Instead, letting the failure bubble up immediately aligns better with the overall system design, where failures are clear, and the recovery path is well-defined.
Additionally, outside of handlers, it makes more sense to let the application’s existing exception handling manage these cases rather than attempting to mask the failure with retries or delays. This keeps the system behavior transparent and predictable, avoiding hidden complexities that could lead to harder-to-diagnose issues later on.