Skip to content

Conversation

danielmarbach
Copy link
Contributor

@danielmarbach danielmarbach commented Aug 19, 2024

Symptoms

Network connections between the client and the RabbitMQ broker may be interrupted. During interruptions the transport attempts to reconnect to the broker. During reconnection attempts, previous connections may stay open and eventually lead to resource exhaustion on the client, which requires the client to be restarted.

Who's affected

All users are affected.

Root cause

Previous connections are not properly disposed and there are a number of race conditions that can lead to new connections being opened that remain active even after the transport is already shutdown.

Backported to

Original content

Replaces:

PR that brings the small adjustments that are needed to test things plus the tests to master to reproduce the problem on master

image

This PR fixes #1268 and quite likely eventually:

Changes required to make the code more testable

The PR attempts to make good enough tradeoffs between not introducing too much test induced damage while introducing test that allow to simulate various races that can occur.

  • Unseal the channel provider
  • Allow override the connection factory
  • Allow intercepting the offloading of the task
  • Allow overriding the delay mechanism

Given the reconnection logic is not really on the super critical hot path the above changes seem to be a good tradeoff

Discussed various ideas with @bording and @andreasohlund that also included things like the time provider, more interfaces around connection factory etc. which also comes with various issues. For example, the NET8 version can use the time provider but then backports would not be able to and even with the time provider in place you would still have to have a mechanism to intercept the offloading of the reconnection loop to synchronize the test in various places.

Approach

Instead of directly accessing and overriding the connection field, the new code creates new connections and atomically swaps the old connection out when the new connection is ready.

This approach has been chosen to preserve one of previous design choices predating this change.

The decision to allow publish attempts to fail when the connection is down is deliberate and aligns with the overall strategy of balancing reliability and consistency in the system. The reasoning here is that it’s quite rare for the publish connection to be down while the message pump connection remains operational, so letting the publish fail allows the system to handle the failure more predictably. By doing this, we can lean on NServiceBus’s recoverability mechanisms, like retries and the broker’s ability to resend messages, rather than introducing complex logic to handle these failures internally.

Blocking the send to wait for the connection to re-establish was ruled out because it could lead to ghost messages. These would occur if a handler blocks on sending, but then the message pump connection is reset, leading to a situation where the message is sent, but the handler fails. This creates an inconsistent state that is harder to manage. Instead, letting the failure bubble up immediately aligns better with the overall system design, where failures are clear, and the recovery path is well-defined.

Additionally, outside of handlers, it makes more sense to let the application’s existing exception handling manage these cases rather than attempting to mask the failure with retries or delays. This keeps the system behavior transparent and predictable, avoiding hidden complexities that could lead to harder-to-diagnose issues later on.

@danielmarbach danielmarbach changed the title Stop loop deterministic dispose Improve reliability of channel provider in case of reconnects Aug 19, 2024
@danielmarbach danielmarbach merged commit b2a42fb into master Aug 22, 2024
3 checks passed
@danielmarbach danielmarbach deleted the stop-loop-deterministic-dispose branch August 22, 2024 06:23
danielmarbach added a commit that referenced this pull request Aug 22, 2024
* Abort loop earlier if possible

* Reduce nesting

* Swap the connection when ready

* Explicit break to reduce nesting

* Nullable enable

* Move connection related stuff into the connection folder

* Adjust the channel provider design slightly to achieve better testability without too much test induced damage

* Test various races that can occur during shutdown and reconnection

---------

Co-authored-by: Daniel Marbach <danielmarbach@users.noreply.github.com>
danielmarbach added a commit that referenced this pull request Aug 22, 2024
* Abort loop earlier if possible

* Reduce nesting

* Swap the connection when ready

* Explicit break to reduce nesting

* Nullable enable

* Move connection related stuff into the connection folder

* Adjust the channel provider design slightly to achieve better testability without too much test induced damage

* Test various races that can occur during shutdown and reconnection

---------

Co-authored-by: Daniel Marbach <danielmarbach@users.noreply.github.com>
danielmarbach added a commit that referenced this pull request Aug 22, 2024
* Abort loop earlier if possible

* Reduce nesting

* Swap the connection when ready

* Explicit break to reduce nesting

* Nullable enable

* Move connection related stuff into the connection folder

* Adjust the channel provider design slightly to achieve better testability without too much test induced damage

* Test various races that can occur during shutdown and reconnection

---------

Co-authored-by: Daniel Marbach <danielmarbach@users.noreply.github.com>
danielmarbach added a commit that referenced this pull request Aug 22, 2024
* Abort loop earlier if possible

* Reduce nesting

* Swap the connection when ready

* Explicit break to reduce nesting

* Nullable enable

* Move connection related stuff into the connection folder

* Adjust the channel provider design slightly to achieve better testability without too much test induced damage

* Test various races that can occur during shutdown and reconnection

---------

Co-authored-by: Daniel Marbach <danielmarbach@users.noreply.github.com>
danielmarbach added a commit that referenced this pull request Aug 22, 2024
* Abort loop earlier if possible

* Reduce nesting

* Swap the connection when ready

* Explicit break to reduce nesting

* Nullable enable

* Move connection related stuff into the connection folder

* Adjust the channel provider design slightly to achieve better testability without too much test induced damage

* Test various races that can occur during shutdown and reconnection

---------

Co-authored-by: Daniel Marbach <danielmarbach@users.noreply.github.com>
danielmarbach added a commit that referenced this pull request Aug 22, 2024
* Abort loop earlier if possible

* Reduce nesting

* Swap the connection when ready

* Explicit break to reduce nesting

* Nullable enable

* Move connection related stuff into the connection folder

* Adjust the channel provider design slightly to achieve better testability without too much test induced damage

* Test various races that can occur during shutdown and reconnection

---------

Co-authored-by: Daniel Marbach <danielmarbach@users.noreply.github.com>
@danielmarbach danielmarbach changed the title Improve reliability of channel provider in case of reconnects Prevent connections from leaking during broker reconnects Aug 22, 2024
@danielmarbach danielmarbach changed the title Prevent connections from leaking during broker reconnects Prevent connections from leaking in case of reconnects Aug 22, 2024
@danielmarbach danielmarbach changed the title Prevent connections from leaking in case of reconnects Previous connections stay active after reconnect, causing resource exhaustion Aug 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ChannelProvider Publish reconnection attempts continue forever after IEndpointInstance had been stopped
4 participants