Previous connections stay active after reconnect, causing resource exhaustion #1435

danielmarbach · 2024-08-19T20:44:40Z

Symptoms

Network connections between the client and the RabbitMQ broker may be interrupted. During interruptions the transport attempts to reconnect to the broker. During reconnection attempts, previous connections may stay open and eventually lead to resource exhaustion on the client, which requires the client to be restarted.

Who's affected

All users are affected.

Root cause

Previous connections are not properly disposed and there are a number of race conditions that can lead to new connections being opened that remain active even after the transport is already shutdown.

Backported to

Original content

Replaces:

Ensure reconnect loop in ChannelProvider terminates when disposed #1270

PR that brings the small adjustments that are needed to test things plus the tests to master to reproduce the problem on master

This PR fixes #1268 and quite likely eventually:

ServiceControl leaks connections with RabbitMQ transport ServiceControl#4226

Changes required to make the code more testable

The PR attempts to make good enough tradeoffs between not introducing too much test induced damage while introducing test that allow to simulate various races that can occur.

Unseal the channel provider
Allow override the connection factory
Allow intercepting the offloading of the task
Allow overriding the delay mechanism

Given the reconnection logic is not really on the super critical hot path the above changes seem to be a good tradeoff

Discussed various ideas with @bording and @andreasohlund that also included things like the time provider, more interfaces around connection factory etc. which also comes with various issues. For example, the NET8 version can use the time provider but then backports would not be able to and even with the time provider in place you would still have to have a mechanism to intercept the offloading of the reconnection loop to synchronize the test in various places.

Approach

Instead of directly accessing and overriding the connection field, the new code creates new connections and atomically swaps the old connection out when the new connection is ready.

This approach has been chosen to preserve one of previous design choices predating this change.

The decision to allow publish attempts to fail when the connection is down is deliberate and aligns with the overall strategy of balancing reliability and consistency in the system. The reasoning here is that it’s quite rare for the publish connection to be down while the message pump connection remains operational, so letting the publish fail allows the system to handle the failure more predictably. By doing this, we can lean on NServiceBus’s recoverability mechanisms, like retries and the broker’s ability to resend messages, rather than introducing complex logic to handle these failures internally.

Blocking the send to wait for the connection to re-establish was ruled out because it could lead to ghost messages. These would occur if a handler blocks on sending, but then the message pump connection is reset, leading to a situation where the message is sent, but the handler fails. This creates an inconsistent state that is harder to manage. Instead, letting the failure bubble up immediately aligns better with the overall system design, where failures are clear, and the recovery path is well-defined.

Additionally, outside of handlers, it makes more sense to let the application’s existing exception handling manage these cases rather than attempting to mask the failure with retries or delays. This keeps the system behavior transparent and predictable, avoiding hidden complexities that could lead to harder-to-diagnose issues later on.

…lity without too much test induced damage

* Abort loop earlier if possible * Reduce nesting * Swap the connection when ready * Explicit break to reduce nesting * Nullable enable * Move connection related stuff into the connection folder * Adjust the channel provider design slightly to achieve better testability without too much test induced damage * Test various races that can occur during shutdown and reconnection --------- Co-authored-by: Daniel Marbach <danielmarbach@users.noreply.github.com>

danielmarbach changed the title ~~Stop loop deterministic dispose~~ Improve reliability of channel provider in case of reconnects Aug 19, 2024

danielmarbach added 8 commits August 20, 2024 16:30

Abort loop earlier if possible

82a4ada

Reduce nesting

a449bdc

Swap the connection when ready

0e2f7c4

Explicit break to reduce nesting

fadaf6a

Nullable enable

28b773b

Move connection related stuff into the connection folder

1083baf

Adjust the channel provider design slightly to achieve better testabi…

dec490f

…lity without too much test induced damage

Test various races that can occur during shutdown and reconnection

b31bf6a

danielmarbach force-pushed the stop-loop-deterministic-dispose branch from 4ce642b to b31bf6a Compare August 21, 2024 12:33

danielmarbach marked this pull request as ready for review August 21, 2024 13:31

danielmarbach requested review from ramonsmits, bording and abparticular August 21, 2024 13:31

danielmarbach mentioned this pull request Aug 21, 2024

Ensure reconnect loop in ChannelProvider terminates when disposed #1270

Closed

bording approved these changes Aug 21, 2024

View reviewed changes

andreasohlund approved these changes Aug 22, 2024

View reviewed changes

abparticular approved these changes Aug 22, 2024

View reviewed changes

danielmarbach merged commit b2a42fb into master Aug 22, 2024
3 checks passed

danielmarbach deleted the stop-loop-deterministic-dispose branch August 22, 2024 06:23

danielmarbach mentioned this pull request Aug 22, 2024

Previous connections stay active after reconnect, causing resource exhaustion #1439

Merged

danielmarbach mentioned this pull request Aug 22, 2024

Previous connections stay active after reconnect, causing resource exhaustion #1440

Merged

danielmarbach mentioned this pull request Aug 22, 2024

Previous connections stay active after reconnect, causing resource exhaustion #1441

Merged

danielmarbach mentioned this pull request Aug 22, 2024

Previous connections stay active after reconnect, causing resource exhaustion #1442

Merged

danielmarbach changed the title ~~Improve reliability of channel provider in case of reconnects~~ Prevent connections from leaking during broker reconnects Aug 22, 2024

danielmarbach changed the title ~~Prevent connections from leaking during broker reconnects~~ Prevent connections from leaking in case of reconnects Aug 22, 2024

danielmarbach changed the title ~~Prevent connections from leaking in case of reconnects~~ Previous connections stay active after reconnect, causing resource exhaustion Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Previous connections stay active after reconnect, causing resource exhaustion #1435

Previous connections stay active after reconnect, causing resource exhaustion #1435

Uh oh!

danielmarbach commented Aug 19, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Previous connections stay active after reconnect, causing resource exhaustion #1435

Previous connections stay active after reconnect, causing resource exhaustion #1435

Uh oh!

Conversation

danielmarbach commented Aug 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Symptoms

Who's affected

Root cause

Backported to

Original content

Changes required to make the code more testable

Approach

Uh oh!

Uh oh!

Uh oh!

danielmarbach commented Aug 19, 2024 •

edited

Loading