Skip to content

Commit 1188d66

Browse files
committed
Make sure to first run cleanup, then kill the cleaner process
Investigating streaming RPC worker timeouts noticed in our cleanup logic, there is a is tiny chance of leaking workers in case the coordinator process is killed right when it's running cleanup: after it stops the cleanup process and before it sends the `kill_all` messages. To fix it, first run cleanup, then kill the cleanup process.
1 parent 0e674f6 commit 1188d66

File tree

1 file changed

+10
-3
lines changed

1 file changed

+10
-3
lines changed

src/fabric/src/fabric_streams.erl

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -95,16 +95,23 @@ start(Workers0, Keypos, StartFun, Replacements, RingOpts) ->
9595

9696
cleanup(Workers) ->
9797
% Stop the auxiliary cleaner process as we got to the point where cleanup
98-
% happesn in the regular fashion so we don't want to send 2x the number kill
99-
% messages
98+
% happens in the regular fashion and we don't want to send 2x the number
99+
% of kill messages.
100+
%
101+
% First, we run the cleanup/1 function, then, we stop the cleaner;
102+
% otherwise there is a tiny risk we get killed after we stop the process
103+
% and before finish calling cleanup/1. This early, forced process kill may
104+
% happen when running the recovery login in the ddoc cache.
105+
%
106+
Res = fabric_util:cleanup(Workers),
100107
case get(?WORKER_CLEANER) of
101108
CleanerPid when is_pid(CleanerPid) ->
102109
erase(?WORKER_CLEANER),
103110
exit(CleanerPid, kill);
104111
_ ->
105112
ok
106113
end,
107-
fabric_util:cleanup(Workers).
114+
Res.
108115

109116
handle_stream_start({rexi_DOWN, _, {_, NodeRef}, _}, _, St) ->
110117
#stream_acc{workers = Workers, ready = Ready, ring_opts = RingOpts} = St,

0 commit comments

Comments
 (0)