Skip to content

session::Client_session::sync_connect() can block for seconds given certain user-code behavior on server side. #27

@ygoldfeld

Description

@ygoldfeld

[ CC - @wkorbe who ran into this (albeit I believe he was intentionally trying to simulate a "stuck process") when integrating Flow-IPC into a real application. ]

ipc::session::Client_session::sync_connect() overload is (as of this writing) the way a session-client initially connects to the opposing process (hence session-server). So in the opposing process normally one performs:

// Construct it: this places a CNS (PID) file in the proper location and "begins listening."
ipc::session::Session_server /* <-- or similar */ srv(...);
// Now or later... accept a connection (or wait for one and then accept).
srv.async_accept(on_connect_func); // on_connect_func() shall receive a ready-to-go Session.

The expected behavior (contract) of Client_session::sync_connect() is, either:

  • fail immediately (non-blocking), as indicated by Error_code/exception; or
  • succeed immediately (non-blocking), as indicated by falsy Error_code/no exception.

The short version is, this contract breaks in this situation and manner:

  • Session_server is properly constructed; but there is no .async_accept() in progress.
  • It should either fail or succeed quickly (the docs, formally speaking, don't make clear which) - non-blockingly - but instead it blocks until an .async_accept() does get called, or 5 seconds pass, whichever happens first.

The work-around is "just" to not do this on the server-side: Only construct the Session_server once ready to actually accept a session. (After that, the continuing "work-around" is essentially to act in the standard fashion recommended in the Guided Manual. Namely, if the server is simple and only accepts one session at a time, then clearly the client side shouldn't attempt to sync_connect() again until the first session dies; and once it does the session-server should either shut down or do another .async_accept(). If the server is more complex - capable of concurrent sessions - then it should .async_accept() immediately upon handling an .async_accept() callback being invoked. Assuming no improper blocking within this loop on the server, there is no problem.)

So essentially the "work-around" is likely to be how one would code the server-side anyway. However that's not 100% true, in that it is fairly natural to construct a thing early-on and further use it (.async_accept() here) somewhat later after potentially lengthy other setup. So at that point constructing it only once ready to potentially accept becomes in fact a work-around.

Naturally, work-arounds aside, it is not good to break the documented contract formally speaking - and, as shown just above, practically speaking too.

SIDE NOTE: From memory, I (@ygoldfeld) think I have a to-do somewhere in the code about this. I am 50% certain of this. If so I should have filed an issue immediately but neglected to do so (human error, unintentional).

--

How to solve it internally?

The reason for this is not too difficult to understand. The session-connect procedure basically is:

  • Server places CNS (PID) file into a location based on the App::m_name, which both sides know and pass (in App struct) to Session_server and Client_session ctors respectively.
  • Server then immediately (also in ctor) sets up a Unix domain server socket (ipc::transport::Native_socket_stream_acceptor) which immediately listen()s. That is incoming socket connect()s are queued by the kernel, so each socket low-level connect() immediately succeeds from that point on.
  • Then, client and server engage in a fairly complex message exchange to set up many things like channels and SHM. This exchange is non-blocking (albeit heavy-weight, to the tune of possibly tens of milliseconds). Once this is done on each side, server async_accept() async-triggers the on-accept user callback; and on client side sync_connect() returns. Though heavy-weight (but quite rare), this means sync_connect() is non-blocking; while async_accept() is "quick," once the low-level UDS connect() arrives.

The problem here is about when the last bullet point actually occurs. On the server the answer is, only once at least one .async_accept() is outstanding. Until then, the client side "sees" the initial Unix domain socket connect, while on the server only the kernel knows about it. Once .async_accept() is called, low-level accept() actually is executed, and the rest of the message exchange happens.

That's just the way I coded it, as it is straightforward and seems correct. It's not quite, though, as if .async_accept() does not occur or occurs late, then the sync_connect() thinks it's off to the races (low-level connect() succeeded immediately as expected), while the server knows nothing of it. Client has no way of knowing it's really only in the kernel UDS backlog still.

So, not hard to understand. As for fixing it... somewhat surprisingly to me, it's actually not obvious. Simple ideas tend to be either wrong or not-quite-great:

  • Could augment server side (e.g., making it an option in the low-level Native_socket_stream_acceptor) to not allow this backlogging of incoming connect()s, when there are not 1+ async_accept()s pending. That would solve the problem during the initial "gap" (between Session_server ctor and first .async_accept())... but it would also break things during subsequent .async_accept()less "gaps." For example, suppose the code is written (and this would be proper) to .async_accept(F), then handle the new session in F (non-blockingly), then .async_accept() again. If a client attempts sync_connect() between the two, it'll fail. Not a catastrophe perhaps, but typically code would be written to now wait a while and try again in N seconds - when really it would already work quite soon, once the next .async_accept() is executed on server.
  • Could simply add (internally) a timeout inside .sync_connect() - e.g., 100msec - so it would give up and yield error unless message exchange starts within 100msec of the UDS connect() success. This is a decent solution. However it's hard to know whether it should be 100msec or 50 or 500; if the machine is quite slow then that's how long conceivably an immediate success would take. So it's brittle in that way. It could be configurable... but that sucks for the user. The whole point of .sync_connect() -- as opposed to having an .async_connect() -- is simplicity, synchronicity, non-blockingness when connecting.

Upon kicking around solutions in my mind I couldn't escape the following conclusion (not that it's all that hairy really).

The bottom line is backlogging incoming .sync_connect()s on the server, when Session_server exists but happens to not have any .async_accept() pending at the moment, is good behavior. Think of it as an analogy to just a simple TCP or stream-UDS connect: if it's listening in any capacity, then a connect should work - whether further traffic occurs in timely fashion is a separate concern, and indeed if there is no matching accept() on the opposing side, then the client can eventually detect this and break the half-functioning channel. So in our case, if there's a Session_server listening, then connects should fully work from the client's point of view. However, in practical terms, the server side would then remain in almost-PEER state until user .async_accept()s successfully and then does Server_session::init_handlers(). Therefore, later synchronous calls can block, if the user indeed has an .async_accept()less Session_server for an extended period of time. Those calls:

  • Session::open_channel(): This is formally allowed to block, but usually it's undesirable. We can give (see below notes re. docs) tips for how to avoid it (plus recommend using init-channels anyway when possible, so open_channel() can be skipped).
  • struc::Channel::sync_request() (and other API combos that prefer non-blockingly-fast request responses). This too is formally allowed to block, and usually it's undesirable. Again, we can give tips for how to avoid it.

In addition, consider adding a simple change wherein in fact we do not listen - nor even place the CNS (PID) file (etc.) - until the first .async_accept(). This one could go either way, but leaving a constructed, not-yet-accepting Session_server around and async_accept()ing for the first time later could be pretty common - IMO it reduces entropy to reject sync_connect()s during that time period. (Note: Doing this alone would mitigate the original issue quite a lot in practice. It would then still remain formally a problem but in appropriate user server-side patterns it would no longer come up.)

NOTE! It is important to explicitly document the expected behavior. This can be subtle, particularly after the above solution. We should take great care to (1) be formally accurate and (2) be informally helpful. Namely:

  • sync_connect() is non-blocking and in particular will succeed as long as opposing Session_server is constructed. However, if the app coder on server allows for blocking-length periods sans .async_accept(), then subsequent operations might not be instantly responsive - until .async_accept() and .init_handlers() gets session to PEER state instead of almost-PEER.
  • Moreover, that situation will be avoided before the first .async_accept(): server will reject connects during this time.
  • Moreover, the app code on server can avoid that situation by
    • In one-concurrent-session setups: doing .async_accept() quickly once a session is hosed.
    • In many-concurrent-sessions setups: doing .async_accept() quickly once an .async_accept() triggers callback.
  • These measures are proper techniques in any case.
  • If not heeded, the things that might (from client-side point of view) be not-instantly-responsive within a given apparently-PEER-state session are at least: open_channel(), sync_request(), and conceptually similar algorithms to sync_request(). It is typically recommended to heed the advice, as there is great power from the simplicity of always knowing that a session is in full PEER state on both sides, and therefore all bidirectional request-like exchanges will be instantly responsive.

--

Impl detail notes:

As it is currently structured (namely all in session::Session_server_impl), the change should not be that hard (though it'll feel "unfortunate" given how mercifully simple today's session::Session_server_impl is). Currently .async_accept() simply triggers N_s_s_acceptor::async_accept() 1-1; once the latter fires successfully, directly from that thread W we do Server_session impl's async-accept-log-in stuff which does the message exchange. So it's just a wrapper around those 2 steps really (plus a couple hooks for SHM-enabled sessions... doesn't concern us here at all).

So now it will have to be somewhat more complex - in fact similar to N_s_s_acceptor internals. It'll need to do the equivalent of today's async_accept() - but mandatorily have one outstanding at all times; keep a queue of user's outstanding "outer" .async_accept() "requests"; and do the usual surplus/deficit thing as in N_s_s_a and some other places in Flow-IPC. (That means dtor will need to fire operation-aborted callbacks if queue is non-empty in dtor... and so on. Use the aforementioned such classes like N_s_s_a; conceptually it'll be a copy-paste like that.)

Will probably need to add a thread W to Session_server_impl (so that if there is a surplus of available almost-PEER sessions, and then a request comes in: fire callback immediately from thread W using W.post()); and mutex-protect the surplus queue/etc.; as that stuff (m_state?) may be modified from at least the Server_sessions' many threads W, or from the new thread W, or I believe N_s_s_a's thread W. Perf is no big deal in this context, so it's fine; just be careful with the multi-threading.

Oh and, again, basically keep everything (maybe m_state?) null until the first .async_accept(); at which point do the equivalent of today's ctor and begin the internal .async_accept() chain.

transport_test is a decent place to have this auto-tested.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions