Worker event loop performance hampered by GIL/convoy effect

Opening this to have a clearer issue to link to in other places. In https://github.com/dask/distributed/issues/5258#issuecomment-914957989, we discovered that the worker's event loop performance could be heavily impacted when:
* User task code, running in threads, holds the GIL (even if the tasks are quick)
* Workload involves data transfer to other workers

For example, while running tasks involving lots of `np.concatenate`s (which doesn't release the GIL), while simultaneously trying to send data to other workers, we found that 71% of the time, the worker's event loop was blocked (not idle) by the GIL.

![gil](https://user-images.githubusercontent.com/3309802/132451170-f97e0e23-a42b-41fe-b633-cda7f4d5790d.png)

This is the "convoy effect", a longstanding issue in CPython: https://bugs.python.org/issue7946. Basically, the non-blocking `socket.send()` call (running in the event loop) releases the GIL and does a non-blocking write to the socket, which is nearly instantaneous. But then it needs to re-acquire the GIL, which some other thread now holds, and will hold for potentially a long time (the default thread-switch interval is 5ms). So this "non-blocking" `socket.send` is, effectively, blocking. If most of what the event loop is trying to do is send data/messages to other workers/the scheduler, then most of the time, the event loop will actually be blocked.

See the "GIL Gotchas" section in https://coiled.io/blog/better-shuffling-in-dask-a-proof-of-concept/ for an intuitive explanation of this involving toothbrushes.

-----

This is clearly bad from a performance standpoint (it massively slows down data transfer between workers).

An open question is whether it's also a stability issue. I wouldn't be surprised to see messages like `Event loop was unresponsive in Worker for 3.28s. This is often caused by long-running GIL-holding functions` in worker logs, even when your tasks aren't _long-running_ GIL-holding functions, but just GIL-holding functions at all.

No async code will perform well if the event loop is gummed up. I don't think the worker is any exception. I don't think that the worker is, or should have to be, designed to work reliably when the event loop is highly unresponsive. Instead, I think we should focus on ways to protect the event loop from being blocked. (The main way to do this is run tasks in subprocesses instead of threads, which I think is a good idea for all sorts of reasons.)

I'm not sure exactly how an unresponsive event loop impacts stability. In theory, it shouldn't matter—everything should still happen, in the same order, just really, really slowly. However, any sane-seeming timeouts will go out the window if the event loop is running 1,000x slower than expected. So in practice, I'm not surprised to see a blocked event loop causing things like https://github.com/dask/distributed/issues/6324.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Worker event loop performance hampered by GIL/convoy effect #6325

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Worker event loop performance hampered by GIL/convoy effect #6325

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions