[core] Implement a thread pool and call the CPython API on all threads within the same concurrency group #52575

kevin85421 · 2025-04-24T07:11:40Z

Why are these changes needed?

We see the following error message from the CI runs of test_threaded_actor.py (example1, example2).

The message "Fatal Python error: PyGILState_Release: auto-releasing thread-state, but no thread-state for this thread" is very scary, but it will not cause any tests to fail.

The root cause is that PyGILState_Release is called on a thread that has never called PyGILState_Ensure. See the CPython source code for more details.

The reason is that we can't control which thread in the thread pool will run the initializer/releaser. Hence, if a concurrency group has more than one thread, the error message above may be printed when we gracefully shut down an actor (i.e., ray.actor.exit_actor()).

In this PR, we implement our own thread pool using std::thread, ensuring that both the initializer and the releaser run on the same thread. Consequently, from the Python interpreter’s perspective, all Python threads in the same concurrency group remain active even after they finish executing Ray tasks.

Related issue number

Closes #51071

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

# test.py
import ray

@ray.remote
class ThreadActor:
    def __init__(self):
        self.counter = 0

    def increment(self):
        self.counter += 1
        return self.counter

    def terminate(self):
        ray.actor.exit_actor()

actor = ThreadActor.options(max_concurrency=10).remote()
print(ray.get(actor.increment.remote()))
ray.get(actor.terminate.remote())

Without this PR: Ran the test 20 times and encountered the error "PyGILState_Release: auto-releasing thread-state" 20 times.
With this PR: Ran the test 20 times and encountered the error 0 times.

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

edoakes · 2025-04-24T14:17:30Z

This means that users cannot access thread-local state when a concurrency group has more than one thread.

What will the behavior be if a user does try to access thread-local state currently? There are really only two acceptable options here IMO unless there is a significant:

Enforce that users cannot use thread-local state and document this limitation clearly. Not sure if it's possible to prevent them from using it, so this is only acceptable if the Python behavior is well-defined when a user tries to use an unsupported call from an uninitialized thread.
Fix the problem for multiple threads as well.

Is there a reason why we can't fix the underlying problem and ensure that the initializer is run once per thread and the same thread runs the corresponding release call? Naiively it seems like this should only require keeping a map of thread_id -> releaser.

kevin85421 · 2025-04-24T15:28:23Z

What will the behavior be if a user does try to access thread-local state currently?

Without this PR
- It is OK to access thread-local state if the task is running on the MainThread or in a concurrency group that has only one thread.
- For a concurrency group with multiple threads, it depends on whether the task that accesses the thread-local state (access_task) is running on the same thread as the task that sets it (set_task). However, several issues must be addressed to ensure that these two tasks run on the same thread.
  - User Interface: we don't expose an user-facing API to enable them to specify to run a task in a specific thread in a concurrency group.
  - Ray Core C++: To the best of my knowledge, boost::asio::thread_pool doesn't expose an API that allows users to specify particular threads.

Is there a reason why we can't fix the underlying problem and ensure that the initializer is run once per thread and the same thread runs the corresponding release call? Naiively it seems like this should only require keeping a map of thread_id -releaser.

As I mentioned above, boost::asio::thread_pool doesn't expose an API that allows users to specify particular threads to the best of my knowledge.

edoakes · 2025-04-24T15:35:50Z

As I mentioned above, boost::asio::thread_pool doesn't expose an API that allows users to specify particular threads to the best of my knowledge.

Got it, in that case it sounds like we'll need to use the lower-level APIs to manage our own basic thread pool and implement the init/release logic. This should not be too challenging given how simple the usage in thread_pool.h is.

kevin85421 · 2025-04-24T15:39:00Z

I think it's fine not to support thread-local state for concurrency groups with more than one thread. I remember discussing this behavior with @stephanie-wang several months ago when we were trying to move the RayCG execution loop to the main thread.

However, executing the initializer is not only for thread-local state; it also aligns Ray more closely with the Python interpreter's assumptions. That is, once a thread with a given thread ID exits, it cannot be restarted. If we want to run the initializer/releaser on each thread, we may need to get rid of the thread_pool and manage threads ourselves.

kevin85421 · 2025-04-24T15:45:46Z

I just saw #52575 (comment) after I submitted #52575 (comment). Implementing our own thread pool makes sense. I want to confirm with you that the goal of implementing our own thread pool to initialize and release Python threads is not to support thread-local state; rather, it is to fulfill the Python interpreter's assumptions, as I mentioned in the previous comment. Users should still not use thread-local state for a concurrency group with multiple threads because of the user interface issue mentioned in #52575 (comment).

edoakes · 2025-04-24T16:05:58Z

Yes exactly. I agree we should not encourage users to do this, but we should fulfill the Python interpreter's assumptions. This will also avoid undefined behavior and/or scary stack traces like the one in this ticket.

As an example, there might be library code that uses thread local storage that users aren't even aware of. We would want to make sure that the code at least runs correctly and doesn't fail in unexpected & confusing ways.

kevin85421 · 2025-04-25T07:17:22Z

@edoakes Is it okay to implement our own thread pool using a naive round-robin approach? If not, I’d prefer to merge this PR first, and then I can follow up with another PR to implement it after on-call.

I took a look at the source code of the post function in boost::asio::thread_pool. It's not trivial if we plan to implement the scheduler by ourselves.

edoakes · 2025-04-25T13:51:39Z

The use case here is quite simple and the work is coarse-grained (task executions). We should be able to use an io_context as a basic queue and rely on it for scheduling.

Psuedocode:

// Start threads.
for i := range num_threads:
    threads_.emplace_back(io_context.run());

// Post work to the queue.
io_context.post(work_callback);

// Graceful shutdown.
io_context.stop();
for i := range num_threads:
    thread.join();

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

python/ray/tests/test_concurrency_group.py

src/ray/core_worker/transport/concurrency_group_manager.cc

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

src/ray/core_worker/transport/thread_pool.cc

kevin85421 · 2025-05-01T01:49:08Z

python/ray/tests/test_concurrency_group.py

        assert value == "f2"


+def test_multiple_threads_in_same_group(ray_start_regular_shared):


We can't use threading.enumerate() to check the number of threads because the threads are not launched by Python.

The threads are visible for py-spy because it checks the information from OS.

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

kevin85421 · 2025-05-01T07:25:48Z

src/ray/core_worker/transport/thread_pool.cc

 /// Join the thread pool.
-void BoundedExecutor::Join() { pool_->join(); }
+void BoundedExecutor::Join() {
+  work_guard_.reset();


Maintain the previous behavior. We can’t assume that Join will always be called after Stop; therefore, we need to reset work_guard_ here.

It's fine to call work_guard_.reset() twice. The second one will not do anything.

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

src/ray/core_worker/transport/thread_pool.cc

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

kevin85421 · 2025-05-02T01:01:44Z

@edoakes CI tests pass!

dayshah · 2025-05-05T23:27:04Z

src/ray/core_worker/transport/thread_pool.cc

+        releaser();
+      }
+    });
+    init_future.wait();


I think instead of having this future that waits for each thread in sequence, you should kick off all threads and have a latch that the constructor will wait on before exiting
https://en.cppreference.com/w/cpp/thread/latch
in boost for c++17
https://www.boost.org/doc/libs/1_88_0/boost/thread/latch.hpp

we could also avoid implementing our own threadpool and keep using boost::threadpool with this. Having each thread run an io_context when it actually doesn't have to do any io means we're wasting some time doing epoll, etc. when it doesn't have to

have a latch/barrier that waits for all of the init functions to start running (because the thread is blocked on the latch you can guarantee that each of the inits will be posted to its own thread)

have a latch/barrier that waits for all of the init functions to finish

at the end

call wait first to make sure all threads are idle, post releaser functions that wait on a latch/barrier again, and then join and stop

I think instead of having this future that waits for each thread in sequence, you should kick off all threads and have a latch that the constructor will wait on before exiting
https://en.cppreference.com/w/cpp/thread/latch
in boost for c++17
https://www.boost.org/doc/libs/1_88_0/boost/thread/latch.hpp

Thanks! Happy to learn new C++ techniques! I’ll take a look. I’m not sure how much benefit running all threads in parallel will provide in Python if the initializer needs to acquire the GIL.

we could also avoid implementing our own threadpool and keep using boost::threadpool with this. Having each thread run an io_context when it actually doesn't have to do any io means we're wasting some time doing epoll, etc. when it doesn't have to

This seems hacky to me. Is this a common usage pattern for boost::asio::thread_pool?

"because the thread is blocked on the latch you can guarantee that each of the inits will be posted to its own thread" => We rely on a behavior that boost::asio neither guarantees nor documents. thread_pool::post() doesn’t guarantee which thread will execute a given event.

post releaser functions that wait on a latch/barrier again, and then join and stop

We need to ensure that the releasers run on the same threads as the initializers that create them. I am not sure whether there is an easy way to do that or not.

It’s a bit over-engineered and may cause potential issues in my opinion. If we don’t have strong evidence, I’d prefer to keep the current implementation. WDYT?

Unsure how common of a usage pattern it is. We can ensure initializer / releaser pairs with thread_id -> releaser map or something.

I'm ok with keeping the current simpler implementation. i doubt the overhead of io_context really matters for us. Another option is just implementing a lightweight threadpool with condition variables and a task queue if working around asio::threadpool with waits and latches and thread ids is rough.

Leaving decision to @edoakes

Leaving decision to

Which decision are you referring to:

(1) The current implementation versus a “lightweight thread pool with condition variables and a task queue,” or
(2) The current implementation versus the thread-pool solution?

For (1), what benefits does the condition-variable solution offer? We can discuss whether it’s worth it or not. For (2), if there’s no strong evidence of benefits or a common usage pattern, it’s a strong no from my perspective.

Per offline discussion, we'll go with the existing implementation in the PR (basic hand-rolled threadpool using io_context). The reasoning is:

We aren't forcing asio threadpool into a model it isn't built for

It's very simple, so without justification it's preferred over a more complex implementation that might be more optimal

@kevin85421 to run full suite of benchmarks to validate (2)

#52575 (comment)

Use latch 1130e71

In PR #52575, the debug build keeps failing without any error messages. When I change the instance type from medium to large, all CI tests pass. I suspect it may be an OOM issue. Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

kevin85421 · 2025-05-08T07:13:26Z

I did a benchmark for the io_context implementation and the condition variable implementation.

io_context:https://gist.github.com/kevin85421/627296b806134829efba95f3f47563c5
condition variable: https://gist.github.com/kevin85421/5c37d9d83a1d62368757794ee502280b

4 threads and post 1000000 times.

io_context: 881.9 ms (the average of 10 runs)
cond var: 1105.8 ms (the average of 10 runs)

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

kevin85421 · 2025-05-08T16:30:59Z

@edoakes there are still some comments that I haven't addressed. I will ping you when all comments are addressed.

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

kevin85421 · 2025-05-09T02:47:31Z

@edoakes All comments have been addressed.

update

4aa0cee

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

kevin85421 added the go add ONLY when ready to merge, run all tests label Apr 24, 2025

jjyao assigned edoakes Apr 24, 2025

kevin85421 marked this pull request as ready for review April 25, 2025 07:17

update

0a2b539

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

kevin85421 commented May 1, 2025

View reviewed changes

python/ray/tests/test_concurrency_group.py Show resolved Hide resolved

src/ray/core_worker/transport/concurrency_group_manager.cc Show resolved Hide resolved

kevin85421 added 3 commits April 30, 2025 17:43

update

2b692dc

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

Merge remote-tracking branch 'upstream/master' into laptop-ray3-20250423

1950902

add test

e63ba54

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

kevin85421 commented May 1, 2025

View reviewed changes

src/ray/core_worker/transport/thread_pool.cc Show resolved Hide resolved

kevin85421 commented May 1, 2025

View reviewed changes

fix lint

f93ee4a

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

kevin85421 changed the title ~~[core] Fix "Fatal Python error: PyGILState_Release: auto-releasing thread-state, but no thread-state for this thread"~~ [core] Implement a thread pool and call the CPython API on all threads within the same concurrency group May 1, 2025

fix tests

26feff9

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

kevin85421 commented May 1, 2025

View reviewed changes

fix tests

ec0c28b

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

edoakes reviewed May 1, 2025

View reviewed changes

src/ray/core_worker/transport/thread_pool.cc Outdated Show resolved Hide resolved

kevin85421 added 4 commits May 1, 2025 12:34

add test / comment

d39ac82

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

debug

940cc81

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

fix lint

6cf1eab

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

update

5356c7c

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

kevin85421 mentioned this pull request May 5, 2025

[CI] Update debug wheel instance_type from medium to large #52797

Merged

8 tasks

dayshah reviewed May 5, 2025

View reviewed changes

dayshah self-assigned this May 5, 2025

kevin85421 added 3 commits May 8, 2025 02:00

address comments

1130e71

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

address comments

40c0c04

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

Merge remote-tracking branch 'upstream/master' into laptop-ray3-20250423

a5d444e

edoakes approved these changes May 8, 2025

View reviewed changes

kevin85421 added 6 commits May 8, 2025 13:51

improve readability

430ef36

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

add timeout

fd48457

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

add comment

3a641b4

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

update tests

4f19b33

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

update tests

8123c32

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

update tests

77879b7

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

edoakes merged commit 478877e into master May 9, 2025
5 checks passed

edoakes deleted the laptop-ray3-20250423 branch May 9, 2025 16:27

hainesmichaelc added the community-backlog label May 22, 2025

		assert value == "f2"


		def test_multiple_threads_in_same_group(ray_start_regular_shared):

[core] Implement a thread pool and call the CPython API on all threads within the same concurrency group #52575

[core] Implement a thread pool and call the CPython API on all threads within the same concurrency group #52575

Uh oh!

Conversation

kevin85421 commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

edoakes commented Apr 24, 2025

Uh oh!

kevin85421 commented Apr 24, 2025

Uh oh!

edoakes commented Apr 24, 2025

Uh oh!

kevin85421 commented Apr 24, 2025

Uh oh!

kevin85421 commented Apr 24, 2025

Uh oh!

edoakes commented Apr 24, 2025

Uh oh!

kevin85421 commented Apr 25, 2025

Uh oh!

edoakes commented Apr 25, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kevin85421 May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kevin85421 commented May 2, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kevin85421 commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kevin85421 commented May 8, 2025

Uh oh!

kevin85421 commented May 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kevin85421 commented Apr 24, 2025 •

edited

Loading

kevin85421 May 1, 2025 •

edited

Loading

kevin85421 commented May 8, 2025 •

edited

Loading