Skip to content

[core] Fix "Fatal Python error: PyGILState_Release: auto-releasing thread-state, but no thread-state for this thread" #52575

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Apr 24, 2025

Why are these changes needed?

We see the following error message from the CI runs of test_threaded_actor.py (example1, example2).

image

The message "Fatal Python error: PyGILState_Release: auto-releasing thread-state, but no thread-state for this thread" is very scary, but it will not cause any tests to fail.

The root cause is that PyGILState_Release is called on a thread that has never called PyGILState_Ensure. See the CPython source code for more details.

The reason is that we can't control which thread in the thread pool will run the initializer/releaser. Hence, if a concurrency group has more than one thread, the error message above may be printed when we gracefully shut down an actor (i.e., ray.actor.exit_actor()).

In this PR, we only execute the initializer and releaser when the executor has only one thread, to ensure that both run on the same thread. This means that users cannot access thread-local state when a concurrency group has more than one thread. I think this behavior is acceptable because users cannot control which thread executes a task, so they should not rely on thread-local state when a concurrency group has more than one thread.

Related issue number

Closes #51071

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(
# test.py
import ray

@ray.remote
class ThreadActor:
    def __init__(self):
        self.counter = 0

    def increment(self):
        self.counter += 1
        return self.counter

    def terminate(self):
        ray.actor.exit_actor()

actor = ThreadActor.options(max_concurrency=10).remote()
print(ray.get(actor.increment.remote()))
ray.get(actor.terminate.remote())
  • Without this PR: Ran the test 5 times and encountered the error "PyGILState_Release: auto-releasing thread-state" 5 times.
  • With this PR: Ran the test 5 times and encountered the error 0 times.

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
@kevin85421 kevin85421 added the go add ONLY when ready to merge, run all tests label Apr 24, 2025
@edoakes
Copy link
Collaborator

edoakes commented Apr 24, 2025

This means that users cannot access thread-local state when a concurrency group has more than one thread.

What will the behavior be if a user does try to access thread-local state currently? There are really only two acceptable options here IMO unless there is a significant:

  1. Enforce that users cannot use thread-local state and document this limitation clearly. Not sure if it's possible to prevent them from using it, so this is only acceptable if the Python behavior is well-defined when a user tries to use an unsupported call from an uninitialized thread.
  2. Fix the problem for multiple threads as well.

Is there a reason why we can't fix the underlying problem and ensure that the initializer is run once per thread and the same thread runs the corresponding release call? Naiively it seems like this should only require keeping a map of thread_id -> releaser.

@kevin85421
Copy link
Member Author

What will the behavior be if a user does try to access thread-local state currently?

  • Without this PR
    • It is OK to access thread-local state if the task is running on the MainThread or in a concurrency group that has only one thread.
    • For a concurrency group with multiple threads, it depends on whether the task that accesses the thread-local state (access_task) is running on the same thread as the task that sets it (set_task). However, several issues must be addressed to ensure that these two tasks run on the same thread.
      • User Interface: we don't expose an user-facing API to enable them to specify to run a task in a specific thread in a concurrency group.
      • Ray Core C++: To the best of my knowledge, boost::asio::thread_pool doesn't expose an API that allows users to specify particular threads.

Is there a reason why we can't fix the underlying problem and ensure that the initializer is run once per thread and the same thread runs the corresponding release call? Naiively it seems like this should only require keeping a map of thread_id -releaser.

As I mentioned above, boost::asio::thread_pool doesn't expose an API that allows users to specify particular threads to the best of my knowledge.

@edoakes
Copy link
Collaborator

edoakes commented Apr 24, 2025

As I mentioned above, boost::asio::thread_pool doesn't expose an API that allows users to specify particular threads to the best of my knowledge.

Got it, in that case it sounds like we'll need to use the lower-level APIs to manage our own basic thread pool and implement the init/release logic. This should not be too challenging given how simple the usage in thread_pool.h is.

@kevin85421
Copy link
Member Author

I think it's fine not to support thread-local state for concurrency groups with more than one thread. I remember discussing this behavior with @stephanie-wang several months ago when we were trying to move the RayCG execution loop to the main thread.

However, executing the initializer is not only for thread-local state; it also aligns Ray more closely with the Python interpreter's assumptions. That is, once a thread with a given thread ID exits, it cannot be restarted. If we want to run the initializer/releaser on each thread, we may need to get rid of the thread_pool and manage threads ourselves.

@kevin85421
Copy link
Member Author

I just saw #52575 (comment) after I submitted #52575 (comment). Implementing our own thread pool makes sense. I want to confirm with you that the goal of implementing our own thread pool to initialize and release Python threads is not to support thread-local state; rather, it is to fulfill the Python interpreter's assumptions, as I mentioned in the previous comment. Users should still not use thread-local state for a concurrency group with multiple threads because of the user interface issue mentioned in #52575 (comment).

@edoakes
Copy link
Collaborator

edoakes commented Apr 24, 2025

Yes exactly. I agree we should not encourage users to do this, but we should fulfill the Python interpreter's assumptions. This will also avoid undefined behavior and/or scary stack traces like the one in this ticket.

As an example, there might be library code that uses thread local storage that users aren't even aware of. We would want to make sure that the code at least runs correctly and doesn't fail in unexpected & confusing ways.

@kevin85421
Copy link
Member Author

@edoakes Is it okay to implement our own thread pool using a naive round-robin approach? If not, I’d prefer to merge this PR first, and then I can follow up with another PR to implement it after on-call.

I took a look at the source code of the post function in boost::asio::thread_pool. It's not trivial if we plan to implement the scheduler by ourselves.

@kevin85421 kevin85421 marked this pull request as ready for review April 25, 2025 07:17
@edoakes
Copy link
Collaborator

edoakes commented Apr 25, 2025

The use case here is quite simple and the work is coarse-grained (task executions). We should be able to use an io_context as a basic queue and rely on it for scheduling.

Psuedocode:

// Start threads.
for i := range num_threads:
    threads_.emplace_back(io_context.run());

// Post work to the queue.
io_context.post(work_callback);

// Graceful shutdown.
io_context.stop();
for i := range num_threads:
    thread.join();

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[core] Only one of the threads in a thread pool will be initialized as a long-running Python thread
2 participants