[core] Only one of the threads in a thread pool will be initialized as a long-running Python thread #51071

kevin85421 · 2025-03-04T22:03:14Z

What happened + What you expected to happen

Currently, only one of the threads in a thread pool will be initialized as a long-running Python thread. I should also investigate whether it's possible to call PyGILState_Release on a different thread other than the one calls PyGILState_Ensure in the thread pool.

Versions / Dependencies

TODO

Reproduction script

TODO

Issue Severity

None

The text was updated successfully, but these errors were encountered:

…s within the same concurrency group (#52575) We see the following error message from the CI runs of `test_threaded_actor.py` ([example1](https://buildkite.com/ray-project/postmerge-macos/builds/5543#019659f5-7285-48fc-b1cf-588fd19bd050), [example2](https://buildkite.com/ray-project/postmerge-macos/builds/5534#01965796-294c-41de-8e6f-ef2970134df2)). ![image](https://github.com/user-attachments/assets/d3a5d47a-1dc6-41b8-b258-d33699d4a04a) The message "Fatal Python error: PyGILState_Release: auto-releasing thread-state, but no thread-state for this thread" is very scary, but it will not cause any tests to fail. The root cause is that `PyGILState_Release` is called on a thread that has never called `PyGILState_Ensure`. See the [CPython source code](https://github.com/python/cpython/blob/a94c7528b596e9ec234f12ebeeb45fc731412b18/Python/pystate.c#L2870) for more details. The reason is that we can't control which thread in the thread pool will run the initializer/releaser. Hence, if a concurrency group has more than one thread, the error message above may be printed when we gracefully shut down an actor (i.e., `ray.actor.exit_actor()`). In this PR, we implement our own thread pool using `std::thread`, ensuring that both the initializer and the releaser run on the same thread. Consequently, from the Python interpreter’s perspective, all Python threads in the same concurrency group remain active even after they finish executing Ray tasks. ## Related issue number Closes #51071 ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( ```python # test.py import ray @ray.remote class ThreadActor: def __init__(self): self.counter = 0 def increment(self): self.counter += 1 return self.counter def terminate(self): ray.actor.exit_actor() actor = ThreadActor.options(max_concurrency=10).remote() print(ray.get(actor.increment.remote())) ray.get(actor.terminate.remote()) ``` * Without this PR: Ran the test 20 times and encountered the error "PyGILState_Release: auto-releasing thread-state" 20 times. <img width="1728" alt="Screenshot 2025-04-30 at 5 23 27 PM" src="https://github.com/user-attachments/assets/644ffd89-8edf-4678-a0cd-528eb642fe66" /> * With this PR: Ran the test 20 times and encountered the error 0 times. <img width="1728" alt="Screenshot 2025-04-30 at 5 25 10 PM" src="https://github.com/user-attachments/assets/03afaa26-0027-4df4-915d-6165bb83583f" /> --------- Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

commit 0abf03bb30a7c234a0820dc4650b6df6d0cbea59 Author: srinathk10 <68668616+srinathk10@users.noreply.github.com> Date: Mon May 12 15:25:17 2025 -0700 Train Tests: Disable cgroup isolation on head node for benchmarking (#52909) --------- Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com> Signed-off-by: srinathk10 <68668616+srinathk10@users.noreply.github.com> Co-authored-by: lanbochen-anyscale <103082133+lanbochen-anyscale@users.noreply.github.com> commit 5349c66c6d5d022f03341b0ac9f1adb34079b0a5 Author: dev-goyal <126589393+dev-goyal@users.noreply.github.com> Date: Mon May 12 18:11:07 2025 -0400 Minor enhancements to Databricks Unity Datasource (#52850) - Move imports around in `read_databricks_tables`. Now, installing `pyspark` is optional if desired. - Print a reason if the query fails - Expose the `is_truncated` field to the user, so they can intervene if needed. Signed-off-by: Dev <dev.goyal@hinge.co> commit da52b137f10567e78fb0dd1937a7480cf70f56ee Author: Matthew Owen <mowen@anyscale.com> Date: Mon May 12 13:46:01 2025 -0700 [data] Remove unused allocated bytes panel and stat (#52943) ## Why are these changes needed? We do not update this stat anywhere in the codebase, this removes the stat and the corresponding panel. ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Matthew Owen <mowen@anyscale.com> commit 61617ccf1e0280ae512991eadd1595f6d1a66f15 Author: matthewdeng <matt@anyscale.com> Date: Mon May 12 11:55:14 2025 -0700 [train] bump test_torch_device_manager timeout (#52917) Test started flakily timing out. Bumping to verify if it's around the threshold. Signed-off-by: Matthew Deng <matt@anyscale.com> commit 7e78c5aee84cf9d05ec8cc6a60d385e7f6df67e7 Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Date: Mon May 12 11:10:05 2025 -0700 [data] skip tfx-bsl tests on premerge (#52942) the base image is not resolving dependencies any more. Signed-off-by: Lonnie Liu <lonnie@anyscale.com> commit 0f864d70cfe812f0164cd6bb414daecbcbb6e8c7 Author: Rueian <rueiancsie@gmail.com> Date: Mon May 12 09:56:14 2025 -0700 [core][autoscaler][v1] deflaky test_autoscaler (#52769) ## Why are these changes needed? From [the logs](https://buildkite.com/ray-project/postmerge/builds/9840#01968329-4ba3-422f-91e0-542d09855d68) provided by @kevin85421, `test_autoscaler.py` has 2 flaky tests: ```python [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] =================================== FAILURES =================================== [2025-04-29T20:28:44Z] ____________________ AutoscalingTest.testConfiguresNewNodes ____________________ [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] self = <python.ray.tests.test_autoscaler.AutoscalingTest testMethod=testConfiguresNewNodes> [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] def testConfiguresNewNodes(self): [2025-04-29T20:28:44Z] config = copy.deepcopy(SMALL_CLUSTER) [2025-04-29T20:28:44Z] config["available_node_types"]["worker"]["min_workers"] = 1 [2025-04-29T20:28:44Z] config_path = self.write_config(config) [2025-04-29T20:28:44Z] self.provider = MockProvider() [2025-04-29T20:28:44Z] runner = MockProcessRunner() [2025-04-29T20:28:44Z] runner.respond_to_call("json .Config.Env", ["[]" for i in range(2)]) [2025-04-29T20:28:44Z] self.provider.create_node( [2025-04-29T20:28:44Z] {}, [2025-04-29T20:28:44Z] { [2025-04-29T20:28:44Z] TAG_RAY_NODE_KIND: NODE_KIND_HEAD, [2025-04-29T20:28:44Z] TAG_RAY_NODE_STATUS: STATUS_UP_TO_DATE, [2025-04-29T20:28:44Z] TAG_RAY_USER_NODE_TYPE: "head", [2025-04-29T20:28:44Z] }, [2025-04-29T20:28:44Z] 1, [2025-04-29T20:28:44Z] ) [2025-04-29T20:28:44Z] autoscaler = MockAutoscaler( [2025-04-29T20:28:44Z] config_path, [2025-04-29T20:28:44Z] LoadMetrics(), [2025-04-29T20:28:44Z] MockGcsClient(), [2025-04-29T20:28:44Z] max_failures=0, [2025-04-29T20:28:44Z] process_runner=runner, [2025-04-29T20:28:44Z] update_interval_s=0, [2025-04-29T20:28:44Z] ) [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] autoscaler.update() [2025-04-29T20:28:44Z] autoscaler.update() [2025-04-29T20:28:44Z] self.waitForNodes(2) [2025-04-29T20:28:44Z] self.provider.finish_starting_nodes() [2025-04-29T20:28:44Z] # TODO(rickyx): This is a hack to avoid running into race conditions [2025-04-29T20:28:44Z] # within v1 autoscaler. These should no longer be relevant in v2. [2025-04-29T20:28:44Z] time.sleep(3) [2025-04-29T20:28:44Z] autoscaler.update() [2025-04-29T20:28:44Z] time.sleep(3) [2025-04-29T20:28:44Z] > self.waitForNodes(2, tag_filters={TAG_RAY_NODE_STATUS: STATUS_UP_TO_DATE}) [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] python/ray/tests/test_autoscaler.py:2250: [2025-04-29T20:28:44Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ [2025-04-29T20:28:44Z] python/ray/tests/test_autoscaler.py:414: in waitForNodes [2025-04-29T20:28:44Z] comparison(n, expected, msg="Unexpected node quantity.") [2025-04-29T20:28:44Z] E AssertionError: 3 != 2 : Unexpected node quantity. ``` and ```python [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] =================================== FAILURES =================================== [2025-04-29T20:28:44Z] ________ AutoscalingTest.testDontScaleDownIdleTimeOutForPlacementGroups ________ [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] self = <python.ray.tests.test_autoscaler.AutoscalingTest testMethod=testDontScaleDownIdleTimeOutForPlacementGroups> [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] def testDontScaleDownIdleTimeOutForPlacementGroups(self): [2025-04-29T20:28:44Z] config = copy.deepcopy(SMALL_CLUSTER) [2025-04-29T20:28:44Z] config["available_node_types"]["head"]["resources"][ [2025-04-29T20:28:44Z] "CPU" [2025-04-29T20:28:44Z] ] = 0 # make the head node not consume any resources. [2025-04-29T20:28:44Z] config["available_node_types"]["worker"][ [2025-04-29T20:28:44Z] "min_workers" [2025-04-29T20:28:44Z] ] = 1 # prepare 1 worker upfront. [2025-04-29T20:28:44Z] config["idle_timeout_minutes"] = 0.1 [2025-04-29T20:28:44Z] config_path = self.write_config(config) [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] self.provider = MockProvider() [2025-04-29T20:28:44Z] self.provider.create_node( [2025-04-29T20:28:44Z] {}, [2025-04-29T20:28:44Z] { [2025-04-29T20:28:44Z] TAG_RAY_NODE_KIND: NODE_KIND_HEAD, [2025-04-29T20:28:44Z] TAG_RAY_NODE_STATUS: STATUS_UP_TO_DATE, [2025-04-29T20:28:44Z] TAG_RAY_USER_NODE_TYPE: "head", [2025-04-29T20:28:44Z] }, [2025-04-29T20:28:44Z] 1, [2025-04-29T20:28:44Z] ) [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] runner = MockProcessRunner() [2025-04-29T20:28:44Z] lm = LoadMetrics() [2025-04-29T20:28:44Z] mock_gcs_client = MockGcsClient() [2025-04-29T20:28:44Z] autoscaler = MockAutoscaler( [2025-04-29T20:28:44Z] config_path, [2025-04-29T20:28:44Z] lm, [2025-04-29T20:28:44Z] mock_gcs_client, [2025-04-29T20:28:44Z] max_failures=0, [2025-04-29T20:28:44Z] process_runner=runner, [2025-04-29T20:28:44Z] update_interval_s=0, [2025-04-29T20:28:44Z] ) [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] autoscaler.update() [2025-04-29T20:28:44Z] # 1 worker is ready upfront. [2025-04-29T20:28:44Z] self.waitForNodes(1, tag_filters=WORKER_FILTER) [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] # Restore min_workers to allow scaling down to 0. [2025-04-29T20:28:44Z] config["available_node_types"]["worker"]["min_workers"] = 0 [2025-04-29T20:28:44Z] self.write_config(config) [2025-04-29T20:28:44Z] autoscaler.update() [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] # Create a placement group with 2 bundles that require 2 workers. [2025-04-29T20:28:44Z] placement_group_table_data = gcs_pb2.PlacementGroupTableData( [2025-04-29T20:28:44Z] placement_group_id=b"\000", [2025-04-29T20:28:44Z] strategy=common_pb2.PlacementStrategy.SPREAD, [2025-04-29T20:28:44Z] ) [2025-04-29T20:28:44Z] for i in range(2): [2025-04-29T20:28:44Z] bundle = common_pb2.Bundle() [2025-04-29T20:28:44Z] bundle.bundle_id.placement_group_id = ( [2025-04-29T20:28:44Z] placement_group_table_data.placement_group_id [2025-04-29T20:28:44Z] ) [2025-04-29T20:28:44Z] bundle.bundle_id.bundle_index = i [2025-04-29T20:28:44Z] bundle.unit_resources["CPU"] = 1 [2025-04-29T20:28:44Z] placement_group_table_data.bundles.append(bundle) [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] # Mark the first worker as idle, but it should not be scaled down by the autoscaler because it will be used by the placement group. [2025-04-29T20:28:44Z] worker_ip = self.provider.non_terminated_node_ips(WORKER_FILTER)[0] [2025-04-29T20:28:44Z] lm.update( [2025-04-29T20:28:44Z] worker_ip, [2025-04-29T20:28:44Z] mock_raylet_id(), [2025-04-29T20:28:44Z] {"CPU": 1}, [2025-04-29T20:28:44Z] {"CPU": 1}, [2025-04-29T20:28:44Z] 20, # idle for 20 seconds, which is longer than the idle_timeout_minutes. [2025-04-29T20:28:44Z] None, [2025-04-29T20:28:44Z] None, [2025-04-29T20:28:44Z] [placement_group_table_data], [2025-04-29T20:28:44Z] ) [2025-04-29T20:28:44Z] autoscaler.update() [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] events = autoscaler.event_summarizer.summary() [2025-04-29T20:28:44Z] assert "Removing 1 nodes of type worker (idle)." not in events, events [2025-04-29T20:28:44Z] assert "Adding 1 node(s) of type worker." in events, events [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] autoscaler.update() [2025-04-29T20:28:44Z] > self.waitForNodes(2, tag_filters=WORKER_FILTER) [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] python/ray/tests/test_autoscaler.py:3708: [2025-04-29T20:28:44Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ [2025-04-29T20:28:44Z] python/ray/tests/test_autoscaler.py:414: in waitForNodes [2025-04-29T20:28:44Z] comparison(n, expected, msg="Unexpected node quantity.") [2025-04-29T20:28:44Z] E AssertionError: 3 != 2 : Unexpected node quantity. ``` They both overprovisioned work nodes (`AssertionError: 3 != 2`) due to the race between `autoscaler.update()` and the background NodeLauncher. In particular, the `pending_launches` counter in the `autoscaler` will be decreased by the background NodeLauncher asynchronously when launching a pending node. That can cause the pending node to disappear from the view of `autoscaler.update()` and thus let it overprovision a new node. The previous solution is adding `time.sleep(3)` between `autoscaler.update()` calls. https://github.com/ray-project/ray/blob/8561936c808464bbebc1117d3b5cd0652392b38b/python/ray/tests/test_autoscaler.py#L2245-L2247 I think we can make it more reliable by using `self.waitForNodes()` instead. This PR fixes these two flaky tests by adding `self.waitForNodes()` between `autoscaler.update()`. It also fixes errors (Runner deserialization error, event summary races) in the previous implementation of `testDontScaleDownIdleTimeOutForPlacementGroups`. Before this PR, these 2 tests would fail due to the race every 200 times. After this PR, these 2 tests can pass 10000 times without failures. ## Related issue number https://github.com/ray-project/ray/issues/52768 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Rueian <rueiancsie@gmail.com> commit 5ecb0e51273a00e14cc5878d28ac848a526c9aeb Author: Wei-Cheng Lai <qazwsx0939059006@gmail.com> Date: Mon May 12 17:46:38 2025 +0100 [docs][tune]: fix import & replace `session.report` with `tune.report` (#52801) Updated the documentation to improve clarity. Signed-off-by: wei-chenglai <qazwsx0939059006@gmail.com> commit 5324339f8407050db46b58f36b68ecdaf5ef31f6 Author: iamjustinhsu <140442892+iamjustinhsu@users.noreply.github.com> Date: Mon May 12 09:36:24 2025 -0700 [Data] Add save modes to file data sinks (#52900)   ## Why are these changes needed?  In write_parquet, we want to be able to support - `OVERWRITE`: (If dir present, delete then write, otherwise, just create dir, then write) A more detailed description can be found in https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html#save-modes This PR was meant to address https://anyscale1.atlassian.net/browse/DATA-946, but since the other save modes weren't that much work, I added the additional following 3 from apache spark too - `IGNORE`: (if dir present, silently pass) - `ERROR`: (if dir present, throw error) - `APPEND` (this is the current behavior we have, if dir present, we append files. Any conflicting file names are overwritten) ## Related issue number attentive requesting this  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Co-authored-by: Balaji Veeramani <balaji@anyscale.com> commit 5c6ccfd848d61eed32d25378a2fb7b65a7c65119 Author: Kai-Hsun Chen <kaihsun@anyscale.com> Date: Mon May 12 09:23:00 2025 -0700 [core][refactor] Remove `GetSequenceNumber` (#52936) Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com> commit 983d1ab957fb489001c4d5ae1835f48477f49f71 Author: David Xia <david@davidxia.com> Date: Mon May 12 01:26:15 2025 -0400 [Doc] improve prometheus-grafana.md (#52821) Signed-off-by: David Xia <david@davidxia.com> commit 66b19d390d156635c32403226d6d6c6e82fb079d Author: lkchen <github@lkchen.net> Date: Sat May 10 12:12:10 2025 -0700 [ray.data.llm] Unify fields in SGLang and vLLM config (#52823) Signed-off-by: Linkun Chen <github@lkchen.net> Signed-off-by: lkchen <github@lkchen.net> commit e7bff7f09e7f5f75603c2a301a9fb19706381dbc Author: Philipp Moritz <pcmoritz@gmail.com> Date: Sat May 10 13:31:50 2025 +0800 Fix uv run when use with vllm's Ray backend (#52916)   ## Why are these changes needed? If vllm's Ray backend is used in the vllm V1 architecture, it will start a subprocess and then call ray.init in that subprocess to launch the actual vllm replicas. This PR makes it so the uv environment still gets propagated correctly in that case. This change is consistent with the behavior of how uv environments propagate to subprocesses with just vanilla `uv run` without Ray: ``` (base) pcmoritz@pcmoritz-DQ44HV60WX vllm-repro % cat pyproject.toml [project] name = "test" version = "0.1" dependencies = [ "ray", ] ``` ``` (base) (base) pcmoritz@pcmoritz-DQ44HV60WX vllm-repro % cat test.py import sys import ray import subprocess import psutil print(sys.executable) print(ray.__path__) # avoid fork bomb if len(psutil.Process().parents()) > 10: sys.exit(0) subprocess.check_call([sys.executable, "test.py"]) ``` ``` (base) pcmoritz@pcmoritz-DQ44HV60WX vllm-repro % uv run test.py warning: No `requires-python` value found in the workspace. Defaulting to `>=3.12`. /private/tmp/vllm-repro/.venv/bin/python3 ['/private/tmp/vllm-repro/.venv/lib/python3.12/site-packages/ray'] /private/tmp/vllm-repro/.venv/bin/python3 ['/private/tmp/vllm-repro/.venv/lib/python3.12/site-packages/ray'] /private/tmp/vllm-repro/.venv/bin/python3 ['/private/tmp/vllm-repro/.venv/lib/python3.12/site-packages/ray'] /private/tmp/vllm-repro/.venv/bin/python3 ['/private/tmp/vllm-repro/.venv/lib/python3.12/site-packages/ray'] /private/tmp/vllm-repro/.venv/bin/python3 ['/private/tmp/vllm-repro/.venv/lib/python3.12/site-packages/ray'] /private/tmp/vllm-repro/.venv/bin/python3 ['/private/tmp/vllm-repro/.venv/lib/python3.12/site-packages/ray'] ``` ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Co-authored-by: pcmoritz <pcmoritz@anyscale.com> commit f3e86752eee651ee839dc97c13d558fdb370b08e Author: Goku Mohandas <gokumd@gmail.com> Date: Fri May 9 22:27:32 2025 -0700 Entity recognition with LLMs (#52342)   ## Why are these changes needed?  ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: GokuMohandas <gokumd@gmail.com> Signed-off-by: Goku Mohandas <gokumd@gmail.com> Signed-off-by: angelinalg <angelina@anyscale.com> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Co-authored-by: angelinalg <angelina@anyscale.com> commit 7d58cd76f00d8d96dc494f32a034f154308f9ce4 Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Date: Fri May 9 18:33:52 2025 -0700 [release] support using any dir in the repo as working dir (#52925) to support testing from docs dir Signed-off-by: Lonnie Liu <lonnie@anyscale.com> commit c983d99626e290e30efcd6f5bdc92e56b561d6bd Author: Dhyey Shah <dhyey2019@gmail.com> Date: Fri May 9 17:45:12 2025 -0700 [core] Record grpc client failures (#52790) Signed-off-by: dayshah <dhyey2019@gmail.com> commit 8addae4ed4d154ec999187277d4300cc592bfbbd Author: Christopher Zhang <chris@anyscale.com> Date: Fri May 9 17:30:34 2025 -0700 remove anyscale navbar on docs.ray.io (#52907) commit 40779a4fa92ad0b60adde92344fc52c6347ec4dd Author: Timothy Seah <timothy.seah777@yahoo.com> Date: Fri May 9 17:13:50 2025 -0700 [train][doc] Remove unused configuration-overview page (#52912) Signed-off-by: Timothy Seah <tseah@Mac.attlocal.net> Co-authored-by: Timothy Seah <tseah@Mac.attlocal.net> commit 257df20e399008254d0104b65d46bd52acf7a8a8 Author: Alexey Kudinkin <ak@anyscale.com> Date: Fri May 9 17:02:34 2025 -0700 [Data] Cleaning up Executor shutdown sequence (#52828) ## Why are these changes needed? 1. Log exception prompting the shutdown (if any) 2. Round durations logged (to millis) --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> commit d258fee55dc2051ea67b2422290d11e89985a484 Author: Connector Switch <c8ef@outlook.com> Date: Sat May 10 05:57:43 2025 +0800 [RLLIB] Fix simple typo in `rllib/evaluation/collectors/agent_collector.py` (#52773) Signed-off-by: Connector Switch <c8ef@outlook.com> commit 5c5590895ad10b956e4ad9fba4c2cda2be68541d Author: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com> Date: Fri May 9 14:16:41 2025 -0700 [Doc] Update configure-manage-dashboard.md (#52890) Signed-off-by: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com> commit cc6790d8fc471f262395d3066b5d7bcac3241efd Author: Kai-Hsun Chen <kaihsun@anyscale.com> Date: Fri May 9 14:13:48 2025 -0700 [chore] Delete unused build.sh (#50649) Signed-off-by: kaihsun <kaihsun@anyscale.com> commit 86c0958a5f051780e1f4cf08ad37bde942040774 Author: Arthur Böök <atte.book@gmail.com> Date: Fri May 9 13:15:03 2025 -0700 [data][llm] fix: remove-no-longer needed guided decoding vllm v0 constraint (#52903) Signed-off-by: Arthur <atte.book@gmail.com> commit c769942d8251b3ab139cf823f4894340b77bb1cf Author: srinathk10 <68668616+srinathk10@users.noreply.github.com> Date: Fri May 9 12:03:35 2025 -0700 ImageDatasource::_read_stream Avoid unnecessary resize and convert (#52885) Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com> commit 3ad416c562bb4fd2ce58b93d2116573a4acc00a0 Author: Dhyey Shah <dhyey2019@gmail.com> Date: Fri May 9 09:27:48 2025 -0700 [core] Raylet Node Manager RPC Failure Documentation (#52710) Documentation for what happens when node manager rpc's fail. Signed-off-by: dayshah <dhyey2019@gmail.com> commit 478877e8f92faa1665adb9db967d4e88d5072279 Author: Kai-Hsun Chen <kaihsun@anyscale.com> Date: Fri May 9 09:27:01 2025 -0700 [core] Implement a thread pool and call the CPython API on all threads within the same concurrency group (#52575) We see the following error message from the CI runs of `test_threaded_actor.py` ([example1](https://buildkite.com/ray-project/postmerge-macos/builds/5543#019659f5-7285-48fc-b1cf-588fd19bd050), [example2](https://buildkite.com/ray-project/postmerge-macos/builds/5534#01965796-294c-41de-8e6f-ef2970134df2)). ![image](https://github.com/user-attachments/assets/d3a5d47a-1dc6-41b8-b258-d33699d4a04a) The message "Fatal Python error: PyGILState_Release: auto-releasing thread-state, but no thread-state for this thread" is very scary, but it will not cause any tests to fail. The root cause is that `PyGILState_Release` is called on a thread that has never called `PyGILState_Ensure`. See the [CPython source code](https://github.com/python/cpython/blob/a94c7528b596e9ec234f12ebeeb45fc731412b18/Python/pystate.c#L2870) for more details. The reason is that we can't control which thread in the thread pool will run the initializer/releaser. Hence, if a concurrency group has more than one thread, the error message above may be printed when we gracefully shut down an actor (i.e., `ray.actor.exit_actor()`). In this PR, we implement our own thread pool using `std::thread`, ensuring that both the initializer and the releaser run on the same thread. Consequently, from the Python interpreter’s perspective, all Python threads in the same concurrency group remain active even after they finish executing Ray tasks. ## Related issue number Closes #51071 ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( ```python # test.py import ray @ray.remote class ThreadActor: def __init__(self): self.counter = 0 def increment(self): self.counter += 1 return self.counter def terminate(self): ray.actor.exit_actor() actor = ThreadActor.options(max_concurrency=10).remote() print(ray.get(actor.increment.remote())) ray.get(actor.terminate.remote()) ``` * Without this PR: Ran the test 20 times and encountered the error "PyGILState_Release: auto-releasing thread-state" 20 times. <img width="1728" alt="Screenshot 2025-04-30 at 5 23 27 PM" src="https://github.com/user-attachments/assets/644ffd89-8edf-4678-a0cd-528eb642fe66" /> * With this PR: Ran the test 20 times and encountered the error 0 times. <img width="1728" alt="Screenshot 2025-04-30 at 5 25 10 PM" src="https://github.com/user-attachments/assets/03afaa26-0027-4df4-915d-6165bb83583f" /> --------- Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com> commit 0697b746c901b50125d8e6ba776bd6d5fe260224 Author: Dhyey Shah <dhyey2019@gmail.com> Date: Fri May 9 09:25:36 2025 -0700 [core] [docs] Dynamic generator deprecation (#52887) Deprecating the dynamic ref generator. It was supposed to be deprecated a long time ago in favor of streaming generators but found that the deprecation warning on the docs page was actually never showing https://docs.ray.io/en/releases-2.46.0/ray-core/tasks/generators.html because the warning is above the title of the page. Moved the dynamic ref generator page under deprecated at the bottom of the ray generators page and outside the tasks subsection. Signed-off-by: dayshah <dhyey2019@gmail.com> commit 262af06532209c4dd81fe2046e29dab5af91bc9c Author: Dhyey Shah <dhyey2019@gmail.com> Date: Thu May 8 21:34:53 2025 -0700 [core] Label selector enum as class to fix windows build (#52884) Signed-off-by: dayshah <dhyey2019@gmail.com> commit 2bb3c5b62094ae468aeba9e2c52abaf64d5dadba Author: Thomas Desrosiers <681004+thomasdesr@users.noreply.github.com> Date: Thu May 8 16:19:20 2025 -0700 [pydoclint] core/_private docstring minimal format fixes (#52872) ## Why are these changes needed? This changes are part a batch effort to rewrite Ray's docstrings to be minimally pydoclint compliant. This PR focuses on making them at least pass basic formatting checks ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Thomas Desrosiers <thomas@anyscale.com> commit 489f233ffcd0789282c16ab6e5806ee7fea1b037 Author: Thomas Desrosiers <681004+thomasdesr@users.noreply.github.com> Date: Thu May 8 15:48:23 2025 -0700 [pydoclint] util docstring minimal format errors (#52880)   ## Why are these changes needed? This changes are part a batch effort to rewrite Ray's docstrings to be minimally pydoclint compliant. This PR focuses on making them at least pass basic formatting checks  ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Thomas Desrosiers <thomas@anyscale.com> commit 94125f4c51d7a3369d3393d332f12dde7fc18b58 Author: matthewdeng <matt@anyscale.com> Date: Thu May 8 15:20:35 2025 -0700 [tune][train] update test_train_v2_integration to use correct RunConfig (#52882) Fixes an issue in which the wrong `RunConfig` was being used. Signed-off-by: Matthew Deng <matt@anyscale.com> commit e629be1cb48f59a3b34ac5cbe095832cd8c38e98 Author: Thomas Desrosiers <681004+thomasdesr@users.noreply.github.com> Date: Thu May 8 15:01:49 2025 -0700 [pydoclint] data docstring minimal format errors (#52883) ## Why are these changes needed? This changes are part a batch effort to rewrite Ray's docstrings to be minimally pydoclint compliant. This PR focuses on making them at least pass basic formatting checks ## Related issue number Extends https://github.com/ray-project/ray/pull/52874 w/ a few more ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Thomas Desrosiers <thomas@anyscale.com> commit b094f84690be3e8648aa93e5f28cb11c01dce2b6 Author: Thomas Desrosiers <681004+thomasdesr@users.noreply.github.com> Date: Thu May 8 14:28:03 2025 -0700 [pydoclint] core/autoscaler docstring minimal format errors (#52873) ## Why are these changes needed? This changes are part a batch effort to rewrite Ray's docstrings to be minimally pydoclint compliant. This PR focuses on making them at least pass basic formatting checks ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Thomas Desrosiers <thomas@anyscale.com> commit e3235250621029f5053642db8ec78ea51f12ba57 Author: Jani Monoses <jani.monoses@gmail.com> Date: Fri May 9 00:04:24 2025 +0300 [llm] Embedding api (#52229) commit 6cc103e0b509b85368e4ee669f52a41d90ad6e89 Author: Thomas Desrosiers <681004+thomasdesr@users.noreply.github.com> Date: Thu May 8 14:01:55 2025 -0700 [pydoclint] workflow docstring minimal format errors (#52881)   ## Why are these changes needed? This changes are part a batch effort to rewrite Ray's docstrings to be minimally pydoclint compliant. This PR focuses on making them at least pass basic formatting checks ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Thomas Desrosiers <thomas@anyscale.com> commit ebdcd2e0db7271daa2bfbd98d528021b1b7a3f6b Author: Thomas Desrosiers <681004+thomasdesr@users.noreply.github.com> Date: Thu May 8 13:56:58 2025 -0700 [pydoclint] tune docstring minimal format errors (#52879)   ## Why are these changes needed? This changes are part a batch effort to rewrite Ray's docstrings to be minimally pydoclint compliant. This PR focuses on making them at least pass basic formatting checks  ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Thomas Desrosiers <thomas@anyscale.com> commit 67b2469f943a58e316fc686690933236240f3be7 Author: Thomas Desrosiers <681004+thomasdesr@users.noreply.github.com> Date: Thu May 8 13:56:35 2025 -0700 [pydoclint] serve docstring minimal format errors (#52877)   ## Why are these changes needed? This changes are part a batch effort to rewrite Ray's docstrings to be minimally pydoclint compliant. This PR focuses on making them at least pass basic formatting checks  ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Thomas Desrosiers <thomas@anyscale.com> commit ee5cc4510ff44b297deb8ea3f7cdf0a75b2190fa Author: Thomas Desrosiers <681004+thomasdesr@users.noreply.github.com> Date: Thu May 8 13:55:32 2025 -0700 [pydoclint] llm docstring minimal format errors (#52876)   ## Why are these changes needed? This changes are part a batch effort to rewrite Ray's docstrings to be minimally pydoclint compliant. This PR focuses on making them at least pass basic formatting checks  ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Thomas Desrosiers <thomas@anyscale.com> commit 2e98bce60a7e6dfd5040895a3ed6b68a1357199c Author: Thomas Desrosiers <681004+thomasdesr@users.noreply.github.com> Date: Thu May 8 13:50:20 2025 -0700 [pydoclint] train docstring minimal format errors (#52878)   ## Why are these changes needed? This changes are part a batch effort to rewrite Ray's docstrings to be minimally pydoclint compliant. This PR focuses on making them at least pass basic formatting checks  ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Thomas Desrosiers <thomas@anyscale.com> commit f6cc12ef4a5634d419b868aa67d8e40c8876577f Author: srinathk10 <68668616+srinathk10@users.noreply.github.com> Date: Thu May 8 13:48:30 2025 -0700 Handle non-contiguous Tensors based GPU transfer (#52548) ## Why are these changes needed? Handle non-contiguous Tensors based GPU transfer. This allows removing the overhead of combining Arrow chunked arrays during Arrow -> Numpy -> Tensor conversion. --------- Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com> Signed-off-by: srinathk10 <68668616+srinathk10@users.noreply.github.com> commit c66bdf203278567d5a6ac3dfdcaff857899c1dba Author: Thomas Desrosiers <681004+thomasdesr@users.noreply.github.com> Date: Thu May 8 13:26:55 2025 -0700 [pydoclint] dashboard docstring minimal format errors (#52875) ## Why are these changes needed? This changes are part a batch effort to rewrite Ray's docstrings to be minimally pydoclint compliant. This PR focuses on making them at least pass basic formatting checks ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Thomas Desrosiers <thomas@anyscale.com> commit 05ba14660f552f11c464d017678ed97eb67b2401 Author: Neil Girdhar <mistersheik@gmail.com> Date: Thu May 8 14:00:27 2025 -0400 [tune] Remove loguniform's base (#50415) Analytically, the base doesn't have any effect on the calculation for tune.loguniform and its variants. Numerically, it seems that the base can only make the calculation less precise, and definitely adds computation. Signed-off-by: Neil Girdhar <mistersheik@gmail.com> commit 62c6771f3f509868361bc9b360f3f61b056bb89b Author: Alexey Kudinkin <ak@anyscale.com> Date: Thu May 8 10:53:21 2025 -0700 [Data] Fix internal queues accounting for all Operators w/ an internal queue (#52806) ## Why are these changes needed? While working on https://github.com/ray-project/ray/pull/52754, i've realized that actually most of the operators w/ internal queues aren't reporting these properly. This PR addresses that problem by 1. Adding `InternalQueueOperatorMixin` forcing classes to implement required methods 2. Fixes `OpState` methods to properly distinguish b/w bundled pending dispatch and queued internally --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> commit fba9084aae34f5339b8db7858364321eb3a18419 Author: Kai-Hsun Chen <kaihsun@anyscale.com> Date: Thu May 8 09:55:44 2025 -0700 [core] `SetTaskStatus` should only be called within the same lock scope where `task_entry` is retrieved (#52770) This PR reverts https://github.com/ray-project/ray/pull/52695 and adds comments to explain where should `SetTaskStatus` be called. #52695 updates the value in submissble_tasks_ without acquiring the mutex lock. If multiple threads or coroutines write to the map, a rehash or deletion may occur, causing the pointer to the value to become invalid. ### Outdated PR statement #### Question https://github.com/ray-project/ray/pull/52695#discussion_r2072477177 Pointers to values in a `flat_hash_map` become invalid after a rehash. Additionally, we dereference those pointers in `RetryTask`, which doesn’t hold a mutex lock. Hence, it’s possible for the pointers to become invalid when other coroutines or threads insert or delete elements from the map, triggering a rehash. "Iterators, references, and pointers to elements are invalidated on rehash." ([reference](https://abseil.io/docs/cpp/guides/container)) #### Solution Changing `submissible_tasks_` from `absl::flat_hash_map<TaskID, TaskEntry>` to `absl::flat_hash_map<TaskID, std::unique_ptr<TaskEntry>>` requires a lot of changes. Hence, this PR implements a short-term solution by copying the value (i.e., TaskEntry) while holding the mutex lock where rehash will not be triggered by other threads / coroutine. --------- Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com> commit 9c988ee61460b3205081e5d6b6d903e3bf4f826e Author: Alan Guo <aguo@anyscale.com> Date: Thu May 8 09:20:32 2025 -0700 fix grafana dashboards dropdowns for data and train dashboard (#52752) Previously the dropdown for variables for data and train dashboard wasn't working for a few reasons: - Data dashboard used the ray_data_allocated_bytes metric which doesn't seem to be guaranteed metric to be emitted when ray data is used - Both data and train dashboard used label_values which only shows values for live metrics. Since these variables represent entities that are expected to stop emitting metrics over time, I changed to use a query that checks for any values over the time range selected based on the approach [suggested here](https://stackoverflow.com/questions/52778031/how-to-provide-label-values-in-grafana-variables-with-time-range-for-prometheus) --------- Signed-off-by: Alan Guo <aguo@anyscale.com> commit 589c1c94a5dcd80366a49418d28797b7f66aac99 Author: Alexey Kudinkin <ak@anyscale.com> Date: Thu May 8 05:12:55 2025 -0700 [Data] Re-enable Actor locality-based scheduling (#52861)   ## Why are these changes needed? Context --- Currently locality-aware scheduling is disabled due to https://github.com/ray-project/ray/issues/43466 However, since we're already using the new API, i've cleaned up the ranking and scheduling sequence and re-enabled locality aware scheduling. Changes --- - Added `RefBundle.get_preferred_object_locations` to compute a mapping of node-ids to total object bytes on the node - Added tests - Rebased `OutputSplitter` onto the new API - Rebased `ActorPool` onto `get_preferred_locations` - Re-enable locality hinting for actors by default ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> commit 1f084697c3d6d5286e68b853b43030d359df2012 Author: angelinalg <122562471+angelinalg@users.noreply.github.com> Date: Thu May 8 04:54:20 2025 -0700 [docstring][rllib] Fix indentation errors in docstrings. (#52849) commit 7a21b34f3876f70c1b12c65affab245d59b60cf7 Author: Sven Mika <svenmika1977@gmail.com> Date: Thu May 8 09:06:24 2025 +0200 [RLlib] Add extra `self.stopped` check to APPO/IMPALA Learner (in case learner thread should stop while waiting for queue). (#52834) commit 988b689a08d18380afc7b70969dd4ed0c3b8ecee Author: Kevin H. Luu <kevin@anyscale.com> Date: Wed May 7 22:26:47 2025 -0700 [docker] Update latest Docker dependencies for 2.46.0 release (#52863) Created by release automation bot. Update with commit 52b43d0998f40d8aada0ffb89f41497fea4878b2 Signed-off-by: dayshah <dhyey2019@gmail.com> Co-authored-by: dayshah <dhyey2019@gmail.com> commit 5868480f6bb20fbc49e4dea7d5adb1279f36b464 Author: Kai-Hsun Chen <kaihsun@anyscale.com> Date: Wed May 7 21:22:10 2025 -0700 [core][chore] Correct `num_retries_left` and `num_oom_retries_left` in the log (#52857) Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com> commit d1823655707a7708ad99fda9cff93c1ac28b2f04 Author: angelinalg <122562471+angelinalg@users.noreply.github.com> Date: Wed May 7 21:09:20 2025 -0700 [docstring][train] fix indentation errors in docstrings (#52855) commit bcbee9fceeb4ff3edf2fa1518c915b8135aa204e Author: Kai-Hsun Chen <kaihsun@anyscale.com> Date: Wed May 7 21:07:32 2025 -0700 [core][refactor] Remove skip_execution (#52856) Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com> commit a698b631d3916866c9061ad22ad4fb0ec3574da8 Author: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com> Date: Wed May 7 20:23:38 2025 -0700 [Serve.llm] Bugfix for duplication of `<bos>` token (#52853) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> commit 0c1faa63df052c7509d783a3d0f05eb28ad79baa Author: angelinalg <122562471+angelinalg@users.noreply.github.com> Date: Wed May 7 19:08:20 2025 -0700 [docstring][data] fix indentation errors in docstrings (#52844) commit ce51640c81b6230e4375ae6dd75d9a9092f13e8d Author: angelinalg <122562471+angelinalg@users.noreply.github.com> Date: Wed May 7 18:36:54 2025 -0700 [docstring][serve] Fix indentation in doc strings. (#52841) commit c0a3cbe6a9cd9960ef5822ca742fb05fa6408e8e Author: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com> Date: Wed May 7 18:08:49 2025 -0700 [Serve.llm][Bugfix] in stream batching, first part of the stream was always consumed and not streamed back from the router (#52848) This PR addresses a bug in stream batching where extra tokens in the first batch were being discarded and adds comprehensive unit tests to verify both chat and completion behaviors under different batching and streaming configurations. - Fixes token loss in stream batching by peeking at the first generator element and correctly handling batched responses. - Adds new fixtures and tests to cover various scenarios (chat/completion, stream true/false, and multiple batching intervals). - Removes redundant configuration in the LLM server test to align with the new streaming batching behavior. --------- Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> commit 7c23b5a4d698066a789a8343692c805bd5242e3f Author: angelinalg <122562471+angelinalg@users.noreply.github.com> Date: Wed May 7 17:31:46 2025 -0700 [docstring][llm] fixing indent errors in docstrings (#52842) commit 8910c3bcb543cd79ecc2b909da01e801a0f2a972 Author: srinathk10 <68668616+srinathk10@users.noreply.github.com> Date: Wed May 7 17:05:34 2025 -0700 Train Tests: Update Image classification map fn (#52845)   ## Why are these changes needed?  Train Tests: Update Image classification map fn. - Current Image processing does np->tensor conversion with transpose to CHW and normalization. ``` 'train/epoch-avg': 37.75413007199859, 'train/epoch-max': 37.75413007199859, 'train/epoch-min': 37.75413007199859, 'train/epoch-total': 37.75413007199859, 'train/global_throughput': 3495.5491702359923, 'train/iter_batch-avg': 0.03661318445453074, 'train/iter_batch-max': 0.6135890130008193, 'train/iter_batch-min': 1.593200067873113e-05, 'train/iter_batch-total': 18.526271333992554, 'train/iter_first_batch-avg': 19.143331854998905, 'train/iter_first_batch-max': 19.143331854998905, 'train/iter_first_batch-min': 19.143331854998905, 'train/iter_first_batch-total': 19.143331854998905, 'train/iter_skip_batch-avg': inf, 'train/iter_skip_batch-max': 0, 'train/iter_skip_batch-min': inf, 'train/iter_skip_batch-total': 0, 'train/local_throughput': 873.8872925589981, 'train/rows_processed-avg': 32.0, 'train/rows_processed-max': 32, 'train/rows_processed-min': 32, 'train/rows_processed-total': 16192, 'train/step-avg': 4.809962454802502e-06, 'train/step-max': 2.1455998648889363e-05, 'train/step-min': 5.109995981911197e-07, 'train/step-total': 0.002433841002130066, 'validation/iter_batch-avg': inf, 'validation/iter_batch-max': 0, 'validation/iter_batch-min': inf, 'validation/iter_batch-total': 0, 'validation/step-avg': inf, 'validation/step-max': 0, 'validation/step-min': inf, 'validation/step-total': 0} -------------------------------------------------------------------------------- 2025-05-07 12:12:11,659 INFO test_utils.py:1953 -- Wrote results to /tmp/release_test_output.json 2025-05-07 12:12:11,660 INFO test_utils.py:1954 -- {"train/epoch-avg": 37.75413007199859, "train/epoch-min": 37.75413007199859, "train/epoch-max": 37.75413007199859, "train/epoch-total": 37.75413007199859, "train/iter_first_batch-avg": 19.143331854998905, "train/iter_first_batch-min": 19.143331854998905, "train/iter_first_batch-max": 19.143331854998905, "train/iter_first_batch-total": 19.143331854998905, "train/step-avg": 4.809962454802502e-06, "train/step-min": 5.109995981911197e-07, "train/step-max": 2.1455998648889363e-05, "train/step-total": 0.002433841002130066, "train/rows_processed-avg": 32.0, "train/rows_processed-min": 32, "train/rows_processed-max": 32, "train/rows_processed-total": 16192, "train/iter_batch-avg": 0.03661318445453074, "train/iter_batch-min": 1.593200067873113e-05, "train/iter_batch-max": 0.6135890130008193, "train/iter_batch-total": 18.526271333992554, "validation/step-avg": Infinity, "validation/step-min": Infinity, "validation/step-max": 0, "validation/step-total": 0, "validation/iter_batch-avg": Infinity, "validation/iter_batch-min": Infinity, "validation/iter_batch-max": 0, "validation/iter_batch-total": 0, "checkpoint/download-avg": Infinity, "checkpoint/download-min": Infinity, "checkpoint/download-max": 0, "checkpoint/download-total": 0, "checkpoint/load-avg": Infinity, "checkpoint/load-min": Infinity, "checkpoint/load-max": 0, "checkpoint/load-total": 0, "train/iter_skip_batch-avg": Infinity, "train/iter_skip_batch-min": Infinity, "train/iter_skip_batch-max": 0, "train/iter_skip_batch-total": 0, "train/local_throughput": 873.8872925589981, "train/global_throughput": 3495.5491702359923, "dataloader/train": {"producer_throughput": 1946.112621486268, "iter_stats": {"prefetch_block-avg": Infinity, "prefetch_block-min": Infinity, "prefetch_block-max": 0, "prefetch_block-total": 0, "fetch_block-avg": 0.0027022377159291976, "fetch_block-min": 0.0005052189990237821, "fetch_block-max": 0.0197697479998169, "fetch_block-total": 0.218881254990265, "block_to_batch-avg": 0.001253903843829141, "block_to_batch-min": 1.9893001081072725e-05, "block_to_batch-max": 0.01351481799974863, "block_to_batch-total": 0.6344753449775453, "format_batch-avg": 3.4910411080130395e-05, "format_batch-min": 9.00899976841174e-06, "format_batch-max": 0.0005209999999351567, "format_batch-total": 0.01766466800654598, "collate-avg": 0.0019578944209519855, "collate-min": 0.00021700100114685483, "collate-max": 0.013516342000002624, "collate-total": 0.9906945770017046, "finalize-avg": 0.011252377077071, "finalize-min": 0.004483607999645756, "finalize-max": 0.03162657899883925, "finalize-total": 5.693702800997926, "time_spent_blocked-avg": 0.0742146621321331, "time_spent_blocked-min": 6.807998943259008e-06, "time_spent_blocked-max": 19.143022770000243, "time_spent_blocked-total": 37.62683370099148, "time_spent_training-avg": 0.00021408673321073962, "time_spent_training-min": 9.916999260894954e-06, "time_spent_training-max": 0.009087054999326938, "time_spent_training-total": 0.10832788700463425}}} ``` - Updated Image processing does np->PIL->Tensor. ``` 'train/epoch-avg': 30.73613611499968, 'train/epoch-max': 30.73613611499968, 'train/epoch-min': 30.73613611499968, 'train/epoch-total': 30.73613611499968, 'train/global_throughput': 5434.769027373354, 'train/iter_batch-avg': 0.023547696209505146, 'train/iter_batch-max': 0.3791560619993106, 'train/iter_batch-min': 1.732300006551668e-05, 'train/iter_batch-total': 11.915134282009603, 'train/iter_first_batch-avg': 18.71798381300141, 'train/iter_first_batch-max': 18.71798381300141, 'train/iter_first_batch-min': 18.71798381300141, 'train/iter_first_batch-total': 18.71798381300141, 'train/iter_skip_batch-avg': inf, 'train/iter_skip_batch-max': 0, 'train/iter_skip_batch-min': inf, 'train/iter_skip_batch-total': 0, 'train/local_throughput': 1358.6922568433386, 'train/rows_processed-avg': 32.0, 'train/rows_processed-max': 32, 'train/rows_processed-min': 32, 'train/rows_processed-total': 16192, 'train/step-avg': 4.362646225640153e-06, 'train/step-max': 2.6562000130070373e-05, 'train/step-min': 4.579997039400041e-07, 'train/step-total': 0.0022074989901739173, 'validation/iter_batch-avg': inf, 'validation/iter_batch-max': 0, 'validation/iter_batch-min': inf, 'validation/iter_batch-total': 0, 'validation/step-avg': inf, 'validation/step-max': 0, 'validation/step-min': inf, 'validation/step-total': 0} -------------------------------------------------------------------------------- 2025-05-07 12:32:57,439 INFO test_utils.py:1953 -- Wrote results to /tmp/release_test_output.json 2025-05-07 12:32:57,439 INFO test_utils.py:1954 -- {"train/epoch-avg": 30.73613611499968, "train/epoch-min": 30.73613611499968, "train/epoch-max": 30.73613611499968, "train/epoch-total": 30.73613611499968, "train/iter_first_batch-avg": 18.71798381300141, "train/iter_first_batch-min": 18.71798381300141, "train/iter_first_batch-max": 18.71798381300141, "train/iter_first_batch-total": 18.71798381300141, "train/step-avg": 4.362646225640153e-06, "train/step-min": 4.579997039400041e-07, "train/step-max": 2.6562000130070373e-05, "train/step-total": 0.0022074989901739173, "train/rows_processed-avg": 32.0, "train/rows_processed-min": 32, "train/rows_processed-max": 32, "train/rows_processed-total": 16192, "train/iter_batch-avg": 0.023547696209505146, "train/iter_batch-min": 1.732300006551668e-05, "train/iter_batch-max": 0.3791560619993106, "train/iter_batch-total": 11.915134282009603, "validation/step-avg": Infinity, "validation/step-min": Infinity, "validation/step-max": 0, "validation/step-total": 0, "validation/iter_batch-avg": Infinity, "validation/iter_batch-min": Infinity, "validation/iter_batch-max": 0, "validation/iter_batch-total": 0, "checkpoint/download-avg": Infinity, "checkpoint/download-min": Infinity, "checkpoint/download-max": 0, "checkpoint/download-total…

kevin85421 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) core Issues that should be addressed in Ray Core and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Mar 4, 2025

kevin85421 self-assigned this Mar 4, 2025

kevin85421 mentioned this issue Apr 24, 2025

[core] Implement a thread pool and call the CPython API on all threads within the same concurrency group #52575

Merged

8 tasks

edoakes added the P0 Issues that should be fixed in short order label Apr 24, 2025

edoakes closed this as completed in #52575 May 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] Only one of the threads in a thread pool will be initialized as a long-running Python thread #51071

[core] Only one of the threads in a thread pool will be initialized as a long-running Python thread #51071

kevin85421 commented Mar 4, 2025 •

edited

Loading

[core] Only one of the threads in a thread pool will be initialized as a long-running Python thread #51071

[core] Only one of the threads in a thread pool will be initialized as a long-running Python thread #51071

Comments

kevin85421 commented Mar 4, 2025 • edited Loading

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

kevin85421 commented Mar 4, 2025 •

edited

Loading