Skip to content

[core] Only one of the threads in a thread pool will be initialized as a long-running Python thread #51071

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kevin85421 opened this issue Mar 4, 2025 · 0 comments · Fixed by #52575
Assignees
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P0 Issues that should be fixed in short order

Comments

@kevin85421
Copy link
Member

kevin85421 commented Mar 4, 2025

What happened + What you expected to happen

Currently, only one of the threads in a thread pool will be initialized as a long-running Python thread. I should also investigate whether it's possible to call PyGILState_Release on a different thread other than the one calls PyGILState_Ensure in the thread pool.

Versions / Dependencies

TODO

Reproduction script

TODO

Issue Severity

None

@kevin85421 kevin85421 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) core Issues that should be addressed in Ray Core and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Mar 4, 2025
@kevin85421 kevin85421 self-assigned this Mar 4, 2025
@edoakes edoakes added the P0 Issues that should be fixed in short order label Apr 24, 2025
edoakes pushed a commit that referenced this issue May 9, 2025
…s within the same concurrency group (#52575)

We see the following error message from the CI runs of
`test_threaded_actor.py`
([example1](https://buildkite.com/ray-project/postmerge-macos/builds/5543#019659f5-7285-48fc-b1cf-588fd19bd050),
[example2](https://buildkite.com/ray-project/postmerge-macos/builds/5534#01965796-294c-41de-8e6f-ef2970134df2)).


![image](https://github.com/user-attachments/assets/d3a5d47a-1dc6-41b8-b258-d33699d4a04a)

The message "Fatal Python error: PyGILState_Release: auto-releasing
thread-state, but no thread-state for this thread" is very scary, but it
will not cause any tests to fail.

The root cause is that `PyGILState_Release` is called on a thread that
has never called `PyGILState_Ensure`. See the [CPython source
code](https://github.com/python/cpython/blob/a94c7528b596e9ec234f12ebeeb45fc731412b18/Python/pystate.c#L2870)
for more details.

The reason is that we can't control which thread in the thread pool will
run the initializer/releaser. Hence, if a concurrency group has more
than one thread, the error message above may be printed when we
gracefully shut down an actor (i.e., `ray.actor.exit_actor()`).

In this PR, we implement our own thread pool using `std::thread`,
ensuring that both the initializer and the releaser run on the same
thread. Consequently, from the Python interpreter’s perspective, all
Python threads in the same concurrency group remain active even after
they finish executing Ray tasks.

## Related issue number

Closes #51071

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(


```python
# test.py
import ray

@ray.remote
class ThreadActor:
    def __init__(self):
        self.counter = 0

    def increment(self):
        self.counter += 1
        return self.counter

    def terminate(self):
        ray.actor.exit_actor()

actor = ThreadActor.options(max_concurrency=10).remote()
print(ray.get(actor.increment.remote()))
ray.get(actor.terminate.remote())
```

* Without this PR: Ran the test 20 times and encountered the error
"PyGILState_Release: auto-releasing thread-state" 20 times.
<img width="1728" alt="Screenshot 2025-04-30 at 5 23 27 PM"
src="https://github.com/user-attachments/assets/644ffd89-8edf-4678-a0cd-528eb642fe66"
/>
* With this PR: Ran the test 20 times and encountered the error 0 times.
<img width="1728" alt="Screenshot 2025-04-30 at 5 25 10 PM"
src="https://github.com/user-attachments/assets/03afaa26-0027-4df4-915d-6165bb83583f"
/>

---------

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
jujipotle added a commit to jujipotle/ray that referenced this issue May 12, 2025
commit 0abf03bb30a7c234a0820dc4650b6df6d0cbea59
Author: srinathk10 <68668616+srinathk10@users.noreply.github.com>
Date:   Mon May 12 15:25:17 2025 -0700

    Train Tests: Disable cgroup isolation on head node for benchmarking (#52909)

    ---------

    Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
    Signed-off-by: srinathk10 <68668616+srinathk10@users.noreply.github.com>
    Co-authored-by: lanbochen-anyscale <103082133+lanbochen-anyscale@users.noreply.github.com>

commit 5349c66c6d5d022f03341b0ac9f1adb34079b0a5
Author: dev-goyal <126589393+dev-goyal@users.noreply.github.com>
Date:   Mon May 12 18:11:07 2025 -0400

    Minor enhancements to Databricks Unity Datasource (#52850)

    - Move imports around in `read_databricks_tables`. Now, installing
    `pyspark` is optional if desired.
    - Print a reason if the query fails
    - Expose the `is_truncated` field to the user, so they can intervene if
    needed.

    Signed-off-by: Dev <dev.goyal@hinge.co>

commit da52b137f10567e78fb0dd1937a7480cf70f56ee
Author: Matthew Owen <mowen@anyscale.com>
Date:   Mon May 12 13:46:01 2025 -0700

    [data] Remove unused allocated bytes panel and stat (#52943)

    ## Why are these changes needed?
    We do not update this stat anywhere in the codebase, this removes the
    stat and the corresponding panel.

    ## Related issue number

    <!-- For example: "Closes #1234" -->

    ## Checks

    - [ ] I've signed off every commit(by using the -s flag, i.e., `git
    commit -s`) in this PR.
    - [ ] I've run `scripts/format.sh` to lint the changes in this PR.
    - [ ] I've included any doc changes needed for
    https://docs.ray.io/en/master/.
    - [ ] I've added any new APIs to the API Reference. For example, if I
    added a
    method in Tune, I've added it in `doc/source/tune/api/` under the
               corresponding `.rst` file.
    - [ ] I've made sure the tests are passing. Note that there might be a
    few flaky tests, see the recent failures at https://flakey-tests.ray.io/
    - Testing Strategy
       - [ ] Unit tests
       - [ ] Release tests
       - [ ] This PR is not tested :(

    Signed-off-by: Matthew Owen <mowen@anyscale.com>

commit 61617ccf1e0280ae512991eadd1595f6d1a66f15
Author: matthewdeng <matt@anyscale.com>
Date:   Mon May 12 11:55:14 2025 -0700

    [train] bump test_torch_device_manager timeout (#52917)

    Test started flakily timing out. Bumping to verify if it's around the
    threshold.

    Signed-off-by: Matthew Deng <matt@anyscale.com>

commit 7e78c5aee84cf9d05ec8cc6a60d385e7f6df67e7
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date:   Mon May 12 11:10:05 2025 -0700

    [data] skip tfx-bsl tests on premerge (#52942)

    the base image is not resolving dependencies any more.

    Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

commit 0f864d70cfe812f0164cd6bb414daecbcbb6e8c7
Author: Rueian <rueiancsie@gmail.com>
Date:   Mon May 12 09:56:14 2025 -0700

    [core][autoscaler][v1] deflaky test_autoscaler (#52769)

    ## Why are these changes needed?

    From [the
    logs](https://buildkite.com/ray-project/postmerge/builds/9840#01968329-4ba3-422f-91e0-542d09855d68)
    provided by @kevin85421, `test_autoscaler.py` has 2 flaky tests:

    ```python
    [2025-04-29T20:28:44Z]
    [2025-04-29T20:28:44Z] =================================== FAILURES ===================================
    [2025-04-29T20:28:44Z] ____________________ AutoscalingTest.testConfiguresNewNodes ____________________
    [2025-04-29T20:28:44Z]
    [2025-04-29T20:28:44Z] self = <python.ray.tests.test_autoscaler.AutoscalingTest testMethod=testConfiguresNewNodes>
    [2025-04-29T20:28:44Z]
    [2025-04-29T20:28:44Z]     def testConfiguresNewNodes(self):
    [2025-04-29T20:28:44Z]         config = copy.deepcopy(SMALL_CLUSTER)
    [2025-04-29T20:28:44Z]         config["available_node_types"]["worker"]["min_workers"] = 1
    [2025-04-29T20:28:44Z]         config_path = self.write_config(config)
    [2025-04-29T20:28:44Z]         self.provider = MockProvider()
    [2025-04-29T20:28:44Z]         runner = MockProcessRunner()
    [2025-04-29T20:28:44Z]         runner.respond_to_call("json .Config.Env", ["[]" for i in range(2)])
    [2025-04-29T20:28:44Z]         self.provider.create_node(
    [2025-04-29T20:28:44Z]             {},
    [2025-04-29T20:28:44Z]             {
    [2025-04-29T20:28:44Z]                 TAG_RAY_NODE_KIND: NODE_KIND_HEAD,
    [2025-04-29T20:28:44Z]                 TAG_RAY_NODE_STATUS: STATUS_UP_TO_DATE,
    [2025-04-29T20:28:44Z]                 TAG_RAY_USER_NODE_TYPE: "head",
    [2025-04-29T20:28:44Z]             },
    [2025-04-29T20:28:44Z]             1,
    [2025-04-29T20:28:44Z]         )
    [2025-04-29T20:28:44Z]         autoscaler = MockAutoscaler(
    [2025-04-29T20:28:44Z]             config_path,
    [2025-04-29T20:28:44Z]             LoadMetrics(),
    [2025-04-29T20:28:44Z]             MockGcsClient(),
    [2025-04-29T20:28:44Z]             max_failures=0,
    [2025-04-29T20:28:44Z]             process_runner=runner,
    [2025-04-29T20:28:44Z]             update_interval_s=0,
    [2025-04-29T20:28:44Z]         )
    [2025-04-29T20:28:44Z]
    [2025-04-29T20:28:44Z]         autoscaler.update()
    [2025-04-29T20:28:44Z]         autoscaler.update()
    [2025-04-29T20:28:44Z]         self.waitForNodes(2)
    [2025-04-29T20:28:44Z]         self.provider.finish_starting_nodes()
    [2025-04-29T20:28:44Z]         # TODO(rickyx): This is a hack to avoid running into race conditions
    [2025-04-29T20:28:44Z]         # within v1 autoscaler. These should no longer be relevant in v2.
    [2025-04-29T20:28:44Z]         time.sleep(3)
    [2025-04-29T20:28:44Z]         autoscaler.update()
    [2025-04-29T20:28:44Z]         time.sleep(3)
    [2025-04-29T20:28:44Z] >       self.waitForNodes(2, tag_filters={TAG_RAY_NODE_STATUS: STATUS_UP_TO_DATE})
    [2025-04-29T20:28:44Z]
    [2025-04-29T20:28:44Z] python/ray/tests/test_autoscaler.py:2250:
    [2025-04-29T20:28:44Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
    [2025-04-29T20:28:44Z] python/ray/tests/test_autoscaler.py:414: in waitForNodes
    [2025-04-29T20:28:44Z]     comparison(n, expected, msg="Unexpected node quantity.")
    [2025-04-29T20:28:44Z] E   AssertionError: 3 != 2 : Unexpected node quantity.
    ```
    and
    ```python
    [2025-04-29T20:28:44Z]
    [2025-04-29T20:28:44Z] =================================== FAILURES ===================================
    [2025-04-29T20:28:44Z] ________ AutoscalingTest.testDontScaleDownIdleTimeOutForPlacementGroups ________
    [2025-04-29T20:28:44Z]
    [2025-04-29T20:28:44Z] self = <python.ray.tests.test_autoscaler.AutoscalingTest testMethod=testDontScaleDownIdleTimeOutForPlacementGroups>
    [2025-04-29T20:28:44Z]
    [2025-04-29T20:28:44Z]     def testDontScaleDownIdleTimeOutForPlacementGroups(self):
    [2025-04-29T20:28:44Z]         config = copy.deepcopy(SMALL_CLUSTER)
    [2025-04-29T20:28:44Z]         config["available_node_types"]["head"]["resources"][
    [2025-04-29T20:28:44Z]             "CPU"
    [2025-04-29T20:28:44Z]         ] = 0  # make the head node not consume any resources.
    [2025-04-29T20:28:44Z]         config["available_node_types"]["worker"][
    [2025-04-29T20:28:44Z]             "min_workers"
    [2025-04-29T20:28:44Z]         ] = 1  # prepare 1 worker upfront.
    [2025-04-29T20:28:44Z]         config["idle_timeout_minutes"] = 0.1
    [2025-04-29T20:28:44Z]         config_path = self.write_config(config)
    [2025-04-29T20:28:44Z]
    [2025-04-29T20:28:44Z]         self.provider = MockProvider()
    [2025-04-29T20:28:44Z]         self.provider.create_node(
    [2025-04-29T20:28:44Z]             {},
    [2025-04-29T20:28:44Z]             {
    [2025-04-29T20:28:44Z]                 TAG_RAY_NODE_KIND: NODE_KIND_HEAD,
    [2025-04-29T20:28:44Z]                 TAG_RAY_NODE_STATUS: STATUS_UP_TO_DATE,
    [2025-04-29T20:28:44Z]                 TAG_RAY_USER_NODE_TYPE: "head",
    [2025-04-29T20:28:44Z]             },
    [2025-04-29T20:28:44Z]             1,
    [2025-04-29T20:28:44Z]         )
    [2025-04-29T20:28:44Z]
    [2025-04-29T20:28:44Z]         runner = MockProcessRunner()
    [2025-04-29T20:28:44Z]         lm = LoadMetrics()
    [2025-04-29T20:28:44Z]         mock_gcs_client = MockGcsClient()
    [2025-04-29T20:28:44Z]         autoscaler = MockAutoscaler(
    [2025-04-29T20:28:44Z]             config_path,
    [2025-04-29T20:28:44Z]             lm,
    [2025-04-29T20:28:44Z]             mock_gcs_client,
    [2025-04-29T20:28:44Z]             max_failures=0,
    [2025-04-29T20:28:44Z]             process_runner=runner,
    [2025-04-29T20:28:44Z]             update_interval_s=0,
    [2025-04-29T20:28:44Z]         )
    [2025-04-29T20:28:44Z]
    [2025-04-29T20:28:44Z]         autoscaler.update()
    [2025-04-29T20:28:44Z]         # 1 worker is ready upfront.
    [2025-04-29T20:28:44Z]         self.waitForNodes(1, tag_filters=WORKER_FILTER)
    [2025-04-29T20:28:44Z]
    [2025-04-29T20:28:44Z]         # Restore min_workers to allow scaling down to 0.
    [2025-04-29T20:28:44Z]         config["available_node_types"]["worker"]["min_workers"] = 0
    [2025-04-29T20:28:44Z]         self.write_config(config)
    [2025-04-29T20:28:44Z]         autoscaler.update()
    [2025-04-29T20:28:44Z]
    [2025-04-29T20:28:44Z]         # Create a placement group with 2 bundles that require 2 workers.
    [2025-04-29T20:28:44Z]         placement_group_table_data = gcs_pb2.PlacementGroupTableData(
    [2025-04-29T20:28:44Z]             placement_group_id=b"\000",
    [2025-04-29T20:28:44Z]             strategy=common_pb2.PlacementStrategy.SPREAD,
    [2025-04-29T20:28:44Z]         )
    [2025-04-29T20:28:44Z]         for i in range(2):
    [2025-04-29T20:28:44Z]             bundle = common_pb2.Bundle()
    [2025-04-29T20:28:44Z]             bundle.bundle_id.placement_group_id = (
    [2025-04-29T20:28:44Z]                 placement_group_table_data.placement_group_id
    [2025-04-29T20:28:44Z]             )
    [2025-04-29T20:28:44Z]             bundle.bundle_id.bundle_index = i
    [2025-04-29T20:28:44Z]             bundle.unit_resources["CPU"] = 1
    [2025-04-29T20:28:44Z]             placement_group_table_data.bundles.append(bundle)
    [2025-04-29T20:28:44Z]
    [2025-04-29T20:28:44Z]         # Mark the first worker as idle, but it should not be scaled down by the autoscaler because it will be used by the placement group.
    [2025-04-29T20:28:44Z]         worker_ip = self.provider.non_terminated_node_ips(WORKER_FILTER)[0]
    [2025-04-29T20:28:44Z]         lm.update(
    [2025-04-29T20:28:44Z]             worker_ip,
    [2025-04-29T20:28:44Z]             mock_raylet_id(),
    [2025-04-29T20:28:44Z]             {"CPU": 1},
    [2025-04-29T20:28:44Z]             {"CPU": 1},
    [2025-04-29T20:28:44Z]             20,  # idle for 20 seconds, which is longer than the idle_timeout_minutes.
    [2025-04-29T20:28:44Z]             None,
    [2025-04-29T20:28:44Z]             None,
    [2025-04-29T20:28:44Z]             [placement_group_table_data],
    [2025-04-29T20:28:44Z]         )
    [2025-04-29T20:28:44Z]         autoscaler.update()
    [2025-04-29T20:28:44Z]
    [2025-04-29T20:28:44Z]         events = autoscaler.event_summarizer.summary()
    [2025-04-29T20:28:44Z]         assert "Removing 1 nodes of type worker (idle)." not in events, events
    [2025-04-29T20:28:44Z]         assert "Adding 1 node(s) of type worker." in events, events
    [2025-04-29T20:28:44Z]
    [2025-04-29T20:28:44Z]         autoscaler.update()
    [2025-04-29T20:28:44Z] >       self.waitForNodes(2, tag_filters=WORKER_FILTER)
    [2025-04-29T20:28:44Z]
    [2025-04-29T20:28:44Z] python/ray/tests/test_autoscaler.py:3708:
    [2025-04-29T20:28:44Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
    [2025-04-29T20:28:44Z] python/ray/tests/test_autoscaler.py:414: in waitForNodes
    [2025-04-29T20:28:44Z]     comparison(n, expected, msg="Unexpected node quantity.")
    [2025-04-29T20:28:44Z] E   AssertionError: 3 != 2 : Unexpected node quantity.
    ```

    They both overprovisioned work nodes (`AssertionError: 3 != 2`) due to
    the race between `autoscaler.update()` and the background NodeLauncher.
    In particular, the `pending_launches` counter in the `autoscaler` will
    be decreased by the background NodeLauncher asynchronously when
    launching a pending node. That can cause the pending node to disappear
    from the view of `autoscaler.update()` and thus let it overprovision a
    new node.

    The previous solution is adding `time.sleep(3)` between
    `autoscaler.update()` calls.

    https://github.com/ray-project/ray/blob/8561936c808464bbebc1117d3b5cd0652392b38b/python/ray/tests/test_autoscaler.py#L2245-L2247

    I think we can make it more reliable by using `self.waitForNodes()`
    instead.

    This PR fixes these two flaky tests by adding `self.waitForNodes()`
    between `autoscaler.update()`.

    It also fixes errors (Runner deserialization error, event summary races)
    in the previous implementation of
    `testDontScaleDownIdleTimeOutForPlacementGroups`.

    Before this PR, these 2 tests would fail due to the race every 200
    times. After this PR, these 2 tests can pass 10000 times without
    failures.

    ## Related issue number

    https://github.com/ray-project/ray/issues/52768

    ## Checks

    - [x] I've signed off every commit(by using the -s flag, i.e., `git
    commit -s`) in this PR.
    - [x] I've run `scripts/format.sh` to lint the changes in this PR.
    - [ ] I've included any doc changes needed for
    https://docs.ray.io/en/master/.
    - [ ] I've added any new APIs to the API Reference. For example, if I
    added a
    method in Tune, I've added it in `doc/source/tune/api/` under the
               corresponding `.rst` file.
    - [x] I've made sure the tests are passing. Note that there might be a
    few flaky tests, see the recent failures at https://flakey-tests.ray.io/
    - Testing Strategy
       - [x] Unit tests
       - [ ] Release tests
       - [ ] This PR is not tested :(

    ---------

    Signed-off-by: Rueian <rueiancsie@gmail.com>

commit 5ecb0e51273a00e14cc5878d28ac848a526c9aeb
Author: Wei-Cheng Lai <qazwsx0939059006@gmail.com>
Date:   Mon May 12 17:46:38 2025 +0100

    [docs][tune]: fix import & replace `session.report` with `tune.report` (#52801)

    Updated the documentation to improve clarity.

    Signed-off-by: wei-chenglai <qazwsx0939059006@gmail.com>

commit 5324339f8407050db46b58f36b68ecdaf5ef31f6
Author: iamjustinhsu <140442892+iamjustinhsu@users.noreply.github.com>
Date:   Mon May 12 09:36:24 2025 -0700

    [Data] Add save modes to file data sinks (#52900)

    <!-- Thank you for your contribution! Please review
    https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
    opening a pull request. -->

    <!-- Please add a reviewer to the assignee section when you create a PR.
    If you don't have the access to it, we will shortly find a reviewer and
    assign them to your PR. -->

    ## Why are these changes needed?

    <!-- Please give a short summary of the change and the problem this
    solves. -->
    In write_parquet, we want to be able to support
    - `OVERWRITE`: (If dir present, delete then write, otherwise, just
    create dir, then write)

    A more detailed description can be found in
    https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html#save-modes

    This PR was meant to address
    https://anyscale1.atlassian.net/browse/DATA-946, but since the other
    save modes weren't that much work, I added the additional following 3
    from apache spark too
    - `IGNORE`: (if dir present, silently pass)
    - `ERROR`: (if dir present, throw error)
    - `APPEND` (this is the current behavior we have, if dir present, we
    append files. Any conflicting file names are overwritten)

    ## Related issue number
    attentive requesting this
    <!-- For example: "Closes #1234" -->

    ## Checks

    - [ ] I've signed off every commit(by using the -s flag, i.e., `git
    commit -s`) in this PR.
    - [ ] I've run `scripts/format.sh` to lint the changes in this PR.
    - [ ] I've included any doc changes needed for
    https://docs.ray.io/en/master/.
    - [ ] I've added any new APIs to the API Reference. For example, if I
    added a
    method in Tune, I've added it in `doc/source/tune/api/` under the
               corresponding `.rst` file.
    - [ ] I've made sure the tests are passing. Note that there might be a
    few flaky tests, see the recent failures at https://flakey-tests.ray.io/
    - Testing Strategy
       - [ ] Unit tests
       - [ ] Release tests
       - [ ] This PR is not tested :(

    ---------

    Co-authored-by: Balaji Veeramani <balaji@anyscale.com>

commit 5c6ccfd848d61eed32d25378a2fb7b65a7c65119
Author: Kai-Hsun Chen <kaihsun@anyscale.com>
Date:   Mon May 12 09:23:00 2025 -0700

    [core][refactor] Remove `GetSequenceNumber` (#52936)

    Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

commit 983d1ab957fb489001c4d5ae1835f48477f49f71
Author: David Xia <david@davidxia.com>
Date:   Mon May 12 01:26:15 2025 -0400

    [Doc] improve prometheus-grafana.md (#52821)

    Signed-off-by: David Xia <david@davidxia.com>

commit 66b19d390d156635c32403226d6d6c6e82fb079d
Author: lkchen <github@lkchen.net>
Date:   Sat May 10 12:12:10 2025 -0700

    [ray.data.llm] Unify fields in SGLang and vLLM config (#52823)

    Signed-off-by: Linkun Chen <github@lkchen.net>
    Signed-off-by: lkchen <github@lkchen.net>

commit e7bff7f09e7f5f75603c2a301a9fb19706381dbc
Author: Philipp Moritz <pcmoritz@gmail.com>
Date:   Sat May 10 13:31:50 2025 +0800

    Fix uv run when use with vllm's Ray backend (#52916)

    <!-- Thank you for your contribution! Please review
    https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
    opening a pull request. -->

    <!-- Please add a reviewer to the assignee section when you create a PR.
    If you don't have the access to it, we will shortly find a reviewer and
    assign them to your PR. -->

    ## Why are these changes needed?

    If vllm's Ray backend is used in the vllm V1 architecture, it will start
    a subprocess and then call ray.init in that subprocess to launch the
    actual vllm replicas. This PR makes it so the uv environment still gets
    propagated correctly in that case.

    This change is consistent with the behavior of how uv environments
    propagate to subprocesses with just vanilla `uv run` without Ray:

    ```
    (base) pcmoritz@pcmoritz-DQ44HV60WX vllm-repro % cat pyproject.toml
    [project]
    name = "test"
    version = "0.1"
    dependencies = [
       "ray",
    ]
    ```
    ```
    (base) (base) pcmoritz@pcmoritz-DQ44HV60WX vllm-repro % cat test.py
    import sys

    import ray
    import subprocess
    import psutil

    print(sys.executable)
    print(ray.__path__)

    # avoid fork bomb
    if len(psutil.Process().parents()) > 10:
        sys.exit(0)

    subprocess.check_call([sys.executable, "test.py"])
    ```

    ```
    (base) pcmoritz@pcmoritz-DQ44HV60WX vllm-repro % uv run test.py
    warning: No `requires-python` value found in the workspace. Defaulting to `>=3.12`.
    /private/tmp/vllm-repro/.venv/bin/python3
    ['/private/tmp/vllm-repro/.venv/lib/python3.12/site-packages/ray']
    /private/tmp/vllm-repro/.venv/bin/python3
    ['/private/tmp/vllm-repro/.venv/lib/python3.12/site-packages/ray']
    /private/tmp/vllm-repro/.venv/bin/python3
    ['/private/tmp/vllm-repro/.venv/lib/python3.12/site-packages/ray']
    /private/tmp/vllm-repro/.venv/bin/python3
    ['/private/tmp/vllm-repro/.venv/lib/python3.12/site-packages/ray']
    /private/tmp/vllm-repro/.venv/bin/python3
    ['/private/tmp/vllm-repro/.venv/lib/python3.12/site-packages/ray']
    /private/tmp/vllm-repro/.venv/bin/python3
    ['/private/tmp/vllm-repro/.venv/lib/python3.12/site-packages/ray']
    ```

    ## Related issue number

    <!-- For example: "Closes #1234" -->

    ## Checks

    - [ ] I've signed off every commit(by using the -s flag, i.e., `git
    commit -s`) in this PR.
    - [ ] I've run `scripts/format.sh` to lint the changes in this PR.
    - [ ] I've included any doc changes needed for
    https://docs.ray.io/en/master/.
    - [ ] I've added any new APIs to the API Reference. For example, if I
    added a
    method in Tune, I've added it in `doc/source/tune/api/` under the
               corresponding `.rst` file.
    - [ ] I've made sure the tests are passing. Note that there might be a
    few flaky tests, see the recent failures at https://flakey-tests.ray.io/
    - Testing Strategy
       - [ ] Unit tests
       - [ ] Release tests
       - [ ] This PR is not tested :(

    ---------

    Co-authored-by: pcmoritz <pcmoritz@anyscale.com>

commit f3e86752eee651ee839dc97c13d558fdb370b08e
Author: Goku Mohandas <gokumd@gmail.com>
Date:   Fri May 9 22:27:32 2025 -0700

    Entity recognition with LLMs (#52342)

    <!-- Thank you for your contribution! Please review
    https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
    opening a pull request. -->

    <!-- Please add a reviewer to the assignee section when you create a PR.
    If you don't have the access to it, we will shortly find a reviewer and
    assign them to your PR. -->

    ## Why are these changes needed?

    <!-- Please give a short summary of the change and the problem this
    solves. -->

    ## Related issue number

    <!-- For example: "Closes #1234" -->

    ## Checks

    - [ ] I've signed off every commit(by using the -s flag, i.e., `git
    commit -s`) in this PR.
    - [ ] I've run `scripts/format.sh` to lint the changes in this PR.
    - [ ] I've included any doc changes needed for
    https://docs.ray.io/en/master/.
    - [ ] I've added any new APIs to the API Reference. For example, if I
    added a
    method in Tune, I've added it in `doc/source/tune/api/` under the
               corresponding `.rst` file.
    - [ ] I've made sure the tests are passing. Note that there might be a
    few flaky tests, see the recent failures at https://flakey-tests.ray.io/
    - Testing Strategy
       - [ ] Unit tests
       - [ ] Release tests
       - [ ] This PR is not tested :(

    ---------

    Signed-off-by: GokuMohandas <gokumd@gmail.com>
    Signed-off-by: Goku Mohandas <gokumd@gmail.com>
    Signed-off-by: angelinalg <angelina@anyscale.com>
    Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
    Co-authored-by: angelinalg <angelina@anyscale.com>

commit 7d58cd76f00d8d96dc494f32a034f154308f9ce4
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date:   Fri May 9 18:33:52 2025 -0700

    [release] support using any dir in the repo as working dir (#52925)

    to support testing from docs dir

    Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

commit c983d99626e290e30efcd6f5bdc92e56b561d6bd
Author: Dhyey Shah <dhyey2019@gmail.com>
Date:   Fri May 9 17:45:12 2025 -0700

    [core] Record grpc client failures (#52790)

    Signed-off-by: dayshah <dhyey2019@gmail.com>

commit 8addae4ed4d154ec999187277d4300cc592bfbbd
Author: Christopher Zhang <chris@anyscale.com>
Date:   Fri May 9 17:30:34 2025 -0700

    remove anyscale navbar on docs.ray.io (#52907)

commit 40779a4fa92ad0b60adde92344fc52c6347ec4dd
Author: Timothy Seah <timothy.seah777@yahoo.com>
Date:   Fri May 9 17:13:50 2025 -0700

    [train][doc] Remove unused configuration-overview page (#52912)

    Signed-off-by: Timothy Seah <tseah@Mac.attlocal.net>
    Co-authored-by: Timothy Seah <tseah@Mac.attlocal.net>

commit 257df20e399008254d0104b65d46bd52acf7a8a8
Author: Alexey Kudinkin <ak@anyscale.com>
Date:   Fri May 9 17:02:34 2025 -0700

    [Data] Cleaning up Executor shutdown sequence (#52828)

    ## Why are these changes needed?

    1. Log exception prompting the shutdown (if any)
    2. Round durations logged (to millis)

    ---------

    Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

commit d258fee55dc2051ea67b2422290d11e89985a484
Author: Connector Switch <c8ef@outlook.com>
Date:   Sat May 10 05:57:43 2025 +0800

    [RLLIB] Fix simple typo in `rllib/evaluation/collectors/agent_collector.py` (#52773)

    Signed-off-by: Connector Switch <c8ef@outlook.com>

commit 5c5590895ad10b956e4ad9fba4c2cda2be68541d
Author: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com>
Date:   Fri May 9 14:16:41 2025 -0700

    [Doc] Update configure-manage-dashboard.md (#52890)

    Signed-off-by: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com>

commit cc6790d8fc471f262395d3066b5d7bcac3241efd
Author: Kai-Hsun Chen <kaihsun@anyscale.com>
Date:   Fri May 9 14:13:48 2025 -0700

    [chore] Delete unused build.sh (#50649)

    Signed-off-by: kaihsun <kaihsun@anyscale.com>

commit 86c0958a5f051780e1f4cf08ad37bde942040774
Author: Arthur Böök <atte.book@gmail.com>
Date:   Fri May 9 13:15:03 2025 -0700

    [data][llm] fix: remove-no-longer needed guided decoding vllm v0 constraint (#52903)

    Signed-off-by: Arthur <atte.book@gmail.com>

commit c769942d8251b3ab139cf823f4894340b77bb1cf
Author: srinathk10 <68668616+srinathk10@users.noreply.github.com>
Date:   Fri May 9 12:03:35 2025 -0700

    ImageDatasource::_read_stream Avoid unnecessary resize and convert (#52885)

    Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

commit 3ad416c562bb4fd2ce58b93d2116573a4acc00a0
Author: Dhyey Shah <dhyey2019@gmail.com>
Date:   Fri May 9 09:27:48 2025 -0700

    [core] Raylet Node Manager RPC Failure Documentation (#52710)

    Documentation for what happens when node manager rpc's fail.

    Signed-off-by: dayshah <dhyey2019@gmail.com>

commit 478877e8f92faa1665adb9db967d4e88d5072279
Author: Kai-Hsun Chen <kaihsun@anyscale.com>
Date:   Fri May 9 09:27:01 2025 -0700

    [core] Implement a thread pool and call the CPython API on all threads within the same concurrency group (#52575)

    We see the following error message from the CI runs of
    `test_threaded_actor.py`
    ([example1](https://buildkite.com/ray-project/postmerge-macos/builds/5543#019659f5-7285-48fc-b1cf-588fd19bd050),
    [example2](https://buildkite.com/ray-project/postmerge-macos/builds/5534#01965796-294c-41de-8e6f-ef2970134df2)).

    ![image](https://github.com/user-attachments/assets/d3a5d47a-1dc6-41b8-b258-d33699d4a04a)

    The message "Fatal Python error: PyGILState_Release: auto-releasing
    thread-state, but no thread-state for this thread" is very scary, but it
    will not cause any tests to fail.

    The root cause is that `PyGILState_Release` is called on a thread that
    has never called `PyGILState_Ensure`. See the [CPython source
    code](https://github.com/python/cpython/blob/a94c7528b596e9ec234f12ebeeb45fc731412b18/Python/pystate.c#L2870)
    for more details.

    The reason is that we can't control which thread in the thread pool will
    run the initializer/releaser. Hence, if a concurrency group has more
    than one thread, the error message above may be printed when we
    gracefully shut down an actor (i.e., `ray.actor.exit_actor()`).

    In this PR, we implement our own thread pool using `std::thread`,
    ensuring that both the initializer and the releaser run on the same
    thread. Consequently, from the Python interpreter’s perspective, all
    Python threads in the same concurrency group remain active even after
    they finish executing Ray tasks.

    ## Related issue number

    Closes #51071

    ## Checks

    - [ ] I've signed off every commit(by using the -s flag, i.e., `git
    commit -s`) in this PR.
    - [ ] I've run `scripts/format.sh` to lint the changes in this PR.
    - [ ] I've included any doc changes needed for
    https://docs.ray.io/en/master/.
    - [ ] I've added any new APIs to the API Reference. For example, if I
    added a
    method in Tune, I've added it in `doc/source/tune/api/` under the
               corresponding `.rst` file.
    - [ ] I've made sure the tests are passing. Note that there might be a
    few flaky tests, see the recent failures at https://flakey-tests.ray.io/
    - Testing Strategy
       - [ ] Unit tests
       - [ ] Release tests
       - [ ] This PR is not tested :(

    ```python
    # test.py
    import ray

    @ray.remote
    class ThreadActor:
        def __init__(self):
            self.counter = 0

        def increment(self):
            self.counter += 1
            return self.counter

        def terminate(self):
            ray.actor.exit_actor()

    actor = ThreadActor.options(max_concurrency=10).remote()
    print(ray.get(actor.increment.remote()))
    ray.get(actor.terminate.remote())
    ```

    * Without this PR: Ran the test 20 times and encountered the error
    "PyGILState_Release: auto-releasing thread-state" 20 times.
    <img width="1728" alt="Screenshot 2025-04-30 at 5 23 27 PM"
    src="https://github.com/user-attachments/assets/644ffd89-8edf-4678-a0cd-528eb642fe66"
    />
    * With this PR: Ran the test 20 times and encountered the error 0 times.
    <img width="1728" alt="Screenshot 2025-04-30 at 5 25 10 PM"
    src="https://github.com/user-attachments/assets/03afaa26-0027-4df4-915d-6165bb83583f"
    />

    ---------

    Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

commit 0697b746c901b50125d8e6ba776bd6d5fe260224
Author: Dhyey Shah <dhyey2019@gmail.com>
Date:   Fri May 9 09:25:36 2025 -0700

    [core] [docs] Dynamic generator deprecation (#52887)

    Deprecating the dynamic ref generator.

    It was supposed to be deprecated a long time ago in favor of streaming
    generators but found that the deprecation warning on the docs page was
    actually never showing
    https://docs.ray.io/en/releases-2.46.0/ray-core/tasks/generators.html
    because the warning is above the title of the page.

    Moved the dynamic ref generator page under deprecated at the bottom of
    the ray generators page and outside the tasks subsection.

    Signed-off-by: dayshah <dhyey2019@gmail.com>

commit 262af06532209c4dd81fe2046e29dab5af91bc9c
Author: Dhyey Shah <dhyey2019@gmail.com>
Date:   Thu May 8 21:34:53 2025 -0700

    [core] Label selector enum as class to fix windows build (#52884)

    Signed-off-by: dayshah <dhyey2019@gmail.com>

commit 2bb3c5b62094ae468aeba9e2c52abaf64d5dadba
Author: Thomas Desrosiers <681004+thomasdesr@users.noreply.github.com>
Date:   Thu May 8 16:19:20 2025 -0700

    [pydoclint] core/_private docstring minimal format fixes (#52872)

    ## Why are these changes needed?

    This changes are part a batch effort to rewrite Ray's docstrings to be
    minimally pydoclint compliant. This PR focuses on making them at least
    pass basic formatting checks

    ## Related issue number

    <!-- For example: "Closes #1234" -->

    ## Checks

    - [x] I've signed off every commit(by using the -s flag, i.e., `git
    commit -s`) in this PR.
    - [ ] I've run `scripts/format.sh` to lint the changes in this PR.
    - [ ] I've included any doc changes needed for
    https://docs.ray.io/en/master/.
    - [ ] I've added any new APIs to the API Reference. For example, if I
    added a
    method in Tune, I've added it in `doc/source/tune/api/` under the
               corresponding `.rst` file.
    - [ ] I've made sure the tests are passing. Note that there might be a
    few flaky tests, see the recent failures at https://flakey-tests.ray.io/
    - Testing Strategy
       - [ ] Unit tests
       - [ ] Release tests
       - [ ] This PR is not tested :(

    ---------

    Signed-off-by: Thomas Desrosiers <thomas@anyscale.com>

commit 489f233ffcd0789282c16ab6e5806ee7fea1b037
Author: Thomas Desrosiers <681004+thomasdesr@users.noreply.github.com>
Date:   Thu May 8 15:48:23 2025 -0700

    [pydoclint] util docstring minimal format errors (#52880)

    <!-- Thank you for your contribution! Please review
    https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
    opening a pull request. -->

    <!-- Please add a reviewer to the assignee section when you create a PR.
    If you don't have the access to it, we will shortly find a reviewer and
    assign them to your PR. -->

    ## Why are these changes needed?

    This changes are part a batch effort to rewrite Ray's docstrings to be
    minimally pydoclint compliant. This PR focuses on making them at least
    pass basic formatting checks

    <!-- Please give a short summary of the change and the problem this
    solves. -->

    ## Related issue number

    <!-- For example: "Closes #1234" -->

    ## Checks

    - [x] I've signed off every commit(by using the -s flag, i.e., `git
    commit -s`) in this PR.
    - [ ] I've run `scripts/format.sh` to lint the changes in this PR.
    - [ ] I've included any doc changes needed for
    https://docs.ray.io/en/master/.
    - [ ] I've added any new APIs to the API Reference. For example, if I
    added a
    method in Tune, I've added it in `doc/source/tune/api/` under the
               corresponding `.rst` file.
    - [ ] I've made sure the tests are passing. Note that there might be a
    few flaky tests, see the recent failures at https://flakey-tests.ray.io/
    - Testing Strategy
       - [ ] Unit tests
       - [ ] Release tests
       - [ ] This PR is not tested :(

    ---------

    Signed-off-by: Thomas Desrosiers <thomas@anyscale.com>

commit 94125f4c51d7a3369d3393d332f12dde7fc18b58
Author: matthewdeng <matt@anyscale.com>
Date:   Thu May 8 15:20:35 2025 -0700

    [tune][train] update test_train_v2_integration to use correct RunConfig (#52882)

    Fixes an issue in which the wrong `RunConfig` was being used.

    Signed-off-by: Matthew Deng <matt@anyscale.com>

commit e629be1cb48f59a3b34ac5cbe095832cd8c38e98
Author: Thomas Desrosiers <681004+thomasdesr@users.noreply.github.com>
Date:   Thu May 8 15:01:49 2025 -0700

    [pydoclint] data docstring minimal format errors (#52883)

    ## Why are these changes needed?

    This changes are part a batch effort to rewrite Ray's docstrings to be
    minimally pydoclint compliant. This PR focuses on making them at least
    pass basic formatting checks

    ## Related issue number

    Extends https://github.com/ray-project/ray/pull/52874 w/ a few more

    ## Checks

    - [x] I've signed off every commit(by using the -s flag, i.e., `git
    commit -s`) in this PR.
    - [ ] I've run `scripts/format.sh` to lint the changes in this PR.
    - [ ] I've included any doc changes needed for
    https://docs.ray.io/en/master/.
    - [ ] I've added any new APIs to the API Reference. For example, if I
    added a
    method in Tune, I've added it in `doc/source/tune/api/` under the
               corresponding `.rst` file.
    - [ ] I've made sure the tests are passing. Note that there might be a
    few flaky tests, see the recent failures at https://flakey-tests.ray.io/
    - Testing Strategy
       - [ ] Unit tests
       - [ ] Release tests
       - [ ] This PR is not tested :(

    Signed-off-by: Thomas Desrosiers <thomas@anyscale.com>

commit b094f84690be3e8648aa93e5f28cb11c01dce2b6
Author: Thomas Desrosiers <681004+thomasdesr@users.noreply.github.com>
Date:   Thu May 8 14:28:03 2025 -0700

    [pydoclint] core/autoscaler docstring minimal format errors (#52873)

    ## Why are these changes needed?

    This changes are part a batch effort to rewrite Ray's docstrings to be
    minimally pydoclint compliant. This PR focuses on making them at least
    pass basic formatting checks

    ## Related issue number

    <!-- For example: "Closes #1234" -->

    ## Checks

    - [x] I've signed off every commit(by using the -s flag, i.e., `git
    commit -s`) in this PR.
    - [ ] I've run `scripts/format.sh` to lint the changes in this PR.
    - [ ] I've included any doc changes needed for
    https://docs.ray.io/en/master/.
    - [ ] I've added any new APIs to the API Reference. For example, if I
    added a
    method in Tune, I've added it in `doc/source/tune/api/` under the
               corresponding `.rst` file.
    - [ ] I've made sure the tests are passing. Note that there might be a
    few flaky tests, see the recent failures at https://flakey-tests.ray.io/
    - Testing Strategy
       - [ ] Unit tests
       - [ ] Release tests
       - [ ] This PR is not tested :(

    Signed-off-by: Thomas Desrosiers <thomas@anyscale.com>

commit e3235250621029f5053642db8ec78ea51f12ba57
Author: Jani Monoses <jani.monoses@gmail.com>
Date:   Fri May 9 00:04:24 2025 +0300

    [llm] Embedding api (#52229)

commit 6cc103e0b509b85368e4ee669f52a41d90ad6e89
Author: Thomas Desrosiers <681004+thomasdesr@users.noreply.github.com>
Date:   Thu May 8 14:01:55 2025 -0700

    [pydoclint] workflow docstring minimal format errors (#52881)

    <!-- Thank you for your contribution! Please review
    https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
    opening a pull request. -->

    <!-- Please add a reviewer to the assignee section when you create a PR.
    If you don't have the access to it, we will shortly find a reviewer and
    assign them to your PR. -->

    ## Why are these changes needed?

    This changes are part a batch effort to rewrite Ray's docstrings to be
    minimally pydoclint compliant. This PR focuses on making them at least
    pass basic formatting checks

    ## Related issue number

    <!-- For example: "Closes #1234" -->

    ## Checks

    - [x] I've signed off every commit(by using the -s flag, i.e., `git
    commit -s`) in this PR.
    - [ ] I've run `scripts/format.sh` to lint the changes in this PR.
    - [ ] I've included any doc changes needed for
    https://docs.ray.io/en/master/.
    - [ ] I've added any new APIs to the API Reference. For example, if I
    added a
    method in Tune, I've added it in `doc/source/tune/api/` under the
               corresponding `.rst` file.
    - [ ] I've made sure the tests are passing. Note that there might be a
    few flaky tests, see the recent failures at https://flakey-tests.ray.io/
    - Testing Strategy
       - [ ] Unit tests
       - [ ] Release tests
       - [ ] This PR is not tested :(

    Signed-off-by: Thomas Desrosiers <thomas@anyscale.com>

commit ebdcd2e0db7271daa2bfbd98d528021b1b7a3f6b
Author: Thomas Desrosiers <681004+thomasdesr@users.noreply.github.com>
Date:   Thu May 8 13:56:58 2025 -0700

    [pydoclint] tune docstring minimal format errors (#52879)

    <!-- Thank you for your contribution! Please review
    https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
    opening a pull request. -->

    <!-- Please add a reviewer to the assignee section when you create a PR.
    If you don't have the access to it, we will shortly find a reviewer and
    assign them to your PR. -->

    ## Why are these changes needed?

    This changes are part a batch effort to rewrite Ray's docstrings to be
    minimally pydoclint compliant. This PR focuses on making them at least
    pass basic formatting checks
    <!-- Please give a short summary of the change and the problem this
    solves. -->

    ## Related issue number

    <!-- For example: "Closes #1234" -->

    ## Checks

    - [x] I've signed off every commit(by using the -s flag, i.e., `git
    commit -s`) in this PR.
    - [ ] I've run `scripts/format.sh` to lint the changes in this PR.
    - [ ] I've included any doc changes needed for
    https://docs.ray.io/en/master/.
    - [ ] I've added any new APIs to the API Reference. For example, if I
    added a
    method in Tune, I've added it in `doc/source/tune/api/` under the
               corresponding `.rst` file.
    - [ ] I've made sure the tests are passing. Note that there might be a
    few flaky tests, see the recent failures at https://flakey-tests.ray.io/
    - Testing Strategy
       - [ ] Unit tests
       - [ ] Release tests
       - [ ] This PR is not tested :(

    Signed-off-by: Thomas Desrosiers <thomas@anyscale.com>

commit 67b2469f943a58e316fc686690933236240f3be7
Author: Thomas Desrosiers <681004+thomasdesr@users.noreply.github.com>
Date:   Thu May 8 13:56:35 2025 -0700

    [pydoclint] serve docstring minimal format errors (#52877)

    <!-- Thank you for your contribution! Please review
    https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
    opening a pull request. -->

    <!-- Please add a reviewer to the assignee section when you create a PR.
    If you don't have the access to it, we will shortly find a reviewer and
    assign them to your PR. -->

    ## Why are these changes needed?

    This changes are part a batch effort to rewrite Ray's docstrings to be
    minimally pydoclint compliant. This PR focuses on making them at least
    pass basic formatting checks
    <!-- Please give a short summary of the change and the problem this
    solves. -->

    ## Related issue number

    <!-- For example: "Closes #1234" -->

    ## Checks

    - [x] I've signed off every commit(by using the -s flag, i.e., `git
    commit -s`) in this PR.
    - [ ] I've run `scripts/format.sh` to lint the changes in this PR.
    - [ ] I've included any doc changes needed for
    https://docs.ray.io/en/master/.
    - [ ] I've added any new APIs to the API Reference. For example, if I
    added a
    method in Tune, I've added it in `doc/source/tune/api/` under the
               corresponding `.rst` file.
    - [ ] I've made sure the tests are passing. Note that there might be a
    few flaky tests, see the recent failures at https://flakey-tests.ray.io/
    - Testing Strategy
       - [ ] Unit tests
       - [ ] Release tests
       - [ ] This PR is not tested :(

    Signed-off-by: Thomas Desrosiers <thomas@anyscale.com>

commit ee5cc4510ff44b297deb8ea3f7cdf0a75b2190fa
Author: Thomas Desrosiers <681004+thomasdesr@users.noreply.github.com>
Date:   Thu May 8 13:55:32 2025 -0700

    [pydoclint] llm docstring minimal format errors (#52876)

    <!-- Thank you for your contribution! Please review
    https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
    opening a pull request. -->

    <!-- Please add a reviewer to the assignee section when you create a PR.
    If you don't have the access to it, we will shortly find a reviewer and
    assign them to your PR. -->

    ## Why are these changes needed?

    This changes are part a batch effort to rewrite Ray's docstrings to be
    minimally pydoclint compliant. This PR focuses on making them at least
    pass basic formatting checks

    <!-- Please give a short summary of the change and the problem this
    solves. -->

    ## Related issue number

    <!-- For example: "Closes #1234" -->

    ## Checks

    - [x] I've signed off every commit(by using the -s flag, i.e., `git
    commit -s`) in this PR.
    - [ ] I've run `scripts/format.sh` to lint the changes in this PR.
    - [ ] I've included any doc changes needed for
    https://docs.ray.io/en/master/.
    - [ ] I've added any new APIs to the API Reference. For example, if I
    added a
    method in Tune, I've added it in `doc/source/tune/api/` under the
               corresponding `.rst` file.
    - [ ] I've made sure the tests are passing. Note that there might be a
    few flaky tests, see the recent failures at https://flakey-tests.ray.io/
    - Testing Strategy
       - [ ] Unit tests
       - [ ] Release tests
       - [ ] This PR is not tested :(

    Signed-off-by: Thomas Desrosiers <thomas@anyscale.com>

commit 2e98bce60a7e6dfd5040895a3ed6b68a1357199c
Author: Thomas Desrosiers <681004+thomasdesr@users.noreply.github.com>
Date:   Thu May 8 13:50:20 2025 -0700

    [pydoclint] train docstring minimal format errors (#52878)

    <!-- Thank you for your contribution! Please review
    https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
    opening a pull request. -->

    <!-- Please add a reviewer to the assignee section when you create a PR.
    If you don't have the access to it, we will shortly find a reviewer and
    assign them to your PR. -->

    ## Why are these changes needed?

    This changes are part a batch effort to rewrite Ray's docstrings to be
    minimally pydoclint compliant. This PR focuses on making them at least
    pass basic formatting checks

    <!-- Please give a short summary of the change and the problem this
    solves. -->

    ## Related issue number

    <!-- For example: "Closes #1234" -->

    ## Checks

    - [x] I've signed off every commit(by using the -s flag, i.e., `git
    commit -s`) in this PR.
    - [ ] I've run `scripts/format.sh` to lint the changes in this PR.
    - [ ] I've included any doc changes needed for
    https://docs.ray.io/en/master/.
    - [ ] I've added any new APIs to the API Reference. For example, if I
    added a
    method in Tune, I've added it in `doc/source/tune/api/` under the
               corresponding `.rst` file.
    - [ ] I've made sure the tests are passing. Note that there might be a
    few flaky tests, see the recent failures at https://flakey-tests.ray.io/
    - Testing Strategy
       - [ ] Unit tests
       - [ ] Release tests
       - [ ] This PR is not tested :(

    Signed-off-by: Thomas Desrosiers <thomas@anyscale.com>

commit f6cc12ef4a5634d419b868aa67d8e40c8876577f
Author: srinathk10 <68668616+srinathk10@users.noreply.github.com>
Date:   Thu May 8 13:48:30 2025 -0700

    Handle non-contiguous Tensors based GPU transfer (#52548)

    ## Why are these changes needed?
    Handle non-contiguous Tensors based GPU transfer. This allows removing
    the overhead of combining Arrow chunked arrays during Arrow -> Numpy ->
    Tensor conversion.

    ---------

    Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
    Signed-off-by: srinathk10 <68668616+srinathk10@users.noreply.github.com>

commit c66bdf203278567d5a6ac3dfdcaff857899c1dba
Author: Thomas Desrosiers <681004+thomasdesr@users.noreply.github.com>
Date:   Thu May 8 13:26:55 2025 -0700

    [pydoclint] dashboard docstring minimal format errors (#52875)

    ## Why are these changes needed?

    This changes are part a batch effort to rewrite Ray's docstrings to be
    minimally pydoclint compliant. This PR focuses on making them at least
    pass basic formatting checks

    ## Related issue number

    <!-- For example: "Closes #1234" -->

    ## Checks

    - [x] I've signed off every commit(by using the -s flag, i.e., `git
    commit -s`) in this PR.
    - [ ] I've run `scripts/format.sh` to lint the changes in this PR.
    - [ ] I've included any doc changes needed for
    https://docs.ray.io/en/master/.
    - [ ] I've added any new APIs to the API Reference. For example, if I
    added a
    method in Tune, I've added it in `doc/source/tune/api/` under the
               corresponding `.rst` file.
    - [ ] I've made sure the tests are passing. Note that there might be a
    few flaky tests, see the recent failures at https://flakey-tests.ray.io/
    - Testing Strategy
       - [ ] Unit tests
       - [ ] Release tests
       - [ ] This PR is not tested :(

    ---------

    Signed-off-by: Thomas Desrosiers <thomas@anyscale.com>

commit 05ba14660f552f11c464d017678ed97eb67b2401
Author: Neil Girdhar <mistersheik@gmail.com>
Date:   Thu May 8 14:00:27 2025 -0400

    [tune] Remove loguniform's base (#50415)

    Analytically, the base doesn't have any effect on the calculation
    for tune.loguniform and its variants.
    Numerically, it seems that the base can only make the calculation less
    precise, and definitely adds computation.

    Signed-off-by: Neil Girdhar <mistersheik@gmail.com>

commit 62c6771f3f509868361bc9b360f3f61b056bb89b
Author: Alexey Kudinkin <ak@anyscale.com>
Date:   Thu May 8 10:53:21 2025 -0700

    [Data] Fix internal queues accounting for all Operators w/ an internal queue (#52806)

    ## Why are these changes needed?

    While working on https://github.com/ray-project/ray/pull/52754, i've
    realized that actually most of the operators w/ internal queues aren't
    reporting these properly.

    This PR addresses that problem by

    1. Adding `InternalQueueOperatorMixin` forcing classes to implement
    required methods
    2. Fixes `OpState` methods to properly distinguish b/w bundled pending
    dispatch and queued internally
    ---------

    Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

commit fba9084aae34f5339b8db7858364321eb3a18419
Author: Kai-Hsun Chen <kaihsun@anyscale.com>
Date:   Thu May 8 09:55:44 2025 -0700

    [core] `SetTaskStatus` should only be called within the same lock scope where `task_entry` is retrieved (#52770)

    This PR reverts https://github.com/ray-project/ray/pull/52695 and adds
    comments to explain where should `SetTaskStatus` be called.

    #52695 updates the value in submissble_tasks_ without acquiring the
    mutex lock. If multiple threads or coroutines write to the map, a rehash
    or deletion may occur, causing the pointer to the value to become
    invalid.

    ### Outdated PR statement

    #### Question

    https://github.com/ray-project/ray/pull/52695#discussion_r2072477177

    Pointers to values in a `flat_hash_map` become invalid after a rehash.
    Additionally, we dereference those pointers in `RetryTask`, which
    doesn’t hold a mutex lock. Hence, it’s possible for the pointers to
    become invalid when other coroutines or threads insert or delete
    elements from the map, triggering a rehash.

    "Iterators, references, and pointers to elements are invalidated on
    rehash." ([reference](https://abseil.io/docs/cpp/guides/container))

    #### Solution

    Changing `submissible_tasks_` from `absl::flat_hash_map<TaskID,
    TaskEntry>` to `absl::flat_hash_map<TaskID, std::unique_ptr<TaskEntry>>`
    requires a lot of changes.

    Hence, this PR implements a short-term solution by copying the value
    (i.e., TaskEntry) while holding the mutex lock where rehash will not be
    triggered by other threads / coroutine.

    ---------

    Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

commit 9c988ee61460b3205081e5d6b6d903e3bf4f826e
Author: Alan Guo <aguo@anyscale.com>
Date:   Thu May 8 09:20:32 2025 -0700

    fix grafana dashboards dropdowns for data and train dashboard (#52752)

    Previously the dropdown for variables for data and train dashboard
    wasn't working for a few reasons:
    - Data dashboard used the ray_data_allocated_bytes metric which doesn't
    seem to be guaranteed metric to be emitted when ray data is used
    - Both data and train dashboard used label_values which only shows
    values for live metrics. Since these variables represent entities that
    are expected to stop emitting metrics over time, I changed to use a
    query that checks for any values over the time range selected based on
    the approach [suggested
    here](https://stackoverflow.com/questions/52778031/how-to-provide-label-values-in-grafana-variables-with-time-range-for-prometheus)

    ---------

    Signed-off-by: Alan Guo <aguo@anyscale.com>

commit 589c1c94a5dcd80366a49418d28797b7f66aac99
Author: Alexey Kudinkin <ak@anyscale.com>
Date:   Thu May 8 05:12:55 2025 -0700

    [Data] Re-enable Actor locality-based scheduling (#52861)

    <!-- Thank you for your contribution! Please review
    https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
    opening a pull request. -->

    <!-- Please add a reviewer to the assignee section when you create a PR.
    If you don't have the access to it, we will shortly find a reviewer and
    assign them to your PR. -->

    ## Why are these changes needed?

    Context
    ---

    Currently locality-aware scheduling is disabled due to
    https://github.com/ray-project/ray/issues/43466

    However, since we're already using the new API, i've cleaned up the
    ranking and scheduling sequence and re-enabled locality aware
    scheduling.

    Changes
    ---

    - Added `RefBundle.get_preferred_object_locations` to compute a mapping
    of node-ids to total object bytes on the node
     - Added tests
     - Rebased `OutputSplitter` onto the new API
     - Rebased `ActorPool` onto `get_preferred_locations`
     - Re-enable locality hinting for actors by default

    ## Related issue number

    <!-- For example: "Closes #1234" -->

    ## Checks

    - [ ] I've signed off every commit(by using the -s flag, i.e., `git
    commit -s`) in this PR.
    - [ ] I've run `scripts/format.sh` to lint the changes in this PR.
    - [ ] I've included any doc changes needed for
    https://docs.ray.io/en/master/.
    - [ ] I've added any new APIs to the API Reference. For example, if I
    added a
    method in Tune, I've added it in `doc/source/tune/api/` under the
               corresponding `.rst` file.
    - [ ] I've made sure the tests are passing. Note that there might be a
    few flaky tests, see the recent failures at https://flakey-tests.ray.io/
    - Testing Strategy
       - [ ] Unit tests
       - [ ] Release tests
       - [ ] This PR is not tested :(

    ---------

    Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

commit 1f084697c3d6d5286e68b853b43030d359df2012
Author: angelinalg <122562471+angelinalg@users.noreply.github.com>
Date:   Thu May 8 04:54:20 2025 -0700

    [docstring][rllib] Fix indentation errors in docstrings. (#52849)

commit 7a21b34f3876f70c1b12c65affab245d59b60cf7
Author: Sven Mika <svenmika1977@gmail.com>
Date:   Thu May 8 09:06:24 2025 +0200

    [RLlib] Add extra `self.stopped` check to APPO/IMPALA Learner (in case learner thread should stop while waiting for queue). (#52834)

commit 988b689a08d18380afc7b70969dd4ed0c3b8ecee
Author: Kevin H. Luu <kevin@anyscale.com>
Date:   Wed May 7 22:26:47 2025 -0700

    [docker] Update latest Docker dependencies for 2.46.0 release (#52863)

    Created by release automation bot.

    Update with commit 52b43d0998f40d8aada0ffb89f41497fea4878b2

    Signed-off-by: dayshah <dhyey2019@gmail.com>
    Co-authored-by: dayshah <dhyey2019@gmail.com>

commit 5868480f6bb20fbc49e4dea7d5adb1279f36b464
Author: Kai-Hsun Chen <kaihsun@anyscale.com>
Date:   Wed May 7 21:22:10 2025 -0700

    [core][chore] Correct `num_retries_left` and `num_oom_retries_left` in the log (#52857)

    Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

commit d1823655707a7708ad99fda9cff93c1ac28b2f04
Author: angelinalg <122562471+angelinalg@users.noreply.github.com>
Date:   Wed May 7 21:09:20 2025 -0700

    [docstring][train] fix indentation errors in docstrings (#52855)

commit bcbee9fceeb4ff3edf2fa1518c915b8135aa204e
Author: Kai-Hsun Chen <kaihsun@anyscale.com>
Date:   Wed May 7 21:07:32 2025 -0700

    [core][refactor] Remove skip_execution (#52856)

    Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

commit a698b631d3916866c9061ad22ad4fb0ec3574da8
Author: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com>
Date:   Wed May 7 20:23:38 2025 -0700

    [Serve.llm] Bugfix for duplication of `<bos>` token (#52853)

    Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

commit 0c1faa63df052c7509d783a3d0f05eb28ad79baa
Author: angelinalg <122562471+angelinalg@users.noreply.github.com>
Date:   Wed May 7 19:08:20 2025 -0700

    [docstring][data] fix indentation errors in docstrings (#52844)

commit ce51640c81b6230e4375ae6dd75d9a9092f13e8d
Author: angelinalg <122562471+angelinalg@users.noreply.github.com>
Date:   Wed May 7 18:36:54 2025 -0700

    [docstring][serve] Fix indentation in doc strings.  (#52841)

commit c0a3cbe6a9cd9960ef5822ca742fb05fa6408e8e
Author: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com>
Date:   Wed May 7 18:08:49 2025 -0700

    [Serve.llm][Bugfix] in stream batching, first part of the stream was always consumed and not streamed back from the router (#52848)

    This PR addresses a bug in stream batching where extra tokens in the
    first batch were being discarded and adds comprehensive unit tests to
    verify both chat and completion behaviors under different batching and
    streaming configurations.

    - Fixes token loss in stream batching by peeking at the first generator
    element and correctly handling batched responses.
    - Adds new fixtures and tests to cover various scenarios
    (chat/completion, stream true/false, and multiple batching intervals).
    - Removes redundant configuration in the LLM server test to align with
    the new streaming batching behavior.

    ---------

    Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

commit 7c23b5a4d698066a789a8343692c805bd5242e3f
Author: angelinalg <122562471+angelinalg@users.noreply.github.com>
Date:   Wed May 7 17:31:46 2025 -0700

    [docstring][llm] fixing indent errors in docstrings (#52842)

commit 8910c3bcb543cd79ecc2b909da01e801a0f2a972
Author: srinathk10 <68668616+srinathk10@users.noreply.github.com>
Date:   Wed May 7 17:05:34 2025 -0700

    Train Tests: Update Image classification map fn (#52845)

    <!-- Thank you for your contribution! Please review
    https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
    opening a pull request. -->

    <!-- Please add a reviewer to the assignee section when you create a PR.
    If you don't have the access to it, we will shortly find a reviewer and
    assign them to your PR. -->

    ## Why are these changes needed?

    <!-- Please give a short summary of the change and the problem this
    solves. -->
    Train Tests: Update Image classification map fn.

    - Current Image processing does np->tensor conversion with transpose to
    CHW and normalization.
    ```

     'train/epoch-avg': 37.75413007199859,
     'train/epoch-max': 37.75413007199859,
     'train/epoch-min': 37.75413007199859,
     'train/epoch-total': 37.75413007199859,
     'train/global_throughput': 3495.5491702359923,
     'train/iter_batch-avg': 0.03661318445453074,
     'train/iter_batch-max': 0.6135890130008193,
     'train/iter_batch-min': 1.593200067873113e-05,
     'train/iter_batch-total': 18.526271333992554,
     'train/iter_first_batch-avg': 19.143331854998905,
     'train/iter_first_batch-max': 19.143331854998905,
     'train/iter_first_batch-min': 19.143331854998905,
     'train/iter_first_batch-total': 19.143331854998905,
     'train/iter_skip_batch-avg': inf,
     'train/iter_skip_batch-max': 0,
     'train/iter_skip_batch-min': inf,
     'train/iter_skip_batch-total': 0,
     'train/local_throughput': 873.8872925589981,
     'train/rows_processed-avg': 32.0,
     'train/rows_processed-max': 32,
     'train/rows_processed-min': 32,
     'train/rows_processed-total': 16192,
     'train/step-avg': 4.809962454802502e-06,
     'train/step-max': 2.1455998648889363e-05,
     'train/step-min': 5.109995981911197e-07,
     'train/step-total': 0.002433841002130066,
     'validation/iter_batch-avg': inf,
     'validation/iter_batch-max': 0,
     'validation/iter_batch-min': inf,
     'validation/iter_batch-total': 0,
     'validation/step-avg': inf,
     'validation/step-max': 0,
     'validation/step-min': inf,
     'validation/step-total': 0}
    --------------------------------------------------------------------------------
    2025-05-07 12:12:11,659 INFO test_utils.py:1953 -- Wrote results to /tmp/release_test_output.json
    2025-05-07 12:12:11,660 INFO test_utils.py:1954 -- {"train/epoch-avg": 37.75413007199859, "train/epoch-min": 37.75413007199859, "train/epoch-max": 37.75413007199859, "train/epoch-total": 37.75413007199859, "train/iter_first_batch-avg": 19.143331854998905, "train/iter_first_batch-min": 19.143331854998905, "train/iter_first_batch-max": 19.143331854998905, "train/iter_first_batch-total": 19.143331854998905, "train/step-avg": 4.809962454802502e-06, "train/step-min": 5.109995981911197e-07, "train/step-max": 2.1455998648889363e-05, "train/step-total": 0.002433841002130066, "train/rows_processed-avg": 32.0, "train/rows_processed-min": 32, "train/rows_processed-max": 32, "train/rows_processed-total": 16192, "train/iter_batch-avg": 0.03661318445453074, "train/iter_batch-min": 1.593200067873113e-05, "train/iter_batch-max": 0.6135890130008193, "train/iter_batch-total": 18.526271333992554, "validation/step-avg": Infinity, "validation/step-min": Infinity, "validation/step-max": 0, "validation/step-total": 0, "validation/iter_batch-avg": Infinity, "validation/iter_batch-min": Infinity, "validation/iter_batch-max": 0, "validation/iter_batch-total": 0, "checkpoint/download-avg": Infinity, "checkpoint/download-min": Infinity, "checkpoint/download-max": 0, "checkpoint/download-total": 0, "checkpoint/load-avg": Infinity, "checkpoint/load-min": Infinity, "checkpoint/load-max": 0, "checkpoint/load-total": 0, "train/iter_skip_batch-avg": Infinity, "train/iter_skip_batch-min": Infinity, "train/iter_skip_batch-max": 0, "train/iter_skip_batch-total": 0, "train/local_throughput": 873.8872925589981, "train/global_throughput": 3495.5491702359923, "dataloader/train": {"producer_throughput": 1946.112621486268, "iter_stats": {"prefetch_block-avg": Infinity, "prefetch_block-min": Infinity, "prefetch_block-max": 0, "prefetch_block-total": 0, "fetch_block-avg": 0.0027022377159291976, "fetch_block-min": 0.0005052189990237821, "fetch_block-max": 0.0197697479998169, "fetch_block-total": 0.218881254990265, "block_to_batch-avg": 0.001253903843829141, "block_to_batch-min": 1.9893001081072725e-05, "block_to_batch-max": 0.01351481799974863, "block_to_batch-total": 0.6344753449775453, "format_batch-avg": 3.4910411080130395e-05, "format_batch-min": 9.00899976841174e-06, "format_batch-max": 0.0005209999999351567, "format_batch-total": 0.01766466800654598, "collate-avg": 0.0019578944209519855, "collate-min": 0.00021700100114685483, "collate-max": 0.013516342000002624, "collate-total": 0.9906945770017046, "finalize-avg": 0.011252377077071, "finalize-min": 0.004483607999645756, "finalize-max": 0.03162657899883925, "finalize-total": 5.693702800997926, "time_spent_blocked-avg": 0.0742146621321331, "time_spent_blocked-min": 6.807998943259008e-06, "time_spent_blocked-max": 19.143022770000243, "time_spent_blocked-total": 37.62683370099148, "time_spent_training-avg": 0.00021408673321073962, "time_spent_training-min": 9.916999260894954e-06, "time_spent_training-max": 0.009087054999326938, "time_spent_training-total": 0.10832788700463425}}}
    ```

    - Updated Image processing does np->PIL->Tensor.

    ```
     'train/epoch-avg': 30.73613611499968,
     'train/epoch-max': 30.73613611499968,
     'train/epoch-min': 30.73613611499968,
     'train/epoch-total': 30.73613611499968,
     'train/global_throughput': 5434.769027373354,
     'train/iter_batch-avg': 0.023547696209505146,
     'train/iter_batch-max': 0.3791560619993106,
     'train/iter_batch-min': 1.732300006551668e-05,
     'train/iter_batch-total': 11.915134282009603,
     'train/iter_first_batch-avg': 18.71798381300141,
     'train/iter_first_batch-max': 18.71798381300141,
     'train/iter_first_batch-min': 18.71798381300141,
     'train/iter_first_batch-total': 18.71798381300141,
     'train/iter_skip_batch-avg': inf,
     'train/iter_skip_batch-max': 0,
     'train/iter_skip_batch-min': inf,
     'train/iter_skip_batch-total': 0,
     'train/local_throughput': 1358.6922568433386,
     'train/rows_processed-avg': 32.0,
     'train/rows_processed-max': 32,
     'train/rows_processed-min': 32,
     'train/rows_processed-total': 16192,
     'train/step-avg': 4.362646225640153e-06,
     'train/step-max': 2.6562000130070373e-05,
     'train/step-min': 4.579997039400041e-07,
     'train/step-total': 0.0022074989901739173,
     'validation/iter_batch-avg': inf,
     'validation/iter_batch-max': 0,
     'validation/iter_batch-min': inf,
     'validation/iter_batch-total': 0,
     'validation/step-avg': inf,
     'validation/step-max': 0,
     'validation/step-min': inf,
     'validation/step-total': 0}
    --------------------------------------------------------------------------------
    2025-05-07 12:32:57,439 INFO test_utils.py:1953 -- Wrote results to /tmp/release_test_output.json
    2025-05-07 12:32:57,439 INFO test_utils.py:1954 -- {"train/epoch-avg": 30.73613611499968, "train/epoch-min": 30.73613611499968, "train/epoch-max": 30.73613611499968, "train/epoch-total": 30.73613611499968, "train/iter_first_batch-avg": 18.71798381300141, "train/iter_first_batch-min": 18.71798381300141, "train/iter_first_batch-max": 18.71798381300141, "train/iter_first_batch-total": 18.71798381300141, "train/step-avg": 4.362646225640153e-06, "train/step-min": 4.579997039400041e-07, "train/step-max": 2.6562000130070373e-05, "train/step-total": 0.0022074989901739173, "train/rows_processed-avg": 32.0, "train/rows_processed-min": 32, "train/rows_processed-max": 32, "train/rows_processed-total": 16192, "train/iter_batch-avg": 0.023547696209505146, "train/iter_batch-min": 1.732300006551668e-05, "train/iter_batch-max": 0.3791560619993106, "train/iter_batch-total": 11.915134282009603, "validation/step-avg": Infinity, "validation/step-min": Infinity, "validation/step-max": 0, "validation/step-total": 0, "validation/iter_batch-avg": Infinity, "validation/iter_batch-min": Infinity, "validation/iter_batch-max": 0, "validation/iter_batch-total": 0, "checkpoint/download-avg": Infinity, "checkpoint/download-min": Infinity, "checkpoint/download-max": 0, "checkpoint/download-total…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P0 Issues that should be fixed in short order
Projects
None yet
2 participants