You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
psarka opened this issue
Apr 24, 2025
· 0 comments
Labels
bugSomething that is supposed to be working; but isn'tclusterscoreIssues that should be addressed in Ray CorestabilitytriageNeeds triage (eg: priority, bug/not-bug, and owning component)
I am unable to run 1000 tasks in 1 job with ray - worker nodes start dying with Expected termination: received SIGTERM. My repro program is a simplified version of the example from the docs :
When submitting this job with ray job submit, autoscaler starts spinning up worker nodes, which then proceed to die (at a slower rate, but still). For example, 12 minutes in, I see 18 alive nodes and 5 dead. With 4 tasks per node, 20 tasks have already failed.
Versions / Dependencies
ray 2.44.1, python 3.12, gcloud instances
Reproduction script
These are the commands that I'm running:
uv run ray up ray-cluster-repro.yaml -y
uv run ray attach ray-cluster-repro.yaml -p 10001
uv run ray dashboard ray-cluster-repro.yaml
RAY_ADDRESS="ray://localhost:10001" uv run ray job submit --no-wait --working-dir=./ --runtime-env=driver-env-repro.yaml -- python cluster_test.py
The text was updated successfully, but these errors were encountered:
psarka
added
bug
Something that is supposed to be working; but isn't
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Apr 24, 2025
bugSomething that is supposed to be working; but isn'tclusterscoreIssues that should be addressed in Ray CorestabilitytriageNeeds triage (eg: priority, bug/not-bug, and owning component)
What happened + What you expected to happen
I am unable to run 1000 tasks in 1 job with ray - worker nodes start dying with
Expected termination: received SIGTERM
. My repro program is a simplified version of the example from the docs :When submitting this job with ray job submit, autoscaler starts spinning up worker nodes, which then proceed to die (at a slower rate, but still). For example, 12 minutes in, I see 18 alive nodes and 5 dead. With 4 tasks per node, 20 tasks have already failed.
Versions / Dependencies
ray 2.44.1, python 3.12, gcloud instances
Reproduction script
These are the commands that I'm running:
driver-env-repro.yaml
:ray-cluster-repro.yaml
:Issue Severity
None
The text was updated successfully, but these errors were encountered: