[Core] Worker nodes dying with 1000 tasks #52585

psarka · 2025-04-24T20:07:56Z

What happened + What you expected to happen

I am unable to run 1000 tasks in 1 job with ray - worker nodes start dying with Expected termination: received SIGTERM. My repro program is a simplified version of the example from the docs :

import time

import ray


@ray.remote(num_cpus=1, max_retries=0)
def process(task):
    print(f'Starting {task}')
    time.sleep(100)


if __name__ == '__main__':
    ray.init(log_to_driver=False)
    unfinished = [process.remote(i) for i in range(1000)]

    while unfinished:
        finished, unfinished = ray.wait(unfinished, num_returns=1, fetch_local=False)

When submitting this job with ray job submit, autoscaler starts spinning up worker nodes, which then proceed to die (at a slower rate, but still). For example, 12 minutes in, I see 18 alive nodes and 5 dead. With 4 tasks per node, 20 tasks have already failed.

Versions / Dependencies

ray 2.44.1, python 3.12, gcloud instances

Reproduction script

These are the commands that I'm running:

uv run ray up ray-cluster-repro.yaml -y
uv run ray attach ray-cluster-repro.yaml -p 10001
uv run ray dashboard ray-cluster-repro.yaml
RAY_ADDRESS="ray://localhost:10001" uv run ray job submit --no-wait --working-dir=./ --runtime-env=driver-env-repro.yaml -- python cluster_test.py

driver-env-repro.yaml:

env_vars:
  RAY_RUNTIME_ENV_HOOK: "ray._private.runtime_env.uv_runtime_env_hook.hook"

py_executable: "uv run"

ray-cluster-repro.yaml:

cluster_name: repro
max_workers: 1024
upscaling_speed: 1.0
docker:
  image: rayproject/ray:2.44.1-py312-cpu
  container_name: "ray_container"
  pull_before_run: True
  run_options:  # Extra options to pass into "docker run"
    - --ulimit nofile=65536:65536
  worker_image: "rayproject/ray:2.44.1-py312-cpu"

idle_timeout_minutes: 3

# Cloud-provider specific configuration.
provider:
    type: gcp
    region: europe-west4
    availability_zone: europe-west4-a
    project_id: axial-matter-417704

auth:
    ssh_user: ubuntu

available_node_types:
    ray_head_default:
        resources: {"CPU": 0}
        node_config:
            machineType: n4-highmem-8
            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 1000
                  # See https://cloud.google.com/compute/docs/images for more images
                  sourceImage: projects/deeplearning-platform-release/global/images/common-cpu-v20240922

    ray_worker_n4_standard_4:
      min_workers: 1
      max_workers: 100
      resources: {"CPU": 4}
      node_config:
        machineType: n4-standard-4
        disks:
          - boot: true
            autoDelete: true
            type: PERSISTENT
            initializeParams:
              diskSizeGb: 50
              sourceImage: projects/deeplearning-platform-release/global/images/common-cpu-v20240922
        scheduling:
          - preemptible: false
        serviceAccounts:
          - email: ray-autoscaler-sa-v1@axial-matter-417704.iam.gserviceaccount.com
            scopes:
              - https://www.googleapis.com/auth/cloud-platform


head_node_type: ray_head_default

file_mounts: {}

cluster_synced_files: []

file_mounts_sync_continuously: False

rsync_exclude:
    - "**/.git"
    - "**/.git/**"

rsync_filter:
    - ".gitignore"

initialization_commands: []

setup_commands: [
  pip install uv
]

head_setup_commands: []

worker_setup_commands: []

head_start_ray_commands:
    - ray stop
    - >-
      ray start
      --head
      --port=6379
      --object-manager-port=8076
      --autoscaling-config=~/ray_bootstrap_config.yaml

worker_start_ray_commands:
    - ray stop
    - >-
      ray start
      --address=$RAY_HEAD_IP:6379
      --object-manager-port=8076

Issue Severity

None

The text was updated successfully, but these errors were encountered:

psarka added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Apr 24, 2025

masoudcharkhabi added clusters core Issues that should be addressed in Ray Core stability labels Apr 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Worker nodes dying with 1000 tasks #52585

[Core] Worker nodes dying with 1000 tasks #52585

psarka commented Apr 24, 2025 •

edited

Loading

[Core] Worker nodes dying with 1000 tasks #52585

[Core] Worker nodes dying with 1000 tasks #52585

Comments

psarka commented Apr 24, 2025 • edited Loading

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

psarka commented Apr 24, 2025 •

edited

Loading