Skip to content

[Core] Worker nodes dying with 1000 tasks #52585

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
psarka opened this issue Apr 24, 2025 · 0 comments
Open

[Core] Worker nodes dying with 1000 tasks #52585

psarka opened this issue Apr 24, 2025 · 0 comments
Labels
bug Something that is supposed to be working; but isn't clusters core Issues that should be addressed in Ray Core stability triage Needs triage (eg: priority, bug/not-bug, and owning component)

Comments

@psarka
Copy link

psarka commented Apr 24, 2025

What happened + What you expected to happen

I am unable to run 1000 tasks in 1 job with ray - worker nodes start dying with Expected termination: received SIGTERM. My repro program is a simplified version of the example from the docs :

import time

import ray


@ray.remote(num_cpus=1, max_retries=0)
def process(task):
    print(f'Starting {task}')
    time.sleep(100)


if __name__ == '__main__':
    ray.init(log_to_driver=False)
    unfinished = [process.remote(i) for i in range(1000)]

    while unfinished:
        finished, unfinished = ray.wait(unfinished, num_returns=1, fetch_local=False)

When submitting this job with ray job submit, autoscaler starts spinning up worker nodes, which then proceed to die (at a slower rate, but still). For example, 12 minutes in, I see 18 alive nodes and 5 dead. With 4 tasks per node, 20 tasks have already failed.

Versions / Dependencies

ray 2.44.1, python 3.12, gcloud instances

Reproduction script

These are the commands that I'm running:

uv run ray up ray-cluster-repro.yaml -y
uv run ray attach ray-cluster-repro.yaml -p 10001
uv run ray dashboard ray-cluster-repro.yaml
RAY_ADDRESS="ray://localhost:10001" uv run ray job submit --no-wait --working-dir=./ --runtime-env=driver-env-repro.yaml -- python cluster_test.py

driver-env-repro.yaml:

env_vars:
  RAY_RUNTIME_ENV_HOOK: "ray._private.runtime_env.uv_runtime_env_hook.hook"

py_executable: "uv run"

ray-cluster-repro.yaml:

cluster_name: repro
max_workers: 1024
upscaling_speed: 1.0
docker:
  image: rayproject/ray:2.44.1-py312-cpu
  container_name: "ray_container"
  pull_before_run: True
  run_options:  # Extra options to pass into "docker run"
    - --ulimit nofile=65536:65536
  worker_image: "rayproject/ray:2.44.1-py312-cpu"

idle_timeout_minutes: 3

# Cloud-provider specific configuration.
provider:
    type: gcp
    region: europe-west4
    availability_zone: europe-west4-a
    project_id: axial-matter-417704

auth:
    ssh_user: ubuntu

available_node_types:
    ray_head_default:
        resources: {"CPU": 0}
        node_config:
            machineType: n4-highmem-8
            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 1000
                  # See https://cloud.google.com/compute/docs/images for more images
                  sourceImage: projects/deeplearning-platform-release/global/images/common-cpu-v20240922

    ray_worker_n4_standard_4:
      min_workers: 1
      max_workers: 100
      resources: {"CPU": 4}
      node_config:
        machineType: n4-standard-4
        disks:
          - boot: true
            autoDelete: true
            type: PERSISTENT
            initializeParams:
              diskSizeGb: 50
              sourceImage: projects/deeplearning-platform-release/global/images/common-cpu-v20240922
        scheduling:
          - preemptible: false
        serviceAccounts:
          - email: ray-autoscaler-sa-v1@axial-matter-417704.iam.gserviceaccount.com
            scopes:
              - https://www.googleapis.com/auth/cloud-platform


head_node_type: ray_head_default

file_mounts: {}

cluster_synced_files: []

file_mounts_sync_continuously: False

rsync_exclude:
    - "**/.git"
    - "**/.git/**"

rsync_filter:
    - ".gitignore"

initialization_commands: []

setup_commands: [
  pip install uv
]

head_setup_commands: []

worker_setup_commands: []

head_start_ray_commands:
    - ray stop
    - >-
      ray start
      --head
      --port=6379
      --object-manager-port=8076
      --autoscaling-config=~/ray_bootstrap_config.yaml

worker_start_ray_commands:
    - ray stop
    - >-
      ray start
      --address=$RAY_HEAD_IP:6379
      --object-manager-port=8076

Issue Severity

None

@psarka psarka added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Apr 24, 2025
@masoudcharkhabi masoudcharkhabi added clusters core Issues that should be addressed in Ray Core stability labels Apr 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't clusters core Issues that should be addressed in Ray Core stability triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

No branches or pull requests

2 participants