[Core] Remote placement using gpu memory #26929

fostiropoulos · 2022-07-23T16:00:34Z

Description

When running ray on machines with different type of GPU accelerators the fractional GPU placement strategy is not suitable. Instead allow for specifying mb of gpu for example.

Additionally the code is not accelerator agnostic and requires writing boiler plate code to determine the fractional gpu to be used even if all accelerators are the same for a given machine.

@ray.remote(gpu_mem=20mb)
def some_fn():
    return True

Use case

This applies for both making the ray remote code work more portable as well as improving GPU utilization for various applications when there is inconsistent GPU types across a cluster.

The text was updated successfully, but these errors were encountered:

fostiropoulos · 2022-07-23T16:01:44Z

If this feature request is approved, I am happy to work on it and create a pull request.

jjyao · 2022-08-11T16:59:35Z

Hi @fostiropoulos, sorry for the late reply. Could you elaborate why the fractional GPU placement strategy is not suitable? and why it's not accelerator agnostic? cc @cadedaniel

cadedaniel · 2022-08-11T17:14:44Z

Hi @fostiropoulos, sorry for the late reply. Could you elaborate why the fractional GPU placement strategy is not suitable? and why it's not accelerator agnostic? cc @cadedaniel

+1. Also, assigning a fixed memory amount to a task or actor comes with user experience problems: what happens if the task or actor consumes more than 20 megabytes of GPU memory? Ray currently defers management of GPU memory to the user code / application. If you give users the option to specify XX amount of megabytes, then they'll be surprised when Ray does nothing to prevent their code from exceeding that budget.

AFAIK this is a big part of why Ray has stuck to fractional placement for common accelerators like GPUs -- it is only a signal for scheduling tasks and actors and places no constraints on the application code.

fostiropoulos · 2022-08-11T18:10:07Z

@jjyao

Example 1:

System 1: Has GPU A100 (memory 40GB)
System 2: Has GPU RTX 3090 (Memory 24GB)

Having a remote that uses 10GB of memory, I will need to specify num_gpus 0.25 for system1 and 0.5 for system 2, which would make my code not work out of the box on both systems and would either require a user configurable attribute or for the programmer to detect the memory available for a GPU and calculate the GPU fractional allocation on the fly (the feature I am suggesting)

Example 2:
System has 2 GPU an GPUs RTX 3090 and an RTX 2080ti with memory's 24GB and 12GB. The bottleneck is the RTX 2080ti where specifying factional GPUs is limited to the capacity of the RTX 2080ti.
In this example we could specify custom resource allocation and overcome fractional gpu limitations, but this is the feature I am also suggesting to be added by default to ray out of the box.

fostiropoulos · 2022-08-11T18:12:39Z

@cadedaniel maybe my initial post was misunderstood. My clarification with examples above can help explain. The GPU memory management and correct resource allocation is still at the user's burden.

jjyao · 2022-08-11T20:53:07Z

Thanks @fostiropoulos, your examples make sense to me and I feel it's something reasonable to be supported inside Ray. Let me add this to the backlog and chat with the team to see when and whether we want to do it.

jsdir · 2023-11-02T04:35:18Z

Duplicate of #37574

jonathan-anyscale · 2023-12-12T17:42:30Z

Hi, want to quick update on this. So we have REP and prototype ready for review. Please try out and leave feedback!
Prototype: #41147
REP: ray-project/enhancements#47

jonathan-anyscale · 2023-12-19T00:52:29Z

@fostiropoulos did you have chance to check the REP and try the prototype?

fostiropoulos added the enhancement Request for new feature and/or capability label Jul 23, 2022

jjyao added the core Issues that should be addressed in Ray Core label Aug 11, 2022

jjyao added this to the Core Backlog milestone Aug 11, 2022

jjyao assigned jjyao and jonathan-anyscale Sep 6, 2023

jjyao added the core-scheduler label Sep 25, 2023

anyscalesam added the triage Needs triage (eg: priority, bug/not-bug, and owning component) label Feb 14, 2024

jjyao added P1.5 and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Mar 18, 2024

jjyao added P2 Important issue, but not time-critical and removed P1.5 labels Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Remote placement using gpu memory #26929

[Core] Remote placement using gpu memory #26929

fostiropoulos commented Jul 23, 2022

fostiropoulos commented Jul 23, 2022

jjyao commented Aug 11, 2022

cadedaniel commented Aug 11, 2022

fostiropoulos commented Aug 11, 2022 •

edited

Loading

fostiropoulos commented Aug 11, 2022

jjyao commented Aug 11, 2022

jsdir commented Nov 2, 2023

jonathan-anyscale commented Dec 12, 2023

jonathan-anyscale commented Dec 19, 2023

[Core] Remote placement using gpu memory #26929

[Core] Remote placement using gpu memory #26929

Comments

fostiropoulos commented Jul 23, 2022

Description

Use case

fostiropoulos commented Jul 23, 2022

jjyao commented Aug 11, 2022

cadedaniel commented Aug 11, 2022

fostiropoulos commented Aug 11, 2022 • edited Loading

fostiropoulos commented Aug 11, 2022

jjyao commented Aug 11, 2022

jsdir commented Nov 2, 2023

jonathan-anyscale commented Dec 12, 2023

jonathan-anyscale commented Dec 19, 2023

fostiropoulos commented Aug 11, 2022 •

edited

Loading