Skip to content

[Core] Remote placement using gpu memory #26929

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
fostiropoulos opened this issue Jul 23, 2022 · 9 comments
Open

[Core] Remote placement using gpu memory #26929

fostiropoulos opened this issue Jul 23, 2022 · 9 comments
Assignees
Labels
core Issues that should be addressed in Ray Core core-scheduler enhancement Request for new feature and/or capability P2 Important issue, but not time-critical
Milestone

Comments

@fostiropoulos
Copy link

Description

When running ray on machines with different type of GPU accelerators the fractional GPU placement strategy is not suitable. Instead allow for specifying mb of gpu for example.

Additionally the code is not accelerator agnostic and requires writing boiler plate code to determine the fractional gpu to be used even if all accelerators are the same for a given machine.

@ray.remote(gpu_mem=20mb)
def some_fn():
    return True

Use case

This applies for both making the ray remote code work more portable as well as improving GPU utilization for various applications when there is inconsistent GPU types across a cluster.

@fostiropoulos fostiropoulos added the enhancement Request for new feature and/or capability label Jul 23, 2022
@fostiropoulos
Copy link
Author

If this feature request is approved, I am happy to work on it and create a pull request.

@jjyao jjyao added the core Issues that should be addressed in Ray Core label Aug 11, 2022
@jjyao jjyao added this to the Core Backlog milestone Aug 11, 2022
@jjyao
Copy link
Collaborator

jjyao commented Aug 11, 2022

Hi @fostiropoulos, sorry for the late reply. Could you elaborate why the fractional GPU placement strategy is not suitable? and why it's not accelerator agnostic? cc @cadedaniel

@cadedaniel
Copy link
Member

Hi @fostiropoulos, sorry for the late reply. Could you elaborate why the fractional GPU placement strategy is not suitable? and why it's not accelerator agnostic? cc @cadedaniel

+1. Also, assigning a fixed memory amount to a task or actor comes with user experience problems: what happens if the task or actor consumes more than 20 megabytes of GPU memory? Ray currently defers management of GPU memory to the user code / application. If you give users the option to specify XX amount of megabytes, then they'll be surprised when Ray does nothing to prevent their code from exceeding that budget.

AFAIK this is a big part of why Ray has stuck to fractional placement for common accelerators like GPUs -- it is only a signal for scheduling tasks and actors and places no constraints on the application code.

@fostiropoulos
Copy link
Author

fostiropoulos commented Aug 11, 2022

@jjyao

Example 1:

System 1: Has GPU A100 (memory 40GB)
System 2: Has GPU RTX 3090 (Memory 24GB)

Having a remote that uses 10GB of memory, I will need to specify num_gpus 0.25 for system1 and 0.5 for system 2, which would make my code not work out of the box on both systems and would either require a user configurable attribute or for the programmer to detect the memory available for a GPU and calculate the GPU fractional allocation on the fly (the feature I am suggesting)

Example 2:
System has 2 GPU an GPUs RTX 3090 and an RTX 2080ti with memory's 24GB and 12GB. The bottleneck is the RTX 2080ti where specifying factional GPUs is limited to the capacity of the RTX 2080ti.
In this example we could specify custom resource allocation and overcome fractional gpu limitations, but this is the feature I am also suggesting to be added by default to ray out of the box.

@fostiropoulos
Copy link
Author

@cadedaniel maybe my initial post was misunderstood. My clarification with examples above can help explain. The GPU memory management and correct resource allocation is still at the user's burden.

@jjyao
Copy link
Collaborator

jjyao commented Aug 11, 2022

Thanks @fostiropoulos, your examples make sense to me and I feel it's something reasonable to be supported inside Ray. Let me add this to the backlog and chat with the team to see when and whether we want to do it.

@jsdir
Copy link

jsdir commented Nov 2, 2023

Duplicate of #37574

@jonathan-anyscale
Copy link
Contributor

Hi, want to quick update on this. So we have REP and prototype ready for review. Please try out and leave feedback!
Prototype: #41147
REP: ray-project/enhancements#47

@jonathan-anyscale
Copy link
Contributor

@fostiropoulos did you have chance to check the REP and try the prototype?

@anyscalesam anyscalesam added the triage Needs triage (eg: priority, bug/not-bug, and owning component) label Feb 14, 2024
@jjyao jjyao added P1.5 and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Mar 18, 2024
@jjyao jjyao added P2 Important issue, but not time-critical and removed P1.5 labels Nov 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Issues that should be addressed in Ray Core core-scheduler enhancement Request for new feature and/or capability P2 Important issue, but not time-critical
Projects
None yet
Development

No branches or pull requests

6 participants