Skip to content

Options for Stagger model loading for low memory systems #47

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jjhursey
Copy link
Contributor

@jjhursey jjhursey commented May 6, 2025

  • --stagger_load : (default: 0 off) Stagger model loading to avoid OOM issues on the host
  • --stagger_update_lazyhandle : (default: 0 off) Stagger update_lazyhandle to avoid OOM issues on the host
  • --dist_timeout : (default: either 10 for NCCL or 30 for others set by PyTorch) torch distributed timeout in minutes

@jjhursey
Copy link
Contributor Author

jjhursey commented May 7, 2025

Keeping this in Draft for now while a few teams get back to me on testing.

@jjhursey jjhursey force-pushed the jhursey/stagger-load branch from 1af2202 to c5218b1 Compare May 7, 2025 14:04
@thanh-lam
Copy link

I checked out this PR and am testing it with latest FMS levels. Reasons for trying this:

  • We have been hitting failures with "Resource temporarily unavailable" when running granite-3.0-8b with 8-AIUs. This is an example of a large model but there're other models hitting this failures as well.
  • With this PR, I added the two new parameters when invoking inference.py:
    • --stagger_load 1
    • --stagger_update_lazyhandle 1
  • So far, tests are passing without hitting "Resource temporarily unavailable"

@thanh-lam
Copy link

For testing purposes: What are differences between option 2 and 1?
@jjhursey Can you add some descriptions for these options?

@jjhursey
Copy link
Contributor Author

For testing purposes: What are differences between option 2 and 1?

We can improve the help text, but this is what it reports now:

    "--stagger_load",
    help="Stagger model loading to avoid OOM issues on the host"

    "--stagger_update_lazyhandle",
    help="Stagger update_lazyhandle to avoid OOM issues on the host"

So two different options to stagger processing of two different sections of the compile phase in case you have a need to set different values for each part.

The default value is 0 which means skip the staggering and let all of the processes enter this section at the same time. This results in the maximum amount of memory utilization, but also (generally) the fastest time through the compile phase.

Setting this value to >0 defines how many concurrent processes are allowed to be in that phase at the same time. So if you run with 16 processes and set the value to 2 then only 2 processes at a time will be performing the compile phase. Once the first set of 2 are done then the next set of 2 will proceed and so on. All processes that have completed the compile will wait for the others to finish before moving on to the next section of the code.

Setting the value to 1 means only one process at a time. This will use the least amount of memory, but will result in the longest compile time since we are only compiling for one process at a time. Increasing the value increases the total amount of memory, but should decrease the compile time.

If you are only worried about functional testing in a memory constrained environment and not on compile time efficiency then setting this value to 1 is likely what you want.

@thanh-lam
Copy link

@jjhursey , thanks! This is very helpful.
Any estimate when or will these new parameters be added to inference.py?
Because this "helps" get around the model failure: "Resource temporarily unavailable" with 8-AIU tests, we need to document it while waiting for "real" memory fixes.

@jjhursey jjhursey force-pushed the jhursey/stagger-load branch from c5218b1 to 2fd81b0 Compare May 22, 2025 15:04
@jjhursey
Copy link
Contributor Author

Both the x86 and Power test teams have confirmed that this helps mitigate their testing needs for large models on low memory systems.

@jjhursey jjhursey marked this pull request as ready for review May 22, 2025 15:05
@jjhursey
Copy link
Contributor Author

@JRosenkranz @ani300 this is ready for review

extra_kwargs = {**padding_kwargs, "only_last_token": True}
max_new_tokens_warmup = max_new_tokens
if compile_dynamic_sendnn:
max_new_tokens_warmup = 2

if stagger_update_lazyhandle > 0 and stagger_update_lazyhandle != world_size:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like logic is called multiple times, do you think it would make sense to put it in it's own utility function, this way in can be re-used in the future

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I can do that. Have an "enter" and "exit" version to place in the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just pushed a commit for this change

Copy link
Contributor

@JRosenkranz JRosenkranz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has this been tested with inference.py as well as test_decoders (multi-aiu / single aiu). In theory those should not change. Also, it might make sense to add an option to this in test_decoders in low memory systems.

@JRosenkranz
Copy link
Contributor

bot:test
TEST_FILE=test_decoders.py MODEL_ID=ibm-granite/granite-3.2-8b-instruct BATCH_SIZE=8 SEQUENCE_LENGTH=64 USE_TINY_MODEL=0 NUM_AIU=2

1 similar comment
@dpatel-ops
Copy link
Collaborator

bot:test
TEST_FILE=test_decoders.py MODEL_ID=ibm-granite/granite-3.2-8b-instruct BATCH_SIZE=8 SEQUENCE_LENGTH=64 USE_TINY_MODEL=0 NUM_AIU=2

@jjhursey jjhursey force-pushed the jhursey/stagger-load branch from 5aaf975 to 14dfb4a Compare June 4, 2025 16:27
@jjhursey
Copy link
Contributor Author

jjhursey commented Jun 4, 2025

I pushed a commit to consolidate the staggered enter/exit. I also rebased on main

@JRosenkranz
Copy link
Contributor

bot:test
TEST_FILE=test_decoders.py MODEL_ID=ibm-granite/granite-3.2-8b-instruct BATCH_SIZE=8 SEQUENCE_LENGTH=64 USE_TINY_MODEL=0

2 similar comments
@JRosenkranz
Copy link
Contributor

bot:test
TEST_FILE=test_decoders.py MODEL_ID=ibm-granite/granite-3.2-8b-instruct BATCH_SIZE=8 SEQUENCE_LENGTH=64 USE_TINY_MODEL=0

@JRosenkranz
Copy link
Contributor

bot:test
TEST_FILE=test_decoders.py MODEL_ID=ibm-granite/granite-3.2-8b-instruct BATCH_SIZE=8 SEQUENCE_LENGTH=64 USE_TINY_MODEL=0

@jjhursey jjhursey force-pushed the jhursey/stagger-load branch from 14dfb4a to 130a407 Compare June 6, 2025 17:55
@jjhursey
Copy link
Contributor Author

jjhursey commented Jun 6, 2025

I pushed an update that adds the docstrings and fixes a DCO check.

torch.distributed.barrier()
dprint(f"Stagger: Enter (Set: {_set+1} of {math.ceil(world_size / float(limit))})")

def stagger_leave(limit: int):
Copy link
Contributor

@JRosenkranz JRosenkranz Jun 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense for this to be a stagger_context (given each stagger_enter needs to be paired with a stagger_leave):

with stagger_context(limit):
     model = get_model(...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that would be a good idea. I've not built a context like that in Python before, but I'm always willing to learn. I'll take a pass at it in the next couple of days.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a commit to convert this to a contextlib function. Take a look and let me know what you think.

jjhursey added 2 commits June 12, 2025 16:44
 * `--stagger_load` : (default: `0` off) Stagger model loading to avoid OOM issues on the host
 * `--stagger_update_lazyhandle` : (default: `0` off) Stagger update_lazyhandle to avoid OOM issues on the host
 * `--dist_timeout` : (default: either `10` for NCCL or `30` for others set by PyTorch) torch distributed timeout in minutes

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
@jjhursey jjhursey force-pushed the jhursey/stagger-load branch from f60d30f to b7c22e0 Compare June 12, 2025 20:44
@jjhursey
Copy link
Contributor Author

jjhursey commented Jun 12, 2025

I recently updated my foundation-model-stack repo and the rest of the stack, and now I'm seeing a hang during the generate wrapped by with torch_sendnn.warmup_mode(): because the second iteration is calling allreduce. So the allreduce is hanging because not all of the processes are participating in the warmup at the same time.

I'm not sure what part of the stack is causing that to break. It's not caused by this PR, but a new synchronization in the stack it is enclosing in the warmup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants