Options for Stagger model loading for low memory systems #47

jjhursey · 2025-05-06T19:13:09Z

--stagger_load : (default: 0 off) Stagger model loading to avoid OOM issues on the host
--stagger_update_lazyhandle : (default: 0 off) Stagger update_lazyhandle to avoid OOM issues on the host
--dist_timeout : (default: either 10 for NCCL or 30 for others set by PyTorch) torch distributed timeout in minutes

jjhursey · 2025-05-07T13:14:43Z

Keeping this in Draft for now while a few teams get back to me on testing.

thanh-lam · 2025-05-20T17:56:18Z

I checked out this PR and am testing it with latest FMS levels. Reasons for trying this:

We have been hitting failures with "Resource temporarily unavailable" when running granite-3.0-8b with 8-AIUs. This is an example of a large model but there're other models hitting this failures as well.
With this PR, I added the two new parameters when invoking inference.py:
- --stagger_load 1
- --stagger_update_lazyhandle 1
So far, tests are passing without hitting "Resource temporarily unavailable"

thanh-lam · 2025-05-20T18:04:47Z

For testing purposes: What are differences between option 2 and 1?
@jjhursey Can you add some descriptions for these options?

jjhursey · 2025-05-21T12:11:51Z

For testing purposes: What are differences between option 2 and 1?

We can improve the help text, but this is what it reports now:

    "--stagger_load",
    help="Stagger model loading to avoid OOM issues on the host"

    "--stagger_update_lazyhandle",
    help="Stagger update_lazyhandle to avoid OOM issues on the host"

So two different options to stagger processing of two different sections of the compile phase in case you have a need to set different values for each part.

The default value is 0 which means skip the staggering and let all of the processes enter this section at the same time. This results in the maximum amount of memory utilization, but also (generally) the fastest time through the compile phase.

Setting this value to >0 defines how many concurrent processes are allowed to be in that phase at the same time. So if you run with 16 processes and set the value to 2 then only 2 processes at a time will be performing the compile phase. Once the first set of 2 are done then the next set of 2 will proceed and so on. All processes that have completed the compile will wait for the others to finish before moving on to the next section of the code.

Setting the value to 1 means only one process at a time. This will use the least amount of memory, but will result in the longest compile time since we are only compiling for one process at a time. Increasing the value increases the total amount of memory, but should decrease the compile time.

If you are only worried about functional testing in a memory constrained environment and not on compile time efficiency then setting this value to 1 is likely what you want.

thanh-lam · 2025-05-21T18:08:02Z

@jjhursey , thanks! This is very helpful.
Any estimate when or will these new parameters be added to inference.py?
Because this "helps" get around the model failure: "Resource temporarily unavailable" with 8-AIU tests, we need to document it while waiting for "real" memory fixes.

jjhursey · 2025-05-22T15:05:39Z

Both the x86 and Power test teams have confirmed that this helps mitigate their testing needs for large models on low memory systems.

jjhursey · 2025-05-22T15:06:17Z

@JRosenkranz @ani300 this is ready for review

JRosenkranz · 2025-06-03T15:39:24Z

aiu_fms_testing_utils/utils/__init__.py

    extra_kwargs = {**padding_kwargs, "only_last_token": True}
    max_new_tokens_warmup = max_new_tokens
    if compile_dynamic_sendnn:
        max_new_tokens_warmup = 2
+
+    if stagger_update_lazyhandle > 0 and stagger_update_lazyhandle != world_size:


It looks like logic is called multiple times, do you think it would make sense to put it in it's own utility function, this way in can be re-used in the future

Yeah I can do that. Have an "enter" and "exit" version to place in the code.

I just pushed a commit for this change

scripts/inference.py

JRosenkranz

Has this been tested with inference.py as well as test_decoders (multi-aiu / single aiu). In theory those should not change. Also, it might make sense to add an option to this in test_decoders in low memory systems.

JRosenkranz · 2025-06-04T00:21:09Z

bot:test
TEST_FILE=test_decoders.py MODEL_ID=ibm-granite/granite-3.2-8b-instruct BATCH_SIZE=8 SEQUENCE_LENGTH=64 USE_TINY_MODEL=0 NUM_AIU=2

dpatel-ops · 2025-06-04T02:19:37Z

bot:test
TEST_FILE=test_decoders.py MODEL_ID=ibm-granite/granite-3.2-8b-instruct BATCH_SIZE=8 SEQUENCE_LENGTH=64 USE_TINY_MODEL=0 NUM_AIU=2

jjhursey · 2025-06-04T16:28:33Z

I pushed a commit to consolidate the staggered enter/exit. I also rebased on main

aiu_fms_testing_utils/utils/__init__.py

JRosenkranz · 2025-06-06T13:55:04Z

bot:test
TEST_FILE=test_decoders.py MODEL_ID=ibm-granite/granite-3.2-8b-instruct BATCH_SIZE=8 SEQUENCE_LENGTH=64 USE_TINY_MODEL=0

JRosenkranz · 2025-06-06T14:03:54Z

bot:test
TEST_FILE=test_decoders.py MODEL_ID=ibm-granite/granite-3.2-8b-instruct BATCH_SIZE=8 SEQUENCE_LENGTH=64 USE_TINY_MODEL=0

JRosenkranz · 2025-06-06T14:31:05Z

bot:test
TEST_FILE=test_decoders.py MODEL_ID=ibm-granite/granite-3.2-8b-instruct BATCH_SIZE=8 SEQUENCE_LENGTH=64 USE_TINY_MODEL=0

jjhursey · 2025-06-06T17:56:30Z

I pushed an update that adds the docstrings and fixes a DCO check.

JRosenkranz · 2025-06-09T19:11:43Z

aiu_fms_testing_utils/utils/__init__.py

+            torch.distributed.barrier()
+        dprint(f"Stagger: Enter (Set: {_set+1} of {math.ceil(world_size / float(limit))})")
+
+def stagger_leave(limit: int):


Would it make sense for this to be a stagger_context (given each stagger_enter needs to be paired with a stagger_leave):

with stagger_context(limit): model = get_model(...)

Yeah, that would be a good idea. I've not built a context like that in Python before, but I'm always willing to learn. I'll take a pass at it in the next couple of days.

I added a commit to convert this to a contextlib function. Take a look and let me know what you think.

* `--stagger_load` : (default: `0` off) Stagger model loading to avoid OOM issues on the host * `--stagger_update_lazyhandle` : (default: `0` off) Stagger update_lazyhandle to avoid OOM issues on the host * `--dist_timeout` : (default: either `10` for NCCL or `30` for others set by PyTorch) torch distributed timeout in minutes Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>

jjhursey · 2025-06-12T20:47:19Z

I recently updated my foundation-model-stack repo and the rest of the stack, and now I'm seeing a hang during the generate wrapped by with torch_sendnn.warmup_mode(): because the second iteration is calling allreduce. So the allreduce is hanging because not all of the processes are participating in the warmup at the same time.

I'm not sure what part of the stack is causing that to break. It's not caused by this PR, but a new synchronization in the stack it is enclosing in the warmup.

jjhursey force-pushed the jhursey/stagger-load branch from 1af2202 to c5218b1 Compare May 7, 2025 14:04

jjhursey force-pushed the jhursey/stagger-load branch from c5218b1 to 2fd81b0 Compare May 22, 2025 15:04

jjhursey marked this pull request as ready for review May 22, 2025 15:05

JRosenkranz reviewed Jun 3, 2025

View reviewed changes

scripts/inference.py Outdated Show resolved Hide resolved

JRosenkranz reviewed Jun 3, 2025

View reviewed changes

scripts/inference.py Outdated Show resolved Hide resolved

JRosenkranz requested changes Jun 3, 2025

View reviewed changes

jjhursey force-pushed the jhursey/stagger-load branch from 5aaf975 to 14dfb4a Compare June 4, 2025 16:27

JRosenkranz reviewed Jun 4, 2025

View reviewed changes

aiu_fms_testing_utils/utils/__init__.py Outdated Show resolved Hide resolved

jjhursey force-pushed the jhursey/stagger-load branch from 14dfb4a to 130a407 Compare June 6, 2025 17:55

tjohnson31415 mentioned this pull request Jun 9, 2025

Support staggered model loading for lower memory requirements vllm-project/vllm-spyre#223

Open

JRosenkranz reviewed Jun 9, 2025

View reviewed changes

jjhursey added 2 commits June 12, 2025 16:44

Convert the stagger enter/leave into a proper contextlib function

b7c22e0

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>

jjhursey force-pushed the jhursey/stagger-load branch from f60d30f to b7c22e0 Compare June 12, 2025 20:44

Options for Stagger model loading for low memory systems #47

Are you sure you want to change the base?

Options for Stagger model loading for low memory systems #47

Uh oh!

Conversation

jjhursey commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jjhursey commented May 7, 2025

Uh oh!

thanh-lam commented May 20, 2025

Uh oh!

thanh-lam commented May 20, 2025

Uh oh!

jjhursey commented May 21, 2025

Uh oh!

thanh-lam commented May 21, 2025

Uh oh!

jjhursey commented May 22, 2025

Uh oh!

jjhursey commented May 22, 2025

Uh oh!

JRosenkranz Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

jjhursey Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

jjhursey Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

JRosenkranz left a comment

Choose a reason for hiding this comment

Uh oh!

JRosenkranz commented Jun 4, 2025

Uh oh!

dpatel-ops commented Jun 4, 2025

Uh oh!

jjhursey commented Jun 4, 2025

Uh oh!

Uh oh!

JRosenkranz commented Jun 6, 2025

Uh oh!

JRosenkranz commented Jun 6, 2025

Uh oh!

JRosenkranz commented Jun 6, 2025

Uh oh!

jjhursey commented Jun 6, 2025

Uh oh!

JRosenkranz Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jjhursey Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

jjhursey Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

jjhursey commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

jjhursey commented May 6, 2025 •

edited

Loading

JRosenkranz Jun 9, 2025 •

edited

Loading

jjhursey commented Jun 12, 2025 •

edited

Loading