Update get_merged_lora_ckpt for dist checkpoints #2834

ankitageorge · 2025-06-18T04:41:09Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

Please link to any issues this PR addresses.

Changelog

What are the changes made in this PR?

There was a bug where the get_merged_lora_ckpt method doesn't work in distributed recipes when async checkpointing was enabled, because the existing method expects the data to be all on rank-0. This PR adds a new method get_merged_lora_dist_ckpt to be used for dist checkpoints and appropriately calls it when needed
note that this doesn't work for nf4 tensors, I've added a comment in the code as well
When testing on 70B model with the lora_finetune_distributed recipe, it took ~20s, compared to 63s for gathering and ~250s for saving (~310s total), saving almost 5 minutes
since we save the adapter weights separately, change the order to save that first, so that the trainer is unblocked faster. This because dist checkpointer blocks on itself for two separate calls, so if we save adapter checkpoint after normal, then we are blocking on the normal checkpoint saving time, which is longer than adapter.

Test plan

Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.

[ x] run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
add unit tests for any new functionality
update docstrings for any new or updated methods or classes
[x ] run unit tests via pytest tests
[x ] run recipe tests via pytest tests -m integration_test
[x ] manually run any new or modified recipes with sufficient proof of correctness
include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

UX

If your function changed a public API, please add a dummy example of what the user experience will look like when calling it.
Here is a docstring example
and a tutorial example

[ x] I did not change any public API
I have added an example to docs or docstrings

pytorch-bot · 2025-06-18T04:41:13Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2834

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 3 Pending

As of commit 9a0a6eb with merge base 9d91fe3 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

joecummings · 2025-06-23T18:45:41Z

torchtune/modules/peft/_utils.py

+    lora_moe_modules = _get_lora_moe_modules(state_dict)
+
+    # Create a simple module for matrix multiplication
+    class MatMulModule(torch.nn.Module):


Why do we need this instead of just calling matmul directly?

these operations don't work properly on d-tensors. it's what was causing the hangs

"doesn't work properly" - can you expand on that?

ok actually, I think it's not needed. Good catch. I thought it was causing problems, but I just re-tested without it and it still works. I'll get rid of them and just add barriers to the existing method.

joecummings · 2025-06-23T18:45:54Z

torchtune/modules/peft/_utils.py

+                lora_b_weight = state_dict[f"{module}.lora_{param}_b"]
+
+                # Create a simple module for transpose operation
+                class TransposeModule(torch.nn.Module):


Same here: why does this need to be a transpose module?

dist merge

bca098d

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 18, 2025

ankitageorge added 3 commits June 18, 2025 05:57

add single device as arg

172174c

fix vanilla lora

72d5636

change order or checkpoint save

b0f211e

ankitageorge marked this pull request as ready for review June 18, 2025 19:02

ankitageorge changed the title ~~dist merge~~ Update get_merged_lora_ckpt for dist checkpoints Jun 18, 2025

ankitageorge requested a review from joecummings June 18, 2025 19:37

ankitageorge added 2 commits June 20, 2025 11:03

fix load too

00be546

Merge branch 'main' into fix-dist-merged-weights

2d2e06a

joecummings reviewed Jun 23, 2025

View reviewed changes

joecummings approved these changes Jun 23, 2025

View reviewed changes

ankitageorge added 2 commits June 23, 2025 13:03

separate dist method not needed

6e29202

fix config

9a0a6eb

ankitageorge force-pushed the fix-dist-merged-weights branch from e5ed645 to 9a0a6eb Compare June 23, 2025 20:25

ankitageorge merged commit 5b2e881 into pytorch:main Jun 23, 2025
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update get_merged_lora_ckpt for dist checkpoints #2834

Update get_merged_lora_ckpt for dist checkpoints #2834

Uh oh!

ankitageorge commented Jun 18, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jun 18, 2025 •

edited

Loading

Uh oh!

joecummings Jun 23, 2025

Uh oh!

ankitageorge Jun 23, 2025

Uh oh!

joecummings Jun 23, 2025

Uh oh!

ankitageorge Jun 23, 2025

Uh oh!

joecummings Jun 23, 2025

Uh oh!

Uh oh!

Uh oh!

Update get_merged_lora_ckpt for dist checkpoints #2834

Update get_merged_lora_ckpt for dist checkpoints #2834

Uh oh!

Conversation

ankitageorge commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Changelog

Test plan

UX

Uh oh!

pytorch-bot bot commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2834

⏳ No Failures, 3 Pending

Uh oh!

joecummings Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

ankitageorge Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

joecummings Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

ankitageorge Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

joecummings Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ankitageorge commented Jun 18, 2025 •

edited

Loading

pytorch-bot bot commented Jun 18, 2025 •

edited

Loading