Skip to content

Fix and improve loading of distributed checkpoints #314

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 105 commits into
base: distributed_tests
Choose a base branch
from

Conversation

jlamypoirier
Copy link
Collaborator

@jlamypoirier jlamypoirier commented Jun 19, 2025

✨ Description

Fix #293

Lots of improvements on the loading of distributed checkpoints in different format.

  • Load files only if they are actually needed for conversion. This should speed things up a lot for large world sizes.
  • Implement a new, much faster loading method for the common case which just copies contiguous slices.
  • Keep track of the per-parameter loaded count so SafeLoad can verify with _check_parameters
  • Fix the tensor-parallel case ([bug] Conversion of distributed checkpoints to huggingface #293). The problem was in _get_parameter_shard_indices_in_full_weight in which the shard indices were sometimes set in a copy of the index (due to flatten) rather than a view, which caused the loaded tensors to be completely ignored.
  • Add more tests involving changes in TP size.

For the case of an unchanged distributed config (ex. starting a new experiment from a distributed checkpoint), loading should now be almost as fast as the unsafe version.

This will also help a lot with elastic training (#241) by cutting most of the resuming time.

🔍 Type of change

Select all that apply:

  • 🐛 Bug fix (non-breaking change that addresses a specific issue)
  • 🚀 New feature (non-breaking change that adds functionality)
  • ⚠️ Breaking change (a change that could affect existing functionality)
  • 📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
  • 🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
  • 📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
  • 📝 Documentation change (updates documentation, including new content or typo fixes)
  • 🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

@jlamypoirier jlamypoirier changed the base branch from main to distributed_tests June 26, 2025 01:26
@jlamypoirier jlamypoirier marked this pull request as ready for review June 27, 2025 23:29
@jlamypoirier jlamypoirier requested a review from sohamparikh June 27, 2025 23:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[bug] Conversion of distributed checkpoints to huggingface
1 participant