-
Notifications
You must be signed in to change notification settings - Fork 95
🐛 Fix Multi-GPU Support with torch.compile
#923
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## develop #923 +/- ##
========================================
Coverage 99.69% 99.69%
========================================
Files 71 71
Lines 8939 8947 +8
Branches 1170 1170
========================================
+ Hits 8912 8920 +8
Misses 23 23
Partials 4 4 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
for more information, see https://pre-commit.ci
torch.compile
…nalytics/tiatoolbox into models-abc-multigpu
…nalytics/tiatoolbox into models-abc-multigpu
…nalytics/tiatoolbox into models-abc-multigpu
…/tiatoolbox into models-abc-multigpu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adapts multi-GPU support to work with torch.compile
by introducing a single‐process DDP fallback and updating cleanup, while retaining a DataParallel fallback for non-compile scenarios.
- Switches from
DataParallel
to a single‐processDistributedDataParallel
(DDP) when multiple GPUs are available and compilation is enabled - Initializes and later destroys the DDP process group in the engine
- Adds a local (skipped-on-CI) test for multi-GPU feature extraction
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
File | Description |
---|---|
tiatoolbox/models/models_abc.py | Added DDP initialization & wrapping logic in model_to and retained DataParallel fallback |
tiatoolbox/models/engine/semantic_segmentor.py | Imports and destroys the DDP process group after inference when compile-compatible |
tests/models/test_feature_extractor.py | Introduced a skipped local test for multi-GPU feature extraction |
Comments suppressed due to low confidence (1)
tests/models/test_feature_extractor.py:128
- This test uses
shutil
andPath
but there are no corresponding imports. Addimport shutil
andfrom pathlib import Path
at the top of the file.
shutil.rmtree(save_dir, ignore_errors=True)
and torch.cuda.device_count() > 1 | ||
and is_torch_compile_compatible() | ||
): # pragma: no cover | ||
dist.destroy_process_group() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Destroying the process group without verifying initialization may error if no group exists. Add if dist.is_initialized():
before calling destroy_process_group()
.
dist.destroy_process_group() | |
if dist.is_initialized(): | |
dist.destroy_process_group() |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @adamshephard
Change multi-GPU mode from
DataParallel
toDataDistributedParallel
to work with torch.compile. However, this essentially limits the task to using one GPU alone when using torch.compile. It is not a trivial solution to change this to use multiple GPUs that also work with torch.compile. We will release a future fix to fully correct this with the new engine.