Skip to content

Conversation

adamshephard
Copy link
Contributor

@adamshephard adamshephard commented Apr 16, 2025

Change multi-GPU mode from DataParallel to DataDistributedParallel to work with torch.compile. However, this essentially limits the task to using one GPU alone when using torch.compile. It is not a trivial solution to change this to use multiple GPUs that also work with torch.compile. We will release a future fix to fully correct this with the new engine.

Copy link

codecov bot commented Apr 16, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.69%. Comparing base (b564590) to head (f1d7cc4).
Report is 1 commits behind head on develop.

Additional details and impacted files
@@           Coverage Diff            @@
##           develop     #923   +/-   ##
========================================
  Coverage    99.69%   99.69%           
========================================
  Files           71       71           
  Lines         8939     8947    +8     
  Branches      1170     1170           
========================================
+ Hits          8912     8920    +8     
  Misses          23       23           
  Partials         4        4           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@shaneahmed shaneahmed added bug Something isn't working enhancement New feature or request labels Apr 25, 2025
@shaneahmed shaneahmed added this to the Release v1.7.0 milestone Apr 25, 2025
@shaneahmed shaneahmed changed the title FIX: Update for multi-GPU support with torch.compile 🐛 Fix Multi-GPU Support with torch.compile Apr 25, 2025
@Jiaqi-Lv Jiaqi-Lv requested a review from Copilot June 13, 2025 14:25
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adapts multi-GPU support to work with torch.compile by introducing a single‐process DDP fallback and updating cleanup, while retaining a DataParallel fallback for non-compile scenarios.

  • Switches from DataParallel to a single‐process DistributedDataParallel (DDP) when multiple GPUs are available and compilation is enabled
  • Initializes and later destroys the DDP process group in the engine
  • Adds a local (skipped-on-CI) test for multi-GPU feature extraction

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
tiatoolbox/models/models_abc.py Added DDP initialization & wrapping logic in model_to and retained DataParallel fallback
tiatoolbox/models/engine/semantic_segmentor.py Imports and destroys the DDP process group after inference when compile-compatible
tests/models/test_feature_extractor.py Introduced a skipped local test for multi-GPU feature extraction
Comments suppressed due to low confidence (1)

tests/models/test_feature_extractor.py:128

  • This test uses shutil and Path but there are no corresponding imports. Add import shutil and from pathlib import Path at the top of the file.
shutil.rmtree(save_dir, ignore_errors=True)

and torch.cuda.device_count() > 1
and is_torch_compile_compatible()
): # pragma: no cover
dist.destroy_process_group()
Copy link
Preview

Copilot AI Jun 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Destroying the process group without verifying initialization may error if no group exists. Add if dist.is_initialized(): before calling destroy_process_group().

Suggested change
dist.destroy_process_group()
if dist.is_initialized():
dist.destroy_process_group()

Copilot uses AI. Check for mistakes.

Copy link
Member

@shaneahmed shaneahmed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @adamshephard

@shaneahmed shaneahmed merged commit 9593cfe into develop Jun 16, 2025
15 checks passed
@shaneahmed shaneahmed deleted the models-abc-multigpu branch June 16, 2025 09:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

torch.compile issue when computing features on multiple GPUs (nn.DataParallel)
2 participants