🐛 Fix Multi-GPU Support with `torch.compile` #923

adamshephard · 2025-04-16T15:54:03Z

Change multi-GPU mode from DataParallel to DataDistributedParallel to work with torch.compile. However, this essentially limits the task to using one GPU alone when using torch.compile. It is not a trivial solution to change this to use multiple GPUs that also work with torch.compile. We will release a future fix to fully correct this with the new engine.

codecov · 2025-04-16T16:17:19Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.69%. Comparing base (b564590) to head (f1d7cc4).
Report is 1 commits behind head on develop.

Additional details and impacted files

@@           Coverage Diff            @@
##           develop     #923   +/-   ##
========================================
  Coverage    99.69%   99.69%           
========================================
  Files           71       71           
  Lines         8939     8947    +8     
  Branches      1170     1170           
========================================
+ Hits          8912     8920    +8     
  Misses          23       23           
  Partials         4        4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

for more information, see https://pre-commit.ci

…nalytics/tiatoolbox into models-abc-multigpu

…/tiatoolbox into models-abc-multigpu

Copilot

Pull Request Overview

This PR adapts multi-GPU support to work with torch.compile by introducing a single‐process DDP fallback and updating cleanup, while retaining a DataParallel fallback for non-compile scenarios.

Switches from DataParallel to a single‐process DistributedDataParallel (DDP) when multiple GPUs are available and compilation is enabled
Initializes and later destroys the DDP process group in the engine
Adds a local (skipped-on-CI) test for multi-GPU feature extraction

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File	Description
tiatoolbox/models/models_abc.py	Added DDP initialization & wrapping logic in `model_to` and retained DataParallel fallback
tiatoolbox/models/engine/semantic_segmentor.py	Imports and destroys the DDP process group after inference when compile-compatible
tests/models/test_feature_extractor.py	Introduced a skipped local test for multi-GPU feature extraction

Comments suppressed due to low confidence (1)

tests/models/test_feature_extractor.py:128

This test uses shutil and Path but there are no corresponding imports. Add import shutil and from pathlib import Path at the top of the file.

shutil.rmtree(save_dir, ignore_errors=True)

tiatoolbox/models/models_abc.py

Copilot · 2025-06-13T14:29:09Z

tiatoolbox/models/engine/semantic_segmentor.py

+            and torch.cuda.device_count() > 1
+            and is_torch_compile_compatible()
+        ):  # pragma: no cover
+            dist.destroy_process_group()


Destroying the process group without verifying initialization may error if no group exists. Add if dist.is_initialized(): before calling destroy_process_group().

Suggested change

dist.destroy_process_group()

if dist.is_initialized():

dist.destroy_process_group()

tests/models/test_feature_extractor.py

shaneahmed

Thanks @adamshephard

FIX: Update for multi-GPU support in models_abc

b271f3e

adamshephard mentioned this pull request Apr 16, 2025

torch.compile issue when computing features on multiple GPUs (nn.DataParallel) #889

Closed

adamshephard and others added 2 commits April 16, 2025 18:04

UPD: Update code

cc5407a

[pre-commit.ci] auto fixes from pre-commit.com hooks

e124cd8

for more information, see https://pre-commit.ci

shaneahmed linked an issue Apr 17, 2025 that may be closed by this pull request

torch.compile issue when computing features on multiple GPUs (nn.DataParallel) #889

Closed

shaneahmed assigned adamshephard Apr 25, 2025

shaneahmed added bug Something isn't working enhancement New feature or request labels Apr 25, 2025

shaneahmed added this to the Release v1.7.0 milestone Apr 25, 2025

shaneahmed changed the title ~~FIX: Update for multi-GPU support with torch.compile~~ 🐛 Fix Multi-GPU Support with torch.compile Apr 25, 2025

adamshephard and others added 15 commits April 25, 2025 11:22

Merge branch 'develop' into models-abc-multigpu

698f16a

Merge branch 'develop' into models-abc-multigpu

ee25842

Merge branch 'develop' into models-abc-multigpu

1f15307

FIX: Fix to work on other machines

e7b0822

FIX: Fix to work on other machines

0615636

Merge branch 'models-abc-multigpu' of https://github.com/TissueImageA…

b830c11

…nalytics/tiatoolbox into models-abc-multigpu

FIX: Fix to work on other machines

a1d7357

Merge branch 'develop' into models-abc-multigpu

73440c2

Merge branch 'models-abc-multigpu' of https://github.com/TissueImageA…

b1d80dc

…nalytics/tiatoolbox into models-abc-multigpu

Merge branch 'develop' into models-abc-multigpu

41a74aa

Merge branch 'develop' into models-abc-multigpu

56df269

Merge branch 'models-abc-multigpu' of https://github.com/TissueImageA…

9b7b24e

…nalytics/tiatoolbox into models-abc-multigpu

Merge branch 'develop' into models-abc-multigpu

f914933

Merge branch 'models-abc-multigpu' of github.com:TissueImageAnalytics…

409498c

…/tiatoolbox into models-abc-multigpu

UPD: Comment out cuda for coverage

f1d7cc4

Jiaqi-Lv requested a review from Copilot June 13, 2025 14:25

Copilot AI reviewed Jun 13, 2025

View reviewed changes

shaneahmed approved these changes Jun 16, 2025

View reviewed changes

shaneahmed merged commit 9593cfe into develop Jun 16, 2025
15 checks passed

shaneahmed deleted the models-abc-multigpu branch June 16, 2025 09:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🐛 Fix Multi-GPU Support with `torch.compile` #923

🐛 Fix Multi-GPU Support with `torch.compile` #923

Uh oh!

adamshephard commented Apr 16, 2025 •

edited

Loading

Uh oh!

codecov bot commented Apr 16, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Jun 13, 2025

Uh oh!

Uh oh!

shaneahmed left a comment

Uh oh!

Uh oh!

Uh oh!

	dist.destroy_process_group()
	if dist.is_initialized():
	dist.destroy_process_group()

🐛 Fix Multi-GPU Support with torch.compile #923

🐛 Fix Multi-GPU Support with torch.compile #923

Uh oh!

Conversation

adamshephard commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

shaneahmed left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

🐛 Fix Multi-GPU Support with `torch.compile` #923

🐛 Fix Multi-GPU Support with `torch.compile` #923

adamshephard commented Apr 16, 2025 •

edited

Loading

codecov bot commented Apr 16, 2025 •

edited

Loading