Description
- TIA Toolbox version: develop branch
- Python version: 3.11.8
- Operating System: linux
Description
I am computing the features using multiple GPUs on the same node using DeepFeatureExtractor
My code for extracting features is pretty much the same as shown in the new notebook showing the feature extraction process: #887
What I Did
nn.DataParallel
built-in within tiatoolbox
handles the multi-gpu computations. I pulled the changes that introduced torch.compile
and changed from ON_GPU
to using device
.
I updated the argument in the DeepFeatureExtractor's predict
method to use device
instead of on_gpu
.
Errors traceback is very long to paste it all. But here are some of the errors (from the single run).
File "/tmp/torchinductor_qun786/vv/cvvkeueuq2m4jcjzub4hcfpkhpogtc5b2xddykdgxvsxcvnpfa2w.py", line 173, in call
buf2 = extern_kernels.convolution(buf0, buf1, stride=(14, 14), padding=(0, 0), dilation=(1, 1), transposed=False, output_padding=(0, 0), groups=1, bias=Non
e)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in
method wrapper_CUDA__cudnn_convolution)
...
raise exception
RuntimeError: Caught RuntimeError in replica 0 on device 0.
...
RuntimeError: Triton Error [CUDA]: invalid device context
What I can gather is that torch.compile
is not working well with nn.DataParallel
.
Please let me know if you can reproduce the error by simply running the DeepFeatureExtractor
feature extraction code with rcParam["torch_compile_mode"] = "default"
on a node with at least 2 devices.
Maybe nn.DistributedDataParallel
is a better option to use: https://pytorch.org/docs/stable/notes/cuda.html#cuda-nn-ddp-instead
tiatoolbox/tiatoolbox/models/models_abc.py
Lines 42 to 61 in 5f1cecb