DDP Training Stuck while GPU utilization is 100% #12239

lsy643 · 2022-03-05T07:14:17Z

lsy643
Mar 5, 2022

Hi:

I am recently using 16 GPUs to train a model with DDP strategy.

Sometimes, the program will stuck at one step of the training while the utilization of all the 16 GPUs is 100%. What makes more strange is that not every time this will happen.

I print the pstack of the process for one gpu, it seems it’s waiting in the synchronize function of nccl, so I guess some information is missing during the reduce step or something alike.

So my first question is that, has anyone else met something like this, and how to handle this problem.

And my second question is that, if my guess is correct, I can track the time of every training step, if it get larger than a threshold, I just rerun that step. So any suggestion about the implementation? It seems that a Callback is not a very simple way to achieve that.

Thanks a lot

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DDP Training Stuck while GPU utilization is 100% #12239

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

DDP Training Stuck while GPU utilization is 100% #12239

Uh oh!

lsy643 Mar 5, 2022

Replies: 0 comments

lsy643
Mar 5, 2022