DDP Training Stuck while GPU utilization is 100% #12239
Unanswered
lsy643
asked this question in
code help: CV
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi:
I am recently using 16 GPUs to train a model with DDP strategy.
Sometimes, the program will stuck at one step of the training while the utilization of all the 16 GPUs is 100%. What makes more strange is that not every time this will happen.
I print the pstack of the process for one gpu, it seems it’s waiting in the synchronize function of nccl, so I guess some information is missing during the reduce step or something alike.
So my first question is that, has anyone else met something like this, and how to handle this problem.
And my second question is that, if my guess is correct, I can track the time of every training step, if it get larger than a threshold, I just rerun that step. So any suggestion about the implementation? It seems that a Callback is not a very simple way to achieve that.
Thanks a lot
Beta Was this translation helpful? Give feedback.
All reactions