multi-node training runs crash because ddp_weakref is None during backward #20750
Unanswered
mishooax
asked this question in
DDP / multi-GPU / multi-node
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
(also reported here: #20706)
Multi-node / multi-GPU training fails mid-way through because ddp_weakref is not being set correctly during the backward pass. This appears to be similar to the issue reported in #20390. I was unable to reproduce this with a small model. Also the exact moment it fails (epoch/step) can vary between training runs. Any ideas? 🙏
Beta Was this translation helpful? Give feedback.
All reactions