Skip to content

Conversation

@luciaquirke
Copy link
Contributor

@luciaquirke luciaquirke commented Nov 18, 2025

Processing a dataset with a single process on torchrun requires us to set the nccl timeout to something ridiculously large, preventing the detection and management of real nccl hangs. The slightly lower level elastic functions fix this

(the updated nccl timeout is merged)

@luciaquirke luciaquirke changed the title Switch from torchrun to elastic library Switch from torchrun to elastic multiprocessing Nov 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants