-
Couldn't load subscription status.
- Fork 568
Open
Description
We are using DLRM model for personalization and we are getting CUDA error. By setting up CUDA_LAUNCH_BLOCKING flag and enabling cuda core dump, it pointed to two files where the issue might be happening
1: torchrec/distributed/embeddingbag.py: input_dist
2:torchrec/sparse/jagged_tensor.py: permute()
Some of our jaggedtensors are using weights, so when we debug the Jagged_tenosor.py we see mismatch in values(permuted length per key sum) and weights. Do you think that could be the root cause of CUDA error.
Metadata
Metadata
Assignees
Labels
No labels