CUDA error: an Illegal memory access was encountered

We are using DLRM model for personalization and we are getting CUDA error. By setting up CUDA_LAUNCH_BLOCKING flag and enabling cuda core dump, it pointed to two files where the issue might be happening 
1: torchrec/distributed/embeddingbag.py: input_dist
2:torchrec/sparse/jagged_tensor.py: permute()

Some of our jaggedtensors are using weights, so when we debug the Jagged_tenosor.py we see mismatch in values(permuted length per key sum) and weights. Do you think that could be the root cause of CUDA error.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

CUDA error: an Illegal memory access was encountered #2957

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

CUDA error: an Illegal memory access was encountered #2957

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions