Skip to content

Conversation

H-Huang
Copy link
Member

@H-Huang H-Huang commented Sep 18, 2025

Option 2 of #1682

Occasionally hitting error in _token_dispatch all_to_all_single_autograd. Race condition w/ cuda streams? Not sure why.

Root Cause (first observed failure):
[0]:
  time      : 2025-09-24_14:52:54
  host      : devvm7508.cco0.facebook.com
  rank      : 2 (local_rank: 2)
  exitcode  : -6 (pid: 774685)
  error_file: /tmp/torchelastic_9gxaawwv/none_tw6441vs/attempt_0/2/error.json
  traceback : Traceback (most recent call last):
    File "/home/howardhuang/local/pytorch/torch/distributed/pipelining/stage.py", line 704, in forward_one_chunk
      output = self.forward_maybe_with_nosync(*composite_args, **composite_kwargs)
    File "/home/howardhuang/local/pytorch/torch/distributed/pipelining/stage.py", line 564, in forward_maybe_with_nosync
      out_val = self.submod(*args, **kwargs)
    File "/home/howardhuang/local/pytorch/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
      return self._call_impl(*args, **kwargs)
    File "/home/howardhuang/local/pytorch/torch/nn/modules/module.py", line 1881, in _call_impl
      return inner()
    File "/home/howardhuang/local/pytorch/torch/nn/modules/module.py", line 1829, in inner
      result = forward_call(*args, **kwargs)
    File "/data/users/howardhuang/titan2/torchtitan/models/deepseek_v3/model/model.py", line 389, in forward
      h = layer(h, self.freqs_cis)
    File "/home/howardhuang/local/pytorch/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
      return self._call_impl(*args, **kwargs)
    File "/home/howardhuang/local/pytorch/torch/nn/modules/module.py", line 1881, in _call_impl
      return inner()
    File "/home/howardhuang/local/pytorch/torch/nn/modules/module.py", line 1829, in inner
      result = forward_call(*args, **kwargs)
    File "/data/users/howardhuang/titan2/torchtitan/models/deepseek_v3/model/model.py", line 300, in forward
      x = x + self.moe(self.ffn_norm(x))
    File "/home/howardhuang/local/pytorch/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
      return self._call_impl(*args, **kwargs)
    File "/home/howardhuang/local/pytorch/torch/nn/modules/module.py", line 1786, in _call_impl
      return forward_call(*args, **kwargs)
    File "/data/users/howardhuang/titan2/torchtitan/models/moe.py", line 431, in forward
      routed_output = self.experts(routed_input, num_tokens_per_expert)
    File "/home/howardhuang/local/pytorch/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
      return self._call_impl(*args, **kwargs)
    File "/home/howardhuang/local/pytorch/torch/nn/modules/module.py", line 1881, in _call_impl
      return inner()
    File "/home/howardhuang/local/pytorch/torch/nn/modules/module.py", line 1818, in inner
      args_result = hook(self, args)
    File "/home/howardhuang/local/pytorch/torch/distributed/tensor/_api.py", line 952, in <lambda>
      lambda mod, inputs: input_fn(mod, inputs, device_mesh)
    File "/data/users/howardhuang/titan2/torchtitan/distributed/expert_parallel.py", line 273, in _token_dispatch
      routed_input = all_to_all_single_autograd(
    File "/home/howardhuang/local/pytorch/torch/distributed/_functional_collectives.py", line 525, in all_to_all_single_autograd
      tensor = torch.ops._c10d_functional_autograd.all_to_all_single(  # type: ignore[attr-defined]
    File "/home/howardhuang/local/pytorch/torch/_ops.py", line 1255, in __call__
      return self._op(*args, **kwargs)
  RuntimeError: Trying to create tensor with negative dimension -1452432247676080984: [-1452432247676080984, 16]

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 18, 2025
@H-Huang H-Huang force-pushed the deepseek-v3-new-methods branch from 3a61b86 to 0f7a7c9 Compare September 22, 2025 21:52
@H-Huang
Copy link
Member Author

H-Huang commented Sep 22, 2025

Running with:

TORCH_NCCL_TRACE_BUFFER_SIZE=2000 TORCH_NCCL_DUMP_ON_TIMEOUT=true TORCH_FR_DUMP_TEMP_FILE=./nccl_trace_rank_ NGPU=4 CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/debug_model.toml" ./run_train.sh

CUDA_LAUNCH_BLOCKING

TORCH_NCCL_TRACE_BUFFER_SIZE=2000 TORCH_NCCL_DUMP_ON_TIMEOUT=true TORCH_FR_DUMP_TEMP_FILE=./nccl_trace_rank_ NGPU=4 CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/debug_model.toml" CUDA_LAUNCH_BLOCKING=1 ./run_train.sh

@H-Huang H-Huang force-pushed the deepseek-v3-new-methods branch from 0f7a7c9 to 6584aac Compare September 24, 2025 21:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant