Performance optimization for full size deepseek v3 on XPK cluster #388

jialei777 · 2025-09-09T17:03:27Z

Adding sharding and the related all to all/dispatching logic around GMM kernel at MoE layer.

Currently failing around sharding with all_to_all collective in MoE layer.

Error executing job with overrides: ['model=deepseek-v3', 'ici_mesh.fsdp=4', 'logging_steps=1', 'dataset.block_size=4096', 'task.global_batch_size=4', 'profile_start_step=1', 'task.lr_scheduler.type=constant', 'task.max_steps=6', 'model.num_hidden_layers=2', 'model.first_k_dense_replace=1']
Traceback (most recent call last):
  File "/home/jialeic_google_com/work/torchprime/torchprime/torch_xla_models/train.py", line 94, in main
    trainer.train_loop()
  File "/home/jialeic_google_com/work/torchprime/torchprime/torch_xla_models/trainer/base_trainer.py", line 263, in train_loop
    loss, grad_norm = self.train_step(batch)
                      ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jialeic_google_com/miniconda3/envs/xla0905-py312-2/lib/python3.12/contextlib.py", line 80, in inner
    with self._recreate_cm():
         ^^^^^^^^^^^^^^^^^^^
  File "/home/jialeic_google_com/miniconda3/envs/xla0905-py312-2/lib/python3.12/contextlib.py", line 144, in __exit__
    next(self.gen)
  File "/home/jialeic_google_com/miniconda3/envs/xla0905-py312-2/lib/python3.12/site-packages/torch_xla/torch_xla.py", line 214, in _compile
    sync()
  File "/home/jialeic_google_com/miniconda3/envs/xla0905-py312-2/lib/python3.12/site-packages/torch_xla/torch_xla.py", line 87, in sync
    torch_xla._XLAC._xla_step_marker(
ValueError: RET_CHECK failure (third_party/tensorflow/compiler/xla/service/spmd/spmd_partitioner.cc:5636) hlo->has_sharding() Side-effect HLO must have sharding: %all-to-all.921 = (bf16[1,131072,7168]{2,1,0}) all-to-all(%add.919), replica_groups={{0},{1},{2},{3}}, constrain_layout=true

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

jialei777 added 2 commits September 9, 2025 17:00

ini

1ebf21b

fix for xpk run

02bb1f0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance optimization for full size deepseek v3 on XPK cluster #388

Performance optimization for full size deepseek v3 on XPK cluster #388

Uh oh!

jialei777 commented Sep 9, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Performance optimization for full size deepseek v3 on XPK cluster #388

Are you sure you want to change the base?

Performance optimization for full size deepseek v3 on XPK cluster #388

Uh oh!

Conversation

jialei777 commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jialei777 commented Sep 9, 2025 •

edited

Loading