Skip to content

[WIP]: Rack Scale Ascend Platform Large-scale MoE Deployment Support #798

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Zaragoto opened this issue May 8, 2025 · 0 comments
Open

Comments

@Zaragoto
Copy link

Zaragoto commented May 8, 2025

🚀 The feature, motivation and pitch

Motivation.

In the real-world deployment of large Mixture-of-Experts (MoE) models like DeepSeek-V3/R1, large-scale Expert Parallelism (EP) support is crucial. While the latest version of vllm-ascend supports a combination of Tensor Parallelism (TP), Expert Parallelism (EP), and Data Parallelism (DP) for distributed parallelism, its DP implementation has limitations that hinder its practical use:

  1. There are bugs in multi-node DP offline inference, which are difficult to identify due to the misleading example provided in
    examples/dp_offline/data_parallel.py
  2. There are bugs in single-node DP online serving, also obscured by the misleading example in examples/dp_offline/data_parallel.py.
  3. Multi-node DP online serving is not supported.
    This issue aims to address and rectify the identified problems with the DP+EP+TP combination in vllm V1 on the Ascend platform. We will detail the detected bugs and unsupported features, along with our proposed code modifications to resolve them.

Proposed Changes.

1. Basic bugfix to make sure the code can work correctly

  • [bug fix] patch/platform/patch_common/patch_distributed.py

The original ascend_stateless_init_dp_group function in patch_distributed.py fails to correctly initialize the dp_group, resulting in the following error message:
Image
This bug stems from the fact that the ProcessGroup interface and attributes in torch.distributed differ between PyTorch 2.5.1 and PyTorch 2.6.0. The current implementation in vLLM does not account for PyTorch 2.5.1 compatibility. To address this, we have implemented the following modification using a monkey patch: [PR id]

  • [bug fix] attention/mla_v1.py

The current implementation of mla_v1.py in the latest version contains bugs that lead to dimension mismatch issues within chunked_prefill_mla.

To resolve this, we have reverted the code to commit ID 1a1f9a6, which effectively fixes the bug.

  • [bug fix] ops/fused_moe.py

All-to-all communication is only used in MC2 mode, which is still not fully supported. Before the backend supports the communication, we should bypass all-to-all function while not using MC2. We adopt the bugfix in PR #710 to fix the same bug.

  • [bug fix] worker/worker_v1.py

When DP ranks have different max_tokens params, the rank finishes forwarding first must wait for others before termination. Otherwise there will show communication error. This should be supported by execute_dummy_batch() to run a dummy forward. We adopt the bugfix in PR #710 to fix the same bug.

2. Add new features to support DP offline inference

  • [new feature] worker/worker_v1.py

During the initialization of NPUWorker, the local_rank is not correctly assigned to support multi-DP configurations on a single node. For instance, if we have two DP processes on a single node (each with TP=1), the local_rank for the worker in DP process-0 should be 0, and the local_rank for the worker in DP process-1 should be 1. However, the current implementation assigns local_rank = 0 to both DP process-0 and DP process-1.
To address this, we modified the code to update the worker's local_rank with an offset calculated using dp_rank_local and tp_size.
This updated code correctly supports scenarios with multiple DP processes per node, such as a single node with DP=2, or a 2-node setup with DP=4 for offline inference, as long as the processes are executed on each node.
However, a bug arises when attempting to support multi-node DP online serving, where all processes are launched on the master node. For example, in a 2-node configuration with DP=2 and TP=1 for online serving, the worker in DP process-0 (on the master node) should have local_rank = 0, and the worker in DP process-1 (on the slave node) should also have local_rank = 0. Because vLLM doesn't fully support multi-node DP online serving, and dp_rank_local is not computed correctly in this scenario, the workers in DP process-0 and DP process-1 are incorrectly assigned local_rank values of 0 and 1, respectively. This leads to an "invalid device ID" error.
To resolve this, we introduced a new environment variable, VLLM_DP_SIZE_PER_NODE. This variable should be set to the number of DP processes per node when running online serving. The code was then modified as follows: [PR ID].

  • [new feature] worker/model_runner_v1.py

During chunked prefill with Data Parallelism (DP), it's crucial to ensure that token lengths are consistent across all DP processes. The example provided by vllm-ascend currently assigns the same requests to all DP processes, which overlooks the scenario where each DP process might have different token lengths. This masks the necessity of padding shorter inputs to achieve uniform token lengths. To address this, we've implemented the following:
• We calculate the maximum token length across all DP processes.
• We pad the inputs of other DP processes with token ID 0 (representing the [PAD] token in the vocabulary) to match this maximum length.
• We updated the testing example examples/dp_offline/data_parallel.py to ensure that each DP process receives different prompts, better reflecting real-world scenarios.
These modifications are included in [PR id].

Alternatives

No response

Additional context

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant