[WIP]: Rack Scale Ascend Platform Large-scale MoE Deployment Support

### 🚀 The feature, motivation and pitch

### **Motivation.**
In the real-world deployment of large Mixture-of-Experts (MoE) models like DeepSeek-V3/R1, large-scale Expert Parallelism (EP) support is crucial. While the latest version of vllm-ascend supports a combination of Tensor Parallelism (TP), Expert Parallelism (EP), and Data Parallelism (DP) for distributed parallelism, its DP implementation has limitations that hinder its practical use:
1. There are bugs in multi-node DP offline inference, which are difficult to identify due to the misleading example provided in 
`examples/dp_offline/data_parallel.py`
2. There are bugs in single-node DP online serving, also obscured by the misleading example in `examples/dp_offline/data_parallel.py`.
3. Multi-node DP online serving is not supported.
This issue aims to address and rectify the identified problems with the DP+EP+TP combination in vllm V1 on the Ascend platform. We will detail the detected bugs and unsupported features, along with our proposed code modifications to resolve them.

### **Proposed Changes.**
**1. Basic bugfix to make sure the code can work correctly**

- **[bug fix]** `patch/platform/patch_common/patch_distributed.py`

The original `ascend_stateless_init_dp_group` function in patch_distributed.py fails to correctly initialize the dp_group, resulting in the following error message:
![Image](https://github.com/user-attachments/assets/89f7f7b8-2667-41b1-89c0-197c30a58cd2)
This bug stems from the fact that the `ProcessGroup` interface and attributes in `torch.distributed` differ between PyTorch 2.5.1 and PyTorch 2.6.0. The current implementation in vLLM does not account for PyTorch 2.5.1 compatibility. To address this, we have implemented the following modification using a monkey patch: [PR id]

- **[bug fix]** `attention/mla_v1.py`

The current implementation of `mla_v1.py` in the latest version contains bugs that lead to dimension mismatch issues within `chunked_prefill_mla`. 

To resolve this, we have reverted the code to commit ID 1a1f9a6d, which effectively fixes the bug.

- **[bug fix]** `ops/fused_moe.py`

All-to-all communication is only used in MC2 mode, which is still not fully supported. Before the backend supports the communication, we should bypass all-to-all function while not using MC2. We adopt the bugfix in PR https://github.com/vllm-project/vllm-ascend/pull/710 to fix the same bug.

- **[bug fix]** `worker/worker_v1.py`

When DP ranks have different max_tokens params, the rank finishes forwarding first must wait for others before termination. Otherwise there will show communication error. This should be supported by execute_dummy_batch() to run a dummy forward. We adopt the bugfix in PR https://github.com/vllm-project/vllm-ascend/pull/710 to fix the same bug.

**2. Add new features to support DP offline inference**

- **[new feature]** `worker/worker_v1.py`

During the initialization of NPUWorker, the local_rank is not correctly assigned to support multi-DP configurations on a single node. For instance, if we have two DP processes on a single node (each with TP=1), the local_rank for the worker in DP process-0 should be 0, and the local_rank for the worker in DP process-1 should be 1. However, the current implementation assigns local_rank = 0 to both DP process-0 and DP process-1.
To address this, we modified the code to update the worker's local_rank with an offset calculated using dp_rank_local and tp_size.
This updated code correctly supports scenarios with multiple DP processes per node, such as a single node with DP=2, or a 2-node setup with DP=4 for offline inference, as long as the processes are executed on each node.
However, a bug arises when attempting to support multi-node DP online serving, where all processes are launched on the master node. For example, in a 2-node configuration with DP=2 and TP=1 for online serving, the worker in DP process-0 (on the master node) should have local_rank = 0, and the worker in DP process-1 (on the slave node) should also have local_rank = 0. Because vLLM doesn't fully support multi-node DP online serving, and dp_rank_local is not computed correctly in this scenario, the workers in DP process-0 and DP process-1 are incorrectly assigned local_rank values of 0 and 1, respectively. This leads to an "invalid device ID" error.
To resolve this, we introduced a new environment variable, **VLLM_DP_SIZE_PER_NODE**. This variable should be set to the number of DP processes per node when running online serving. The code was then modified as follows: [PR ID].

- **[new feature]** `worker/model_runner_v1.py`

During chunked prefill with Data Parallelism (DP), it's crucial to ensure that token lengths are consistent across all DP processes. The example provided by vllm-ascend currently assigns the same requests to all DP processes, which overlooks the scenario where each DP process might have different token lengths. This masks the necessity of padding shorter inputs to achieve uniform token lengths. To address this, we've implemented the following:
    • We calculate the maximum token length across all DP processes.
    • We pad the inputs of other DP processes with token ID 0 (representing the [PAD] token in the vocabulary) to match this maximum length.
    • We updated the testing example `examples/dp_offline/data_parallel.py` to ensure that each DP process receives different prompts, better reflecting real-world scenarios.
These modifications are included in [PR id].


### Alternatives

_No response_

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP]: Rack Scale Ascend Platform Large-scale MoE Deployment Support #798

🚀 The feature, motivation and pitch

Motivation.

Proposed Changes.

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[WIP]: Rack Scale Ascend Platform Large-scale MoE Deployment Support #798

Description

🚀 The feature, motivation and pitch

Motivation.

Proposed Changes.

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions