You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the real-world deployment of large Mixture-of-Experts (MoE) models like DeepSeek-V3/R1, large-scale Expert Parallelism (EP) support is crucial. While the latest version of vllm-ascend supports a combination of Tensor Parallelism (TP), Expert Parallelism (EP), and Data Parallelism (DP) for distributed parallelism, its DP implementation has limitations that hinder its practical use:
There are bugs in multi-node DP offline inference, which are difficult to identify due to the misleading example provided in examples/dp_offline/data_parallel.py
There are bugs in single-node DP online serving, also obscured by the misleading example in examples/dp_offline/data_parallel.py.
Multi-node DP online serving is not supported.
This issue aims to address and rectify the identified problems with the DP+EP+TP combination in vllm V1 on the Ascend platform. We will detail the detected bugs and unsupported features, along with our proposed code modifications to resolve them.
Proposed Changes.
1. Basic bugfix to make sure the code can work correctly
The original ascend_stateless_init_dp_group function in patch_distributed.py fails to correctly initialize the dp_group, resulting in the following error message:
This bug stems from the fact that the ProcessGroup interface and attributes in torch.distributed differ between PyTorch 2.5.1 and PyTorch 2.6.0. The current implementation in vLLM does not account for PyTorch 2.5.1 compatibility. To address this, we have implemented the following modification using a monkey patch: [PR id]
[bug fix]attention/mla_v1.py
The current implementation of mla_v1.py in the latest version contains bugs that lead to dimension mismatch issues within chunked_prefill_mla.
To resolve this, we have reverted the code to commit ID 1a1f9a6, which effectively fixes the bug.
[bug fix]ops/fused_moe.py
All-to-all communication is only used in MC2 mode, which is still not fully supported. Before the backend supports the communication, we should bypass all-to-all function while not using MC2. We adopt the bugfix in PR #710 to fix the same bug.
[bug fix]worker/worker_v1.py
When DP ranks have different max_tokens params, the rank finishes forwarding first must wait for others before termination. Otherwise there will show communication error. This should be supported by execute_dummy_batch() to run a dummy forward. We adopt the bugfix in PR #710 to fix the same bug.
2. Add new features to support DP offline inference
[new feature]worker/worker_v1.py
During the initialization of NPUWorker, the local_rank is not correctly assigned to support multi-DP configurations on a single node. For instance, if we have two DP processes on a single node (each with TP=1), the local_rank for the worker in DP process-0 should be 0, and the local_rank for the worker in DP process-1 should be 1. However, the current implementation assigns local_rank = 0 to both DP process-0 and DP process-1.
To address this, we modified the code to update the worker's local_rank with an offset calculated using dp_rank_local and tp_size.
This updated code correctly supports scenarios with multiple DP processes per node, such as a single node with DP=2, or a 2-node setup with DP=4 for offline inference, as long as the processes are executed on each node.
However, a bug arises when attempting to support multi-node DP online serving, where all processes are launched on the master node. For example, in a 2-node configuration with DP=2 and TP=1 for online serving, the worker in DP process-0 (on the master node) should have local_rank = 0, and the worker in DP process-1 (on the slave node) should also have local_rank = 0. Because vLLM doesn't fully support multi-node DP online serving, and dp_rank_local is not computed correctly in this scenario, the workers in DP process-0 and DP process-1 are incorrectly assigned local_rank values of 0 and 1, respectively. This leads to an "invalid device ID" error.
To resolve this, we introduced a new environment variable, VLLM_DP_SIZE_PER_NODE. This variable should be set to the number of DP processes per node when running online serving. The code was then modified as follows: [PR ID].
[new feature]worker/model_runner_v1.py
During chunked prefill with Data Parallelism (DP), it's crucial to ensure that token lengths are consistent across all DP processes. The example provided by vllm-ascend currently assigns the same requests to all DP processes, which overlooks the scenario where each DP process might have different token lengths. This masks the necessity of padding shorter inputs to achieve uniform token lengths. To address this, we've implemented the following:
• We calculate the maximum token length across all DP processes.
• We pad the inputs of other DP processes with token ID 0 (representing the [PAD] token in the vocabulary) to match this maximum length.
• We updated the testing example examples/dp_offline/data_parallel.py to ensure that each DP process receives different prompts, better reflecting real-world scenarios.
These modifications are included in [PR id].
Alternatives
No response
Additional context
No response
The text was updated successfully, but these errors were encountered:
🚀 The feature, motivation and pitch
Motivation.
In the real-world deployment of large Mixture-of-Experts (MoE) models like DeepSeek-V3/R1, large-scale Expert Parallelism (EP) support is crucial. While the latest version of vllm-ascend supports a combination of Tensor Parallelism (TP), Expert Parallelism (EP), and Data Parallelism (DP) for distributed parallelism, its DP implementation has limitations that hinder its practical use:
examples/dp_offline/data_parallel.py
examples/dp_offline/data_parallel.py
.This issue aims to address and rectify the identified problems with the DP+EP+TP combination in vllm V1 on the Ascend platform. We will detail the detected bugs and unsupported features, along with our proposed code modifications to resolve them.
Proposed Changes.
1. Basic bugfix to make sure the code can work correctly
patch/platform/patch_common/patch_distributed.py
The original

ascend_stateless_init_dp_group
function in patch_distributed.py fails to correctly initialize the dp_group, resulting in the following error message:This bug stems from the fact that the
ProcessGroup
interface and attributes intorch.distributed
differ between PyTorch 2.5.1 and PyTorch 2.6.0. The current implementation in vLLM does not account for PyTorch 2.5.1 compatibility. To address this, we have implemented the following modification using a monkey patch: [PR id]attention/mla_v1.py
The current implementation of
mla_v1.py
in the latest version contains bugs that lead to dimension mismatch issues withinchunked_prefill_mla
.To resolve this, we have reverted the code to commit ID 1a1f9a6, which effectively fixes the bug.
ops/fused_moe.py
All-to-all communication is only used in MC2 mode, which is still not fully supported. Before the backend supports the communication, we should bypass all-to-all function while not using MC2. We adopt the bugfix in PR #710 to fix the same bug.
worker/worker_v1.py
When DP ranks have different max_tokens params, the rank finishes forwarding first must wait for others before termination. Otherwise there will show communication error. This should be supported by execute_dummy_batch() to run a dummy forward. We adopt the bugfix in PR #710 to fix the same bug.
2. Add new features to support DP offline inference
worker/worker_v1.py
During the initialization of NPUWorker, the local_rank is not correctly assigned to support multi-DP configurations on a single node. For instance, if we have two DP processes on a single node (each with TP=1), the local_rank for the worker in DP process-0 should be 0, and the local_rank for the worker in DP process-1 should be 1. However, the current implementation assigns local_rank = 0 to both DP process-0 and DP process-1.
To address this, we modified the code to update the worker's local_rank with an offset calculated using dp_rank_local and tp_size.
This updated code correctly supports scenarios with multiple DP processes per node, such as a single node with DP=2, or a 2-node setup with DP=4 for offline inference, as long as the processes are executed on each node.
However, a bug arises when attempting to support multi-node DP online serving, where all processes are launched on the master node. For example, in a 2-node configuration with DP=2 and TP=1 for online serving, the worker in DP process-0 (on the master node) should have local_rank = 0, and the worker in DP process-1 (on the slave node) should also have local_rank = 0. Because vLLM doesn't fully support multi-node DP online serving, and dp_rank_local is not computed correctly in this scenario, the workers in DP process-0 and DP process-1 are incorrectly assigned local_rank values of 0 and 1, respectively. This leads to an "invalid device ID" error.
To resolve this, we introduced a new environment variable, VLLM_DP_SIZE_PER_NODE. This variable should be set to the number of DP processes per node when running online serving. The code was then modified as follows: [PR ID].
worker/model_runner_v1.py
During chunked prefill with Data Parallelism (DP), it's crucial to ensure that token lengths are consistent across all DP processes. The example provided by vllm-ascend currently assigns the same requests to all DP processes, which overlooks the scenario where each DP process might have different token lengths. This masks the necessity of padding shorter inputs to achieve uniform token lengths. To address this, we've implemented the following:
• We calculate the maximum token length across all DP processes.
• We pad the inputs of other DP processes with token ID 0 (representing the [PAD] token in the vocabulary) to match this maximum length.
• We updated the testing example
examples/dp_offline/data_parallel.py
to ensure that each DP process receives different prompts, better reflecting real-world scenarios.These modifications are included in [PR id].
Alternatives
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: