Skip to content

Commit 95da7b0

Browse files
authored
Merge branch 'vllm-project:main' into main
2 parents 08c5790 + 8326f15 commit 95da7b0

File tree

29 files changed

+519
-375
lines changed

29 files changed

+519
-375
lines changed

.github/workflows/vllm_ascend_test.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -118,7 +118,7 @@ jobs:
118118
TORCH_DEVICE_BACKEND_AUTOLOAD: 0
119119
run: |
120120
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/x86_64-linux/devlib
121-
pytest -sv --cov --cov-report=xml:unittests-coverage.xml tests/ut --ignore=tests/ut/test_platform.py --ignore=tests/ut/ops/test_vocab_parallel_embedding.py
121+
pytest -sv --cov --cov-report=xml:unittests-coverage.xml tests/ut --ignore=tests/ut/test_platform.py
122122
123123
- name: Upload coverage to Codecov
124124
if: ${{ matrix.vllm_version == 'main' }}

docs/source/locale/zh_CN/LC_MESSAGES/user_guide/configuration/additional_config.po

Lines changed: 2 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -148,10 +148,6 @@ msgid ""
148148
" to be passed in."
149149
msgstr "在为MOE模型使用专家负载均衡时,需要传入专家映射路径。"
150150

151-
#: ../../user_guide/configuration/additional_config.md
152-
msgid "`chunked_prefill_for_mla`"
153-
msgstr "`chunked_prefill_for_mla`"
154-
155151
#: ../../user_guide/configuration/additional_config.md
156152
msgid "`False`"
157153
msgstr "`False`"
@@ -199,8 +195,8 @@ msgid ""
199195
msgstr "是否将MLA的向量操作放到另一个流中。此选项仅对使用MLA的模型(例如,DeepSeek)有效。"
200196

201197
#: ../../user_guide/configuration/additional_config.md
202-
msgid "`enable_multistream_moe`"
203-
msgstr "`enable_multistream_moe`"
198+
msgid "`multistream_overlap_shared_expert`"
199+
msgstr "`multistream_overlap_shared_expert`"
204200

205201
#: ../../user_guide/configuration/additional_config.md
206202
msgid ""

docs/source/user_guide/configuration/additional_config.md

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -30,12 +30,12 @@ The following table lists the additional configuration options available in vLLM
3030
| `ascend_scheduler_config` | dict | `{}` | The config options for ascend scheduler |
3131
| `refresh` | bool | `false` | Whether to refresh global ascend config content. This value is usually used by rlhf or ut/e2e test case. |
3232
| `expert_map_path` | str | `None` | When using expert load balancing for the MOE model, an expert map path needs to be passed in. |
33-
| `chunked_prefill_for_mla` | bool | `False` | Whether to enable the fused operator-like chunked_prefill. |
3433
| `enable_prefetch` | bool | `False` | Whether to enable weight prefetch. |
3534
| `kv_cache_dtype` | str | `None` | When using the kv cache quantization method, kv cache dtype needs to be set, currently only int8 is supported. |
3635
| `enable_shared_expert_dp` | bool | `False` | When the shared expert in DP, it has better performance but consumes more memory. Currently only DeepSeek series models are supported to use. |
3736
| `lmhead_tensor_parallel_size` | int | `None` | The custom tensor parallel size of lmhead. |
3837
| `oproj_tensor_parallel_size` | int | `None` | The custom tensor parallel size of oproj. |
38+
| `multistream_overlap_shared_expert`| bool | `False` | Whether to enable multistream shared expert. This option only takes effects on moe models with shared experts. |
3939

4040
The details of each config option are as follows:
4141

@@ -46,7 +46,6 @@ The details of each config option are as follows:
4646
| `enabled` | bool | `False` | Whether to enable torchair graph mode. Currently only DeepSeek series models and PanguProMoE are supported to use torchair graph mode |
4747
| `mode` | str | `None` | When using reduce-overhead mode for torchair, mode needs to be set |
4848
| `enable_multistream_mla`| bool | `False` | Whether to put vector ops of MLA to another stream. This option only takes effects on models using MLA (e.g., DeepSeek). |
49-
| `enable_multistream_moe`| bool | `False` | Whether to enable multistream shared expert. This option only takes effects on DeepSeek moe models. |
5049
| `enable_view_optimize` | bool | `True` | Whether to enable torchair view optimization |
5150
| `enable_frozen_parameter` | bool | `True` | Whether to fix the memory address of weights during inference to reduce the input address refresh time during graph execution. |
5251
| `use_cached_graph` | bool | `False` | Whether to use cached graph |
@@ -75,13 +74,13 @@ An example of additional configuration is as follows:
7574
"use_cached_graph": True,
7675
"graph_batch_sizes": [1, 2, 4, 8],
7776
"graph_batch_sizes_init": False,
78-
"enable_multistream_moe": False,
7977
"enable_kv_nz": False
8078
},
8179
"ascend_scheduler_config": {
8280
"enabled": True,
8381
"enable_chunked_prefill": True,
8482
},
83+
"multistream_overlap_shared_expert": True,
8584
"refresh": False,
8685
}
8786
```

examples/disaggregated_prefill_v1/README.md

Lines changed: 2 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -70,9 +70,7 @@ vllm serve /models/deepseek_r1_w8a8 \
7070
"kv_port": "20001",
7171
"engine_id": "0",
7272
"kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector"
73-
}' \
74-
--additional-config \
75-
'{"chunked_prefill_for_mla":true}'
73+
}'
7674
```
7775

7876
Run prefill server P2 on second node:
@@ -114,9 +112,7 @@ vllm serve /models/deepseek_r1_w8a8 \
114112
"kv_port": "20001",
115113
"engine_id": "0",
116114
"kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector"
117-
}' \
118-
--additional-config \
119-
'{"chunked_prefill_for_mla":true}'
115+
}'
120116
```
121117

122118
Run decode server d1 on third node:

examples/external_online_dp/run_dp_template.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,4 +43,4 @@ vllm serve model_path \
4343
"kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector"
4444
}' \
4545
--additional-config \
46-
'{"ascend_scheduler_config": {"enabled": true}, "torchair_graph_config":{"enabled":true,"enable_kv_nz":false, "enable_multistream_moe":false, "graph_batch_size":[28]}, "enable_weight_nz_layout":true}'
46+
'{"ascend_scheduler_config": {"enabled": true}, "torchair_graph_config":{"enabled":true,"enable_kv_nz":false, "graph_batch_size":[28]}, "enable_weight_nz_layout":true, "enable_multistream_moe":false}'

examples/run_dp_server.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,4 +29,4 @@ vllm serve Qwen/Qwen1.5-MoE-A2.7B \
2929
--gpu-memory-utilization 0.9 \
3030
--trust-remote-code \
3131
--enforce-eager \
32-
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":false, "enable_multistream_moe":false, "use_cached_graph":false}}'
32+
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":false, "use_cached_graph":false}}'

tests/e2e/models/configs/DeepSeek-V2-Lite.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,8 @@ tasks:
77
- name: "exact_match,flexible-extract"
88
value: 0.375
99
tensor_parallel_size: 2
10+
batch_size: 8
11+
gpu_memory_utilization: 0.7
1012
apply_chat_template: False
1113
fewshot_as_multiturn: False
1214
trust_remote_code: True

tests/e2e/models/test_lm_eval_correctness.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -84,7 +84,7 @@ def generate_report(tp_size, eval_config, report_data, report_dir, env_config):
8484
apply_chat_template=eval_config.get("apply_chat_template", True),
8585
fewshot_as_multiturn=eval_config.get("fewshot_as_multiturn", True),
8686
limit=eval_config.get("limit", "N/A"),
87-
batch_size="auto",
87+
batch_size=eval_config.get("batch_size", "auto"),
8888
num_fewshot=eval_config.get("num_fewshot", "N/A"),
8989
rows=report_data["rows"],
9090
parallel_mode=parallel_mode)
@@ -110,7 +110,7 @@ def test_lm_eval_correctness_param(config_filename, tp_size, report_dir,
110110
"apply_chat_template": eval_config.get("apply_chat_template", True),
111111
"fewshot_as_multiturn": eval_config.get("fewshot_as_multiturn", True),
112112
"limit": eval_config.get("limit", None),
113-
"batch_size": "auto",
113+
"batch_size": eval_config.get("batch_size", "auto"),
114114
}
115115
for s in ["num_fewshot", "fewshot_as_multiturn", "apply_chat_template"]:
116116
val = eval_config.get(s, None)

tests/e2e/multicard/test_offline_inference_distributed.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -66,8 +66,8 @@ def test_models_distributed_DeepSeek_multistream_moe():
6666
additional_config={
6767
"torchair_graph_config": {
6868
"enabled": True,
69-
"enable_multistream_moe": True,
7069
},
70+
"enable_multistream_moe": True,
7171
"ascend_scheduler_config": {
7272
"enabled": True,
7373
},
Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,103 @@
1+
#
2+
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
3+
# Copyright 2023 The vLLM team.
4+
#
5+
# Licensed under the Apache License, Version 2.0 (the "License");
6+
# you may not use this file except in compliance with the License.
7+
# You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing, software
12+
# distributed under the License is distributed on an "AS IS" BASIS,
13+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
# See the License for the specific language governing permissions and
15+
# limitations under the License.
16+
#
17+
"""
18+
Compare the outputs of vLLM with multistream_overlap_shared_expert
19+
enabled and disabled.
20+
21+
Run `pytest tests/e2e/singlecard/test_multistream_overlap_shared_expert.py`.
22+
"""
23+
24+
import pytest
25+
from vllm import SamplingParams
26+
27+
from tests.e2e.conftest import VllmRunner
28+
from tests.e2e.model_utils import check_outputs_equal
29+
30+
MODELS = [
31+
"Qwen/Qwen3-0.6B",
32+
]
33+
34+
35+
@pytest.mark.parametrize("model", MODELS)
36+
@pytest.mark.parametrize("max_tokens", [32])
37+
def test_models_with_multistream_overlap_shared_expert(
38+
model: str,
39+
max_tokens: int,
40+
) -> None:
41+
prompts = [
42+
"Hello, my name is", "The president of the United States is",
43+
"The capital of France is", "The future of AI is"
44+
]
45+
46+
sampling_params = SamplingParams(max_tokens=max_tokens, temperature=0.0)
47+
with VllmRunner(
48+
model,
49+
max_model_len=1024,
50+
enforce_eager=True,
51+
additional_config={
52+
"multistream_overlap_shared_expert": True,
53+
},
54+
) as runner:
55+
vllm_moe_ms_eager_outputs = runner.model.generate(
56+
prompts, sampling_params)
57+
58+
with VllmRunner(
59+
model,
60+
max_model_len=1024,
61+
enforce_eager=False,
62+
additional_config={
63+
"multistream_overlap_shared_expert": True,
64+
},
65+
) as runner:
66+
vllm_moe_ms_aclgraph_outputs = runner.model.generate(
67+
prompts, sampling_params)
68+
69+
with VllmRunner(
70+
model,
71+
max_model_len=1024,
72+
enforce_eager=True,
73+
) as runner:
74+
vllm_eager_outputs = runner.model.generate(prompts, sampling_params)
75+
76+
vllm_moe_ms_eager_outputs_list = []
77+
for output in vllm_moe_ms_eager_outputs:
78+
vllm_moe_ms_eager_outputs_list.append(
79+
(output.outputs[0].index, output.outputs[0].text))
80+
81+
vllm_moe_ms_aclgraph_outputs_list = []
82+
for output in vllm_moe_ms_aclgraph_outputs:
83+
vllm_moe_ms_aclgraph_outputs_list.append(
84+
(output.outputs[0].index, output.outputs[0].text))
85+
86+
vllm_eager_outputs_list = []
87+
for output in vllm_eager_outputs:
88+
vllm_eager_outputs_list.append(
89+
(output.outputs[0].index, output.outputs[0].text))
90+
91+
check_outputs_equal(
92+
outputs_0_lst=vllm_eager_outputs_list,
93+
outputs_1_lst=vllm_moe_ms_eager_outputs_list,
94+
name_0="vllm_eager_outputs",
95+
name_1="vllm_moe_ms_eager_outputs",
96+
)
97+
98+
check_outputs_equal(
99+
outputs_0_lst=vllm_eager_outputs_list,
100+
outputs_1_lst=vllm_moe_ms_aclgraph_outputs_list,
101+
name_0="vllm_eager_outputs",
102+
name_1="vllm_moe_ms_aclgraph_outputs",
103+
)

0 commit comments

Comments
 (0)