Skip to content

Commit b472d8f

Browse files
authored
Merge branch 'vllm-project:main' into main
2 parents cd9c2f8 + dd087ef commit b472d8f

38 files changed

+1055
-979
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ By using vLLM Ascend plugin, popular open-source models, including Transformer-l
4242
- OS: Linux
4343
- Software:
4444
* Python >= 3.9, < 3.12
45-
* CANN >= 8.2.rc1
45+
* CANN >= 8.2.rc1 (Ascend HDK version refers to [here](https://www.hiascend.com/document/detail/zh/canncommercial/82RC1/releasenote/releasenote_0000.html))
4646
* PyTorch >= 2.7.1, torch-npu >= 2.7.1.dev20250724
4747
* vLLM (the same version as vllm-ascend)
4848

README.zh.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ vLLM 昇腾插件 (`vllm-ascend`) 是一个由社区维护的让vLLM在Ascend NP
4343
- 操作系统:Linux
4444
- 软件:
4545
* Python >= 3.9, < 3.12
46-
* CANN >= 8.2.rc1
46+
* CANN >= 8.2.rc1 (Ascend HDK 版本参考[这里](https://www.hiascend.com/document/detail/zh/canncommercial/82RC1/releasenote/releasenote_0000.html))
4747
* PyTorch >= 2.7.1, torch-npu >= 2.7.1.dev20250724
4848
* vLLM (与vllm-ascend版本一致)
4949

benchmarks/scripts/run-performance-benchmarks.sh

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -78,7 +78,9 @@ kill_npu_processes() {
7878
ps -aux
7979
lsof -t -i:8000 | xargs -r kill -9
8080
pgrep python3 | xargs -r kill -9
81-
81+
# vLLM now names the process with VLLM prefix after https://github.com/vllm-project/vllm/pull/21445
82+
pgrep VLLM | xargs -r kill -9
83+
8284
sleep 4
8385
rm -rf ~/.config/vllm
8486

benchmarks/tests/serving-tests.json

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,8 @@
2323
"hf_split": "train",
2424
"endpoint": "/v1/chat/completions",
2525
"dataset_path": "lmarena-ai/vision-arena-bench-v0.1",
26-
"num_prompts": 200
26+
"num_prompts": 200,
27+
"no_stream": ""
2728
}
2829
},
2930
{

docs/source/installation.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ This document describes how to install vllm-ascend manually.
1111

1212
| Software | Supported version | Note |
1313
|---------------|----------------------------------|-------------------------------------------|
14+
| Ascend HDK | Refer to [here](https://www.hiascend.com/document/detail/zh/canncommercial/82RC1/releasenote/releasenote_0000.html) | Required for CANN |
1415
| CANN | >= 8.2.RC1 | Required for vllm-ascend and torch-npu |
1516
| torch-npu | >= 2.7.1.dev20250724 | Required for vllm-ascend, No need to install manually, it will be auto installed in below steps |
1617
| torch | >= 2.7.1 | Required for torch-npu and vllm |

docs/source/tutorials/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,4 +15,5 @@ multi_npu_quantization
1515
single_node_300i
1616
multi_node
1717
multi_node_kimi
18+
multi_node_pd_disaggregation
1819
:::
Lines changed: 244 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,244 @@
1+
# Prefill-Decode Disaggregation Verification (Qwen)
2+
3+
## Getting Start
4+
5+
vLLM-Ascend now supports prefill-decode (PD) disaggregation with EP (Expert Parallel) options. This guide take one-by-one steps to verify these features with constrained resources.
6+
7+
Take the Qwen3-30B-A3B model as an example, use vllm-ascend v0.10.1rc1 (with vLLM v0.10.1.1) on 3 Atlas 800T A2 servers to deploy the "1P2D" architecture. Assume the ip of the prefiller server is 192.0.0.1, and the decoder servers are 192.0.0.2 (decoder 1) and 192.0.0.3 (decoder 2). On each server, use 2 NPUs to deploy one service instance.
8+
9+
## Verify Multi-Node Communication Environment
10+
11+
### Physical Layer Requirements
12+
13+
- The physical machines must be located on the same WLAN, with network connectivity.
14+
- All NPUs must be interconnected. Intra-node connectivity is via HCCS, and inter-node connectivity is via RDMA.
15+
16+
### Verification Process
17+
18+
1. Single Node Verification:
19+
20+
Execute the following commands on each node in sequence. The results must all be `success` and the status must be `UP`:
21+
22+
```bash
23+
# Check the remote switch ports
24+
for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done
25+
# Get the link status of the Ethernet ports (UP or DOWN)
26+
for i in {0..7}; do hccn_tool -i $i -link -g ; done
27+
# Check the network health status
28+
for i in {0..7}; do hccn_tool -i $i -net_health -g ; done
29+
# View the network detected IP configuration
30+
for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done
31+
# View gateway configuration
32+
for i in {0..7}; do hccn_tool -i $i -gateway -g ; done
33+
# View NPU network configuration
34+
cat /etc/hccn.conf
35+
```
36+
37+
2. Get NPU IP Addresses
38+
39+
```bash
40+
for i in {0..7}; do hccn_tool -i $i -ip -g;done
41+
```
42+
43+
3. Cross-Node PING Test
44+
45+
```bash
46+
# Execute on the target node (replace 'x.x.x.x' with actual npu ip address)
47+
for i in {0..7}; do hccn_tool -i $i -ping -g address x.x.x.x;done
48+
```
49+
50+
## Generate Ranktable
51+
52+
The rank table is a JSON file that specifies the mapping of Ascend NPU ranks to nodes. For more details please refer to the [vllm-ascend examples](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/README.md). Execute the following commands for reference.
53+
54+
```shell
55+
cd vllm-ascend/examples/disaggregate_prefill_v1/
56+
bash gen_ranktable.sh --ips <prefiller_node1_local_ip> <prefiller_node2_local_ip> <decoder_node1_local_ip> <decoder_node2_local_ip> \
57+
--npus-per-node <npu_clips> --network-card-name <nic_name> --prefill-device-cnt <prefiller_npu_clips> --decode-device-cnt <decode_npu_clips> \
58+
[--local-device-ids <id_1>,<id_2>,<id_3>...]
59+
```
60+
61+
Assume that we use device 0,1 on the prefiller server node and device 6,7 on both of the decoder server nodes. Take the following commands as an example. (`--local-device-ids` is necessary if you specify certain NPU devices on the local server.)
62+
63+
```shell
64+
# On the prefiller node
65+
cd vllm-ascend/examples/disaggregate_prefill_v1/
66+
bash gen_ranktable.sh --ips 192.0.0.1 192.0.0.2 192.0.0.3 \
67+
--npus-per-node 2 --network-card-name eth0 --prefill-device-cnt 2 --decode-device-cnt 4 --local-device-ids 0,1
68+
69+
# On the decoder 1
70+
cd vllm-ascend/examples/disaggregate_prefill_v1/
71+
bash gen_ranktable.sh --ips 192.0.0.1 192.0.0.2 192.0.0.3 \
72+
--npus-per-node 2 --network-card-name eth0 --prefill-device-cnt 2 --decode-device-cnt 4 --local-device-ids 6,7
73+
74+
# On the decoder 2
75+
cd vllm-ascend/examples/disaggregate_prefill_v1/
76+
bash gen_ranktable.sh --ips 192.0.0.1 192.0.0.2 192.0.0.3 \
77+
--npus-per-node 2 --network-card-name eth0 --prefill-device-cnt 2 --decode-device-cnt 4 --local-device-ids 6,7
78+
```
79+
80+
Rank table will generated at /vllm-workspace/vllm-ascend/examples/disaggregate_prefill_v1/ranktable.json
81+
82+
|Parameter | meaning |
83+
| --- | --- |
84+
| --ips | Each node's local ip (prefiller nodes should be front of decoder nodes) |
85+
| --npus-per-node | Each node's npu clips |
86+
| --network-card-name | The physical machines' NIC |
87+
|--prefill-device-cnt | Npu clips used for prefill |
88+
|--decode-device-cnt |Npu clips used for decode |
89+
|--local-device-ids |Optional. No need if using all devices on the local node. |
90+
91+
## Prefiller / Decoder Deployment
92+
93+
We can run the following scripts to launch a server on the prefiller/decoder node respectively.
94+
95+
:::::{tab-set}
96+
97+
::::{tab-item} Prefiller node
98+
99+
```shell
100+
export HCCL_IF_IP=192.0.0.1 # node ip
101+
export GLOO_SOCKET_IFNAME="eth0" # network card name
102+
export TP_SOCKET_IFNAME="eth0"
103+
export HCCL_SOCKET_IFNAME="eth0"
104+
export DISAGGREGATED_PREFILL_RANK_TABLE_PATH="/path/to/your/generated/ranktable.json"
105+
export OMP_PROC_BIND=false
106+
export OMP_NUM_THREADS=10
107+
export VLLM_USE_V1=1
108+
109+
vllm serve /model/Qwen3-30B-A3B \
110+
--host 0.0.0.0 \
111+
--port 13700 \
112+
--tensor-parallel-size 2 \
113+
--no-enable-prefix-caching \
114+
--seed 1024 \
115+
--served-model-name qwen3-moe \
116+
--max-model-len 6144 \
117+
--max-num-batched-tokens 6144 \
118+
--trust-remote-code \
119+
--gpu-memory-utilization 0.9 \
120+
--enable-expert-parallel \
121+
--kv-transfer-config \
122+
'{"kv_connector": "LLMDataDistCMgrConnector",
123+
"kv_buffer_device": "npu",
124+
"kv_role": "kv_producer",
125+
"kv_parallel_size": 1,
126+
"kv_port": "20001",
127+
"engine_id": "0",
128+
"kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector"
129+
}' \
130+
--additional-config \
131+
'{"torchair_graph_config": {"enabled":false, "enable_multistream_shared_expert":false}, "ascend_scheduler_config":{"enabled":true, "enable_chunked_prefill":false}}' \
132+
--enforce-eager
133+
```
134+
135+
::::
136+
137+
::::{tab-item} Decoder node 1
138+
139+
```shell
140+
export HCCL_IF_IP=192.0.0.2 # node ip
141+
export GLOO_SOCKET_IFNAME="eth0" # network card name
142+
export TP_SOCKET_IFNAME="eth0"
143+
export HCCL_SOCKET_IFNAME="eth0"
144+
export DISAGGREGATED_PREFILL_RANK_TABLE_PATH="/path/to/your/generated/ranktable.json"
145+
export OMP_PROC_BIND=false
146+
export OMP_NUM_THREADS=10
147+
export VLLM_USE_V1=1
148+
149+
vllm serve /model/Qwen3-30B-A3B \
150+
--host 0.0.0.0 \
151+
--port 13700 \
152+
--no-enable-prefix-caching \
153+
--tensor-parallel-size 2 \
154+
--seed 1024 \
155+
--served-model-name qwen3-moe \
156+
--max-model-len 6144 \
157+
--max-num-batched-tokens 6144 \
158+
--trust-remote-code \
159+
--gpu-memory-utilization 0.9 \
160+
--enable-expert-parallel \
161+
--kv-transfer-config \
162+
'{"kv_connector": "LLMDataDistCMgrConnector",
163+
"kv_buffer_device": "npu",
164+
"kv_role": "kv_consumer",
165+
"kv_parallel_size": 1,
166+
"kv_port": "20001",
167+
"engine_id": "0",
168+
"kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector"
169+
}' \
170+
--additional-config \
171+
'{"torchair_graph_config": {"enabled":false, "enable_multistream_shared_expert":false}, "ascend_scheduler_config":{"enabled":true, "enable_chunked_prefill":false}}'
172+
```
173+
174+
::::
175+
176+
::::{tab-item} Decoder node 2
177+
178+
```shell
179+
export HCCL_IF_IP=192.0.0.3 # node ip
180+
export GLOO_SOCKET_IFNAME="eth0" # network card name
181+
export TP_SOCKET_IFNAME="eth0"
182+
export HCCL_SOCKET_IFNAME="eth0"
183+
export DISAGGREGATED_PREFILL_RANK_TABLE_PATH="/path/to/your/generated/ranktable.json"
184+
export OMP_PROC_BIND=false
185+
export OMP_NUM_THREADS=10
186+
export VLLM_USE_V1=1
187+
188+
vllm serve /model/Qwen3-30B-A3B \
189+
--host 0.0.0.0 \
190+
--port 13700 \
191+
--no-enable-prefix-caching \
192+
--tensor-parallel-size 2 \
193+
--seed 1024 \
194+
--served-model-name qwen3-moe \
195+
--max-model-len 6144 \
196+
--max-num-batched-tokens 6144 \
197+
--trust-remote-code \
198+
--gpu-memory-utilization 0.9 \
199+
--enable-expert-parallel \
200+
--kv-transfer-config \
201+
'{"kv_connector": "LLMDataDistCMgrConnector",
202+
"kv_buffer_device": "npu",
203+
"kv_role": "kv_consumer",
204+
"kv_parallel_size": 1,
205+
"kv_port": "20001",
206+
"engine_id": "0",
207+
"kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector"
208+
}' \
209+
--additional-config \
210+
'{"torchair_graph_config": {"enabled":false, "enable_multistream_shared_expert":false}, "ascend_scheduler_config":{"enabled":true, "enable_chunked_prefill":false}}'
211+
```
212+
213+
::::
214+
215+
:::::
216+
217+
## Example proxy for Deployment
218+
219+
Run a proxy server on the same node with prefiller service instance. You can get the proxy program in the repository's examples: [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)
220+
221+
```shell
222+
python load_balance_proxy_server_example.py \
223+
--host 192.0.0.1 \
224+
--port 8080 \
225+
--prefiller-hosts 192.0.0.1 \
226+
--prefiller-port 13700 \
227+
--decoder-hosts 192.0.0.2 192.0.0.3 \
228+
--decoder-ports 13700 13700
229+
```
230+
231+
## Verification
232+
233+
Check service health using the proxy server endpoint.
234+
235+
```shell
236+
curl http://192.0.0.1:8080/v1/completions \
237+
-H "Content-Type: application/json" \
238+
-d '{
239+
"model": "qwen3-moe",
240+
"prompt": "Who are you?",
241+
"max_tokens": 100,
242+
"temperature": 0
243+
}'
244+
```

docs/source/user_guide/configuration/additional_config.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@ The following table lists the additional configuration options available in vLLM
3535
| `kv_cache_dtype` | str | `None` | When using the kv cache quantization method, kv cache dtype needs to be set, currently only int8 is supported. |
3636
| `enable_shared_expert_dp` | bool | `False` | When the shared expert in DP, it has better performance but consumes more memory. Currently only DeepSeek series models are supported to use. |
3737
| `lmhead_tensor_parallel_size` | int | `None` | The custom tensor parallel size of lmhead. |
38+
| `oproj_tensor_parallel_size` | int | `None` | The custom tensor parallel size of oproj. |
3839

3940
The details of each config option are as follows:
4041

examples/disaggregated_prefill_v1/gen_ranktable.py

Lines changed: 41 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,10 @@
1717
type=int,
1818
required=True,
1919
help="number of decode devices")
20+
parser.add_argument("--local-device-ids",
21+
type=str,
22+
required=False,
23+
help="local device ids")
2024
args = parser.parse_args()
2125
local_host = args.local_host
2226
prefill_device_cnt = args.prefill_device_cnt
@@ -54,39 +58,47 @@ def get_cmd_stdout(cmd):
5458
"\n")[0].split(":")[1].strip()
5559
chips_per_card = int(chips_per_card)
5660

61+
if args.local_device_ids:
62+
local_device_ids = args.local_device_ids.split(',')
63+
else:
64+
local_device_ids = []
65+
for card_id in range(num_cards):
66+
for chip_id in range(chips_per_card):
67+
device_id = card_id * chips_per_card + chip_id
68+
local_device_ids.append(device_id)
69+
5770
# generate local device list for local rank 0, and gather it to all ranks
5871
local_device_list: list[dict[str, str]] = list()
5972
if local_rank == "0":
6073
super_pod_id = "0"
61-
for card_id in range(num_cards):
62-
for chip_id in range(chips_per_card):
63-
device_id = card_id * chips_per_card + chip_id
64-
if soc_info == AscendSocVersion.A3:
65-
device_ip = get_cmd_stdout(
66-
f"{hccn_tool_path} -i {device_id} -vnic -g | grep ipaddr"
67-
).split(":")[1].strip()
68-
super_device_id = get_cmd_stdout(
69-
f"npu-smi info -t spod-info -i {card_id} -c {chip_id} | grep SDID"
70-
).split(":")[1].strip()
71-
super_pod_id = get_cmd_stdout(
72-
f"npu-smi info -t spod-info -i {card_id} -c {chip_id} | grep \"Super Pod ID\""
73-
).split(":")[1].strip()
74-
else:
75-
device_ip = get_cmd_stdout(
76-
f"{hccn_tool_path} -i {device_id} -ip -g | grep ipaddr"
77-
).split(":")[1].strip()
78-
79-
device_info = {
80-
"server_id": local_host,
81-
"device_id": str(device_id),
82-
"device_ip": str(device_ip),
83-
}
84-
if soc_info == AscendSocVersion.A3:
85-
device_info.update({
86-
"super_pod_id": str(super_pod_id),
87-
"super_device_id": str(super_device_id)
88-
})
89-
local_device_list.append(device_info)
74+
for idx in range(len(local_device_ids)):
75+
device_id = local_device_ids[idx]
76+
if soc_info == AscendSocVersion.A3:
77+
device_ip = get_cmd_stdout(
78+
f"{hccn_tool_path} -i {device_id} -vnic -g | grep ipaddr"
79+
).split(":")[1].strip()
80+
super_device_id = get_cmd_stdout(
81+
f"npu-smi info -t spod-info -i {card_id} -c {chip_id} | grep SDID"
82+
).split(":")[1].strip()
83+
super_pod_id = get_cmd_stdout(
84+
f"npu-smi info -t spod-info -i {card_id} -c {chip_id} | grep \"Super Pod ID\""
85+
).split(":")[1].strip()
86+
else:
87+
device_ip = get_cmd_stdout(
88+
f"{hccn_tool_path} -i {device_id} -ip -g | grep ipaddr"
89+
).split(":")[1].strip()
90+
91+
device_info = {
92+
"server_id": local_host,
93+
"device_id": str(device_id),
94+
"device_ip": str(device_ip),
95+
}
96+
if soc_info == AscendSocVersion.A3:
97+
device_info.update({
98+
"super_pod_id": str(super_pod_id),
99+
"super_device_id": str(super_device_id)
100+
})
101+
local_device_list.append(device_info)
90102

91103
dist.init_process_group(backend=dist.Backend.GLOO)
92104
global_device_list = [None] * dist.get_world_size()

0 commit comments

Comments
 (0)