Skip to content

Commit 4ae3893

Browse files
authored
Merge branch 'vllm-project:main' into main
2 parents 8a24711 + d01fd1d commit 4ae3893

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

47 files changed

+1891
-272
lines changed

.github/workflows/format_pr_body.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ jobs:
3636

3737
- name: Get vLLM version
3838
run: |
39-
VLLM_COMMIT=9607d5eb449711b349d4c2bee0a9c94afcc7ed14
39+
VLLM_COMMIT=f225ea7dd98e9f29752e5c032cd4a8ee1d712f16
4040
echo "VLLM_COMMIT=https://github.com/vllm-project/vllm/commit/$VLLM_COMMIT" >> $GITHUB_ENV
4141
4242
- name: Checkout repository

.github/workflows/vllm_ascend_test.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ jobs:
4242
lint:
4343
uses: ./.github/workflows/pre-commit.yml
4444
with:
45-
vllm: 9607d5eb449711b349d4c2bee0a9c94afcc7ed14
45+
vllm: f225ea7dd98e9f29752e5c032cd4a8ee1d712f16
4646

4747
changes:
4848
runs-on: ubuntu-latest
@@ -83,7 +83,7 @@ jobs:
8383
VLLM_USE_MODELSCOPE: True
8484
strategy:
8585
matrix:
86-
vllm_version: [9607d5eb449711b349d4c2bee0a9c94afcc7ed14, v0.10.2]
86+
vllm_version: [f225ea7dd98e9f29752e5c032cd4a8ee1d712f16, v0.10.2]
8787
steps:
8888
- name: Install packages
8989
run: |
@@ -138,7 +138,7 @@ jobs:
138138
name: e2e-light
139139
strategy:
140140
matrix:
141-
vllm_version: [9607d5eb449711b349d4c2bee0a9c94afcc7ed14, v0.10.2]
141+
vllm_version: [f225ea7dd98e9f29752e5c032cd4a8ee1d712f16, v0.10.2]
142142
# Note (yikun): If CI resource are limited we can split job into two chain jobs
143143
needs: [lint, changes]
144144
# only trigger e2e test after lint passed and the change is e2e related with pull request.

.github/workflows/vllm_ascend_test_full.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,7 @@ jobs:
6868
name: e2e-full
6969
strategy:
7070
matrix:
71-
vllm_version: [9607d5eb449711b349d4c2bee0a9c94afcc7ed14, v0.10.2]
71+
vllm_version: [f225ea7dd98e9f29752e5c032cd4a8ee1d712f16, v0.10.2]
7272
needs: [changes]
7373
if: ${{ needs.changes.outputs.e2e_tracker == 'true' }}
7474
uses: ./.github/workflows/_e2e_test.yaml

docs/source/developer_guide/modeling/adding_a_new_model.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,6 @@ from torch import nn
6161
from vllm.attention import Attention
6262
from vllm.config import VllmConfig
6363
from vllm.sequence import IntermediateTensors
64-
from vllm.model_executor.sampling_metadata import SamplingMetadata
6564

6665
class CustomAttention(nn.Module):
6766
def __init__(self, vllm_config: VllmConfig, prefix: str):

docs/source/tutorials/multi_node_ray.md

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -91,7 +91,7 @@ After setting up the containers and installing vllm-ascend on each node, follow
9191

9292
Choose one machine as the head node and the others as worker nodes. Before proceeding, use `ip addr` to check your `nic_name` (network interface name).
9393

94-
Set the `ASCEND_RT_VISIBLE_DEVICES` environment variable to specify the NPU devices to use. For Ray versions above 2.1, also set the `RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES` variable to avoid device recognition issues. The `--num-gpus` parameter defines the number of NPUs to be used on each node.
94+
Set the `ASCEND_RT_VISIBLE_DEVICES` environment variable to specify the NPU devices to use. For Ray versions above 2.1, also set the `RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES` variable to avoid device recognition issues.
9595

9696
Below are the commands for the head and worker nodes:
9797

@@ -109,7 +109,7 @@ export GLOO_SOCKET_IFNAME={nic_name}
109109
export TP_SOCKET_IFNAME={nic_name}
110110
export RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1
111111
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
112-
ray start --head --num-gpus=8
112+
ray start --head
113113
```
114114

115115
**Worker node**:
@@ -125,20 +125,22 @@ export GLOO_SOCKET_IFNAME={nic_name}
125125
export TP_SOCKET_IFNAME={nic_name}
126126
export RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1
127127
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
128-
ray start --address='{head_node_ip}:6379' --num-gpus=8 --node-ip-address={local_ip}
128+
ray start --address='{head_node_ip}:6379' --node-ip-address={local_ip}
129129
```
130130

131131
Once the cluster is started on multiple nodes, execute `ray status` and `ray list nodes` to verify the Ray cluster's status. You should see the correct number of nodes and NPUs listed.
132132

133-
## Start the Online Inference Service on multinode
134-
In the container, you can use vLLM as if all NPUs were on a single node. vLLM will utilize NPU resources across all nodes in the Ray cluster. You only need to run the vllm command on one node.
133+
## Start the Online Inference Service on multinode scenario
134+
In the container, you can use vLLM as if all NPUs were on a single node. vLLM will utilize NPU resources across all nodes in the Ray cluster.
135+
136+
**You only need to run the vllm command on one node.**
135137

136138
To set up parallelism, the common practice is to set the `tensor-parallel-size` to the number of NPUs per node, and the `pipeline-parallel-size` to the number of nodes.
137139

138140
For example, with 16 NPUs across 2 nodes (8 NPUs per node), set the tensor parallel size to 8 and the pipeline parallel size to 2:
139141

140142
```shell
141-
vllm Qwen/Qwen3-235B-A22B \
143+
vllm serve Qwen/Qwen3-235B-A22B \
142144
--distributed-executor-backend ray \
143145
--pipeline-parallel-size 2 \
144146
--tensor-parallel-size 8 \
@@ -154,7 +156,7 @@ vllm Qwen/Qwen3-235B-A22B \
154156
Alternatively, if you want to use only tensor parallelism, set the tensor parallel size to the total number of NPUs in the cluster. For example, with 16 NPUs across 2 nodes, set the tensor parallel size to 16:
155157

156158
```shell
157-
vllm Qwen/Qwen3-235B-A22B \
159+
vllm serve Qwen/Qwen3-235B-A22B \
158160
--distributed-executor-backend ray \
159161
--tensor-parallel-size 16 \
160162
--enable-expert-parallel \

examples/offline_disaggregated_prefill_npu.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -79,7 +79,7 @@ def run_prefill(prefill_done, process_close):
7979

8080

8181
def run_decode(prefill_done):
82-
os.environ['VLLM_LLMDD_RPC_PORT'] = '6634'
82+
os.environ['VLLM_ASCEND_LLMDD_RPC_PORT'] = '6634'
8383
# ranktable.json needs be generated using gen_ranktable.sh
8484
# from the examples/disaggregated_prefill_v1 module in the main branch.
8585
os.environ['DISAGGREGATED_PREFILL_RANK_TABLE_PATH'] = "./ranktable.json"

tests/e2e/model_utils.py

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,12 @@
1919

2020
from typing import Dict, List, Optional, Sequence, Tuple, Union
2121

22-
from vllm.sequence import PromptLogprobs, SampleLogprobs
22+
from vllm_ascend.utils import vllm_version_is
23+
24+
if vllm_version_is("0.10.2"):
25+
from vllm.sequence import PromptLogprobs, SampleLogprobs
26+
else:
27+
from vllm.logprobs import PromptLogprobs, SampleLogprobs
2328

2429
TokensText = Tuple[List[int], str]
2530

tests/e2e/pd_disaggreate/run_edge_case_test.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,7 @@ run_tests_for_model() {
7070
# Start prefill instance
7171
PREFILL_PORT=8001
7272

73-
BASE_CMD="ASCEND_RT_VISIBLE_DEVICES=0 VLLM_LLMDD_RPC_PORT=5559 vllm serve $model_name \
73+
BASE_CMD="ASCEND_RT_VISIBLE_DEVICES=0 VLLM_ASCEND_LLMDD_RPC_PORT=5559 vllm serve $model_name \
7474
--port $PREFILL_PORT \
7575
--seed 1024 \
7676
--enforce-eager \
@@ -90,7 +90,7 @@ run_tests_for_model() {
9090
DECODE_PORT=8002
9191

9292
# Build the command with or without model-specific args
93-
BASE_CMD="ASCEND_RT_VISIBLE_DEVICES=1 VLLM_LLMDD_RPC_PORT=6000 vllm serve $model_name \
93+
BASE_CMD="ASCEND_RT_VISIBLE_DEVICES=1 VLLM_ASCEND_LLMDD_RPC_PORT=6000 vllm serve $model_name \
9494
--port $DECODE_PORT \
9595
--seed 1024 \
9696
--enforce-eager \

tests/ut/attention/test_mla_v1.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -554,7 +554,11 @@ def test_mla_preprocess(self, magic_npu_fetch):
554554
self.impl.num_kv_heads = self.impl.num_heads
555555

556556
decode_res, prefill_res = self.impl._mla_preprocess(
557-
hidden_states, kv_cache, attn_metadata, need_gather_q_kv=False)
557+
"mock_layer",
558+
hidden_states,
559+
kv_cache,
560+
attn_metadata,
561+
need_gather_q_kv=False)
558562

559563
self.assertIsNotNone(decode_res)
560564
self.assertIsNotNone(prefill_res)

tests/ut/core/test_schedule_config.py

Lines changed: 0 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,6 @@ def setUp(self):
2727
max_model_len=8192,
2828
is_multimodal_model=False,
2929
send_delta_data=False,
30-
scheduler_delay_factor=0,
3130
)
3231

3332
def test_initialize_from_config_with_default(self):
@@ -90,21 +89,6 @@ def test_not_implemented_send_delta_data(self):
9089
str(context.exception),
9190
)
9291

93-
def test_not_implemented_delay_factor(self):
94-
with self.assertRaises(NotImplementedError) as context:
95-
AscendSchedulerConfig.initialize_from_config(
96-
self.basic_scheduler_config,
97-
AscendSchedulerConfig(
98-
delay_factor=1,
99-
max_num_batched_tokens=2048,
100-
max_model_len=2048,
101-
),
102-
)
103-
self.assertIn(
104-
"currently AscendScheduler doesn't support scheduler_delay_factor",
105-
str(context.exception),
106-
)
107-
10892
def test_no_override(self):
10993
ascend_config = AscendSchedulerConfig.initialize_from_config(
11094
self.basic_scheduler_config, {})

0 commit comments

Comments
 (0)