Add graph mode for Qwen2.5 and Qwen3 #1787

NicholasTao · 2025-07-14T10:58:19Z

What this PR does / why we need it?
Add graph mode for Qwen2.5 and Qwen3

Does this PR introduce any user-facing change?
No

How was this patch tested?
Tested the single-operator mode and graph mode of Qwen2.5, Qwen3, DeepSeek.

ApsarasX · 2025-07-14T12:47:14Z

huyz-git · 2025-07-15T09:05:58Z

I got the following error when running Qwen2.5-32B and Qwen3-30B-A3B model:

(VllmWorker rank=0 pid=50303) ERROR 07-15 17:03:26 [multiproc_executor.py:527] WorkerProc hit an exception.
(VllmWorker rank=0 pid=50303) ERROR 07-15 17:03:26 [multiproc_executor.py:527] Traceback (most recent call last):
(VllmWorker rank=0 pid=50303) ERROR 07-15 17:03:26 [multiproc_executor.py:527]   File "/home/abc/vllm/vllm/v1/executor/multiproc_executor.py", line 522, in worker_busy_loop
(VllmWorker rank=0 pid=50303) ERROR 07-15 17:03:26 [multiproc_executor.py:527]     output = func(*args, **kwargs)
(VllmWorker rank=0 pid=50303) ERROR 07-15 17:03:26 [multiproc_executor.py:527]   File "/home/abc/vllm-ascend/vllm_ascend/worker/worker_v1.py", line 182, in execute_model
(VllmWorker rank=0 pid=50303) ERROR 07-15 17:03:26 [multiproc_executor.py:527]     output = self.model_runner.execute_model(scheduler_output)
(VllmWorker rank=0 pid=50303) ERROR 07-15 17:03:26 [multiproc_executor.py:527]   File "/home/abc/venv-vllm/lib64/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker rank=0 pid=50303) ERROR 07-15 17:03:26 [multiproc_executor.py:527]     return func(*args, **kwargs)
(VllmWorker rank=0 pid=50303) ERROR 07-15 17:03:26 [multiproc_executor.py:527]   File "/home/abc/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 1344, in execute_model
(VllmWorker rank=0 pid=50303) ERROR 07-15 17:03:26 [multiproc_executor.py:527]     self._update_states(scheduler_output)
(VllmWorker rank=0 pid=50303) ERROR 07-15 17:03:26 [multiproc_executor.py:527]   File "/home/abc/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 531, in _update_states
(VllmWorker rank=0 pid=50303) ERROR 07-15 17:03:26 [multiproc_executor.py:527]     for block_ids, new_block_ids in zip(  # type: ignore[call-overload]
(VllmWorker rank=0 pid=50303) ERROR 07-15 17:03:26 [multiproc_executor.py:527] TypeError: zip() takes no keyword arguments

And the following error when running Qwen3-32B model:

(VllmWorker rank=0 pid=52790) ERROR 07-15 16:37:50 [multiproc_executor.py:492] WorkerProc failed to start.
(VllmWorker rank=0 pid=52790) ERROR 07-15 16:37:50 [multiproc_executor.py:492] Traceback (most recent call last):
(VllmWorker rank=0 pid=52790) ERROR 07-15 16:37:50 [multiproc_executor.py:492]   File "/home/abc/vllm/vllm/v1/executor/multiproc_executor.py", line 466, in worker_main
(VllmWorker rank=0 pid=52790) ERROR 07-15 16:37:50 [multiproc_executor.py:492]     worker = WorkerProc(*args, **kwargs)
(VllmWorker rank=0 pid=52790) ERROR 07-15 16:37:50 [multiproc_executor.py:492]   File "/home/abc/vllm/vllm/v1/executor/multiproc_executor.py", line 363, in __init__
(VllmWorker rank=0 pid=52790) ERROR 07-15 16:37:50 [multiproc_executor.py:492]     self.worker.load_model()
(VllmWorker rank=0 pid=52790) ERROR 07-15 16:37:50 [multiproc_executor.py:492]   File "/home/abc/vllm-ascend/vllm_ascend/worker/worker_v1.py", line 196, in load_model
(VllmWorker rank=0 pid=52790) ERROR 07-15 16:37:50 [multiproc_executor.py:492]     self.model_runner.load_model()
(VllmWorker rank=0 pid=52790) ERROR 07-15 16:37:50 [multiproc_executor.py:492]   File "/home/abc/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 1799, in load_model
(VllmWorker rank=0 pid=52790) ERROR 07-15 16:37:50 [multiproc_executor.py:492]     self.model = get_model(vllm_config=self.vllm_config)
(VllmWorker rank=0 pid=52790) ERROR 07-15 16:37:50 [multiproc_executor.py:492]   File "/home/abc/vllm/vllm/model_executor/model_loader/__init__.py", line 59, in get_model
(VllmWorker rank=0 pid=52790) ERROR 07-15 16:37:50 [multiproc_executor.py:492]     return loader.load_model(vllm_config=vllm_config,
(VllmWorker rank=0 pid=52790) ERROR 07-15 16:37:50 [multiproc_executor.py:492]   File "/home/abc/vllm/vllm/model_executor/model_loader/base_loader.py", line 38, in load_model
(VllmWorker rank=0 pid=52790) ERROR 07-15 16:37:50 [multiproc_executor.py:492]     model = initialize_model(vllm_config=vllm_config,
(VllmWorker rank=0 pid=52790) ERROR 07-15 16:37:50 [multiproc_executor.py:492]   File "/home/abc/vllm/vllm/model_executor/model_loader/utils.py", line 62, in initialize_model
(VllmWorker rank=0 pid=52790) ERROR 07-15 16:37:50 [multiproc_executor.py:492]     return model_class(vllm_config=vllm_config, prefix=prefix)
(VllmWorker rank=0 pid=52790) ERROR 07-15 16:37:50 [multiproc_executor.py:492]   File "/home/abc/vllm-ascend/vllm_ascend/models/qwen3.py", line 279, in __init__
(VllmWorker rank=0 pid=52790) ERROR 07-15 16:37:50 [multiproc_executor.py:492]     self.model = CustomQwen3Model(vllm_config=vllm_config,
(VllmWorker rank=0 pid=52790) ERROR 07-15 16:37:50 [multiproc_executor.py:492]   File "/home/abc/vllm/vllm/compilation/decorators.py", line 152, in __init__
(VllmWorker rank=0 pid=52790) ERROR 07-15 16:37:50 [multiproc_executor.py:492]     old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
(VllmWorker rank=0 pid=52790) ERROR 07-15 16:37:50 [multiproc_executor.py:492]   File "/home/abc/vllm-ascend/vllm_ascend/models/qwen3.py", line 209, in __init__
(VllmWorker rank=0 pid=52790) ERROR 07-15 16:37:50 [multiproc_executor.py:492]     self.cos_sin_cache = self.layers[0].self_attn.rotary_emb.cos_sin_cache
(VllmWorker rank=0 pid=52790) ERROR 07-15 16:37:50 [multiproc_executor.py:492]   File "/home/abc/venv-vllm/lib64/python3.9/site-packages/torch/nn/modules/module.py", line 1931, in __getattr__
(VllmWorker rank=0 pid=52790) ERROR 07-15 16:37:50 [multiproc_executor.py:492]     raise AttributeError(
(VllmWorker rank=0 pid=52790) ERROR 07-15 16:37:50 [multiproc_executor.py:492] AttributeError: 'RotaryEmbedding' object has no attribute 'cos_sin_cache'

All models are run with tp 2. The first error is occurred during inference, and the second error is occurred during startup.

ttanzhiqiang · 2025-07-15T09:22:09Z

Does this pr support all_gather's dp?

depeng1994 · 2025-07-15T12:26:18Z

vllm_ascend/attention/attention_v1.py

@@ -188,6 +217,41 @@ def build(self,
        slot_mapping = self.runner.slot_mapping[:num_actual_tokens]
        attn_mask = self.runner.attn_mask
        attn_state = self.runner.attn_state
+        query_start_loc_cpu = self.runner.query_start_loc_cpu[:num_reqs + 1]


误修改的代码，请删除

depeng1994 · 2025-07-15T12:40:11Z

vllm_ascend/attention/attention_v1.py


        vllm_config = get_current_vllm_config()
        self.full_graph = vllm_config.compilation_config.full_cuda_graph
        self.block_size = vllm_config.cache_config.block_size

+    def update_kv_cache(


补充UT，和_npu_reshape_and_cachej结果要求一致

已补充UT

depeng1994 · 2025-07-15T12:47:03Z

vllm_ascend/attention/attention_v1.py

-                    key_cache=self.key_cache,
-                    value_cache=self.value_cache,
-                    slot_indices=slots)
+                if not attn_metadata.with_prefill_across_dp and self.torchair_graph_enabled:


with_prefill_across_dp建议修改为attn_metadata.attn_state == AscendAttentionState.DecodeOnly 保持decode内部用一套算子

当前两者语义略有不同, 将在详细分析后修改

depeng1994 · 2025-07-15T12:55:20Z

vllm_ascend/attention/attention_v1.py

+                    key_cache = self.key_cache.view(*self.key_cache.shape[:-2], -1)
+                    value_cache = self.value_cache.view(*self.value_cache.shape[:-2], -1)
+
+                    output = torch_npu.npu_incre_flash_attention(


建议修改为npu_fused_infer_attention_score算子，并增加UT

经讨论继续使用npu_incre_flash_attention

depeng1994 · 2025-07-15T13:24:33Z

vllm_ascend/models/qwen3_moe.py

+        self.q_norm = RMSNorm(self.head_dim, eps=rms_norm_eps)
+        self.k_norm = RMSNorm(self.head_dim, eps=rms_norm_eps)
+        ascend_config = get_ascend_config()
+        self.torchair_graph_enabled = ascend_config.torchair_graph_config.enabled


是否移到forward中直接判断ascend_config，减少代码

todo, 移到forward中的代码, 待验证后合入

depeng1994 · 2025-07-15T13:32:53Z

vllm_ascend/ops/rotary_embedding.py

+
+
+def rope_forward(
+        self,


补充UT，和ATB算子作精度校验

已补充UT

depeng1994 · 2025-07-15T13:35:18Z

vllm_ascend/worker/model_runner_v1.py

@@ -992,7 +992,8 @@ def _process_reqs(
        # Use host tensor, other wise error: tensor.hostData is null
        common_attn_metadata = CommonAttentionMetadata(
            query_start_loc=query_start_loc,
-            seq_lens=self.seq_lens_cpu[:num_reqs])
+            seq_lens=self.seq_lens_cpu[:num_reqs],


误修改？

无误, 最新PR逻辑正常

depeng1994 · 2025-07-15T13:37:24Z

vllm_ascend/worker/model_runner_v1.py

@@ -1112,6 +1114,20 @@ def _process_reqs(
                if envs_ascend.VLLM_ASCEND_ENABLE_DBO and with_prefill:
                    model_kwargs["graph_enable"] = False  # type: ignore
                if self.torchair_graph_enabled and not with_prefill:
+                    torch._dynamo.mark_static(input_ids)
+                    torch._dynamo.mark_static(positions)


定位重复编译问题，解决后删除

github-actions · 2025-07-17T02:26:44Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: taoyuxiang <t30002884@china.huawei.com>

github-actions bot added module:ops module:core labels Jul 14, 2025

NicholasTao force-pushed the qw23 branch 3 times, most recently from 9afb7cf to c985f62 Compare July 14, 2025 12:20

NicholasTao force-pushed the qw23 branch 2 times, most recently from 871c038 to 5a93bc3 Compare July 15, 2025 08:43

NicholasTao force-pushed the qw23 branch 5 times, most recently from 16a59c4 to 4c85d7c Compare July 15, 2025 12:04

depeng1994 reviewed Jul 15, 2025

View reviewed changes

NicholasTao force-pushed the qw23 branch 5 times, most recently from a0e44e3 to 076e767 Compare July 17, 2025 02:19

github-actions bot added the merge-conflicts label Jul 17, 2025

NicholasTao force-pushed the qw23 branch 3 times, most recently from b6c1124 to e7c0013 Compare July 17, 2025 07:07

github-actions bot added the module:tests label Jul 17, 2025

Add graph mode for Qwen2.5 and Qwen3

21da1b6

Signed-off-by: taoyuxiang <t30002884@china.huawei.com>

NicholasTao force-pushed the qw23 branch from e7c0013 to 21da1b6 Compare July 17, 2025 13:30

NicholasTao closed this Jul 18, 2025

NicholasTao deleted the qw23 branch July 18, 2025 06:37

Add graph mode for Qwen2.5 and Qwen3 #1787

Add graph mode for Qwen2.5 and Qwen3 #1787

Conversation

NicholasTao commented Jul 14, 2025

Uh oh!

ApsarasX commented Jul 14, 2025

Uh oh!

huyz-git commented Jul 15, 2025

Uh oh!

ttanzhiqiang commented Jul 15, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NicholasTao Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jul 17, 2025

Uh oh!

Uh oh!

NicholasTao Jul 18, 2025 •

edited

Loading