Skip to content

Add graph mode for Qwen2.5 and Qwen3 #1787

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

NicholasTao
Copy link

What this PR does / why we need it?
Add graph mode for Qwen2.5 and Qwen3

Does this PR introduce any user-facing change?
No

How was this patch tested?
Tested the single-operator mode and graph mode of Qwen2.5, Qwen3, DeepSeek.

@ApsarasX
Copy link
Collaborator

@NeverRaR plz review

@NicholasTao NicholasTao force-pushed the qw23 branch 2 times, most recently from 871c038 to 5a93bc3 Compare July 15, 2025 08:43
@huyz-git
Copy link

I got the following error when running Qwen2.5-32B and Qwen3-30B-A3B model:

(VllmWorker rank=0 pid=50303) ERROR 07-15 17:03:26 [multiproc_executor.py:527] WorkerProc hit an exception.
(VllmWorker rank=0 pid=50303) ERROR 07-15 17:03:26 [multiproc_executor.py:527] Traceback (most recent call last):
(VllmWorker rank=0 pid=50303) ERROR 07-15 17:03:26 [multiproc_executor.py:527]   File "/home/abc/vllm/vllm/v1/executor/multiproc_executor.py", line 522, in worker_busy_loop
(VllmWorker rank=0 pid=50303) ERROR 07-15 17:03:26 [multiproc_executor.py:527]     output = func(*args, **kwargs)
(VllmWorker rank=0 pid=50303) ERROR 07-15 17:03:26 [multiproc_executor.py:527]   File "/home/abc/vllm-ascend/vllm_ascend/worker/worker_v1.py", line 182, in execute_model
(VllmWorker rank=0 pid=50303) ERROR 07-15 17:03:26 [multiproc_executor.py:527]     output = self.model_runner.execute_model(scheduler_output)
(VllmWorker rank=0 pid=50303) ERROR 07-15 17:03:26 [multiproc_executor.py:527]   File "/home/abc/venv-vllm/lib64/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker rank=0 pid=50303) ERROR 07-15 17:03:26 [multiproc_executor.py:527]     return func(*args, **kwargs)
(VllmWorker rank=0 pid=50303) ERROR 07-15 17:03:26 [multiproc_executor.py:527]   File "/home/abc/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 1344, in execute_model
(VllmWorker rank=0 pid=50303) ERROR 07-15 17:03:26 [multiproc_executor.py:527]     self._update_states(scheduler_output)
(VllmWorker rank=0 pid=50303) ERROR 07-15 17:03:26 [multiproc_executor.py:527]   File "/home/abc/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 531, in _update_states
(VllmWorker rank=0 pid=50303) ERROR 07-15 17:03:26 [multiproc_executor.py:527]     for block_ids, new_block_ids in zip(  # type: ignore[call-overload]
(VllmWorker rank=0 pid=50303) ERROR 07-15 17:03:26 [multiproc_executor.py:527] TypeError: zip() takes no keyword arguments

And the following error when running Qwen3-32B model:

(VllmWorker rank=0 pid=52790) ERROR 07-15 16:37:50 [multiproc_executor.py:492] WorkerProc failed to start.
(VllmWorker rank=0 pid=52790) ERROR 07-15 16:37:50 [multiproc_executor.py:492] Traceback (most recent call last):
(VllmWorker rank=0 pid=52790) ERROR 07-15 16:37:50 [multiproc_executor.py:492]   File "/home/abc/vllm/vllm/v1/executor/multiproc_executor.py", line 466, in worker_main
(VllmWorker rank=0 pid=52790) ERROR 07-15 16:37:50 [multiproc_executor.py:492]     worker = WorkerProc(*args, **kwargs)
(VllmWorker rank=0 pid=52790) ERROR 07-15 16:37:50 [multiproc_executor.py:492]   File "/home/abc/vllm/vllm/v1/executor/multiproc_executor.py", line 363, in __init__
(VllmWorker rank=0 pid=52790) ERROR 07-15 16:37:50 [multiproc_executor.py:492]     self.worker.load_model()
(VllmWorker rank=0 pid=52790) ERROR 07-15 16:37:50 [multiproc_executor.py:492]   File "/home/abc/vllm-ascend/vllm_ascend/worker/worker_v1.py", line 196, in load_model
(VllmWorker rank=0 pid=52790) ERROR 07-15 16:37:50 [multiproc_executor.py:492]     self.model_runner.load_model()
(VllmWorker rank=0 pid=52790) ERROR 07-15 16:37:50 [multiproc_executor.py:492]   File "/home/abc/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 1799, in load_model
(VllmWorker rank=0 pid=52790) ERROR 07-15 16:37:50 [multiproc_executor.py:492]     self.model = get_model(vllm_config=self.vllm_config)
(VllmWorker rank=0 pid=52790) ERROR 07-15 16:37:50 [multiproc_executor.py:492]   File "/home/abc/vllm/vllm/model_executor/model_loader/__init__.py", line 59, in get_model
(VllmWorker rank=0 pid=52790) ERROR 07-15 16:37:50 [multiproc_executor.py:492]     return loader.load_model(vllm_config=vllm_config,
(VllmWorker rank=0 pid=52790) ERROR 07-15 16:37:50 [multiproc_executor.py:492]   File "/home/abc/vllm/vllm/model_executor/model_loader/base_loader.py", line 38, in load_model
(VllmWorker rank=0 pid=52790) ERROR 07-15 16:37:50 [multiproc_executor.py:492]     model = initialize_model(vllm_config=vllm_config,
(VllmWorker rank=0 pid=52790) ERROR 07-15 16:37:50 [multiproc_executor.py:492]   File "/home/abc/vllm/vllm/model_executor/model_loader/utils.py", line 62, in initialize_model
(VllmWorker rank=0 pid=52790) ERROR 07-15 16:37:50 [multiproc_executor.py:492]     return model_class(vllm_config=vllm_config, prefix=prefix)
(VllmWorker rank=0 pid=52790) ERROR 07-15 16:37:50 [multiproc_executor.py:492]   File "/home/abc/vllm-ascend/vllm_ascend/models/qwen3.py", line 279, in __init__
(VllmWorker rank=0 pid=52790) ERROR 07-15 16:37:50 [multiproc_executor.py:492]     self.model = CustomQwen3Model(vllm_config=vllm_config,
(VllmWorker rank=0 pid=52790) ERROR 07-15 16:37:50 [multiproc_executor.py:492]   File "/home/abc/vllm/vllm/compilation/decorators.py", line 152, in __init__
(VllmWorker rank=0 pid=52790) ERROR 07-15 16:37:50 [multiproc_executor.py:492]     old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
(VllmWorker rank=0 pid=52790) ERROR 07-15 16:37:50 [multiproc_executor.py:492]   File "/home/abc/vllm-ascend/vllm_ascend/models/qwen3.py", line 209, in __init__
(VllmWorker rank=0 pid=52790) ERROR 07-15 16:37:50 [multiproc_executor.py:492]     self.cos_sin_cache = self.layers[0].self_attn.rotary_emb.cos_sin_cache
(VllmWorker rank=0 pid=52790) ERROR 07-15 16:37:50 [multiproc_executor.py:492]   File "/home/abc/venv-vllm/lib64/python3.9/site-packages/torch/nn/modules/module.py", line 1931, in __getattr__
(VllmWorker rank=0 pid=52790) ERROR 07-15 16:37:50 [multiproc_executor.py:492]     raise AttributeError(
(VllmWorker rank=0 pid=52790) ERROR 07-15 16:37:50 [multiproc_executor.py:492] AttributeError: 'RotaryEmbedding' object has no attribute 'cos_sin_cache'

All models are run with tp 2. The first error is occurred during inference, and the second error is occurred during startup.

@ttanzhiqiang
Copy link
Contributor

Does this pr support all_gather's dp?

@NicholasTao NicholasTao force-pushed the qw23 branch 5 times, most recently from 16a59c4 to 4c85d7c Compare July 15, 2025 12:04
@@ -188,6 +217,41 @@ def build(self,
slot_mapping = self.runner.slot_mapping[:num_actual_tokens]
attn_mask = self.runner.attn_mask
attn_state = self.runner.attn_state
query_start_loc_cpu = self.runner.query_start_loc_cpu[:num_reqs + 1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

误修改的代码,请删除

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改


vllm_config = get_current_vllm_config()
self.full_graph = vllm_config.compilation_config.full_cuda_graph
self.block_size = vllm_config.cache_config.block_size

def update_kv_cache(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

补充UT,和_npu_reshape_and_cachej结果要求一致

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已补充UT

key_cache=self.key_cache,
value_cache=self.value_cache,
slot_indices=slots)
if not attn_metadata.with_prefill_across_dp and self.torchair_graph_enabled:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with_prefill_across_dp建议修改为attn_metadata.attn_state == AscendAttentionState.DecodeOnly 保持decode内部用一套算子

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

当前两者语义略有不同, 将在详细分析后修改

key_cache = self.key_cache.view(*self.key_cache.shape[:-2], -1)
value_cache = self.value_cache.view(*self.value_cache.shape[:-2], -1)

output = torch_npu.npu_incre_flash_attention(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建议修改为npu_fused_infer_attention_score算子,并增加UT

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

经讨论继续使用npu_incre_flash_attention

self.q_norm = RMSNorm(self.head_dim, eps=rms_norm_eps)
self.k_norm = RMSNorm(self.head_dim, eps=rms_norm_eps)
ascend_config = get_ascend_config()
self.torchair_graph_enabled = ascend_config.torchair_graph_config.enabled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是否移到forward中直接判断ascend_config,减少代码

Copy link
Author

@NicholasTao NicholasTao Jul 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo, 移到forward中的代码, 待验证后合入



def rope_forward(
self,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

补充UT,和ATB算子作精度校验

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已补充UT

@@ -992,7 +992,8 @@ def _process_reqs(
# Use host tensor, other wise error: tensor.hostData is null
common_attn_metadata = CommonAttentionMetadata(
query_start_loc=query_start_loc,
seq_lens=self.seq_lens_cpu[:num_reqs])
seq_lens=self.seq_lens_cpu[:num_reqs],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

误修改?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

无误, 最新PR逻辑正常

@@ -1112,6 +1114,20 @@ def _process_reqs(
if envs_ascend.VLLM_ASCEND_ENABLE_DBO and with_prefill:
model_kwargs["graph_enable"] = False # type: ignore
if self.torchair_graph_enabled and not with_prefill:
torch._dynamo.mark_static(input_ids)
torch._dynamo.mark_static(positions)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

定位重复编译问题,解决后删除

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo

@NicholasTao NicholasTao force-pushed the qw23 branch 5 times, most recently from a0e44e3 to 076e767 Compare July 17, 2025 02:19
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@NicholasTao NicholasTao force-pushed the qw23 branch 3 times, most recently from b6c1124 to e7c0013 Compare July 17, 2025 07:07
Signed-off-by: taoyuxiang <t30002884@china.huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants