support FULL graph mode in Qwen #3369

momo609 · 2025-10-10T09:03:06Z

move to PR:https://github.com/vllm-project/vllm-ascend/actions/runs/19228494454/job/54961305233?pr=3970

What this PR does / why we need it?

The current library only supports the FullDecodeOnly graph mode, which enables full graph execution during the decode. This PR extends support to allow full graph execution in both the prefill and decode, referred to as FULL graph mode.
support FULL graph mode：

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.11.0
vLLM main: vllm-project/vllm@83f478b

github-actions · 2025-10-10T09:03:14Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request introduces support for 'fullandpiecesewise' graph compilation. The changes involve updating the attention mechanism to use npu_fused_infer_attention_score, modifying platform configurations, and adjusting the model runner. I've found a few critical issues in the implementation. There's an incorrect assertion in platform.py that will always fail, a misleading log message with missing logic, and an incorrect calculation of query_start_loc in the model runner's dummy run, which will break graph capturing. These issues need to be addressed to ensure the new functionality works correctly.

gemini-code-assist · 2025-10-10T09:05:28Z

vllm_ascend/platform.py

+            assert compilation_config.level == CompilationLevel.PIECEWISE, \
+                "When enabling piecewise aclgraph, please make sure compilation_config.level == CompilationLevel.PIECEWISE and compilation_config.cudagraph_mode == CUDAGraphMode.PIECEWISE"


The assertion on line 242 will always fail because it checks if compilation_config.cudagraph_mode is CUDAGraphMode.PIECEWISE, but this code block is only executed when compilation_config.cudagraph_mode is CUDAGraphMode.FULL_AND_PIECEWISE. This will prevent the FULL_AND_PIECEWISE mode from working.

assert compilation_config.level == CompilationLevel.PIECEWISE, \ "When enabling piecewise aclgraph, please make sure compilation_config.level == CompilationLevel.PIECEWISE and compilation_config.cudagraph_mode == CUDAGraphMode.FULL_AND_PIECEWISE"

gemini-code-assist · 2025-10-10T09:05:28Z

vllm_ascend/worker/model_runner_v1.py

+            self.query_start_loc[:num_reqs + 1] = num_tokens
+            self.query_start_loc_cpu[:num_reqs + 1] = num_tokens


self.query_start_loc and self.query_start_loc_cpu are being incorrectly set to a scalar value num_tokens. These tensors are expected to store the cumulative sum of token lengths for each request in the batch. Broadcasting a scalar value will result in incorrect query_start_loc values, which will likely cause errors or incorrect behavior during graph capture and profiling dummy runs. The correct logic should compute the cumulative sum of tokens per request, similar to how it's handled in the _prepare_inputs method.

gemini-code-assist · 2025-10-10T09:05:29Z

vllm_ascend/platform.py

+        if compilation_config.level == CompilationLevel.PIECEWISE:
+            logger.warning(
+                "NEW  NPU does not support %s compilation level. Setting CUDAGraphMode to NONE",
+                compilation_config.level)


The log message at line 183 is misleading. It states "Setting CUDAGraphMode to NONE", but the code does not actually modify compilation_config.cudagraph_mode. This can cause confusion and incorrect behavior if PIECEWISE compilation level is not supported. Additionally, "NEW NPU" appears to be a typo and should likely be "NPU".

if compilation_config.level == CompilationLevel.PIECEWISE: logger.warning( "NPU does not support %s compilation level. Setting CUDAGraphMode to NONE", compilation_config.level) compilation_config.cudagraph_mode = CUDAGraphMode.NONE

github-actions · 2025-10-11T02:24:22Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

yiz-liu · 2025-10-11T01:31:19Z

vllm_ascend/attention/attention_v1.py

+                graph_params.attn_params[num_tokens_origin].append((
                    query,
-                    self.key_cache,
-                    self.value_cache,
+                    key,
+                    value,
+                    attn_metadata.block_tables,
+                    block_size,
+                    seq_lens,
+                    query_start_loc,
                    self.num_kv_heads,
                    self.num_heads,
                    self.scale,
-                    attn_metadata.block_tables,
-                    attn_metadata.seq_lens,
-                    output,
+                    attn_output,
+                    softmax_lse
                ))


Please use weak_ref_tensor like #3331.

yiz-liu · 2025-10-11T03:18:56Z

vllm_ascend/attention/attention_v1.py

+                output = torch.empty_like(query)
+                softmax_lse = torch.empty(num_tokens,
+                                        dtype=query.dtype,
+                                        device=query.device)
+                query_start_loc = attn_metadata.query_start_loc[1:].cpu().int().tolist()
+                seq_lens = attn_metadata.seq_lens.cpu().int().tolist()


Try not to create or modify params in forward, refactor them to prepare_inputs or somewhere else, we only need to do it once every step, not every layer.

github-actions · 2025-10-14T08:14:24Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

momo609 · 2025-10-14T08:17:33Z

vllm_ascend/attention/attention_v1.py

    seq_lens: torch.Tensor = None

    query_start_loc: torch.Tensor = None
+    seq_lens_list: List[int] = None


momo609 · 2025-10-14T08:19:50Z

vllm_ascend/attention/attention_v1.py

+                        num_block, block_size, -1)
+                    value = self.value_cache.view(  # type: ignore
+                        num_block, block_size, -1)
+                    softmax_lse = torch.empty(num_tokens,


softmax_lse move to init.

github-actions · 2025-10-17T13:59:17Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

whx-sjtu · 2025-10-22T08:21:52Z

vllm_ascend/attention/attention_v1.py

+        forward_context: ForwardContext = get_forward_context()
+        if torch.version.cann.startswith("8.3"):
+            if forward_context.capturing:
+                output = self.full_graph_attention(query, key, value, attn_metadata, 128, output)


Why hard-code block_size to 128 here?

github-actions · 2025-10-23T01:37:34Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions · 2025-10-27T11:43:24Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions · 2025-10-29T02:18:15Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions · 2025-10-31T09:18:43Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>

github-actions bot added the module:core label Oct 10, 2025

momo609 changed the title ~~add fullandpiecesewise graph.~~ support FULL_AND_PIECEWISE graph mode. Oct 10, 2025

gemini-code-assist bot reviewed Oct 10, 2025

View reviewed changes

github-actions bot added the merge-conflicts label Oct 11, 2025

yiz-liu requested changes Oct 11, 2025

View reviewed changes

momo609 force-pushed the fullpiece branch 3 times, most recently from ee7e275 to 8fe8978 Compare October 13, 2025 06:38

momo609 changed the title ~~support FULL_AND_PIECEWISE graph mode.~~ support FULL_AND_PIECEWISE graph mode and spiltfusedpa op . Oct 13, 2025

momo609 force-pushed the fullpiece branch 2 times, most recently from 0f6e3d6 to 81291e9 Compare October 14, 2025 02:10

github-actions bot removed the merge-conflicts label Oct 14, 2025

momo609 force-pushed the fullpiece branch 2 times, most recently from f9ad80d to bcda351 Compare October 14, 2025 04:42

github-actions bot added the merge-conflicts label Oct 14, 2025

momo609 commented Oct 14, 2025

View reviewed changes

momo609 force-pushed the fullpiece branch from c1bfb7a to f42d134 Compare October 16, 2025 07:19

github-actions bot added merge-conflicts and removed merge-conflicts labels Oct 16, 2025

momo609 changed the title ~~support FULL_AND_PIECEWISE graph mode and spiltfusedpa op .~~ support FULL_AND_PIECEWISE and FULL graph mode Oct 21, 2025

momo609 force-pushed the fullpiece branch from 2fcde61 to 6fdff86 Compare October 21, 2025 09:30

github-actions bot removed the merge-conflicts label Oct 21, 2025

momo609 changed the title ~~support FULL_AND_PIECEWISE and FULL graph mode~~ support FULL_AND_PIECEWISE and FULL graph mode in Qwen Oct 22, 2025

github-actions bot added the module:tests label Oct 22, 2025

whx-sjtu reviewed Oct 22, 2025

View reviewed changes

github-actions bot added the merge-conflicts label Oct 23, 2025

momo609 changed the title ~~support FULL_AND_PIECEWISE and FULL graph mode in Qwen~~ support FULL graph mode in Qwen Oct 23, 2025

momo609 force-pushed the fullpiece branch 2 times, most recently from 55bee48 to bdb15fb Compare October 27, 2025 02:07

github-actions bot removed merge-conflicts module:core labels Oct 27, 2025

github-actions bot added merge-conflicts module:core labels Oct 27, 2025

momo609 force-pushed the fullpiece branch from 656bc9d to d00047a Compare October 28, 2025 03:00

github-actions bot removed the merge-conflicts label Oct 28, 2025

momo609 force-pushed the fullpiece branch from a450e1f to 16160a2 Compare October 28, 2025 03:07

github-actions bot added the merge-conflicts label Oct 29, 2025

momo609 force-pushed the fullpiece branch 6 times, most recently from a98f46f to 81e3e8d Compare October 30, 2025 08:06

github-actions bot removed the merge-conflicts label Oct 30, 2025

momo609 force-pushed the fullpiece branch 2 times, most recently from 55e6279 to a6354ec Compare October 30, 2025 09:05

github-actions bot added the merge-conflicts label Oct 31, 2025

wangxiaoxin-sherie added 2 commits November 4, 2025 11:40

add FULL graph.

4689528

Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>

support full

cf15e1b

momo609 force-pushed the fullpiece branch from 7a30161 to cf15e1b Compare November 4, 2025 03:41

		assert compilation_config.level == CompilationLevel.PIECEWISE, \
		"When enabling piecewise aclgraph, please make sure compilation_config.level == CompilationLevel.PIECEWISE and compilation_config.cudagraph_mode == CUDAGraphMode.PIECEWISE"

		self.query_start_loc[:num_reqs + 1] = num_tokens
		self.query_start_loc_cpu[:num_reqs + 1] = num_tokens

support FULL graph mode in Qwen #3369

Are you sure you want to change the base?

support FULL graph mode in Qwen #3369

Conversation

momo609 commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Oct 10, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 11, 2025

Uh oh!

yiz-liu Oct 11, 2025

Choose a reason for hiding this comment

Uh oh!

yiz-liu Oct 11, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 14, 2025

Uh oh!

momo609 Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

momo609 Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 17, 2025

Uh oh!

whx-sjtu Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 23, 2025

Uh oh!

github-actions bot commented Oct 27, 2025

Uh oh!

github-actions bot commented Oct 29, 2025

Uh oh!

github-actions bot commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

momo609 commented Oct 10, 2025 •

edited

Loading