Skip to content

Conversation

@momo609
Copy link
Collaborator

@momo609 momo609 commented Oct 10, 2025

move to PR:https://github.com/vllm-project/vllm-ascend/actions/runs/19228494454/job/54961305233?pr=3970

What this PR does / why we need it?

The current library only supports the FullDecodeOnly graph mode, which enables full graph execution during the decode. This PR extends support to allow full graph execution in both the prefill and decode, referred to as FULL graph mode.
support FULL graph mode:
image

Does this PR introduce any user-facing change?

How was this patch tested?

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@momo609 momo609 changed the title add fullandpiecesewise graph. support FULL_AND_PIECEWISE graph mode. Oct 10, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for 'fullandpiecesewise' graph compilation. The changes involve updating the attention mechanism to use npu_fused_infer_attention_score, modifying platform configurations, and adjusting the model runner. I've found a few critical issues in the implementation. There's an incorrect assertion in platform.py that will always fail, a misleading log message with missing logic, and an incorrect calculation of query_start_loc in the model runner's dummy run, which will break graph capturing. These issues need to be addressed to ensure the new functionality works correctly.

assert compilation_config.level == CompilationLevel.PIECEWISE, \
"When enabling piecewise aclgraph, please make sure compilation_config.level == CompilationLevel.PIECEWISE and compilation_config.cudagraph_mode == CUDAGraphMode.PIECEWISE"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The assertion on line 242 will always fail because it checks if compilation_config.cudagraph_mode is CUDAGraphMode.PIECEWISE, but this code block is only executed when compilation_config.cudagraph_mode is CUDAGraphMode.FULL_AND_PIECEWISE. This will prevent the FULL_AND_PIECEWISE mode from working.

            assert compilation_config.level == CompilationLevel.PIECEWISE, \
                "When enabling piecewise aclgraph, please make sure compilation_config.level == CompilationLevel.PIECEWISE and compilation_config.cudagraph_mode == CUDAGraphMode.FULL_AND_PIECEWISE"

Comment on lines 2175 to 2211
self.query_start_loc[:num_reqs + 1] = num_tokens
self.query_start_loc_cpu[:num_reqs + 1] = num_tokens
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

self.query_start_loc and self.query_start_loc_cpu are being incorrectly set to a scalar value num_tokens. These tensors are expected to store the cumulative sum of token lengths for each request in the batch. Broadcasting a scalar value will result in incorrect query_start_loc values, which will likely cause errors or incorrect behavior during graph capture and profiling dummy runs. The correct logic should compute the cumulative sum of tokens per request, similar to how it's handled in the _prepare_inputs method.

Comment on lines 181 to 184
if compilation_config.level == CompilationLevel.PIECEWISE:
logger.warning(
"NEW NPU does not support %s compilation level. Setting CUDAGraphMode to NONE",
compilation_config.level)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The log message at line 183 is misleading. It states "Setting CUDAGraphMode to NONE", but the code does not actually modify compilation_config.cudagraph_mode. This can cause confusion and incorrect behavior if PIECEWISE compilation level is not supported. Additionally, "NEW NPU" appears to be a typo and should likely be "NPU".

        if compilation_config.level == CompilationLevel.PIECEWISE:
            logger.warning(
                "NPU does not support %s compilation level. Setting CUDAGraphMode to NONE",
                compilation_config.level)
            compilation_config.cudagraph_mode = CUDAGraphMode.NONE

@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Comment on lines 435 to 448
graph_params.attn_params[num_tokens_origin].append((
query,
self.key_cache,
self.value_cache,
key,
value,
attn_metadata.block_tables,
block_size,
seq_lens,
query_start_loc,
self.num_kv_heads,
self.num_heads,
self.scale,
attn_metadata.block_tables,
attn_metadata.seq_lens,
output,
attn_output,
softmax_lse
))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use weak_ref_tensor like #3331.

Comment on lines 406 to 411
output = torch.empty_like(query)
softmax_lse = torch.empty(num_tokens,
dtype=query.dtype,
device=query.device)
query_start_loc = attn_metadata.query_start_loc[1:].cpu().int().tolist()
seq_lens = attn_metadata.seq_lens.cpu().int().tolist()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try not to create or modify params in forward, refactor them to prepare_inputs or somewhere else, we only need to do it once every step, not every layer.

@momo609 momo609 force-pushed the fullpiece branch 3 times, most recently from ee7e275 to 8fe8978 Compare October 13, 2025 06:38
@momo609 momo609 changed the title support FULL_AND_PIECEWISE graph mode. support FULL_AND_PIECEWISE graph mode and spiltfusedpa op . Oct 13, 2025
@momo609 momo609 force-pushed the fullpiece branch 2 times, most recently from 0f6e3d6 to 81291e9 Compare October 14, 2025 02:10
@momo609 momo609 force-pushed the fullpiece branch 2 times, most recently from f9ad80d to bcda351 Compare October 14, 2025 04:42
@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

seq_lens: torch.Tensor = None

query_start_loc: torch.Tensor = None
seq_lens_list: List[int] = None
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add note.

num_block, block_size, -1)
value = self.value_cache.view( # type: ignore
num_block, block_size, -1)
softmax_lse = torch.empty(num_tokens,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

softmax_lse move to init.

@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@momo609 momo609 changed the title support FULL_AND_PIECEWISE graph mode and spiltfusedpa op . support FULL_AND_PIECEWISE and FULL graph mode Oct 21, 2025
@momo609 momo609 changed the title support FULL_AND_PIECEWISE and FULL graph mode support FULL_AND_PIECEWISE and FULL graph mode in Qwen Oct 22, 2025
forward_context: ForwardContext = get_forward_context()
if torch.version.cann.startswith("8.3"):
if forward_context.capturing:
output = self.full_graph_attention(query, key, value, attn_metadata, 128, output)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why hard-code block_size to 128 here?

@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@momo609 momo609 changed the title support FULL_AND_PIECEWISE and FULL graph mode in Qwen support FULL graph mode in Qwen Oct 23, 2025
@momo609 momo609 force-pushed the fullpiece branch 2 times, most recently from 55bee48 to bdb15fb Compare October 27, 2025 02:07
@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@momo609 momo609 force-pushed the fullpiece branch 6 times, most recently from a98f46f to 81e3e8d Compare October 30, 2025 08:06
@momo609 momo609 force-pushed the fullpiece branch 2 times, most recently from 55e6279 to a6354ec Compare October 30, 2025 09:05
@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

wangxiaoxin-sherie added 2 commits November 4, 2025 11:40
Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants