chunked prefill, access splitfuse op #2962

qyqc731 · 2025-09-16T11:56:46Z

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.10.2
vLLM main: vllm-project/vllm@0faf3cc

github-actions · 2025-09-16T11:56:57Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

这个 PR 主要是为了接入新的 splitfuse chunked prefill 算子。代码改动涉及 attention_v1.py 和 model_runner_v1.py 两个文件。在 attention_v1.py 中，_forward_v1_style 函数的注意力计算从 _npu_paged_attention_splitfuse 切换到了 npu_fused_infer_attention_score。在 model_runner_v1.py 中，为 ChunkedPrefill 场景生成 attention mask 的逻辑被修改。

我的审查发现两个严重问题：

在 attention_v1.py 中，传递给新算子的 actual_seq_lengths 参数值是错误的，使用了累积的 token 位置而不是序列长度，这会导致注意力计算错误。
在 model_runner_v1.py 中，为 ChunkedPrefill 生成的 attention mask 使用了硬编码的尺寸 (2048, 2048)，这使得代码很脆弱，当序列长度超过 2048 时会导致错误。

建议修复这两个严重问题以保证代码的正确性和健壮性。

gemini-code-assist · 2025-09-16T11:59:11Z

vllm_ascend/attention/attention_v1.py

-            num_kv_heads=self.num_kv_heads,
+            input_layout="TND",
+            block_size=block_size,
+            actual_seq_lengths=attn_metadata.query_start_loc[1:],


actual_seq_lengths 参数被传递了 attn_metadata.query_start_loc[1:]，它包含的是累积的 token 位置，而不是各个序列的长度。这很可能会导致错误的注意力计算结果。你应该使用 attn_metadata.query_lens，它才包含正确的序列长度，并确保它在正确的设备上。

Suggested change

actual_seq_lengths=attn_metadata.query_start_loc[1:],

actual_seq_lengths=attn_metadata.query_lens.to(query.device),

gemini-code-assist · 2025-09-16T11:59:11Z

vllm_ascend/worker/model_runner_v1.py

@@ -797,8 +797,7 @@ def _make_attention_mask(self, seq_lens, position,
                             attn_state) -> torch.Tensor:
        # Chunk Prefill situation.
        if attn_state == AscendAttentionState.ChunkedPrefill and not self.vllm_config.model_config.use_mla:
-            return self.attn_mask_builder.get_splitfuse_attn_mask(
-                seq_lens, position, self.dtype, self.device)
+            return torch.triu(torch.ones(2048, 2048), diagonal=1).to(torch.int8)


Attention mask 使用了硬编码的尺寸 (2048, 2048)。这是一个魔法数字，使得实现不够健壮。如果批处理中任何序列的长度超过 2048，将导致不正确的掩码或越界错误。掩码的大小应该由模型配置的最大序列长度决定，以确保正确性并避免魔法数字。

Suggested change

return torch.triu(torch.ones(2048, 2048), diagonal=1).to(torch.int8)

return torch.triu(torch.ones(self.model_config.max_model_len, self.model_config.max_model_len), diagonal=1).to(torch.int8)

rjg-lyh · 2025-09-17T07:36:05Z

Could you integrate other scenarios, such as full FlashAttention, using the FIA interface as well, and provide the performance test results?

chunked prefill splitfuse算子接入

5446db6

qyqc731 changed the title ~~chunked prefill splitfuse算子接入~~ chunked prefill splitfuse op in Sep 16, 2025

gemini-code-assist bot reviewed Sep 16, 2025

View reviewed changes

splitfuse access optimize

782e774

qyqc731 changed the title ~~chunked prefill splitfuse op in~~ chunked prefill, access splitfuse op Sep 16, 2025

optimize again

0a81166

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chunked prefill, access splitfuse op #2962

chunked prefill, access splitfuse op #2962

qyqc731 commented Sep 16, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Sep 16, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 16, 2025

Uh oh!

gemini-code-assist bot Sep 16, 2025

Uh oh!

rjg-lyh commented Sep 17, 2025

Uh oh!

Uh oh!

	actual_seq_lengths=attn_metadata.query_start_loc[1:],
	actual_seq_lengths=attn_metadata.query_lens.to(query.device),

	return torch.triu(torch.ones(2048, 2048), diagonal=1).to(torch.int8)
	return torch.triu(torch.ones(self.model_config.max_model_len, self.model_config.max_model_len), diagonal=1).to(torch.int8)

chunked prefill, access splitfuse op #2962

Are you sure you want to change the base?

chunked prefill, access splitfuse op #2962

Conversation

qyqc731 commented Sep 16, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Sep 16, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

rjg-lyh commented Sep 17, 2025

Uh oh!

Uh oh!

qyqc731 commented Sep 16, 2025 •

edited by github-actions bot

Loading