Skip to content

Conversation

qyqc731
Copy link
Contributor

@qyqc731 qyqc731 commented Sep 16, 2025

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@qyqc731 qyqc731 changed the title chunked prefill splitfuse算子接入 chunked prefill splitfuse op in Sep 16, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

这个 PR 主要是为了接入新的 splitfuse chunked prefill 算子。代码改动涉及 attention_v1.pymodel_runner_v1.py 两个文件。在 attention_v1.py 中,_forward_v1_style 函数的注意力计算从 _npu_paged_attention_splitfuse 切换到了 npu_fused_infer_attention_score。在 model_runner_v1.py 中,为 ChunkedPrefill 场景生成 attention mask 的逻辑被修改。

我的审查发现两个严重问题:

  1. attention_v1.py 中,传递给新算子的 actual_seq_lengths 参数值是错误的,使用了累积的 token 位置而不是序列长度,这会导致注意力计算错误。
  2. model_runner_v1.py 中,为 ChunkedPrefill 生成的 attention mask 使用了硬编码的尺寸 (2048, 2048),这使得代码很脆弱,当序列长度超过 2048 时会导致错误。

建议修复这两个严重问题以保证代码的正确性和健壮性。

num_kv_heads=self.num_kv_heads,
input_layout="TND",
block_size=block_size,
actual_seq_lengths=attn_metadata.query_start_loc[1:],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

actual_seq_lengths 参数被传递了 attn_metadata.query_start_loc[1:],它包含的是累积的 token 位置,而不是各个序列的长度。这很可能会导致错误的注意力计算结果。你应该使用 attn_metadata.query_lens,它才包含正确的序列长度,并确保它在正确的设备上。

Suggested change
actual_seq_lengths=attn_metadata.query_start_loc[1:],
actual_seq_lengths=attn_metadata.query_lens.to(query.device),

@@ -797,8 +797,7 @@ def _make_attention_mask(self, seq_lens, position,
attn_state) -> torch.Tensor:
# Chunk Prefill situation.
if attn_state == AscendAttentionState.ChunkedPrefill and not self.vllm_config.model_config.use_mla:
return self.attn_mask_builder.get_splitfuse_attn_mask(
seq_lens, position, self.dtype, self.device)
return torch.triu(torch.ones(2048, 2048), diagonal=1).to(torch.int8)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Attention mask 使用了硬编码的尺寸 (2048, 2048)。这是一个魔法数字,使得实现不够健壮。如果批处理中任何序列的长度超过 2048,将导致不正确的掩码或越界错误。掩码的大小应该由模型配置的最大序列长度决定,以确保正确性并避免魔法数字。

Suggested change
return torch.triu(torch.ones(2048, 2048), diagonal=1).to(torch.int8)
return torch.triu(torch.ones(self.model_config.max_model_len, self.model_config.max_model_len), diagonal=1).to(torch.int8)

@qyqc731 qyqc731 changed the title chunked prefill splitfuse op in chunked prefill, access splitfuse op Sep 16, 2025
@rjg-lyh
Copy link
Collaborator

rjg-lyh commented Sep 17, 2025

Could you integrate other scenarios, such as full FlashAttention, using the FIA interface as well, and provide the performance test results?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants