Skip to content

Conversation

@pichangping
Copy link
Contributor

@pichangping pichangping commented Oct 27, 2025

What this PR does / why we need it?

We have optimized the performance of long sequences:First,Modify the input data format for attention calculation. Instead of using the original BSND format, remove the logic for converting between TND and BSND, and directly adopt the TND format.
The TND input format can be directly reused, which shortens the data flow path. Converting to BSND is an unnecessary processing step.Second, we switched the output update of the concatenated small operators to the npu_attention_update fusion operator to improve performance.

Does this PR introduce any user-facing change?

How was this patch tested?

Signed-off-by: pichangping <1337510399@qq.com>
@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the attention mechanism to leverage native TND layout support and a new npu_attention_update kernel on Ascend NPUs, removing the manual BSND packing/unpacking and update logic. While this simplifies the code and likely improves performance, I've identified several critical issues related to potential memory leaks in graph capture mode and a bug with a hardcoded value that could cause errors with models that have large context windows. There is also a high-severity concern about unnecessary type casting that could impact performance.

Signed-off-by: pichangping <1337510399@qq.com>
Signed-off-by: pichangping <1337510399@qq.com>
Signed-off-by: pichangping <1337510399@qq.com>
Signed-off-by: pichangping <1337510399@qq.com>
Signed-off-by: pichangping <1337510399@qq.com>
@pichangping pichangping changed the title BSND to TND and FA_UPDATE replacement [long_seq_optim] BSND to TND and FA_UPDATE replacement Oct 27, 2025
Signed-off-by: pichangping <1337510399@qq.com>
max_seq_len = max(seq_lens, default=0)
pcp_prefill_mask = torch.triu(
torch.full((num_prefills, max_seq_len, max_seq_len),
torch.full((2048, 2048),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里硬编码成2048,2048的原因是啥

Signed-off-by: pichangping <1337510399@qq.com>
Signed-off-by: pichangping <1337510399@qq.com>
@yiz-liu yiz-liu added ready read for review ready-for-test start test by label for PR labels Oct 28, 2025
@yiz-liu yiz-liu merged commit f57bdb0 into vllm-project:main Oct 29, 2025
46 of 53 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants