-
Notifications
You must be signed in to change notification settings - Fork 547
[long_seq_optim] BSND to TND and FA_UPDATE replacement #3778
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: pichangping <1337510399@qq.com>
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the attention mechanism to leverage native TND layout support and a new npu_attention_update kernel on Ascend NPUs, removing the manual BSND packing/unpacking and update logic. While this simplifies the code and likely improves performance, I've identified several critical issues related to potential memory leaks in graph capture mode and a bug with a hardcoded value that could cause errors with models that have large context windows. There is also a high-severity concern about unnecessary type casting that could impact performance.
Signed-off-by: pichangping <1337510399@qq.com>
Signed-off-by: pichangping <1337510399@qq.com>
Signed-off-by: pichangping <1337510399@qq.com>
Signed-off-by: pichangping <1337510399@qq.com>
Signed-off-by: pichangping <1337510399@qq.com>
Signed-off-by: pichangping <1337510399@qq.com>
| max_seq_len = max(seq_lens, default=0) | ||
| pcp_prefill_mask = torch.triu( | ||
| torch.full((num_prefills, max_seq_len, max_seq_len), | ||
| torch.full((2048, 2048), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里硬编码成2048,2048的原因是啥
Signed-off-by: pichangping <1337510399@qq.com>
Signed-off-by: pichangping <1337510399@qq.com>
What this PR does / why we need it?
We have optimized the performance of long sequences:First,Modify the input data format for attention calculation. Instead of using the original BSND format, remove the logic for converting between TND and BSND, and directly adopt the TND format.
The TND input format can be directly reused, which shortens the data flow path. Converting to BSND is an unnecessary processing step.Second, we switched the output update of the concatenated small operators to the npu_attention_update fusion operator to improve performance.
Does this PR introduce any user-facing change?
How was this patch tested?