[Perf] Move attention update stream out of loop to optimize performance #3985

momo609 · 2025-11-04T13:04:40Z

What this PR does / why we need it?

In the update_*attn_params functions, the torch.npu.stream(update_stream) context manager was previously located inside the for-loop that updates parameters for each layer. This resulted in redundant stream initiations for every layer, adding unnecessary overhead.

This commit refactors the code by moving the stream context manager to wrap the entire for-loop. This ensures that the update stream is initiated only once per function call, rather than for each layer. This change reduces 90us in each decode model.
update stream in every layer:

remove update stream in every layer:

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.11.0
vLLM main: vllm-project/vllm@83f478b

github-actions · 2025-11-04T13:04:49Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request correctly optimizes update_attn_params by moving the stream context manager out of the loop. However, a similar change in update_mla_attn_params introduces a critical bug. The call to torch.npu.graph_task_update_begin() is now inside a conditional else block, while torch.npu.graph_task_update_end() remains unconditional. This mismatch will lead to runtime errors under certain conditions. The fix is to move torch.npu.graph_task_update_begin() outside the conditional block to ensure it's always executed.

gemini-code-assist · 2025-11-04T13:06:08Z

vllm_ascend/compilation/acl_graph.py

+                torch.npu.graph_task_update_begin(update_stream, handle)



The call to torch.npu.graph_task_update_begin() has been moved into a conditional block, but torch.npu.graph_task_update_end() is still called unconditionally. This will cause a runtime error when speculative_config and speculative_config.method == "deepseek_mtp" is true, as graph_task_update_end will be called without a corresponding graph_task_update_begin.

To fix this, torch.npu.graph_task_update_begin() should be moved out of the else block to ensure it is always called within the loop.

Suggested change

torch.npu.graph_task_update_begin(update_stream, handle)

torch.npu.graph_task_update_begin(update_stream, handle)

Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>

github-actions · 2025-11-06T15:11:09Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

gemini-code-assist bot reviewed Nov 4, 2025

View reviewed changes

optimize fullgraph.

f1f9126

Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>

momo609 force-pushed the fullprove branch from 7b4b8cd to f1f9126 Compare November 5, 2025 01:09

weijinqian0 added ready read for review ready-for-test start test by label for PR labels Nov 5, 2025

yiz-liu approved these changes Nov 5, 2025

View reviewed changes

github-actions bot added the merge-conflicts label Nov 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Perf] Move attention update stream out of loop to optimize performance #3985

[Perf] Move attention update stream out of loop to optimize performance #3985

momo609 commented Nov 4, 2025

Uh oh!

github-actions bot commented Nov 4, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 4, 2025

Uh oh!

github-actions bot commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	torch.npu.graph_task_update_begin(update_stream, handle)

	torch.npu.graph_task_update_begin(update_stream, handle)

[Perf] Move attention update stream out of loop to optimize performance #3985

Are you sure you want to change the base?

[Perf] Move attention update stream out of loop to optimize performance #3985

Conversation

momo609 commented Nov 4, 2025

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Nov 4, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants