Skip to content

Conversation

@momo609
Copy link
Collaborator

@momo609 momo609 commented Nov 4, 2025

What this PR does / why we need it?

In the update_*attn_params functions, the torch.npu.stream(update_stream) context manager was previously located inside the for-loop that updates parameters for each layer. This resulted in redundant stream initiations for every layer, adding unnecessary overhead.

This commit refactors the code by moving the stream context manager to wrap the entire for-loop. This ensures that the update stream is initiated only once per function call, rather than for each layer. This change reduces 90us in each decode model.
update stream in every layer:
image

remove update stream in every layer:
image

Does this PR introduce any user-facing change?

How was this patch tested?

@github-actions
Copy link

github-actions bot commented Nov 4, 2025

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly optimizes update_attn_params by moving the stream context manager out of the loop. However, a similar change in update_mla_attn_params introduces a critical bug. The call to torch.npu.graph_task_update_begin() is now inside a conditional else block, while torch.npu.graph_task_update_end() remains unconditional. This mismatch will lead to runtime errors under certain conditions. The fix is to move torch.npu.graph_task_update_begin() outside the conditional block to ensure it's always executed.

Comment on lines +264 to 265
torch.npu.graph_task_update_begin(update_stream, handle)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The call to torch.npu.graph_task_update_begin() has been moved into a conditional block, but torch.npu.graph_task_update_end() is still called unconditionally. This will cause a runtime error when speculative_config and speculative_config.method == "deepseek_mtp" is true, as graph_task_update_end will be called without a corresponding graph_task_update_begin.

To fix this, torch.npu.graph_task_update_begin() should be moved out of the else block to ensure it is always called within the loop.

Suggested change
torch.npu.graph_task_update_begin(update_stream, handle)
torch.npu.graph_task_update_begin(update_stream, handle)

Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
@weijinqian0 weijinqian0 added ready read for review ready-for-test start test by label for PR labels Nov 5, 2025
@github-actions
Copy link

github-actions bot commented Nov 6, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-conflicts ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants