-
Notifications
You must be signed in to change notification settings - Fork 449
[Bugfix] ngram spec decode attention error and repeat add sampled token ids error #2972
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…en ids Signed-off-by: Icey <1790571317@qq.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces two critical bug fixes for N-gram speculative decoding. The first fix corrects a data corruption issue in NgramProposer
where sampled tokens were being added redundantly and with incorrect indexing. The second fix addresses an attention error during the verification step of N-gram speculative decoding by setting the correct attention state, ensuring the appropriate attention kernel is used. Both changes are crucial for the correctness of N-gram speculative decoding.
num_tokens = self.runner.input_batch.num_tokens_no_spec[i] | ||
drafter_output = self.propose( | ||
self.runner.input_batch.token_ids_cpu[i, :end_idx]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code block redundantly adds sampled_ids
to token_ids_cpu
, a task already performed in NPUModelRunner._execute_model
. This redundancy is also buggy: num_tokens_no_spec
is already updated, causing incorrect indexing and data corruption in the input to self.propose
. This is a critical issue that leads to incorrect behavior in speculative decoding.
num_tokens = self.runner.input_batch.num_tokens_no_spec[i] | |
drafter_output = self.propose( | |
self.runner.input_batch.token_ids_cpu[i, :end_idx]) | |
num_tokens = self.runner.input_batch.num_tokens_no_spec[i] | |
drafter_output = self.propose( | |
self.runner.input_batch.token_ids_cpu[i, :num_tokens]) |
elif self.drafter and self.drafter.name == SpecDcodeType.NGRAM: | ||
attn_state = AscendAttentionState.DecodeOnly | ||
else: | ||
attn_state = AscendAttentionState.SpecDecoding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For N-gram speculative decoding, falling back to AscendAttentionState.SpecDecoding
is incorrect. This state is intended for model-based proposers and uses an attention kernel (_npu_paged_attention_splitfuse
) that is unsuitable for N-gram verification, leading to errors. N-gram verification is more akin to a standard decode step and requires a dedicated attention state to use the correct kernel.
elif self.drafter and self.drafter.name == SpecDcodeType.NGRAM:
attn_state = AscendAttentionState.DecodeOnly
else:
attn_state = AscendAttentionState.SpecDecoding
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
it's good to add a e2e test as well. |
What this PR does / why we need it?
Does this PR introduce any user-facing change?
N/A
How was this patch tested?
CI passed with new added/existing test.