Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 2 additions & 6 deletions vllm_ascend/spec_decode/ngram_proposer.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,13 +51,9 @@ def generate_token_ids(self,
draft_token_ids.append([])
continue

# Add sampled_token_ids to token_ids_cpu.
start_idx = self.runner.input_batch.num_tokens_no_spec[i]
end_idx = start_idx + num_sampled_ids
self.runner.input_batch.token_ids_cpu[
i, start_idx:end_idx] = sampled_ids
num_tokens = self.runner.input_batch.num_tokens_no_spec[i]
drafter_output = self.propose(
self.runner.input_batch.token_ids_cpu[i, :end_idx])
Comment on lines +54 to -60
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This code block redundantly adds sampled_ids to token_ids_cpu, a task already performed in NPUModelRunner._execute_model. This redundancy is also buggy: num_tokens_no_spec is already updated, causing incorrect indexing and data corruption in the input to self.propose. This is a critical issue that leads to incorrect behavior in speculative decoding.

Suggested change
num_tokens = self.runner.input_batch.num_tokens_no_spec[i]
drafter_output = self.propose(
self.runner.input_batch.token_ids_cpu[i, :end_idx])
num_tokens = self.runner.input_batch.num_tokens_no_spec[i]
drafter_output = self.propose(
self.runner.input_batch.token_ids_cpu[i, :num_tokens])

self.runner.input_batch.token_ids_cpu[i, :num_tokens])
if drafter_output is None or len(drafter_output) == 0:
draft_token_ids.append([])
else:
Expand Down
2 changes: 2 additions & 0 deletions vllm_ascend/worker/model_runner_v1.py
Original file line number Diff line number Diff line change
Expand Up @@ -1440,6 +1440,8 @@ def _build_attn_state(self, num_reqs, num_scheduled_tokens,
if self.drafter and (self.drafter.name == SpecDcodeType.EAGLE
or self.drafter.name == SpecDcodeType.EAGLE3):
attn_state = AscendAttentionState.ChunkedPrefill
elif self.drafter and self.drafter.name == SpecDcodeType.NGRAM:
attn_state = AscendAttentionState.DecodeOnly
else:
attn_state = AscendAttentionState.SpecDecoding
Comment on lines +1443 to 1444
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

For N-gram speculative decoding, falling back to AscendAttentionState.SpecDecoding is incorrect. This state is intended for model-based proposers and uses an attention kernel (_npu_paged_attention_splitfuse) that is unsuitable for N-gram verification, leading to errors. N-gram verification is more akin to a standard decode step and requires a dedicated attention state to use the correct kernel.

            elif self.drafter and self.drafter.name == SpecDcodeType.NGRAM:
                attn_state = AscendAttentionState.DecodeOnly
            else:
                attn_state = AscendAttentionState.SpecDecoding

# splitfuse
Expand Down
Loading