[Bugfix] ngram spec decode attention error and repeat add sampled token ids error #2972

wxsIcey · 2025-09-17T03:08:31Z

What this PR does / why we need it?

Fixes ngram spec decode attention error [Bug]: Attention error in ngram spec decode #2971
Fixes repeat add sampled token ids

Does this PR introduce any user-facing change?

N/A

How was this patch tested?

import os

os.environ["VLLM_USE_MODELSCOPE"] = "True"
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

from vllm import LLM, SamplingParams


def main():
    prompts = [
        "The quick brown fox jumps over the lazy dog.",
    ]

    # Create a sampling params object.
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
    # Create an LLM.
    llm = LLM(
            model="Qwen/Qwen2.5-0.5B-Instruct",
            tensor_parallel_size=1,
            speculative_config={
                "method": "ngram",
                "num_speculative_tokens": 10, # 每次最多推测 10 个 token
                "prompt_lookup_max": 1, # 最多使用 1 个 n-gram 进行匹配
            },
            enforce_eager=True,
            gpu_memory_utilization=0.8,
            max_model_len=8192,
        )

    # Generate texts from the prompts.
    outputs = llm.generate(prompts, sampling_params)
    print(f"Outputs: {outputs}")
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")


if __name__ == "__main__":
    main()

CI passed with new added/existing test.

vLLM version: v0.10.2
vLLM main: vllm-project/vllm@eeb135e

…en ids Signed-off-by: Icey <1790571317@qq.com>

gemini-code-assist

Code Review

This pull request introduces two critical bug fixes for N-gram speculative decoding. The first fix corrects a data corruption issue in NgramProposer where sampled tokens were being added redundantly and with incorrect indexing. The second fix addresses an attention error during the verification step of N-gram speculative decoding by setting the correct attention state, ensuring the appropriate attention kernel is used. Both changes are crucial for the correctness of N-gram speculative decoding.

gemini-code-assist · 2025-09-17T03:10:08Z

vllm_ascend/spec_decode/ngram_proposer.py

+            num_tokens = self.runner.input_batch.num_tokens_no_spec[i]
            drafter_output = self.propose(
-                self.runner.input_batch.token_ids_cpu[i, :end_idx])


This code block redundantly adds sampled_ids to token_ids_cpu, a task already performed in NPUModelRunner._execute_model. This redundancy is also buggy: num_tokens_no_spec is already updated, causing incorrect indexing and data corruption in the input to self.propose. This is a critical issue that leads to incorrect behavior in speculative decoding.

Suggested change

num_tokens = self.runner.input_batch.num_tokens_no_spec[i]

drafter_output = self.propose(

self.runner.input_batch.token_ids_cpu[i, :end_idx])

num_tokens = self.runner.input_batch.num_tokens_no_spec[i]

drafter_output = self.propose(

self.runner.input_batch.token_ids_cpu[i, :num_tokens])

gemini-code-assist · 2025-09-17T03:10:08Z

vllm_ascend/worker/model_runner_v1.py

+            elif self.drafter and self.drafter.name == SpecDcodeType.NGRAM:
+                attn_state = AscendAttentionState.DecodeOnly
            else:
                attn_state = AscendAttentionState.SpecDecoding


For N-gram speculative decoding, falling back to AscendAttentionState.SpecDecoding is incorrect. This state is intended for model-based proposers and uses an attention kernel (_npu_paged_attention_splitfuse) that is unsuitable for N-gram verification, leading to errors. N-gram verification is more akin to a standard decode step and requires a dedicated attention state to use the correct kernel.

elif self.drafter and self.drafter.name == SpecDcodeType.NGRAM: attn_state = AscendAttentionState.DecodeOnly else: attn_state = AscendAttentionState.SpecDecoding

github-actions · 2025-09-17T03:11:30Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

wangxiyuan · 2025-09-17T06:49:10Z

it's good to add a e2e test as well.

[Bugfix] ngram spec decode attention error and repeat add sampled tok…

db725cd

…en ids Signed-off-by: Icey <1790571317@qq.com>

wxsIcey changed the title ~~[Bugfix] ngram spec decode attention error and repeat add sampled tok…~~ [Bugfix] ngram spec decode attention error and repeat add sampled token ids Sep 17, 2025

gemini-code-assist bot reviewed Sep 17, 2025

View reviewed changes

wxsIcey changed the title ~~[Bugfix] ngram spec decode attention error and repeat add sampled token ids~~ [Bugfix] ngram spec decode attention error and repeat add sampled token ids error Sep 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] ngram spec decode attention error and repeat add sampled token ids error #2972

[Bugfix] ngram spec decode attention error and repeat add sampled token ids error #2972

wxsIcey commented Sep 17, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 17, 2025

Uh oh!

gemini-code-assist bot Sep 17, 2025

Uh oh!

github-actions bot commented Sep 17, 2025

Uh oh!

wangxiyuan commented Sep 17, 2025

Uh oh!

Uh oh!

[Bugfix] ngram spec decode attention error and repeat add sampled token ids error #2972

Are you sure you want to change the base?

[Bugfix] ngram spec decode attention error and repeat add sampled token ids error #2972

Conversation

wxsIcey commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Sep 17, 2025

Uh oh!

wangxiyuan commented Sep 17, 2025

Uh oh!

Uh oh!

wxsIcey commented Sep 17, 2025 •

edited

Loading