Skip to content

Conversation

wxsIcey
Copy link
Collaborator

@wxsIcey wxsIcey commented Sep 17, 2025

What this PR does / why we need it?

Does this PR introduce any user-facing change?

N/A

How was this patch tested?

import os

os.environ["VLLM_USE_MODELSCOPE"] = "True"
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

from vllm import LLM, SamplingParams


def main():
    prompts = [
        "The quick brown fox jumps over the lazy dog.",
    ]

    # Create a sampling params object.
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
    # Create an LLM.
    llm = LLM(
            model="Qwen/Qwen2.5-0.5B-Instruct",
            tensor_parallel_size=1,
            speculative_config={
                "method": "ngram",
                "num_speculative_tokens": 10, # 每次最多推测 10 个 token
                "prompt_lookup_max": 1, # 最多使用 1 个 n-gram 进行匹配
            },
            enforce_eager=True,
            gpu_memory_utilization=0.8,
            max_model_len=8192,
        )

    # Generate texts from the prompts.
    outputs = llm.generate(prompts, sampling_params)
    print(f"Outputs: {outputs}")
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")


if __name__ == "__main__":
    main()

CI passed with new added/existing test.

…en ids

Signed-off-by: Icey <1790571317@qq.com>
@wxsIcey wxsIcey changed the title [Bugfix] ngram spec decode attention error and repeat add sampled tok… [Bugfix] ngram spec decode attention error and repeat add sampled token ids Sep 17, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces two critical bug fixes for N-gram speculative decoding. The first fix corrects a data corruption issue in NgramProposer where sampled tokens were being added redundantly and with incorrect indexing. The second fix addresses an attention error during the verification step of N-gram speculative decoding by setting the correct attention state, ensuring the appropriate attention kernel is used. Both changes are crucial for the correctness of N-gram speculative decoding.

Comment on lines +54 to -60
num_tokens = self.runner.input_batch.num_tokens_no_spec[i]
drafter_output = self.propose(
self.runner.input_batch.token_ids_cpu[i, :end_idx])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This code block redundantly adds sampled_ids to token_ids_cpu, a task already performed in NPUModelRunner._execute_model. This redundancy is also buggy: num_tokens_no_spec is already updated, causing incorrect indexing and data corruption in the input to self.propose. This is a critical issue that leads to incorrect behavior in speculative decoding.

Suggested change
num_tokens = self.runner.input_batch.num_tokens_no_spec[i]
drafter_output = self.propose(
self.runner.input_batch.token_ids_cpu[i, :end_idx])
num_tokens = self.runner.input_batch.num_tokens_no_spec[i]
drafter_output = self.propose(
self.runner.input_batch.token_ids_cpu[i, :num_tokens])

Comment on lines +1443 to 1444
elif self.drafter and self.drafter.name == SpecDcodeType.NGRAM:
attn_state = AscendAttentionState.DecodeOnly
else:
attn_state = AscendAttentionState.SpecDecoding
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

For N-gram speculative decoding, falling back to AscendAttentionState.SpecDecoding is incorrect. This state is intended for model-based proposers and uses an attention kernel (_npu_paged_attention_splitfuse) that is unsuitable for N-gram verification, leading to errors. N-gram verification is more akin to a standard decode step and requires a dedicated attention state to use the correct kernel.

            elif self.drafter and self.drafter.name == SpecDcodeType.NGRAM:
                attn_state = AscendAttentionState.DecodeOnly
            else:
                attn_state = AscendAttentionState.SpecDecoding

Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@wxsIcey wxsIcey changed the title [Bugfix] ngram spec decode attention error and repeat add sampled token ids [Bugfix] ngram spec decode attention error and repeat add sampled token ids error Sep 17, 2025
@wangxiyuan
Copy link
Collaborator

it's good to add a e2e test as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants