Skip to content

[Guide]: Usage on AscendScheduler in vLLM Ascend #788

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
MengqingCao opened this issue May 8, 2025 · 2 comments
Open

[Guide]: Usage on AscendScheduler in vLLM Ascend #788

MengqingCao opened this issue May 8, 2025 · 2 comments
Labels
guide guide note

Comments

@MengqingCao
Copy link
Collaborator

MengqingCao commented May 8, 2025

Why use AscendScheduler in vLLM Ascend

We could enable AscendScheduler to accelerate inference when using V1 engine.

AscendScheduler is a V0-style scheduling schema that divides requests into prefill and decode for processing. In this way, after enabling AscendScheduler, V1 requests will be divided into prefill requests, decode requests, and mixed requests. Since the attention operator used by prefill and decode performs better than that used by mixed requests, it will bring performance improvement.

How to use AscendScheduler in vLLM Ascend

Add ascend_scheduler_config to additional_config when creating a LLM will enable AscendScheduler while using V1.

Please refer to the following example:

import os

from vllm import LLM, SamplingParams

# Enable V1Engine
os.environ["VLLM_USE_V1"] = "1"

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

# Create a sampling params object.
sampling_params = SamplingParams(max_tokens=100, temperature=0.0)

# Create an LLM with AscendScheduler
llm = LLM(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    additional_config={
        'ascend_scheduler_config': {},
    },
)

# Generate texts from the prompts.
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Advanced

If you want to enable chunked-prefill in AscendScheduler, set additional_config={"ascend_scheduler_config": {"enable_chunked_prefill": True}}

Note

The performance may deteriorate if chunked-prefill is enabled currently.

@whx-sjtu
Copy link
Contributor

whx-sjtu commented May 8, 2025

If you want to enable chunked-prefill in AscendScheduler (the performance may deteriorate if this feature is enabled currently), set additional_config={"ascend_scheduler_config": {"enable_chunked_prefill": True}}
Currently, AscendScheduler provides V0-style scheduling schema in the v1 engine. More features will be added in the future.

@MengqingCao
Copy link
Collaborator Author

If you want to enable chunked-prefill in AscendScheduler (the performance may deteriorate if this feature is enabled currently), set additional_config={"ascend_scheduler_config": {"enable_chunked_prefill": True}} Currently, AscendScheduler provides V0-style scheduling schema in the v1 engine. More features will be added in the future.

Thanks for the additional notes, I'll update this in this usage.

@Yikun Yikun changed the title [Usage]: Usage on AscendScheduler in vLLM Ascend [Guide]: Usage on AscendScheduler in vLLM Ascend May 15, 2025
@Yikun Yikun added the guide guide note label May 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
guide guide note
Projects
None yet
Development

No branches or pull requests

3 participants