You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We could enable AscendScheduler to accelerate inference when using V1 engine.
AscendScheduler is a V0-style scheduling schema that divides requests into prefill and decode for processing. In this way, after enabling AscendScheduler, V1 requests will be divided into prefill requests, decode requests, and mixed requests. Since the attention operator used by prefill and decode performs better than that used by mixed requests, it will bring performance improvement.
How to use AscendScheduler in vLLM Ascend
Add ascend_scheduler_config to additional_config when creating a LLM will enable AscendScheduler while using V1.
Please refer to the following example:
importosfromvllmimportLLM, SamplingParams# Enable V1Engineos.environ["VLLM_USE_V1"] ="1"prompts= [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
# Create a sampling params object.sampling_params=SamplingParams(max_tokens=100, temperature=0.0)
# Create an LLM with AscendSchedulerllm=LLM(
model="Qwen/Qwen2.5-0.5B-Instruct",
additional_config={
'ascend_scheduler_config': {},
},
)
# Generate texts from the prompts.outputs=llm.generate(prompts, sampling_params)
foroutputinoutputs:
prompt=output.promptgenerated_text=output.outputs[0].textprint(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
Advanced
If you want to enable chunked-prefill in AscendScheduler, set additional_config={"ascend_scheduler_config": {"enable_chunked_prefill": True}}
Note
The performance may deteriorate if chunked-prefill is enabled currently.
The text was updated successfully, but these errors were encountered:
If you want to enable chunked-prefill in AscendScheduler (the performance may deteriorate if this feature is enabled currently), set additional_config={"ascend_scheduler_config": {"enable_chunked_prefill": True}}
Currently, AscendScheduler provides V0-style scheduling schema in the v1 engine. More features will be added in the future.
If you want to enable chunked-prefill in AscendScheduler (the performance may deteriorate if this feature is enabled currently), set additional_config={"ascend_scheduler_config": {"enable_chunked_prefill": True}} Currently, AscendScheduler provides V0-style scheduling schema in the v1 engine. More features will be added in the future.
Thanks for the additional notes, I'll update this in this usage.
Why use AscendScheduler in vLLM Ascend
We could enable
AscendScheduler
to accelerate inference when using V1 engine.AscendScheduler
is a V0-style scheduling schema that divides requests into prefill and decode for processing. In this way, after enablingAscendScheduler
, V1 requests will be divided into prefill requests, decode requests, and mixed requests. Since the attention operator used by prefill and decode performs better than that used by mixed requests, it will bring performance improvement.How to use AscendScheduler in vLLM Ascend
Add
ascend_scheduler_config
toadditional_config
when creating aLLM
will enableAscendScheduler
while using V1.Please refer to the following example:
Advanced
If you want to enable chunked-prefill in AscendScheduler, set
additional_config={"ascend_scheduler_config": {"enable_chunked_prefill": True}}
Note
The performance may deteriorate if chunked-prefill is enabled currently.
The text was updated successfully, but these errors were encountered: