Releases · vllm-project/vllm-ascend

22 Jun 07:08

Yikun

v0.9.1rc1

c30ddb8

v0.9.1rc1 Pre-release

Pre-release

This is the 1st release candidate of v0.9.1 for vLLM Ascend. Please follow the official doc to get started.

Experimental

Atlas 300I series is experimental supported in this release (Functional test passed with Qwen2.5-7b-instruct/Qwen2.5-0.5b/Qwen3-0.6B/Qwen3-4B/Qwen3-8B). #1333
Support EAGLE-3 for speculative decoding. #1032

After careful consideration, above features will NOT be included in v0.9.1-dev branch (v0.9.1 final release) taking into account the v0.9.1 release quality and the feature rapid iteration. We will improve this from 0.9.2rc1 and later.

Core

Ascend PyTorch adapter (torch_npu) has been upgraded to 2.5.1.post1.dev20250528. Don’t forget to update it in your environment. #1235
Support Atlas 300I series container image. You can get it from quay.io
Fix token-wise padding mechanism to make multi-card graph mode work. #1300
Upgrade vLLM to 0.9.1 [#1165]#1165

Other Improvements

Initial support Chunked Prefill for MLA. #1172
An example of best practices to run DeepSeek with ETP has been added. #1101
Performance improvements for DeepSeek using the TorchAir graph. #1098, #1131
Supports the speculative decoding feature with AscendScheduler. #943
Improve VocabParallelEmbedding custom op performance. It will be enabled in the next release. #796
Fixed a device discovery and setup bug when running vLLM Ascend on Ray #884
DeepSeek with MC2 (Merged Compute and Communication) now works properly. #1268
Fixed log2phy NoneType bug with static EPLB feature. #1186
Improved performance for DeepSeek with DBO enabled. #997, #1135
Refactoring AscendFusedMoE #1229
Add initial user stories page (include LLaMA-Factory/TRL/verl/MindIE Turbo/GPUStack) #1224
Add unit test framework #1201

Known Issues

In some cases, the vLLM process may crash with a GatherV3 error when aclgraph is enabled. We are working on this issue and will fix it in the next release. #1038
Prefix cache feature does not work with the Ascend Scheduler but without chunked prefill enabled. This will be fixed in the next release. #1350

Full Changelog

v0.9.0rc2...v0.9.1rc1

New Contributors

@farawayboat made their first contribution in #1333
@yzim made their first contribution in #1159
@chenwaner made their first contribution in #1098
@wangyanhui-cmss made their first contribution in #1184
@songshanhu07 made their first contribution in #1186
@yuancaoyaoHW made their first contribution in #1032

Full Changelog: v0.9.0rc2...v0.9.1rc1

Contributors

farawayboat, yzim, and 4 other contributors

Assets 2

10 Jun 14:29

wangxiyuan

v0.9.0rc2

8dd686d

v0.9.0rc2 Pre-release

Pre-release

This is the 2nd official release candidate of v0.9.0 for vllm-ascend. Please follow the official doc to start the journey. From this release, V1 Engine is recommended to use. The code of V0 Engine is frozen and will not be maintained any more. Please set environment VLLM_USE_V1=1 to enable V1 Engine.

Highlights

DeepSeek works with graph mode now. Follow the official doc to take a try. #789
Qwen series models works with graph mode now. It works by default with V1 Engine. Please note that in this release, only Qwen series models are well tested with graph mode. We'll make it stable and generalize in the next release. If you hit any issues, please feel free to open an issue on GitHub and fallback to eager mode temporarily by set enforce_eager=True when initializing the model.

Core

The performance of multi-step scheduler has been improved. Thanks for the contribution from China Merchants Bank. #814
LoRA、Multi-LoRA And Dynamic Serving is supported for V1 Engine now. Thanks for the contribution from China Merchants Bank. #893
prefix cache and chunked prefill feature works now #782 #844
Spec decode and MTP features work with V1 Engine now. #874 #890
DP feature works with DeepSeek now. #1012
Input embedding feature works with V0 Engine now. #916
Sleep mode feature works with V1 Engine now. #1084

Model

Qwen2.5 VL works with V1 Engine now. #736
LLama4 works now. #740
A new kind of DeepSeek model called dual-batch overlap(DBO) is added. Please set VLLM_ASCEND_ENABLE_DBO=1 to use it. #941

Other

online serve with ascend quantization works now. #877
A batch of bugs for graph mode and moe model have been fixed. #773 #771 #774 #816 #817 #819 #912 #897 #961 #958 #913 #905
A batch of performance improvement PRs have been merged. #784 #803 #966 #839 #970 #947 #987 #1085
From this release, binary wheel package will be released as well. #775
The contributor doc site is added

Known Issue

In some case, vLLM process may be crashed with aclgraph enabled. We're working this issue and it'll be fixed in the next release. #1038
Multi node data-parallel doesn't work with this release. This is a known issue in vllm and has been fixed on main branch. #18981

New Contributors

@chris668899 made their first contribution in #771
@NeverRaR made their first contribution in #789
@cxcxflying made their first contribution in #740
@22dimensions made their first contribution in #835
@wonderful199082 made their first contribution in #814
@yangpuPKU made their first contribution in #937
@ttanzhiqiang made their first contribution in #909
@ponix-j made their first contribution in #874
@XWFAlone made their first contribution in #890
@NINGBENZHE made their first contribution in #896
@momo609 made their first contribution in #970
@David9857 made their first contribution in #947
@depeng1994 made their first contribution in #1013
@hahazhky made their first contribution in #987
@weijinqian0 made their first contribution in #1067
@sdmyzlp made their first contribution in #1091
@zxdukki made their first contribution in #941
@ChenTaoyu-SJTU made their first contribution in #736
@Yuxiao-Xu made their first contribution in #1116

Contributors

wonderful199082, weijinqian0, and 17 other contributors

Assets 2

10 Jun 01:17

wangxiyuan

v0.9.0rc1

706de02

v0.9.0rc1 Pre-release

Pre-release

Just a pre release for 0.9.0. There are still some known bug in this release

Assets 2

29 May 09:50

wangxiyuan

v0.7.3.post1

c69ceac

v0.7.3.post1

This is the first post release of 0.7.3. Please follow the official doc to start the journey. It includes the following changes:

Highlights

Qwen3 and Qwen3MOE is supported now. The performance and accuracy of Qwen3 is well tested. You can try it now. Mindie Turbo is recomanded to improve the performance of Qwen3. #903 #915
Added a new performance guide. The guide aims to help users to improve vllm-ascend performance on system level. It includes OS configuration, library optimization, deploy guide and so on. #878 Doc Link

Bug Fix

Qwen2.5-VL works for RLHF scenarios now. #928
Users can launch the model from online weights now. e.g. from huggingface or modelscope directly #858 #918
The meaningless log info UserWorkspaceSize0 has been cleaned. #911
The log level for Failed to import vllm_ascend_C has been changed to warning instead of error. #956
DeepSeek MLA now works with chunked prefill in V1 Engine. Please note that V1 engine in 0.7.3 is just expermential and only for test usage. #849 #936

Docs

The benchmark doc is updated for Qwen2.5 and Qwen2.5-VL #792
Add the note to clear that only "modelscope<1.23.0" works with 0.7.3. #954

Assets 2

08 May 13:38

Yikun

v0.7.3

779eebb

v0.7.3

🎉 Hello, World!

We are excited to announce the release of 0.7.3 for vllm-ascend. This is the first official release. The functionality, performance, and stability of this release are fully tested and verified. We encourage you to try it out and provide feedback. We'll post bug fix versions in the future if needed. Please follow the official doc to start the journey.

Highlights

This release includes all features landed in the previous release candidates (v0.7.1rc1, v0.7.3rc1, v0.7.3rc2). And all the features are fully tested and verified. Visit the official doc the get the detail feature and model support matrix.
Upgrade CANN to 8.1.RC1 to enable chunked prefill and automatic prefix caching features. You can now enable them now.
Upgrade PyTorch to 2.5.1. vLLM Ascend no longer relies on the dev version of torch-npu now. Now users don't need to install the torch-npu by hand. The 2.5.1 version of torch-npu will be installed automaticlly. #662
Integrate MindIE Turbo into vLLM Ascend to improve DeepSeek V3/R1, Qwen 2 series performance. #708

Core

LoRA、Multi-LoRA And Dynamic Serving is supported now. The performance will be improved in the next release. Please follow the official doc for more usage information. Thanks for the contribution from China Merchants Bank. #700

Model

The performance of Qwen2 vl and Qwen2.5 vl is improved. #702
The performance of apply_penalties and topKtopP ops are improved. #525

Other

Fixed a issue that may lead CPU memory leak. #691 #712
A new environment SOC_VERSION is added. If you hit any soc detection erro when building with custom ops enabled, please set SOC_VERSION to a suitable value. #606
openEuler container image supported with v0.7.3-openeuler tag. #665
Prefix cache feature works on V1 engine now. #559

Assets 2

06 May 15:53

Yikun

v0.8.5rc1

ec27af3

v0.8.5rc1 Pre-release

Pre-release

This is the 1st release candidate of v0.8.5 for vllm-ascend. Please follow the official doc to start the journey.

Experimental: Now you can enable V1 egnine by setting the environment variable VLLM_USE_V1=1, see the feature support status of vLLM Ascend in here.

Highlights

Upgrade CANN version to 8.1.RC1 to support chunked prefill and automatic prefix caching (--enable_prefix_caching) when V1 is enabled #747
Optimize Qwen2 VL and Qwen 2.5 VL #701
Improve Deepseek V3 eager mode and graph mode performance, now you can use --additional_config={'enable_graph_mode': True} to enable graph mode. #598 #731

Core

Upgrade vLLM to 0.8.5.post1 #715
Fix early return in CustomDeepseekV2MoE.forward during profile_run #682
Adapts for new quant model generated by modelslim #719
Initial support on P2P Disaggregated Prefill based on llm_datadist #694
Use /vllm-workspace as code path and include .git in container image to fix issue when start vllm under /workspace #726
Optimize NPU memory usage to make DeepSeek R1 W8A8 32K model len work. #728
Fix PYTHON_INCLUDE_PATH typo in setup.py #762

Other

Add Qwen3-0.6B test #717
Add nightly CI #668
Add accuracy test report #542

Known issue

If you are running the DeepSeek with VLLM_USE_V1=1 enabled will encounter call aclnnInplaceCopy failed, Please refer #778 to fix.

Assets 2

28 Apr 23:09

Yikun

v0.8.4rc2

1fce70a

v0.8.4rc2 Pre-release

Pre-release

This is the second release candidate of v0.8.4 for vllm-ascend. Please follow the official doc to start the journey. Some experimental features are included in this version, such as W8A8 quantization and EP/DP support. We'll make them stable enough in the next release.

Highlights

Qwen3 and Qwen3MOE is supported now. Please follow the official doc to run the quick demo. #709
Ascend W8A8 quantization method is supported now. Please take the official doc for example. Any feedback is welcome. #580
DeepSeek V3/R1 works with DP, TP and MTP now. Please note that it's still in experimental status. Let us know if you hit any problem. #429 #585 #626 #636 #671

Core

Torch.compile feature is supported with V1 engine now. It's disabled by default because this feature rely on CANN 8.1 release. We'll make it avaiable by default in the next release #426
Upgrade PyTorch to 2.5.1. vLLM Ascend no longer relies on the dev version of torch-npu now. Now users don't need to install the torch-npu by hand. The 2.5.1 version of torch-npu will be installed automaticlly. #661

Other

MiniCPM model works now. #645
openEuler container image supported with v0.8.4-openeuler tag and customs Ops build is enabled by default for openEuler OS. #689
Fix ModuleNotFoundError bug to make Lora work #600
Add "Using EvalScope evaluation" doc #611
Add a VLLM_VERSION environment to make vLLM version configurable to help developer set correct vLLM version if the code of vLLM is changed by hand locally. #651

Assets 2

18 Apr 11:31

wangxiyuan

v0.8.4rc1

086423d

v0.8.4rc1 Pre-release

Pre-release

This is the first release candidate of v0.8.4 for vllm-ascend. Please follow the official doc to start the journey. From this version, vllm-ascend will follow the newest version of vllm and release every two weeks. For example, if vllm releases v0.8.5 in the next two weeks, vllm-ascend will release v0.8.5rc1 instead of v0.8.4rc2. Please find the detail from the official documentation.

Highlights

vLLM V1 engine experimental support is included in this version. You can visit official guide to get more detail. By default, vLLM will fallback to V0 if V1 doesn't work, please set VLLM_USE_V1=1 environment if you want to use V1 forcely.
LoRA、Multi-LoRA And Dynamic Serving is supported now. The performance will be improved in the next release. Please follow the official doc for more usage information. Thanks for the contribution from China Merchants Bank. #521.
Sleep Mode feature is supported. Currently it's only work on V0 engine. V1 engine support will come soon. #513

Core

The Ascend scheduler is added for V1 engine. This scheduler is more affinity with Ascend hardware. More scheduler policy will be added in the future. #543
Disaggregated Prefill feature is supported. Currently only 1P1D works. NPND is under design by vllm team. vllm-ascend will support it once it's ready from vLLM. Follow the official guide to use. #432
Spec decode feature works now. Currently it's only work on V0 engine. V1 engine support will come soon. #500
Structured output feature works now on V1 Engine. Currently it only supports xgrammar backend while using guidance backend may get some errors. #555

Other

A new communicator pyhccl is added. It's used for call CANN HCCL library directly instead of using torch.distribute. More usage of it will be added in the next release #503
The custom ops build is enabled by default. You should install the packages like gcc, cmake first to build vllm-ascend from source. Set COMPILE_CUSTOM_KERNELS=0 environment to disable the compilation if you don't need it. #466
The custom op rotay embedding is enabled by default now to improve the performance. #555

Assets 2

29 Mar 01:12

wangxiyuan

v0.7.3rc2

00459ae

v0.7.3rc2 Pre-release

Pre-release

This is 2nd release candidate of v0.7.3 for vllm-ascend. Please follow the official doc to start the journey.

Quickstart with container: https://vllm-ascend.readthedocs.io/en/v0.7.3-dev/quick_start.html
Installation: https://vllm-ascend.readthedocs.io/en/v0.7.3-dev/installation.html

Highlights

Add Ascend Custom Ops framewrok. Developers now can write customs ops using AscendC. An example ops rotary_embedding is added. More tutorials will come soon. The Custome Ops complation is disabled by default when installing vllm-ascend. Set COMPILE_CUSTOM_KERNELS=1 to enable it. #371
V1 engine is basic supported in this release. The full support will be done in 0.8.X release. If you hit any issue or have any requirement of V1 engine. Please tell us here. #376
Prefix cache feature works now. You can set enable_prefix_caching=True to enable it. #282

Core

Bump torch_npu version to dev20250320.3 to improve accuracy to fix !!! output problem. #406

Model

The performance of Qwen2-vl is improved by optimizing patch embedding (Conv3D). #398

Other

Fixed a bug to make sure multi step scheduler feature work. #349
Fixed a bug to make prefix cache feature works with correct accuracy. #424

Assets 2

14 Mar 04:19

wangxiyuan

v0.7.3rc1

f025df0

v0.7.3rc1 Pre-release

Pre-release

🎉 Hello, World! This is the first release candidate of v0.7.3 for vllm-ascend. Please follow the official doc to start the journey.

Quickstart with container: https://vllm-ascend.readthedocs.io/en/v0.7.3-dev/quick_start.html
Installation: https://vllm-ascend.readthedocs.io/en/v0.7.3-dev/installation.html

Highlights

DeepSeek V3/R1 works well now. Read the official guide to start! #242
Speculative decoding feature is supported. #252
Multi step scheduler feature is supported. #300

Core

Bump torch_npu version to dev20250308.3 to improve _exponential accuracy
Added initial support for pooling models. Bert based model, such as BAAI/bge-base-en-v1.5 and BAAI/bge-reranker-v2-m3 works now. #229

Model

The performance of Qwen2-VL is improved. #241
MiniCPM is now supported #164

Other

Support MTP(Multi-Token Prediction) for DeepSeek V3/R1 #236
[Docs] Added more model tutorials, include DeepSeek, QwQ, Qwen and Qwen 2.5VL. See the official doc for detail
Pin modelscope<1.23.0 on vLLM v0.7.3 to resolve: vllm-project/vllm#13807

Known issues

In some cases, expecially when the input/output is very long with VL model, the accuracy of output may be incorrect. You may see many ! or some other unreadable code in the output. We are working on it. It'll be fixed in the next release.
Improved and reduced the garbled code in model output. But if you still hit the issue, try to change the gerneration config value, such as temperature and try again. Any feedback is welcome. #277

Assets 2

Releases: vllm-project/vllm-ascend

v0.9.1rc1

Experimental

Core

Other Improvements

Known Issues

Full Changelog

New Contributors

Contributors

Uh oh!

v0.9.0rc2

Highlights

Core

Model

Other

Known Issue

New Contributors

Contributors

Uh oh!

v0.9.0rc1

Uh oh!

v0.7.3.post1

Highlights

Bug Fix

Docs

Uh oh!

v0.7.3

Highlights

Core

Model

Other

Uh oh!

v0.8.5rc1

Highlights

Core

Other

Known issue

Uh oh!

v0.8.4rc2

Highlights

Core

Other

Uh oh!

v0.8.4rc1

Highlights

Core

Other

Uh oh!

v0.7.3rc2

Highlights

Core

Model

Other

Uh oh!

v0.7.3rc1

Highlights

Core

Model

Other

Known issues

Uh oh!