Release v0.8.4 · vllm-project/vllm

This release contains 180 commits from 84 contributors (25 new contributors!).

Highlights

This release includes important accuracy fixes for Llama4 models, if you are using it, we highly recommend you to update.

Model

Llama4 (#16113,#16509) bug fix and enhancements:
- qknorm should be not shared across head (#16311)
- Enable attention temperature tuning by default for long context (>32k) (#16439)
- Index Error When Single Request Near Max Context (#16209)
- Add tuned FusedMoE kernel config for Llama4 Scout, TP=8 on H100 (#16488)
- Update to transformers==4.51.1 (#16257)
- Added chat templates for LLaMa4 pythonic tool calling (#16463)
- Optimized topk for topk=1(#16512)
- Add warning for Attention backends that do not support irope yet (#16212)
Support Qwen3 and Qwen3MoE (#15289), smolvlm (#16017), jinaai/jina-embeddings-v3 (#16120), InternVL3 (#16495), GLM-4-0414 (#16338)

API

Estimate max-model-len use available KV cache memory. The error message nows hints at how to set --max-model-len (#16168)
Add hf_token to EngineArgs (#16093)
Enable regex support with xgrammar in V0 engine (#13228)
Support matryoshka representation / support embedding API dimensions (#16331)
Add bucket for request_latency, time_to_first_token and time_per_output_token (#15202)
Support for TorchAO quantization (#14231)

Hardware

Intel-Gaudi: Multi-step scheduling implementation for HPU (#12779)
TPU:
- Make @support_torch_compile work for XLA backend (#15782)
- Use language_model interface for getting text backbone in MM (#16410)

Performance

DeepSeek MLA: a new merge_attn_states CUDA kernel, 3x speedup (#16173)
MoE: Support W8A8 channel-wise weights and per-token activations in triton fused_moe_kernel (#16366)
Add support to modelopt quantization of Mixtral model (#15961)
Enable PTPC FP8 for CompressedTensorsW8A8Fp8MoEMethod (triton fused_moe) (#16537)

V1 Engine Core

Enable multi-input by default (#15799)
Scatter and gather placeholders in the model runner (#16076)
Set structured output backend to auto by default (#15724)
Zero-copy tensor/ndarray serialization/transmission (#13790)
Eagle Model loading (#16035)
KV cache slots for eagle heads (#16370)
Add supports_structured_output() method to Platform (#16148)

Developer Facing

Add sampling parameters to benchmark_serving. (#16022)
AutoWeightsLoader refacotring (#16383, #16325, #16088, #16203, #16103)
Unifieid configuration with engine args: LoadConfig (#16422), ParallelConfig (#16332)

What's Changed

[Misc] Auto detect bitsandbytes pre-quantized models by @tristanleclercq in #16027
[CI] Fix benchmark script level by @khluu in #16089
fix: support clang17 for macos and fix the real libomp by @yihong0618 in #16086
[doc] fix 404 by @reidliu41 in #16082
Revert "doc: add info for macos clang errors (#16049)" by @yihong0618 in #16091
Fix some capitalisations in generated examples doc titles by @hmellor in #16094
[Misc] format output for encoder_decoder.py by @reidliu41 in #16095
[Misc] Remove redundant code by @chaunceyjiang in #16098
[Bugfix] fix use_atomic_add support of marlin kernel when using v1 engine by @jinzhen-lin in #15946
[Model] use AutoWeightsLoader for phi, gemma, deepseek by @jonghyunchoe in #16088
[Model] fix model testing for TeleChat2ForCausalLM and V0 llama4 by @luccafong in #16112
[Benchmark] Add sampling parameters to benchmark_serving. by @hyeygit in #16022
[Frontend] Fix typo in tool chat templates for llama3.2 and toolace by @bjj in #14501
[CI][V1] Fix passing tokenizer as kwarg to validate_guidance_grammar by @ywang96 in #16117
[Misc] refactor example eagle by @reidliu41 in #16100
[Doc][Bugfix] Add missing EOF in k8s deploy doc by @psschwei in #16025
[Misc] Improve model redirect to accept json dictionary by @Isotr0py in #16119
[Model] use AutoWeightsLoader for stablelm,starcoder2,zamba2 by @lengrongfu in #16103
[Bugfix] LoRA : Fix the order in which the kernels process LoRAs by @varun-sundar-rabindranath in #16040
[Bugfix] add hf_token to EngineArgs by @paolovic in #16093
[Misc] update requires-python in pyproject.toml by @reidliu41 in #16116
[TPU] Update PyTorch/XLA by @yaochengji in #16130
[V1][Minor] Optimize get_cached_block by @WoosukKwon in #16135
Fix requires-python by @martinhoyer in #16132
[Metrics] Add bucket for request_latency, time_to_first_token and time_per_output_token by @yankay in #15202
[V1][Minor] Minor simplification for get_computed_blocks by @WoosukKwon in #16139
[Misc] Update Mistral-3.1 example by @DarkLight1337 in #16147
[Bugfix] Make dummy encoder prompt padding alternative and add missing warnings by @Isotr0py in #16129
[CI] Set max transformers version for Ultravox model test by @ywang96 in #16149
doc: fix some typos in doc by @yihong0618 in #16154
[VLM] Florence-2 supports online serving by @Isotr0py in #16164
[V1][Structured Output] Add supports_structured_output() method to Platform by @shen-shanshan in #16148
[Model] Add Qwen3 and Qwen3MoE by @YamPengLi in #15289
[Misc] improve example mlpspeculator and llm_engine_example by @reidliu41 in #16175
[Doc]Update image to latest version by @WangErXiao in #16186
Upstream Llama4 Support to Main by @houseroad in #16113
[Bugfix] Re-enable support for ChatGLMForConditionalGeneration by @DarkLight1337 in #16187
[V1] Revert the default max_num_seqs to V0 values for most hardware by @DarkLight1337 in #16158
[Misc] Print encoder seq len to short warning only once by @gshtras in #16193
[Misc] Human-readable max-model-len cli arg by @NickLucche in #16181
[Misc] Move Llama 4 projector call into encoder execution by @ywang96 in #16201
[Bugfix] Fix guidance backend for Qwen models by @benchislett in #16210
[V1][BugFix] Exit properly if engine core fails during startup by @njhill in #16137
[Misc] add description attribute in CLI by @reidliu41 in #15921
[Bugfix][V0] XGrammar structured output supports Enum by @leon-seidel in #15878
Torchao by @drisspg in #14231
[ROCm][Bugfix][FP8] Make fp8 quant respect fused modules mapping by @mgoin in #16031
[core] do not send error across process by @youkaichao in #16174
[Misc] Update compressed-tensors to version 0.9.3 by @mlsw in #16196
Update BASE_IMAGE to 2.22 release of Neuron by @aws-satyajith in #16218
[V1] Scatter and gather placeholders in the model runner by @ywang96 in #16076
[Bugfix] fix use-ep bug to enable ep by dp/tp size > 1 by @zxfan-cpu in #16161
Add warning for Attention backends that do not support irope yet by @sarckk in #16212
[Bugfix] Do not skip "empty" parts of chats that are parsable by @mgoin in #16219
[Bugfix] Fix and reorganize broken GGUF tests and bump gguf version by @Isotr0py in #16194
[torch.compile][TPU] Make @support_torch_compile work for XLA backend by @lsy323 in #15782
[V1] Add disable_chunked_mm_input arg to disable partial mm input prefill by @mgoin in #15837
[Misc] Merge the logs of pp layers partitions by @kebe7jun in #16225
[Docs] Add Slides from Singapore Meetup by @simon-mo in #16213
[Misc] format and refactor some examples by @reidliu41 in #16252
[Misc] Add warning for multimodal data in LLM.beam_search by @alex-jw-brooks in #16241
[Model] use AutoWeightsLoader for phimoe,qwen2_moe,qwen3_moe by @lengrongfu in #16203
[BugFix][ROCm] Fix GGUF MoE Dispatch Block_Dim for ROCm by @tywuAMD in #16247
[Bugfix] Remove triton do_bench fast_flush arg by @kebe7jun in #16256
Update to transformers==4.51.1 by @hmellor in #16257
[New Model]: jinaai/jina-embeddings-v3 by @noooop in #16120
[Misc] Avoid stripping meaningful whitespace from nvidia-smi topo -m output in collect_env.py by @imkero in #16272
[Bugfix] Proper input validation for multi-modal encoder-decoder models by @DarkLight1337 in #16156
[Bugfix] Handle process_weights_after_loading for QKVCrossParallelLinear by @Isotr0py in #15328
Add warning that content below line in template will be removed by @hmellor in #16276
[BugFix] Fix Llama4 - Index Error When Single Request Near Max Context by @LucasWilkinson in #16209
[Bugfix] fix deepseek fp16 scale bug by @jinzhen-lin in #14809
[V1] Update structured output offline inference example by @russellb in #15721
[CI/Build] Fix CI LoRA failure by @jeejeelee in #16270
Add support to modelopt quantization of Mixtral model by @yueshen2016 in #15961
[Model] Add smolvlm support by @chaunceyjiang in #16017
[Bug] [ROCm] Fix Llama 4 Enablement Bug on ROCm: V0 ROCmFlashAttentionImpl and Triton Fused MoE bugs by @tjtanaa in #16198
[Bugfix] fix gettid method is not define by @lengrongfu in #16084
[Feature] Estimate max-model-len use available KV cache memory by @lengrongfu in #16168
[Core] Upgrade to xgrammar 0.1.18, add cache size limit by @russellb in #16283
[CI][Bugfix] Fix bad tolerance for test_batch_base64_embedding by @mgoin in #16221
[TPU] Update PyTorch/XLA by @yaochengji in #16288
[BugFix] Fix fusion test and add them to CI by @ProExpertProg in #16287
[Misc] Fix test_sharded_state_loader.py(#16004) by @Accelerator1996 in #16005
[Bugfix] Avoid transferring cached multi-modal items from P0 to P1 by @DarkLight1337 in #16273
Update label-tpu mergify and remove removal bot by @mgoin in #16298
[BugFix] logger is not callable by @yihong0618 in #16312
[BugFix] llama4 qknorm should be not shared across head by @luccafong in #16311
update neuron config by @ajayvohra2005 in #16289
[BugFix] fix some typos found by typos. by @yihong0618 in #16314
[Model] Add SupportsMultiModal.get_language_model interface by @NickLucche in #16007
[Bugfix][Frontend] respect provided default guided decoding backend by @gcalmettes in #15476
Revert "Update label-tpu mergify and remove removal bot" by @mgoin in #16350
[Bugfix] Fix profiling.py by @hhy3 in #16202
[Bugfix] catch AssertionError in MistralTokenizer as ValueError by @gcalmettes in #16344
[CI]Fix hpu docker and numpy version for CI by @xuechendi in #16355
Fix benchmark_throughput.py --backend=hf by @mgoin in #16352
[Build/CI] Add tracing deps to vllm container image by @russellb in #15224
[Hardware] add platform-specific request validation api by @joerunde in #16291
[Misc] refactor Structured Outputs example by @reidliu41 in #16322
[TPU][V1] Refine tpu_model_runner to mitigate future recompilation issues by @yaochengji in #16275
Add GLM-4-0414 support by @zRzRzRzRzRzRzR in #16338
[Bugfix]: do not shutdown server if skip_special_use=False for MistralTokenizer by @gcalmettes in #14094
[Model] use AutoWeightsLoader for granite, granitemoe, granitemoeshared, grok1, mixtral by @aaron-ang in #16325
[TPU] Fix dummy loading OOM by @yaochengji in #16372
[bugfix] Avoid the time consumption caused by creating dummy videos. by @Jintao-Huang in #16371
[CI][Bugfix] Pin triton version for CPU by @ywang96 in #16384
[misc] use tqdm.auto where appropriate by @BKitor in #16290
[Bugfix][TPU] Fix TPU validate_request by @mgoin in #16369
fix sonnet dataset sample when prefix len is very small by @Chenyaaang in #16379
[Model] use AutoWeightsLoader for deepseek_v2, internlm2 by @aaron-ang in #16383
[Misc] Update transformers version limits of multi-modal tests by @DarkLight1337 in #16381
[Bugfix] Fix validation error for text-only Mllama 3.2 by @DarkLight1337 in #16377
[Kernel] Use moe_wna16 kernel for compressed tensors wna16 moe models by @mgoin in #16038
[doc] add download model tips by @reidliu41 in #16389
Update Numba to 0.61.2 by @cyyever in #16376
[Model] Remove image mm limit for LLaMa4 by @yeqcharlotte in #16365
[doc] update the wrong link by @reidliu41 in #16401
[CI] Add auto update workflow for Dockerfile graph by @WineChord in #11879
Fix the torch version parsing logic by @houseroad in #15857
[VLM] Remove BaseProcessingInfo.get_mm_max_tokens_per_item by @DarkLight1337 in #16408
[TPU][V1] Use language_model interface for getting text backbone in MM by @NickLucche in #16410
Improve configs - ParallelConfig by @hmellor in #16332
[V1] Set structured output backend to auto by default by @russellb in #15724
[V1][Spec Decode] Eagle Model loading by @LiuXiaoxuanPKU in #16035
[Bugfix] Fix bug when dataset is json by @Chenyaaang in #15899
[Model] Reduce redundant computations in mamba2 blocks for Bamba-9B by @cyang49 in #15423
[V1] Zero-copy tensor/ndarray serialization/transmission by @njhill in #13790
[VLM] Avoid unnecessary dummy multimodal data during processing by @DarkLight1337 in #16416
[Bugfix] Fix output token length check logic by @eeslook in #16419
[TPU][V1] Disable per-request seed/Generator by @NickLucche in #16172
Fix range_ratio Bug in RandomDataset by @jadewang21 in #16126
check input length of sonnet samples by @alexey-belyakov in #16423
update benchmark_serving_structured_output to include auto backend by @Chenyaaang in #16438
[Llama4] Enable attention temperature tuning by default for long context (>32k) by @sarckk in #16439
Update supported_hardware.md for TPU INT8 by @mgoin in #16437
[Bugfix][VLM] Fix failing Phi-4-MM multi-images tests and add vision-speech test by @Isotr0py in #16424
[CPU][Bugfix] Fix CPU docker issues by @bigPYJ1151 in #16454
[Bugfix] Don't set an upper bound on repetition penalty by @alex-jw-brooks in #16403
Revert "[Model] use AutoWeightsLoader for deepseek_v2, internlm2" by @DefTruth in #16453
[Core][LoRA][1/N] Add LoRA for EncoderDecoderModelRunner by @jeejeelee in #15990
Enforce valid max_num_batched_tokens when disable_chunked_mm_input=True by @mgoin in #16447
[Misc] Raise error for V1 not supporting Long LoRA. by @jeejeelee in #16415
[Misc] update api_client example by @reidliu41 in #16459
Don't install triton on ppc64le platform by @hmellor in #16470
[Kernel] support merge_attn_states CUDA kernel, 3x speedup by @DefTruth in #16173
[Bugfix] Fix bugs of running Quark quantized models by @cha557 in #16236
[Hardware][Intel-Gaudi] Multi-step scheduling implementation for HPU by @tzielinski-habana in #12779
Fix erroneous "model doesn't support compile" warning by @zou3519 in #16486
[TPU][V1] Make --disable_chunked_mm_input mandatory for serving MM models by @NickLucche in #16483
[Kernel] Support W8A8 channel-wise weights and per-token activations in triton fused_moe_kernel by @mgoin in #16366
[Doc] Document InternVL3 support by @Isotr0py in #16495
[Bugfix] handle alignment of encoder_seq_lens in mllama.py by @tjohnson31415 in #14784
Improve configs - LoadConfig by @hmellor in #16422
[Frontend] Added chat templates for LLaMa4 pythonic tool calling by @yeqcharlotte in #16463
[Kernel] Add tuned FusedMoE kernel config for Llama4 Scout, TP=8 on H100 by @sarckk in #16488
Update openai_compatible_server.md by @Chr1st1anSears in #16507
[Bugfix] clean up duplicated code by @lengrongfu in #16485
Bugfix for PixtralHF models without spatial_merge_size by @mgoin in #16513
[Doc] Fix link to vLLM blog by @terrytangyuan in #16519
[CI][Bugfix] Add mistral_tool_use to Ci by @mgoin in #16517
[BugFix] Handle non-contiguous tensors properly when serializing by @njhill in #16492
[Doc] Update Llama4 Model Names in Supported Models by @yeqcharlotte in #16509
Optimized topk for topk=1 (Llama-4) by @mgoin in #16512
[Feature][V1] Add xgrammar to support minLength, maxLength with test by @leon-seidel in #16516
[Frontend] support matryoshka representation / support embedding API dimensions by @noooop in #16331
fix: spelling by @ezhoureal in #16466
[Misc] Update chat utils tests by @DarkLight1337 in #16520
[Misc] Openai transcription client example use same Whisper model by @NickLucche in #16487
[V1] Enable multi-input by default by @DarkLight1337 in #15799
[MISC] Make GroupCoordinator compatible with out-of-tree devices by @ji-huazhong in #16464
[Misc] Delete redundant code by @jeejeelee in #16530
Fix syntaxWarning: invalid escape sequence '\s' by @DamonFool in #16532
[Perf] Optimize Preparing Inputs for GPU Model Runner by @SnowCharmQ in #16484
[Bugfix] Validate logit biases to prevent out of vocab ids crashing engine by @rymc in #16529
[V1][Spec Decode] KV cache slots for eagle heads by @LiuXiaoxuanPKU in #16370
Enable PTPC FP8 for CompressedTensorsW8A8Fp8MoEMethod (triton fused_moe) by @mgoin in #16537
[Benchmark][Bugfix] Fix SonnetDataset default values in benchmark_throughput.py by @JenZhao in #16556
[Core][V0] Enable regex support with xgrammar by @russellb in #13228

New Contributors

@bjj made their first contribution in #14501
@psschwei made their first contribution in #16025
@paolovic made their first contribution in #16093
@YamPengLi made their first contribution in #15289
@leon-seidel made their first contribution in #15878
@drisspg made their first contribution in #14231
@mlsw made their first contribution in #16196
@aws-satyajith made their first contribution in #16218
@zxfan-cpu made their first contribution in #16161
@sarckk made their first contribution in #16212
@yueshen2016 made their first contribution in #15961
@Accelerator1996 made their first contribution in #16005
@hhy3 made their first contribution in #16202
@zRzRzRzRzRzRzR made their first contribution in #16338
@aaron-ang made their first contribution in #16325
@Jintao-Huang made their first contribution in #16371
@WineChord made their first contribution in #11879
@eeslook made their first contribution in #16419
@jadewang21 made their first contribution in #16126
@alexey-belyakov made their first contribution in #16423
@tzielinski-habana made their first contribution in #12779
@Chr1st1anSears made their first contribution in #16507
@ezhoureal made their first contribution in #16466
@SnowCharmQ made their first contribution in #16484
@rymc made their first contribution in #16529

Full Changelog: v0.8.3...v0.8.4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.8.4