This release contains 180 commits from 84 contributors (25 new contributors!).
Highlights
This release includes important accuracy fixes for Llama4 models, if you are using it, we highly recommend you to update.
Model
- Llama4 (#16113,#16509) bug fix and enhancements:
- qknorm should be not shared across head (#16311)
- Enable attention temperature tuning by default for long context (>32k) (#16439)
- Index Error When Single Request Near Max Context (#16209)
- Add tuned FusedMoE kernel config for Llama4 Scout, TP=8 on H100 (#16488)
- Update to transformers==4.51.1 (#16257)
- Added chat templates for LLaMa4 pythonic tool calling (#16463)
- Optimized topk for topk=1(#16512)
- Add warning for Attention backends that do not support irope yet (#16212)
- Support Qwen3 and Qwen3MoE (#15289), smolvlm (#16017), jinaai/jina-embeddings-v3 (#16120), InternVL3 (#16495), GLM-4-0414 (#16338)
API
- Estimate max-model-len use available KV cache memory. The error message nows hints at how to set
--max-model-len
(#16168) - Add hf_token to EngineArgs (#16093)
- Enable regex support with xgrammar in V0 engine (#13228)
- Support matryoshka representation / support embedding API dimensions (#16331)
- Add bucket for
request_latency
,time_to_first_token
andtime_per_output_token
(#15202) - Support for TorchAO quantization (#14231)
Hardware
- Intel-Gaudi: Multi-step scheduling implementation for HPU (#12779)
- TPU:
Performance
- DeepSeek MLA: a new merge_attn_states CUDA kernel, 3x speedup (#16173)
- MoE: Support W8A8 channel-wise weights and per-token activations in triton fused_moe_kernel (#16366)
- Add support to modelopt quantization of Mixtral model (#15961)
- Enable PTPC FP8 for CompressedTensorsW8A8Fp8MoEMethod (triton fused_moe) (#16537)
V1 Engine Core
- Enable multi-input by default (#15799)
- Scatter and gather placeholders in the model runner (#16076)
- Set structured output backend to
auto
by default (#15724) - Zero-copy tensor/ndarray serialization/transmission (#13790)
- Eagle Model loading (#16035)
- KV cache slots for eagle heads (#16370)
- Add
supports_structured_output()
method to Platform (#16148)
Developer Facing
- Add sampling parameters to benchmark_serving. (#16022)
- AutoWeightsLoader refacotring (#16383, #16325, #16088, #16203, #16103)
- Unifieid configuration with engine args:
LoadConfig
(#16422),ParallelConfig
(#16332)
What's Changed
- [Misc] Auto detect bitsandbytes pre-quantized models by @tristanleclercq in #16027
- [CI] Fix benchmark script level by @khluu in #16089
- fix: support clang17 for macos and fix the real libomp by @yihong0618 in #16086
- [doc] fix 404 by @reidliu41 in #16082
- Revert "doc: add info for macos clang errors (#16049)" by @yihong0618 in #16091
- Fix some capitalisations in generated examples doc titles by @hmellor in #16094
- [Misc] format output for encoder_decoder.py by @reidliu41 in #16095
- [Misc] Remove redundant code by @chaunceyjiang in #16098
- [Bugfix] fix use_atomic_add support of marlin kernel when using v1 engine by @jinzhen-lin in #15946
- [Model] use AutoWeightsLoader for phi, gemma, deepseek by @jonghyunchoe in #16088
- [Model] fix model testing for TeleChat2ForCausalLM and V0 llama4 by @luccafong in #16112
- [Benchmark] Add sampling parameters to benchmark_serving. by @hyeygit in #16022
- [Frontend] Fix typo in tool chat templates for llama3.2 and toolace by @bjj in #14501
- [CI][V1] Fix passing
tokenizer
as kwarg tovalidate_guidance_grammar
by @ywang96 in #16117 - [Misc] refactor example eagle by @reidliu41 in #16100
- [Doc][Bugfix] Add missing EOF in k8s deploy doc by @psschwei in #16025
- [Misc] Improve model redirect to accept json dictionary by @Isotr0py in #16119
- [Model] use AutoWeightsLoader for stablelm,starcoder2,zamba2 by @lengrongfu in #16103
- [Bugfix] LoRA : Fix the order in which the kernels process LoRAs by @varun-sundar-rabindranath in #16040
- [Bugfix] add hf_token to EngineArgs by @paolovic in #16093
- [Misc] update requires-python in pyproject.toml by @reidliu41 in #16116
- [TPU] Update PyTorch/XLA by @yaochengji in #16130
- [V1][Minor] Optimize get_cached_block by @WoosukKwon in #16135
- Fix requires-python by @martinhoyer in #16132
- [Metrics] Add bucket for
request_latency
,time_to_first_token
andtime_per_output_token
by @yankay in #15202 - [V1][Minor] Minor simplification for get_computed_blocks by @WoosukKwon in #16139
- [Misc] Update Mistral-3.1 example by @DarkLight1337 in #16147
- [Bugfix] Make dummy encoder prompt padding alternative and add missing warnings by @Isotr0py in #16129
- [CI] Set max transformers version for Ultravox model test by @ywang96 in #16149
- doc: fix some typos in doc by @yihong0618 in #16154
- [VLM] Florence-2 supports online serving by @Isotr0py in #16164
- [V1][Structured Output] Add
supports_structured_output()
method to Platform by @shen-shanshan in #16148 - [Model] Add Qwen3 and Qwen3MoE by @YamPengLi in #15289
- [Misc] improve example mlpspeculator and llm_engine_example by @reidliu41 in #16175
- [Doc]Update image to latest version by @WangErXiao in #16186
- Upstream Llama4 Support to Main by @houseroad in #16113
- [Bugfix] Re-enable support for
ChatGLMForConditionalGeneration
by @DarkLight1337 in #16187 - [V1] Revert the default
max_num_seqs
to V0 values for most hardware by @DarkLight1337 in #16158 - [Misc] Print encoder seq len to short warning only once by @gshtras in #16193
- [Misc] Human-readable
max-model-len
cli arg by @NickLucche in #16181 - [Misc] Move Llama 4 projector call into encoder execution by @ywang96 in #16201
- [Bugfix] Fix guidance backend for Qwen models by @benchislett in #16210
- [V1][BugFix] Exit properly if engine core fails during startup by @njhill in #16137
- [Misc] add description attribute in CLI by @reidliu41 in #15921
- [Bugfix][V0] XGrammar structured output supports Enum by @leon-seidel in #15878
- Torchao by @drisspg in #14231
- [ROCm][Bugfix][FP8] Make fp8 quant respect fused modules mapping by @mgoin in #16031
- [core] do not send error across process by @youkaichao in #16174
- [Misc] Update compressed-tensors to version 0.9.3 by @mlsw in #16196
- Update BASE_IMAGE to 2.22 release of Neuron by @aws-satyajith in #16218
- [V1] Scatter and gather placeholders in the model runner by @ywang96 in #16076
- [Bugfix] fix use-ep bug to enable ep by dp/tp size > 1 by @zxfan-cpu in #16161
- Add warning for Attention backends that do not support irope yet by @sarckk in #16212
- [Bugfix] Do not skip "empty" parts of chats that are parsable by @mgoin in #16219
- [Bugfix] Fix and reorganize broken GGUF tests and bump gguf version by @Isotr0py in #16194
- [torch.compile][TPU] Make @support_torch_compile work for XLA backend by @lsy323 in #15782
- [V1] Add
disable_chunked_mm_input
arg to disable partial mm input prefill by @mgoin in #15837 - [Misc] Merge the logs of pp layers partitions by @kebe7jun in #16225
- [Docs] Add Slides from Singapore Meetup by @simon-mo in #16213
- [Misc] format and refactor some examples by @reidliu41 in #16252
- [Misc] Add warning for multimodal data in LLM.beam_search by @alex-jw-brooks in #16241
- [Model] use AutoWeightsLoader for phimoe,qwen2_moe,qwen3_moe by @lengrongfu in #16203
- [BugFix][ROCm] Fix GGUF MoE Dispatch Block_Dim for ROCm by @tywuAMD in #16247
- [Bugfix] Remove triton do_bench fast_flush arg by @kebe7jun in #16256
- Update to transformers==4.51.1 by @hmellor in #16257
- [New Model]: jinaai/jina-embeddings-v3 by @noooop in #16120
- [Misc] Avoid stripping meaningful whitespace from
nvidia-smi topo -m
output in collect_env.py by @imkero in #16272 - [Bugfix] Proper input validation for multi-modal encoder-decoder models by @DarkLight1337 in #16156
- [Bugfix] Handle
process_weights_after_loading
forQKVCrossParallelLinear
by @Isotr0py in #15328 - Add warning that content below line in template will be removed by @hmellor in #16276
- [BugFix] Fix Llama4 - Index Error When Single Request Near Max Context by @LucasWilkinson in #16209
- [Bugfix] fix deepseek fp16 scale bug by @jinzhen-lin in #14809
- [V1] Update structured output offline inference example by @russellb in #15721
- [CI/Build] Fix CI LoRA failure by @jeejeelee in #16270
- Add support to modelopt quantization of Mixtral model by @yueshen2016 in #15961
- [Model] Add smolvlm support by @chaunceyjiang in #16017
- [Bug] [ROCm] Fix Llama 4 Enablement Bug on ROCm: V0 ROCmFlashAttentionImpl and Triton Fused MoE bugs by @tjtanaa in #16198
- [Bugfix] fix gettid method is not define by @lengrongfu in #16084
- [Feature] Estimate max-model-len use available KV cache memory by @lengrongfu in #16168
- [Core] Upgrade to xgrammar 0.1.18, add cache size limit by @russellb in #16283
- [CI][Bugfix] Fix bad tolerance for test_batch_base64_embedding by @mgoin in #16221
- [TPU] Update PyTorch/XLA by @yaochengji in #16288
- [BugFix] Fix fusion test and add them to CI by @ProExpertProg in #16287
- [Misc] Fix test_sharded_state_loader.py(#16004) by @Accelerator1996 in #16005
- [Bugfix] Avoid transferring cached multi-modal items from P0 to P1 by @DarkLight1337 in #16273
- Update label-tpu mergify and remove removal bot by @mgoin in #16298
- [BugFix] logger is not callable by @yihong0618 in #16312
- [BugFix] llama4 qknorm should be not shared across head by @luccafong in #16311
- update neuron config by @ajayvohra2005 in #16289
- [BugFix] fix some typos found by typos. by @yihong0618 in #16314
- [Model] Add
SupportsMultiModal.get_language_model
interface by @NickLucche in #16007 - [Bugfix][Frontend] respect provided default guided decoding backend by @gcalmettes in #15476
- Revert "Update label-tpu mergify and remove removal bot" by @mgoin in #16350
- [Bugfix] Fix profiling.py by @hhy3 in #16202
- [Bugfix] catch AssertionError in MistralTokenizer as ValueError by @gcalmettes in #16344
- [CI]Fix hpu docker and numpy version for CI by @xuechendi in #16355
- Fix
benchmark_throughput.py --backend=hf
by @mgoin in #16352 - [Build/CI] Add tracing deps to vllm container image by @russellb in #15224
- [Hardware] add platform-specific request validation api by @joerunde in #16291
- [Misc] refactor Structured Outputs example by @reidliu41 in #16322
- [TPU][V1] Refine tpu_model_runner to mitigate future recompilation issues by @yaochengji in #16275
- Add GLM-4-0414 support by @zRzRzRzRzRzRzR in #16338
- [Bugfix]: do not shutdown server if
skip_special_use=False
for MistralTokenizer by @gcalmettes in #14094 - [Model] use AutoWeightsLoader for granite, granitemoe, granitemoeshared, grok1, mixtral by @aaron-ang in #16325
- [TPU] Fix dummy loading OOM by @yaochengji in #16372
- [bugfix] Avoid the time consumption caused by creating dummy videos. by @Jintao-Huang in #16371
- [CI][Bugfix] Pin triton version for CPU by @ywang96 in #16384
- [misc] use tqdm.auto where appropriate by @BKitor in #16290
- [Bugfix][TPU] Fix TPU validate_request by @mgoin in #16369
- fix sonnet dataset sample when prefix len is very small by @Chenyaaang in #16379
- [Model] use AutoWeightsLoader for deepseek_v2, internlm2 by @aaron-ang in #16383
- [Misc] Update transformers version limits of multi-modal tests by @DarkLight1337 in #16381
- [Bugfix] Fix validation error for text-only Mllama 3.2 by @DarkLight1337 in #16377
- [Kernel] Use moe_wna16 kernel for compressed tensors wna16 moe models by @mgoin in #16038
- [doc] add download model tips by @reidliu41 in #16389
- Update Numba to 0.61.2 by @cyyever in #16376
- [Model] Remove image mm limit for LLaMa4 by @yeqcharlotte in #16365
- [doc] update the wrong link by @reidliu41 in #16401
- [CI] Add auto update workflow for Dockerfile graph by @WineChord in #11879
- Fix the torch version parsing logic by @houseroad in #15857
- [VLM] Remove
BaseProcessingInfo.get_mm_max_tokens_per_item
by @DarkLight1337 in #16408 - [TPU][V1] Use
language_model
interface for getting text backbone in MM by @NickLucche in #16410 - Improve configs -
ParallelConfig
by @hmellor in #16332 - [V1] Set structured output backend to
auto
by default by @russellb in #15724 - [V1][Spec Decode] Eagle Model loading by @LiuXiaoxuanPKU in #16035
- [Bugfix] Fix bug when dataset is json by @Chenyaaang in #15899
- [Model] Reduce redundant computations in mamba2 blocks for Bamba-9B by @cyang49 in #15423
- [V1] Zero-copy tensor/ndarray serialization/transmission by @njhill in #13790
- [VLM] Avoid unnecessary dummy multimodal data during processing by @DarkLight1337 in #16416
- [Bugfix] Fix output token length check logic by @eeslook in #16419
- [TPU][V1] Disable per-request seed/Generator by @NickLucche in #16172
- Fix range_ratio Bug in RandomDataset by @jadewang21 in #16126
- check input length of sonnet samples by @alexey-belyakov in #16423
- update benchmark_serving_structured_output to include auto backend by @Chenyaaang in #16438
- [Llama4] Enable attention temperature tuning by default for long context (>32k) by @sarckk in #16439
- Update supported_hardware.md for TPU INT8 by @mgoin in #16437
- [Bugfix][VLM] Fix failing Phi-4-MM multi-images tests and add vision-speech test by @Isotr0py in #16424
- [CPU][Bugfix] Fix CPU docker issues by @bigPYJ1151 in #16454
- [Bugfix] Don't set an upper bound on repetition penalty by @alex-jw-brooks in #16403
- Revert "[Model] use AutoWeightsLoader for deepseek_v2, internlm2" by @DefTruth in #16453
- [Core][LoRA][1/N] Add LoRA for EncoderDecoderModelRunner by @jeejeelee in #15990
- Enforce valid max_num_batched_tokens when disable_chunked_mm_input=True by @mgoin in #16447
- [Misc] Raise error for V1 not supporting Long LoRA. by @jeejeelee in #16415
- [Misc] update api_client example by @reidliu41 in #16459
- Don't install triton on
ppc64le
platform by @hmellor in #16470 - [Kernel] support merge_attn_states CUDA kernel, 3x speedup by @DefTruth in #16173
- [Bugfix] Fix bugs of running Quark quantized models by @cha557 in #16236
- [Hardware][Intel-Gaudi] Multi-step scheduling implementation for HPU by @tzielinski-habana in #12779
- Fix erroneous "model doesn't support compile" warning by @zou3519 in #16486
- [TPU][V1] Make
--disable_chunked_mm_input
mandatory for serving MM models by @NickLucche in #16483 - [Kernel] Support W8A8 channel-wise weights and per-token activations in triton fused_moe_kernel by @mgoin in #16366
- [Doc] Document InternVL3 support by @Isotr0py in #16495
- [Bugfix] handle alignment of encoder_seq_lens in mllama.py by @tjohnson31415 in #14784
- Improve configs -
LoadConfig
by @hmellor in #16422 - [Frontend] Added chat templates for LLaMa4 pythonic tool calling by @yeqcharlotte in #16463
- [Kernel] Add tuned FusedMoE kernel config for Llama4 Scout, TP=8 on H100 by @sarckk in #16488
- Update openai_compatible_server.md by @Chr1st1anSears in #16507
- [Bugfix] clean up duplicated code by @lengrongfu in #16485
- Bugfix for PixtralHF models without spatial_merge_size by @mgoin in #16513
- [Doc] Fix link to vLLM blog by @terrytangyuan in #16519
- [CI][Bugfix] Add mistral_tool_use to Ci by @mgoin in #16517
- [BugFix] Handle non-contiguous tensors properly when serializing by @njhill in #16492
- [Doc] Update Llama4 Model Names in Supported Models by @yeqcharlotte in #16509
- Optimized topk for topk=1 (Llama-4) by @mgoin in #16512
- [Feature][V1] Add xgrammar to support minLength, maxLength with test by @leon-seidel in #16516
- [Frontend] support matryoshka representation / support embedding API dimensions by @noooop in #16331
- fix: spelling by @ezhoureal in #16466
- [Misc] Update chat utils tests by @DarkLight1337 in #16520
- [Misc] Openai transcription client example use same Whisper model by @NickLucche in #16487
- [V1] Enable multi-input by default by @DarkLight1337 in #15799
- [MISC] Make GroupCoordinator compatible with out-of-tree devices by @ji-huazhong in #16464
- [Misc] Delete redundant code by @jeejeelee in #16530
- Fix syntaxWarning: invalid escape sequence '\s' by @DamonFool in #16532
- [Perf] Optimize Preparing Inputs for GPU Model Runner by @SnowCharmQ in #16484
- [Bugfix] Validate logit biases to prevent out of vocab ids crashing engine by @rymc in #16529
- [V1][Spec Decode] KV cache slots for eagle heads by @LiuXiaoxuanPKU in #16370
- Enable PTPC FP8 for CompressedTensorsW8A8Fp8MoEMethod (triton fused_moe) by @mgoin in #16537
- [Benchmark][Bugfix] Fix SonnetDataset default values in benchmark_throughput.py by @JenZhao in #16556
- [Core][V0] Enable regex support with xgrammar by @russellb in #13228
New Contributors
- @bjj made their first contribution in #14501
- @psschwei made their first contribution in #16025
- @paolovic made their first contribution in #16093
- @YamPengLi made their first contribution in #15289
- @leon-seidel made their first contribution in #15878
- @drisspg made their first contribution in #14231
- @mlsw made their first contribution in #16196
- @aws-satyajith made their first contribution in #16218
- @zxfan-cpu made their first contribution in #16161
- @sarckk made their first contribution in #16212
- @yueshen2016 made their first contribution in #15961
- @Accelerator1996 made their first contribution in #16005
- @hhy3 made their first contribution in #16202
- @zRzRzRzRzRzRzR made their first contribution in #16338
- @aaron-ang made their first contribution in #16325
- @Jintao-Huang made their first contribution in #16371
- @WineChord made their first contribution in #11879
- @eeslook made their first contribution in #16419
- @jadewang21 made their first contribution in #16126
- @alexey-belyakov made their first contribution in #16423
- @tzielinski-habana made their first contribution in #12779
- @Chr1st1anSears made their first contribution in #16507
- @ezhoureal made their first contribution in #16466
- @SnowCharmQ made their first contribution in #16484
- @rymc made their first contribution in #16529
Full Changelog: v0.8.3...v0.8.4