Skip to content

v0.8.4

Latest
Compare
Choose a tag to compare
@github-actions github-actions released this 14 Apr 06:14
· 256 commits to main since this release
dc1b4a6

This release contains 180 commits from 84 contributors (25 new contributors!).

Highlights

This release includes important accuracy fixes for Llama4 models, if you are using it, we highly recommend you to update.

Model

  • Llama4 (#16113,#16509) bug fix and enhancements:
    • qknorm should be not shared across head (#16311)
    • Enable attention temperature tuning by default for long context (>32k) (#16439)
    • Index Error When Single Request Near Max Context (#16209)
    • Add tuned FusedMoE kernel config for Llama4 Scout, TP=8 on H100 (#16488)
    • Update to transformers==4.51.1 (#16257)
    • Added chat templates for LLaMa4 pythonic tool calling (#16463)
    • Optimized topk for topk=1(#16512)
    • Add warning for Attention backends that do not support irope yet (#16212)
  • Support Qwen3 and Qwen3MoE (#15289), smolvlm (#16017), jinaai/jina-embeddings-v3 (#16120), InternVL3 (#16495), GLM-4-0414 (#16338)

API

  • Estimate max-model-len use available KV cache memory. The error message nows hints at how to set --max-model-len (#16168)
  • Add hf_token to EngineArgs (#16093)
  • Enable regex support with xgrammar in V0 engine (#13228)
  • Support matryoshka representation / support embedding API dimensions (#16331)
  • Add bucket for request_latency, time_to_first_token and time_per_output_token (#15202)
  • Support for TorchAO quantization (#14231)

Hardware

  • Intel-Gaudi: Multi-step scheduling implementation for HPU (#12779)
  • TPU:
    • Make @support_torch_compile work for XLA backend (#15782)
    • Use language_model interface for getting text backbone in MM (#16410)

Performance

  • DeepSeek MLA: a new merge_attn_states CUDA kernel, 3x speedup (#16173)
  • MoE: Support W8A8 channel-wise weights and per-token activations in triton fused_moe_kernel (#16366)
  • Add support to modelopt quantization of Mixtral model (#15961)
  • Enable PTPC FP8 for CompressedTensorsW8A8Fp8MoEMethod (triton fused_moe) (#16537)

V1 Engine Core

  • Enable multi-input by default (#15799)
  • Scatter and gather placeholders in the model runner (#16076)
  • Set structured output backend to auto by default (#15724)
  • Zero-copy tensor/ndarray serialization/transmission (#13790)
  • Eagle Model loading (#16035)
  • KV cache slots for eagle heads (#16370)
  • Add supports_structured_output() method to Platform (#16148)

Developer Facing

What's Changed

New Contributors

Full Changelog: v0.8.3...v0.8.4