Skip to content

Commit 4c8c86d

Browse files
tjtanaayoukaichaoDarkLight1337heheda12345wangxiyuan
authored
[FEAT] [AITER] Support AITER operators: Fused MoE, Linear, Norm (#436)
* [Doc] Update Quantization Hardware Support Documentation (vllm-project#12025) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com> * [HPU][misc] add comments for explanation (vllm-project#12034) Signed-off-by: youkaichao <youkaichao@gmail.com> * [Bugfix] Fix various bugs in multi-modal processor (vllm-project#12031) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Kernel] Revert the API change of Attention.forward (vllm-project#12038) Signed-off-by: Chen Zhang <zhangch99@outlook.com> * [Platform] Add output for Attention Backend (vllm-project#11981) Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> * [Bugfix][Kernel] Give unique name to BlockSparseFlashAttention (vllm-project#12040) Signed-off-by: Chen Zhang <zhangch99@outlook.com> * Explain where the engine args go when using Docker (vllm-project#12041) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * Docs lint * [Doc]: Update the Json Example of the `Engine Arguments` document (vllm-project#12045) * [Misc] Merge bitsandbytes_stacked_params_mapping and packed_modules_mapping (vllm-project#11924) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> * [Kernel] Support MulAndSilu (vllm-project#11624) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> * [HPU][Bugfix] Don't use /dev/accel/accel0 for HPU autodetection in setup.py (vllm-project#12046) Signed-off-by: Konrad Zawora <kzawora@habana.ai> * [Platform] move current_memory_usage() into platform (vllm-project#11369) Signed-off-by: Shanshan Shen <467638484@qq.com> * [V1][BugFix] Fix edge case in VLM scheduling (vllm-project#12065) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> * [Misc] Add multipstep chunked-prefill support for FlashInfer (vllm-project#10467) * [core] Turn off GPU communication overlap for Ray executor (vllm-project#12051) Signed-off-by: Rui Qiao <ruisearch42@gmail.com> * [core] platform agnostic executor via collective_rpc (vllm-project#11256) Signed-off-by: youkaichao <youkaichao@gmail.com> * [Doc] Update examples to remove SparseAutoModelForCausalLM (vllm-project#12062) Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> * [V1][Prefix Cache] Move the logic of num_computed_tokens into KVCacheManager (vllm-project#12003) * Fix: cases with empty sparsity config (vllm-project#12057) Signed-off-by: Rahul Tuli <rahul@neuralmagic.com> * Type-fix: make execute_model output type optional (vllm-project#12020) * [Platform] Do not raise error if _Backend is not found (vllm-project#12023) Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Mengqing Cao <cmq0113@163.com> Co-authored-by: Mengqing Cao <cmq0113@163.com> * [Model]: Support internlm3 (vllm-project#12037) * Misc: allow to use proxy in `HTTPConnection` (vllm-project#12042) Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Misc][Quark] Upstream Quark format to VLLM (vllm-project#10765) Signed-off-by: kewang-xlnx <kewang@xilinx.com> Signed-off-by: kewang2 <kewang2@amd.com> Co-authored-by: kewang2 <kewang2@amd.com> Co-authored-by: Michael Goin <michael@neuralmagic.com> * [Doc]: Update `OpenAI-Compatible Server` documents (vllm-project#12082) * [Bugfix] use right truncation for non-generative tasks (vllm-project#12050) Signed-off-by: Joe Runde <Joseph.Runde@ibm.com> * [V1][Core] Autotune encoder cache budget (vllm-project#11895) Signed-off-by: Roger Wang <ywang@roblox.com> * [Bugfix] Fix _get_lora_device for HQQ marlin (vllm-project#12090) Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com> * Allow hip sources to be directly included when compiling for rocm. (vllm-project#12087) * [Core] Default to using per_token quantization for fp8 when cutlass is supported. (vllm-project#8651) Signed-off-by: mgoin <michael@neuralmagic.com> Co-authored-by: Michael Goin <mgoin@redhat.com> Co-authored-by: mgoin <michael@neuralmagic.com> * [Doc] Add documentation for specifying model architecture (vllm-project#12105) * Various cosmetic/comment fixes (vllm-project#12089) Signed-off-by: mgoin <michael@neuralmagic.com> * [Bugfix] Remove hardcoded `head_size=256` for Deepseek v2 and v3 (vllm-project#12067) Signed-off-by: Isotr0py <2037008807@qq.com> * Support torchrun and SPMD-style offline inference (vllm-project#12071) Signed-off-by: youkaichao <youkaichao@gmail.com> * [core] LLM.collective_rpc interface and RLHF example (vllm-project#12084) Signed-off-by: youkaichao <youkaichao@gmail.com> * [Bugfix] Fix max image feature size for Llava-one-vision (vllm-project#12104) Signed-off-by: Roger Wang <ywang@roblox.com> * Enable user marker for vllm profiling (#357) * Enable user marker for vllm profiling --------- Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com> * [misc] Add LoRA kernel micro benchmarks (vllm-project#11579) * [Model] Add support for deepseek-vl2-tiny model (vllm-project#12068) Signed-off-by: Isotr0py <2037008807@qq.com> * Deepseek V3 support (#364) * Changing the hard coded datatype to see if it's enough for the model to work * Picking the upstrteam moe kernel version * make upstream fix for v3 also works for rocm v2 * Conditional fnuz dtype * Requantizing from fn to fnuz * Requantizing moe as well * Actually requantizing moe weights * Conditional requantization and assert on padding in block quant * Format --------- Co-authored-by: charlifu <charlifu@amd.com> * [Bugfix] Set enforce_eager automatically for mllama (vllm-project#12127) Signed-off-by: Chen Zhang <zhangch99@outlook.com> * [Bugfix] Fix a path bug in disaggregated prefill example script. (vllm-project#12121) Signed-off-by: Kuntai Du <kuntai@uchicago.edu> * [CI]add genai-perf benchmark in nightly benchmark (vllm-project#10704) Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> * [Doc] Add instructions on using Podman when SELinux is active (vllm-project#12136) Signed-off-by: Yuan Tang <terrytangyuan@gmail.com> * [Bugfix] Fix issues in CPU build Dockerfile (vllm-project#12135) Signed-off-by: Yuan Tang <terrytangyuan@gmail.com> * [BugFix] add more `is not None` check in VllmConfig.__post_init__ (vllm-project#12138) Signed-off-by: Chen Zhang <zhangch99@outlook.com> * [Misc] Add deepseek_vl2 chat template (vllm-project#12143) Signed-off-by: Isotr0py <2037008807@qq.com> * [ROCm][MoE] moe tuning support for rocm (vllm-project#12049) Signed-off-by: Divakar Verma <divakar.verma@amd.com> * [V1] Move more control of kv cache initialization from model_executor to EngineCore (vllm-project#11960) Signed-off-by: Chen Zhang <zhangch99@outlook.com> Co-authored-by: Cody Yu <hao.yu.cody@gmail.com> * [Misc][LoRA] Improve the readability of LoRA error messages (vllm-project#12102) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> * [CI/Build][CPU][Bugfix] Fix CPU CI (vllm-project#12150) Signed-off-by: jiang1.li <jiang1.li@intel.com> * [core] allow callable in collective_rpc (vllm-project#12151) Signed-off-by: youkaichao <youkaichao@gmail.com> * [Bugfix] Fix score api for missing max_model_len validation (vllm-project#12119) Signed-off-by: Wallas Santos <wallashss@ibm.com> * [Bugfix] Mistral tokenizer encode accept list of str (vllm-project#12149) Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> * [AMD][FP8] Using MI300 FP8 format on ROCm for block_quant (vllm-project#12134) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> * [torch.compile] disable logging when cache is disabled (vllm-project#12043) Signed-off-by: youkaichao <youkaichao@gmail.com> * [misc] fix cross-node TP (vllm-project#12166) Signed-off-by: youkaichao <youkaichao@gmail.com> * [AMD][CI/Build][Bugfix] use pytorch stale wheel (vllm-project#12172) Signed-off-by: hongxyan <hongxyan@amd.com> * [core] further polish memory profiling (vllm-project#12126) Signed-off-by: youkaichao <youkaichao@gmail.com> * [Docs] Fix broken link in SECURITY.md (vllm-project#12175) Signed-off-by: Russell Bryant <rbryant@redhat.com> * [Model] Port deepseek-vl2 processor, remove dependency (vllm-project#12169) Signed-off-by: Isotr0py <2037008807@qq.com> * [core] clean up executor class hierarchy between v1 and v0 (vllm-project#12171) Signed-off-by: youkaichao <youkaichao@gmail.com> * [Misc] Support register quantization method out-of-tree (vllm-project#11969) * [V1] Collect env var for usage stats (vllm-project#12115) * [BUGFIX] Move scores to float32 in case of running xgrammar on cpu (vllm-project#12152) Signed-off-by: Michal Adamczyk <madamczyk@habana.ai> * [Bugfix] Fix multi-modal processors for transformers 4.48 (vllm-project#12187) * [torch.compile] store inductor compiled Python file (vllm-project#12182) Signed-off-by: youkaichao <youkaichao@gmail.com> * benchmark_serving support --served-model-name param (vllm-project#12109) Signed-off-by: zibai <zibai.gj@alibaba-inc.com> Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com> * [Misc] Add BNB support to GLM4-V model (vllm-project#12184) Signed-off-by: Isotr0py <2037008807@qq.com> * [V1] Add V1 support of Qwen2-VL (vllm-project#12128) Signed-off-by: Roger Wang <ywang@roblox.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: imkero <kerorek@outlook.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Model] Support for fairseq2 Llama (vllm-project#11442) Signed-off-by: Martin Gleize <mgleize@meta.com> Co-authored-by: mgleize user <mgleize@a100-st-p4de24xlarge-4.fair-a100.hpcaas> * [Bugfix] Fix num_heads value for simple connector when tp enabled (vllm-project#12074) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * [torch.compile] fix sym_tensor_indices (vllm-project#12191) Signed-off-by: youkaichao <youkaichao@gmail.com> * Move linting to `pre-commit` (vllm-project#11975) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [DOC] Fix typo in docstring and assert message (vllm-project#12194) Signed-off-by: Yuan Tang <terrytangyuan@gmail.com> * [DOC] Add missing docstring in LLMEngine.add_request() (vllm-project#12195) Signed-off-by: Yuan Tang <terrytangyuan@gmail.com> * [Bugfix] Fix incorrect types in LayerwiseProfileResults (vllm-project#12196) Signed-off-by: Yuan Tang <terrytangyuan@gmail.com> * [Model] Add Qwen2 PRM model support (vllm-project#12202) Signed-off-by: Isotr0py <2037008807@qq.com> * [Core] Interface for accessing model from `VllmRunner` (vllm-project#10353) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [misc] add placeholder format.sh (vllm-project#12206) Signed-off-by: youkaichao <youkaichao@gmail.com> * [CI/Build] Remove dummy CI steps (vllm-project#12208) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [CI/Build] Make pre-commit faster (vllm-project#12212) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Model] Upgrade Aria to transformers 4.48 (vllm-project#12203) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [misc] print a message to suggest how to bypass commit hooks (vllm-project#12217) Signed-off-by: youkaichao <youkaichao@gmail.com> * [core][bugfix] configure env var during import vllm (vllm-project#12209) Signed-off-by: youkaichao <youkaichao@gmail.com> * [V1] Remove `_get_cache_block_size` (vllm-project#12214) Signed-off-by: Chen Zhang <zhangch99@outlook.com> * [Misc] Pass `attention` to impl backend (vllm-project#12218) Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> * [Bugfix] Fix `HfExampleModels.find_hf_info` (vllm-project#12223) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [CI] Pass local python version explicitly to pre-commit mypy.sh (vllm-project#12224) Signed-off-by: Chen Zhang <zhangch99@outlook.com> * Using ROCm6.3.1 base docker and building hipblas-common (#366) * [Misc] Update CODEOWNERS (vllm-project#12229) * fix: update platform detection for M-series arm based MacBook processors (vllm-project#12227) Signed-off-by: isikhi <huseyin.isik000@gmail.com> * [misc] add cuda runtime version to usage data (vllm-project#12190) Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: Roger Wang <ywang@roblox.com> * [bugfix] catch xgrammar unsupported array constraints (vllm-project#12210) Signed-off-by: Jason Cheng <jasoncky96@gmail.com> * [Kernel] optimize moe_align_block_size for cuda graph and large num_experts (e.g. DeepSeek-V3) (vllm-project#12222) Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com> Co-authored-by: Michael Goin <mgoin@redhat.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> * Add quantization and guided decoding CODEOWNERS (vllm-project#12228) Signed-off-by: mgoin <michael@neuralmagic.com> * [AMD][Build] Porting dockerfiles from the ROCm/vllm fork (vllm-project#11777) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> * [BugFix] Fix GGUF tp>1 when vocab_size is not divisible by 64 (vllm-project#12230) Signed-off-by: NickLucche <nlucches@redhat.com> * [ci/build] disable failed and flaky tests (vllm-project#12240) Signed-off-by: youkaichao <youkaichao@gmail.com> * [Misc] Rename `MultiModalInputsV2 -> MultiModalInputs` (vllm-project#12244) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Misc]Add BNB quantization for PaliGemmaForConditionalGeneration (vllm-project#12237) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> * [Misc] Remove redundant TypeVar from base model (vllm-project#12248) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Bugfix] Fix mm_limits access for merged multi-modal processor (vllm-project#12252) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [torch.compile] transparent compilation with more logging (vllm-project#12246) Signed-off-by: youkaichao <youkaichao@gmail.com> * [V1][Bugfix] Fix data item ordering in mixed-modality inference (vllm-project#12259) Signed-off-by: Roger Wang <ywang@roblox.com> * Remove pytorch comments for outlines + compressed-tensors (vllm-project#12260) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> * [Platform] improve platforms getattr (vllm-project#12264) Signed-off-by: Mengqing Cao <cmq0113@163.com> * [ci/build] update nightly torch for gh200 test (vllm-project#12270) Signed-off-by: youkaichao <youkaichao@gmail.com> * [Bugfix] fix race condition that leads to wrong order of token returned (vllm-project#10802) Signed-off-by: Jannis Schönleber <joennlae@gmail.com> * [Kernel] fix moe_align_block_size error condition (vllm-project#12239) Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com> * [v1][stats][1/n] Add RequestStatsUpdate and RequestStats types (vllm-project#10907) Signed-off-by: rickyx <rickyx@anyscale.com> * [Bugfix] Multi-sequence broken (vllm-project#11898) Signed-off-by: Andy Lo <andy@mistral.ai> * [Misc] Remove experimental dep from tracing.py (vllm-project#12007) Signed-off-by: Adrian Cole <adrian.cole@elastic.co> * [Misc] Set default backend to SDPA for get_vit_attn_backend (vllm-project#12235) Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> * [Core] Free CPU pinned memory on environment cleanup (vllm-project#10477) * Update pre-commit.yml (#374) * Update pre-commit.yml * Reapplying missing format * New codespell exclude location --------- Co-authored-by: Kevin H. Luu <kevin@anyscale.com> * [bugfix] moe tuning. rm is_navi() (vllm-project#12273) Signed-off-by: Divakar Verma <divakar.verma@amd.com> * [BUGFIX] When skip_tokenize_init and multistep are set, execution crashes (vllm-project#12277) Signed-off-by: maleksan85 <maleksan@amd.com> Co-authored-by: maleksan85 <maleksan@amd.com> * [Documentation][AMD] Add information about prebuilt ROCm vLLM docker for perf validation purpose (vllm-project#12281) Signed-off-by: Hongxia Yang <hongxyan@amd.com> * [VLM] Simplify post-processing of replacement info (vllm-project#12269) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [ci/lint] Add back default arg for pre-commit (vllm-project#12279) Signed-off-by: kevin <kevin@anyscale.com> * [CI] add docker volume prune to neuron CI (vllm-project#12291) Signed-off-by: Liangfu Chen <liangfc@amazon.com> * [Ci/Build] Fix mypy errors on main (vllm-project#12296) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Benchmark] More accurate TPOT calc in `benchmark_serving.py` (vllm-project#12288) Signed-off-by: Nick Hill <nhill@redhat.com> * [core] separate builder init and builder prepare for each batch (vllm-project#12253) Signed-off-by: youkaichao <youkaichao@gmail.com> * [Build] update requirements of no-device (vllm-project#12299) Signed-off-by: Mengqing Cao <cmq0113@163.com> * [Core] Support fully transparent sleep mode (vllm-project#11743) Signed-off-by: youkaichao <youkaichao@gmail.com> * [VLM] Avoid unnecessary tokenization (vllm-project#12310) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Model][Bugfix]: correct Aria model output (vllm-project#12309) Signed-off-by: xffxff <1247714429@qq.com> * [Bugfix][VLM] Fix mixed-modality inference backward compatibility for V0 (vllm-project#12313) Signed-off-by: Roger Wang <ywang@roblox.com> * [Doc] Add docs for prompt replacement (vllm-project#12318) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Misc] Fix the error in the tip for the --lora-modules parameter (vllm-project#12319) Signed-off-by: wangerxiao <863579016@qq.com> * [Misc] Improve the readability of BNB error messages (vllm-project#12320) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> * Skip tokenize/detokenize when it is disabled by arg --skip-tokenizer-init (#367) * switching detokenize flag to be False * detokenize = False for benchmarks * restoring default in main vllm code for detokenize * removing extra spaces * moving detokenize to flag * adding support for token ids --------- Co-authored-by: maleksan85 <maleksan@amd.com> * [Bugfix] Fix HPU multiprocessing executor (vllm-project#12167) Signed-off-by: Konrad Zawora <kzawora@habana.ai> * [Core] Support `reset_prefix_cache` (vllm-project#12284) * [Frontend][V1] Online serving performance improvements (vllm-project#12287) * [AMD][Quantization] Add TritonScaledMMLinearKernel since int8 is broken for AMD (vllm-project#12282) Signed-off-by: Randall Smith <Randall.Smith@amd.com> * FP8 FA fixes (#381) * FP8 FA fixes Summary: Add missing clamp and fix reciprocal scale computation. * linter * Returning the use of the proper stream in allreduce (#382) * [Bugfix] Fixing AMD LoRA CI test. (vllm-project#12329) Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com> * [Docs] Update FP8 KV Cache documentation (vllm-project#12238) Signed-off-by: mgoin <michael@neuralmagic.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> * [Docs] Document vulnerability disclosure process (vllm-project#12326) Signed-off-by: Russell Bryant <rbryant@redhat.com> * [V1] Add `uncache_blocks` (vllm-project#12333) * [doc] explain common errors around torch.compile (vllm-project#12340) Signed-off-by: youkaichao <youkaichao@gmail.com> * [Hardware][Gaudi][BugFix] Fix dataclass error due to triton package update (vllm-project#12338) Signed-off-by: zhenwei <zhenweiliu@habana.ai> * [Bugfix] Fix k_proj's bias for whisper self attention (vllm-project#12342) Signed-off-by: Isotr0py <2037008807@qq.com> * [Kernel] Flash Attention 3 Support (vllm-project#12093) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> * [Doc] Troubleshooting errors during model inspection (vllm-project#12351) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [V1] Simplify M-RoPE (vllm-project#12352) Signed-off-by: Roger Wang <ywang@roblox.com> Co-authored-by: imkero <kerorek@outlook.com> * [Bugfix] Fix broken internvl2 inference with v1 (vllm-project#12360) Signed-off-by: Isotr0py <2037008807@qq.com> * [core] add wake_up doc and some sanity check (vllm-project#12361) Signed-off-by: youkaichao <youkaichao@gmail.com> * [torch.compile] decouple compile sizes and cudagraph sizes (vllm-project#12243) Signed-off-by: youkaichao <youkaichao@gmail.com> * [FP8][Kernel] Dynamic kv cache scaling factors computation (vllm-project#11906) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Co-authored-by: Micah Williamson <micah.williamson@amd.com> * [TPU] Update TPU CI to use torchxla nightly on 20250122 (vllm-project#12334) Signed-off-by: Siyuan Liu <lsiyuan@google.com> * [Docs] Document Phi-4 support (vllm-project#12362) Signed-off-by: Isotr0py <2037008807@qq.com> * [BugFix] Fix parameter names and `process_after_weight_loading` for W4A16 MoE Group Act Order (vllm-project#11528) Signed-off-by: ElizaWszola <eliza@neuralmagic.com> Co-authored-by: ElizaWszola <eliza@neuralmagic.com> Co-authored-by: Michael Goin <michael@neuralmagic.com> * [Misc] Fix OpenAI API Compatibility Issues in Benchmark Script (vllm-project#12357) Signed-off-by: Junichi Sato <junichi.sato@sbintuitions.co.jp> * [Docs] Add meetup slides (vllm-project#12345) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> * Using pytorch commit past the point when rowwise PR (pytorch/pytorch#144432) was merged (#384) * Integrated ater: kvcache pa gemm rmsnorm * fix pa * fix * replace topk softmax * [Docs] Update spec decode + structured output in compat matrix (vllm-project#12373) Signed-off-by: Russell Bryant <rbryant@redhat.com> * replace fp moe kernel with aiter kernel * [V1][Frontend] Coalesce bunched `RequestOutput`s (vllm-project#12298) Signed-off-by: Nick Hill <nhill@redhat.com> Co-authored-by: Robert Shaw <rshaw@neuralmagic.com> * Set weights_only=True when using torch.load() (vllm-project#12366) Signed-off-by: Russell Bryant <rbryant@redhat.com> * [Bugfix] Path join when building local path for S3 clone (vllm-project#12353) Signed-off-by: Omer Dayan (SW-GPU) <omer@run.ai> * change ater to aiter * Update compressed-tensors version (vllm-project#12367) * [V1] Increase default batch size for H100/H200 (vllm-project#12369) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> * [perf] fix perf regression from vllm-project#12253 (vllm-project#12380) Signed-off-by: youkaichao <youkaichao@gmail.com> * [Misc] Use VisionArena Dataset for VLM Benchmarking (vllm-project#12389) Signed-off-by: Roger Wang <ywang@roblox.com> * [ci/build] fix wheel size check (vllm-project#12396) Signed-off-by: youkaichao <youkaichao@gmail.com> * [Hardware][Gaudi][Doc] Add missing step in setup instructions (vllm-project#12382) * [ci/build] sync default value for wheel size (vllm-project#12398) Signed-off-by: youkaichao <youkaichao@gmail.com> * [Misc] Enable proxy support in benchmark script (vllm-project#12356) Signed-off-by: Junichi Sato <junichi.sato@sbintuitions.co.jp> * [Bugfix][Kernel] Fix CUDA 11.8 being broken by FA3 build (vllm-project#12375) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> * Applying scales rename to fp8 config * Applying scales rename to fp8 config (#387) * Update Dockerfile.rocm * [Misc] Remove deprecated code (vllm-project#12383) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Bugfix][Kernel] FA3 Fix - RuntimeError: This flash attention build only supports pack_gqa (for build size reasons). (vllm-project#12405) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> * Using aiter moe kernel * Dev-docker Documentation Updates (#378) * Dev-docker Documentation Updates Minor updates to several sections, with links to other documents where appropriate. * Fix formatting of GEMM filename * README cleanup - Reorder some sections of the README to make them easier to follow - Improve formatting of bash commands - Prefer use of huggingface model names instead of hard-coded directories - Clean up wording * Expanded sample commands for Latency and Throughput * Fix markdown links * Fix pre-commit errors * Updates from review Initial updates to incorporate feedback from a review session held with @t-parry * Update script args to match current recommendations * Remove recommended max-num-seqs values for now --------- Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com> * [Bugfix][Kernel] Fix moe align block issue for mixtral (vllm-project#12413) * [Bugfix] Fix BLIP-2 processing (vllm-project#12412) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [ROCm][MoE] MI300 tuned configs Mixtral-8x(7B,22B) | fp16, fp8 (vllm-project#12408) Signed-off-by: Divakar Verma <divakar.verma@amd.com> * [Misc] Add FA2 support to ViT MHA layer (vllm-project#12355) Signed-off-by: Isotr0py <2037008807@qq.com> * [TPU][CI] Update torchxla version in requirement-tpu.txt (vllm-project#12422) Signed-off-by: Siyuan Liu <lsiyuan@google.com> * [Misc][Bugfix] FA3 support to ViT MHA layer (vllm-project#12435) Signed-off-by: Roger Wang <ywang@roblox.com> Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: Isotr0py <2037008807@qq.com> * [V1][Perf] Reduce scheduling overhead in model runner after cuda sync (vllm-project#12094) Signed-off-by: Keyun Tong <tongkeyun@gmail.com> * [V1][Bugfix] Fix assertion when mm hashing is turned off (vllm-project#12439) Signed-off-by: Roger Wang <ywang@roblox.com> * fix pa copy * pa update * [Misc] Revert FA on ViT vllm-project#12355 and vllm-project#12435 (vllm-project#12445) * [Frontend] generation_config.json for maximum tokens(vllm-project#12242) Signed-off-by: Matthew Hendrey <matthew.hendrey@gmail.com> Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> Signed-off-by: youkaichao <youkaichao@gmail.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: Yuan Tang <terrytangyuan@gmail.com> Signed-off-by: Isotr0py <2037008807@qq.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: shangmingc <caishangming@linux.alibaba.com> Co-authored-by: youkaichao <youkaichao@gmail.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Yuan Tang <terrytangyuan@gmail.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Co-authored-by: Chen Zhang <zhangch99@outlook.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> * [Bugfix] Disable w16a16 2of4 sparse CompressedTensors24 (vllm-project#12417) Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: mgoin <michael@neuralmagic.com> * add fp16 pa support for aiter * [Bugfix/CI] Fix broken kernels/test_mha.py (vllm-project#12450) * [Bugfix][Kernel] Fix perf regression caused by PR vllm-project#12405 (vllm-project#12434) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> * [Build/CI] Fix libcuda.so linkage (vllm-project#12424) Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * [Frontend] Rerank API (Jina- and Cohere-compatible API) (vllm-project#12376) Signed-off-by: Kyle Mistele <kyle@mistele.com> * [DOC] Add link to vLLM blog (vllm-project#12460) Signed-off-by: Yuan Tang <terrytangyuan@gmail.com> * [V1] Avoid list creation in input preparation (vllm-project#12457) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> * [Frontend] Support scores endpoint in run_batch (vllm-project#12430) Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io> * [Bugfix] Fix Granite 3.0 MoE model loading (vllm-project#12446) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Bugfix] Fix missing seq_start_loc in xformers prefill metadata (vllm-project#12464) Signed-off-by: Isotr0py <2037008807@qq.com> * [V1][Minor] Minor optimizations for update_from_output (vllm-project#12454) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> * [Bugfix] Fix gpt2 GGUF inference (vllm-project#12467) Signed-off-by: Isotr0py <2037008807@qq.com> * aiter build instructions * [Build] Only build 9.0a for scaled_mm and sparse kernels (vllm-project#12339) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> * Copy to the right path * [V1][Metrics] Add initial Prometheus logger (vllm-project#12416) Signed-off-by: Mark McLoughlin <markmc@redhat.com> * [V1][CI/Test] Do basic test for top-p & top-k sampling (vllm-project#12469) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> * [FlashInfer] Upgrade to 0.2.0 (vllm-project#11194) Signed-off-by: Bowen Wang <abmfy@icloud.com> Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: youkaichao <youkaichao@gmail.com> * Support FP8 FA from Quark format (#388) * Support FP8 FA from Quark format * Support FP8 FA from Quark format * nit: update comment * Direct call on ROCm * 20250127 docs update (#392) * updating code blocks * typo * updated manifest * Including feedback * whitespace * Deepseek instructions * hyperlink fix * hyperlink fix * updating what is new * cpx update * typo * whitespace * whitespace * Add env var toggles to disable AITER MoE or PA (both by default on) * Update accuracy benchmark for batch size > 1 * Add a few more AITER toggles for norm and linear layers * Faster Custom Paged Attention kernels (#372) * integrate new cpa kernel, update tests and benchmark * added comments to mfma4 kernel * further comments for mfma16 kernel * clang-format * Lint * add flag for logits rtz conversion and disable by default * lint * [Bugfix]: Fix paged attention unit tests of #372 (#389) * [Bugfix]: fix paged attention tests based on the updated kernels in `csrc/attention/paged_attention_v1.cu`,`csrc/attention/paged_attention_v2.cu` and `csrc/rocm/attention.cu`. * improve code documentation. * lint --------- Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com> --------- Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com> Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Co-authored-by: Joe Shajrawi <17753158+shajrawi@users.noreply.github.com> Co-authored-by: TJian <tunjian1996@gmail.com> Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com> * Using a more precise profiling on ROCm to properly account for weights padding (#394) * Public aiter repo * Fail if aiter build failed silently * Aiter can only be built on MI300x * Typo fix * Aiter PA off by default * Changes to support updated aiter FP8 PA * Support FP8 and INT8 KV cache according to ROCm/aiter#90 * add moe weight shuffle for dynamic quant and unquantized path Signed-off-by: charlifu <charlifu@amd.com> * Use FP16-native PA after support in ROCm/aiter#97 * Fix: Use FP8 pertoken quantize if KV cache dtype is FP8 * revert rocm_flash_attn.py line 883 * Don't enable by default to use an RC for main vllm-dev docker * use ck moe for bf16 and fp16 fused_moe * Merge remote-tracking branch 'origin/aiter_intergration_final' into merge-aiter-llama-fp8 Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> * [Bugfix] include moe shuffle env variable Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> --------- Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Signed-off-by: youkaichao <youkaichao@gmail.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Roger Wang <ywang@roblox.com> Signed-off-by: yisheng <yi.sheng@intel.com> Signed-off-by: Abatom <abzhonghua@gmail.com> Signed-off-by: Liangfu Chen <liangfc@amazon.com> Signed-off-by: Russell Bryant <rbryant@redhat.com> Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> Signed-off-by: Sourashis Roy <sroy@roblox.com> Signed-off-by: Nishidha Panpaliya <nishidha.panpaliya@partner.ibm.com> Signed-off-by: Ilya Lavrenov <ilya.lavrenov@intel.com> Signed-off-by: simon-mo <simon.mo@hey.com> Signed-off-by: Wallas Santos <wallashss@ibm.com> Signed-off-by: jiang1.li <jiang1.li@intel.com> Signed-off-by: yan ma <yan.ma@intel.com> Signed-off-by: Randall Smith <Randall.Smith@amd.com> Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Signed-off-by: mgoin <michael@neuralmagic.com> Signed-off-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com> Signed-off-by: Ye Qi <yeq@meta.com> Signed-off-by: Mengqing Cao <cmq0113@163.com> Signed-off-by: Joe Runde <Joseph.Runde@ibm.com> Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> Signed-off-by: Kuntai Du <kuntai@uchicago.edu> Signed-off-by: Isotr0py <2037008807@qq.com> Signed-off-by: Ren MinMin <renmm6@chinaunicom.cn> Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Signed-off-by: Fred Reiss <frreiss@us.ibm.com> Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by: shaochangxu.scx <shaochangxu.scx@antgroup.com> Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> Signed-off-by: Akshat Tripathi <akshat@krai.ai> Signed-off-by: Oleg Mosalov <oleg@krai.ai> Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com> Signed-off-by: Yida Wu <yidawu@alumni.cmu.edu> Signed-off-by: Chenguang Li <757486878@qq.com> Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com> Signed-off-by: Shanshan Shen <467638484@qq.com> Signed-off-by: elijah <f1renze.142857@gmail.com> Signed-off-by: Konrad Zawora <kzawora@habana.ai> Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Signed-off-by: Rui Qiao <ruisearch42@gmail.com> Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Signed-off-by: Rahul Tuli <rahul@neuralmagic.com> Signed-off-by: kewang-xlnx <kewang@xilinx.com> Signed-off-by: kewang2 <kewang2@amd.com> Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Signed-off-by: Yuan Tang <terrytangyuan@gmail.com> Signed-off-by: Divakar Verma <divakar.verma@amd.com> Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Signed-off-by: hongxyan <hongxyan@amd.com> Signed-off-by: Michal Adamczyk <madamczyk@habana.ai> Signed-off-by: zibai <zibai.gj@alibaba-inc.com> Signed-off-by: Martin Gleize <mgleize@meta.com> Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> Signed-off-by: isikhi <huseyin.isik000@gmail.com> Signed-off-by: Jason Cheng <jasoncky96@gmail.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com> Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> Signed-off-by: Jannis Schönleber <joennlae@gmail.com> Signed-off-by: rickyx <rickyx@anyscale.com> Signed-off-by: Andy Lo <andy@mistral.ai> Signed-off-by: Adrian Cole <adrian.cole@elastic.co> Signed-off-by: maleksan85 <maleksan@amd.com> Signed-off-by: Hongxia Yang <hongxyan@amd.com> Signed-off-by: kevin <kevin@anyscale.com> Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: xffxff <1247714429@qq.com> Signed-off-by: wangerxiao <863579016@qq.com> Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com> Signed-off-by: zhenwei <zhenweiliu@habana.ai> Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: Siyuan Liu <lsiyuan@google.com> Signed-off-by: ElizaWszola <eliza@neuralmagic.com> Signed-off-by: Junichi Sato <junichi.sato@sbintuitions.co.jp> Signed-off-by: Omer Dayan (SW-GPU) <omer@run.ai> Signed-off-by: Keyun Tong <tongkeyun@gmail.com> Signed-off-by: Matthew Hendrey <matthew.hendrey@gmail.com> Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> Signed-off-by: Kyle Mistele <kyle@mistele.com> Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io> Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Bowen Wang <abmfy@icloud.com> Signed-off-by: charlifu <charlifu@amd.com> Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> Co-authored-by: youkaichao <youkaichao@gmail.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Co-authored-by: Chen Zhang <zhangch99@outlook.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Co-authored-by: maang-h <55082429+maang-h@users.noreply.github.com> Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com> Co-authored-by: YiSheng5 <yi.sheng@intel.com> Co-authored-by: Zhonghua Deng <abatom@163.com> Co-authored-by: Liangfu Chen <liangfc@amazon.com> Co-authored-by: XiaobingZhang <xiaobingzhangupc@gmail.com> Co-authored-by: Russell Bryant <rbryant@redhat.com> Co-authored-by: Yuan <yuan.zhou@intel.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: jiangjiadi <34134495+jiangjiadi@users.noreply.github.com> Co-authored-by: jiadi.jjd <jiadi.jjd@antgroup.com> Co-authored-by: sroy745 <142070531+sroy745@users.noreply.github.com> Co-authored-by: Jie Fu (傅杰) <jiefu@tencent.com> Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com> Co-authored-by: WangErXiao <863579016@qq.com> Co-authored-by: Nishidha <nishidha.panpaliya@partner.ibm.com> Co-authored-by: Ilya Lavrenov <ilya.lavrenov@intel.com> Co-authored-by: Simon Mo <simon.mo@hey.com> Co-authored-by: Wallas Henrique <wallashss@users.noreply.github.com> Co-authored-by: Michael Goin <michael@neuralmagic.com> Co-authored-by: Li, Jiang <jiang1.li@intel.com> Co-authored-by: Yan Ma <yan.ma@intel.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: rasmith <Randall.Smith@amd.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Maximilien de Bayser <mbayser@br.ibm.com> Co-authored-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com> Co-authored-by: Guspan Tanadi <36249910+guspan-tanadi@users.noreply.github.com> Co-authored-by: Ye (Charlotte) Qi <ye.charlotte.qi@gmail.com> Co-authored-by: yeq <yeq@devgpu004.lla3.facebook.com> Co-authored-by: Mengqing Cao <cmq0113@163.com> Co-authored-by: Charles Frye <cfrye59@gmail.com> Co-authored-by: Joe Runde <Joseph.Runde@ibm.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> Co-authored-by: cennn <61925104+cennn@users.noreply.github.com> Co-authored-by: Kuntai Du <kuntai@uchicago.edu> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: minmin <rmm0811@gmail.com> Co-authored-by: Ren MinMin <renmm6@chinaunicom.cn> Co-authored-by: Travis Johnson <tsjohnso@us.ibm.com> Co-authored-by: Fred Reiss <frreiss@us.ibm.com> Co-authored-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Co-authored-by: shaochangxu <85155497+shaochangxu@users.noreply.github.com> Co-authored-by: shaochangxu.scx <shaochangxu.scx@antgroup.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com> Co-authored-by: sixgod <evethwillbeok@outlook.com> Co-authored-by: Isotr0py <2037008807@qq.com> Co-authored-by: Rafael Vasquez <rafvasq21@gmail.com> Co-authored-by: Akshat Tripathi <Akshat.tripathi6568@gmail.com> Co-authored-by: Oleg Mosalov <oleg@krai.ai> Co-authored-by: Avshalom Manevich <12231371+avshalomman@users.noreply.github.com> Co-authored-by: Yangcheng Li <liyangcheng.lyc@alibaba-inc.com> Co-authored-by: Siyuan Li <94890248+liaoyanqing666@users.noreply.github.com> Co-authored-by: Concurrensee <yida.wu@amd.com> Co-authored-by: Chenguang Li <757486878@qq.com> Co-authored-by: Alex Brooks <alex.brooks@ibm.com> Co-authored-by: Shanshan Shen <467638484@qq.com> Co-authored-by: elijah <30852919+e1ijah1@users.noreply.github.com> Co-authored-by: Konrad Zawora <kzawora@habana.ai> Co-authored-by: Elfie Guo <164945471+elfiegg@users.noreply.github.com> Co-authored-by: Rui Qiao <161574667+ruisearch42@users.noreply.github.com> Co-authored-by: Kyle Sayers <kylesayrs@gmail.com> Co-authored-by: Rahul Tuli <rahul@neuralmagic.com> Co-authored-by: Keyun Tong <tongkeyun@gmail.com> Co-authored-by: RunningLeon <maningsheng@sensetime.com> Co-authored-by: kewang-xlnx <73578509+kewang-xlnx@users.noreply.github.com> Co-authored-by: kewang2 <kewang2@amd.com> Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: tvirolai-amd <teemu.virolainen@amd.com> Co-authored-by: Michael Goin <mgoin@redhat.com> Co-authored-by: Zhaoyi Li <36555117+Lzy17@users.noreply.github.com> Co-authored-by: charlifu <charlifu@amd.com> Co-authored-by: Yuan Tang <terrytangyuan@gmail.com> Co-authored-by: Cody Yu <hao.yu.cody@gmail.com> Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com> Co-authored-by: yancong <32220263+ice-tong@users.noreply.github.com> Co-authored-by: Michal Adamczyk <madamczyk@habana.ai> Co-authored-by: gujing <925973396@qq.com> Co-authored-by: imkero <kerorek@outlook.com> Co-authored-by: Martin Gleize <mgleize@meta.com> Co-authored-by: mgleize user <mgleize@a100-st-p4de24xlarge-4.fair-a100.hpcaas> Co-authored-by: shangmingc <caishangming@linux.alibaba.com> Co-authored-by: Işık <41375111+isikhi@users.noreply.github.com> Co-authored-by: Roger Wang <ywang@roblox.com> Co-authored-by: Cheng Kuan Yong Jason <jasoncky96@gmail.com> Co-authored-by: Jinzhen Lin <linjinzhen@hotmail.com> Co-authored-by: Thomas Parnell <tpa@zurich.ibm.com> Co-authored-by: Jannis Schönleber <joennlae@gmail.com> Co-authored-by: Ricky Xu <xuchen727@hotmail.com> Co-authored-by: Andy Lo <andylolu24@gmail.com> Co-authored-by: Adrian Cole <64215+codefromthecrypt@users.noreply.github.com> Co-authored-by: Jani Monoses <jani.monoses@gmail.com> Co-authored-by: Kevin H. Luu <kevin@anyscale.com> Co-authored-by: Aleksandr Malyshev <164964928+maleksan85@users.noreply.github.com> Co-authored-by: maleksan85 <maleksan@amd.com> Co-authored-by: Nick Hill <nickhill@us.ibm.com> Co-authored-by: zhou fan <1247714429@qq.com> Co-authored-by: ilia-cher <30845429+ilia-cher@users.noreply.github.com> Co-authored-by: Alexei-V-Ivanov-AMD <156011006+Alexei-V-Ivanov-AMD@users.noreply.github.com> Co-authored-by: liuzhenwei <zhenweiliu@habana.ai> Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: Micah Williamson <micah.williamson@amd.com> Co-authored-by: Siyuan Liu <lsiyuan@google.com> Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com> Co-authored-by: ElizaWszola <eliza@neuralmagic.com> Co-authored-by: Junichi Sato <junichi.sato@sbintuitions.co.jp> Co-authored-by: amd-ruitang3 <Rui.Tang2@amd.com> Co-authored-by: Robert Shaw <rshaw@neuralmagic.com> Co-authored-by: omer-dayan <omer@run.ai> Co-authored-by: Mohit Deopujari <mdeopujari@habana.ai> Co-authored-by: Jeremy Arnold <103538711+JArnoldAMD@users.noreply.github.com> Co-authored-by: chenjun <junchen2@amd.com> Co-authored-by: ValarLip <340077269@qq.com> Co-authored-by: Matthew Hendrey <matthew.hendrey@gmail.com> Co-authored-by: Kyle Mistele <kyle@mistele.com> Co-authored-by: Pooya Davoodi <pooya.davoodi@parasail.io> Co-authored-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Bowen Wang <abmfy@icloud.com> Co-authored-by: Bowen Bao <bowenbao@amd.com> Co-authored-by: arakowsk-amd <182798202+arakowsk-amd@users.noreply.github.com> Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com> Co-authored-by: sanyalington <shomy.sanyal@amd.com> Co-authored-by: Joe Shajrawi <17753158+shajrawi@users.noreply.github.com> Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com> Co-authored-by: charlifu <chalifu@amd.com>
1 parent 87b3c56 commit 4c8c86d

File tree

13 files changed

+576
-123
lines changed

13 files changed

+576
-123
lines changed

Dockerfile.rocm

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -109,11 +109,18 @@ ARG COMMON_WORKDIR
109109
COPY --from=export_vllm /benchmarks ${COMMON_WORKDIR}/vllm/benchmarks
110110
COPY --from=export_vllm /examples ${COMMON_WORKDIR}/vllm/examples
111111

112+
RUN git clone --recursive https://github.com/ROCm/aiter.git
113+
RUN cd /app/aiter && pip install -r requirements.txt && PREBUILD_KERNELS=1 GPU_ARCHS=gfx942 python3 setup.py develop && pip show aiter
114+
115+
112116
ENV RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES=1
113117
ENV TOKENIZERS_PARALLELISM=false
114118

115119
# Performance environment variable.
116120
ENV HIP_FORCE_DEV_KERNARG=1
117121

122+
# Enable Aiter. Make sure this only exists on the aiter branch.
123+
# ENV VLLM_USE_AITER=1
124+
118125
CMD ["/bin/bash"]
119126

benchmarks/test_accuracy.py

Lines changed: 75 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -1,45 +1,82 @@
1-
import time
1+
import argparse
2+
import dataclasses
23

4+
# from transformers import AutoTokenizer
35
from vllm import LLM, SamplingParams
6+
from vllm.engine.arg_utils import EngineArgs
7+
from vllm.utils import FlexibleArgumentParser
48

59

6-
def main():
7-
llm = LLM(
8-
'/data/AI-ModelScope/Mixtral-8x7B-Instruct-v0___1/',
9-
tensor_parallel_size=1,
10-
#quantization="serenity",
11-
dtype='float16',
12-
#swap_space=16,
13-
#enforce_eager=True,
14-
#kv_cache_dtype="fp8",
15-
#quantization="fp8",
16-
#quantized_weights_path="/quantized/quark/llama.safetensors",
17-
#worker_use_ray=True,
18-
#trust_remote_code=True,
19-
#distributed_executor_backend="mp",
20-
)
21-
batch_size = 5
22-
max_tokens = 256
23-
prompt = """The sun is a"""
24-
sampling_params = SamplingParams(temperature=0,
25-
top_p=0.95,
26-
max_tokens=max_tokens)
27-
28-
start_time = time.perf_counter()
29-
outs = llm.generate([prompt] * batch_size, sampling_params=sampling_params)
30-
end_time = time.perf_counter()
31-
elapsed_time = end_time - start_time
32-
33-
out_lengths = [len(x.token_ids) for out in outs for x in out.outputs]
34-
num_tokens = sum(out_lengths)
35-
36-
print(
37-
f"{num_tokens} tokens. {num_tokens / batch_size} on average. {num_tokens / elapsed_time:.2f} tokens/s. {elapsed_time} seconds" # noqa: E501
10+
def main(args: argparse.Namespace):
11+
print(args)
12+
13+
engine_args = EngineArgs.from_cli_args(args)
14+
15+
# NOTE(woosuk): If the request cannot be processed in a single batch,
16+
# the engine will automatically process the request in multiple batches.
17+
llm = LLM(**dataclasses.asdict(engine_args))
18+
19+
sampling_params = SamplingParams(
20+
n=args.n,
21+
temperature=1.0,
22+
top_p=1.0,
23+
ignore_eos=True,
24+
max_tokens=args.output_len,
3825
)
39-
for out in outs:
40-
print("===========")
41-
print(out.outputs[0].text)
26+
print(sampling_params)
27+
28+
# tokenizer = AutoTokenizer.from_pretrained(engine_args.model)
29+
# inputs = tokenizer('Hello, world!', return_tensors='pt').input_ids
30+
inputs = [
31+
'Where is the capital of China?',
32+
'The capital of Russia is ',
33+
'The CEO of DeepSeek is ',
34+
'The future of AI is',
35+
] * 32
36+
outputs = llm.generate(inputs, sampling_params)
37+
for i, output in enumerate(outputs):
38+
prompt = output.prompt
39+
generated_text = output.outputs[0].text
40+
print(f"Prompt {i}: {prompt!r}, Generated text: {generated_text!r}")
41+
# print(tokenizer.decode(outputs[0]))
42+
4243

44+
if __name__ == '__main__':
45+
parser = FlexibleArgumentParser(
46+
description='Benchmark the latency of processing a single batch of '
47+
'requests till completion.')
48+
parser.add_argument('--input-len', type=int, default=32)
49+
parser.add_argument('--output-len', type=int, default=128)
50+
parser.add_argument('--batch-size', type=int, default=8)
51+
parser.add_argument('--n',
52+
type=int,
53+
default=1,
54+
help='Number of generated sequences per prompt.')
55+
parser.add_argument('--use-beam-search', action='store_true')
56+
parser.add_argument('--num-iters-warmup',
57+
type=int,
58+
default=10,
59+
help='Number of iterations to run for warmup.')
60+
parser.add_argument('--num-iters',
61+
type=int,
62+
default=30,
63+
help='Number of iterations to run.')
64+
parser.add_argument(
65+
'--profile',
66+
action='store_true',
67+
help='profile the generation process of a single batch')
68+
parser.add_argument(
69+
'--profile-result-dir',
70+
type=str,
71+
default=None,
72+
help=('path to save the pytorch profiler output. Can be visualized '
73+
'with ui.perfetto.dev or Tensorboard.'))
74+
parser.add_argument(
75+
'--output-json',
76+
type=str,
77+
default=None,
78+
help='Path to save the latency results in JSON format.')
4379

44-
if __name__ == "__main__":
45-
main()
80+
parser = EngineArgs.add_cli_args(parser)
81+
args = parser.parse_args()
82+
main(args)

vllm/__init__.py

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,6 @@
66
from vllm.engine.arg_utils import AsyncEngineArgs, EngineArgs
77
from vllm.engine.async_llm_engine import AsyncLLMEngine
88
from vllm.engine.llm_engine import LLMEngine
9-
from vllm.entrypoints.fast_sync_llm import FastSyncLLM
109
from vllm.entrypoints.llm import LLM
1110
from vllm.executor.ray_utils import initialize_ray_cluster
1211
from vllm.inputs import PromptType, TextPrompt, TokensPrompt
@@ -38,7 +37,6 @@
3837
"__version__",
3938
"__version_tuple__",
4039
"LLM",
41-
"FastSyncLLM",
4240
"ModelRegistry",
4341
"PromptType",
4442
"TextPrompt",

vllm/attention/backends/rocm_flash_attn.py

Lines changed: 53 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,8 +11,14 @@
1111
AttentionMetadata, AttentionType)
1212
from vllm.attention.backends.utils import (CommonAttentionState,
1313
CommonMetadataBuilder)
14-
from vllm.attention.ops.paged_attn import (PagedAttention,
15-
PagedAttentionMetadata)
14+
15+
if envs.VLLM_USE_AITER_PAGED_ATTN:
16+
from vllm.attention.ops.paged_attn_ater import (PagedAttention,
17+
PagedAttentionMetadata)
18+
else:
19+
from vllm.attention.ops.paged_attn import (PagedAttention,
20+
PagedAttentionMetadata)
21+
1622
from vllm.logger import init_logger
1723
from vllm.platforms import current_platform
1824

@@ -460,6 +466,9 @@ def __init__(
460466
logits_soft_cap: Optional[float] = None,
461467
attn_type: str = AttentionType.DECODER,
462468
) -> None:
469+
self.k_scale = torch.tensor([1.0], dtype=torch.float32)
470+
self.v_scale = torch.tensor([1.0], dtype=torch.float32)
471+
self.init_kv_scales = False
463472
if blocksparse_params is not None:
464473
raise ValueError(
465474
"ROCmFlashAttention does not support blocksparse attention.")
@@ -609,6 +618,25 @@ def forward(
609618
else:
610619
assert value is None
611620

621+
if (envs.VLLM_USE_AITER_PAGED_ATTN and kv_cache.dtype.itemsize == 1
622+
and self.init_kv_scales is False
623+
and kv_cache.shape != torch.Size([0])):
624+
num_blocks = kv_cache.shape[1]
625+
block_size = kv_cache.shape[2] // (self.num_kv_heads *
626+
self.head_size)
627+
self.k_scale = torch.ones(
628+
(self.num_kv_heads, num_blocks * block_size),
629+
dtype=torch.float32,
630+
device=kv_cache.device)
631+
self.v_scale = torch.ones(
632+
(self.num_kv_heads, num_blocks * block_size),
633+
dtype=torch.float32,
634+
device=kv_cache.device)
635+
self.init_kv_scales = True
636+
# if self.init_kv_scales:
637+
layer._k_scale = self.k_scale
638+
layer._v_scale = self.v_scale
639+
612640
if self.attn_type != AttentionType.ENCODER and kv_cache.numel() > 0:
613641
key_cache, value_cache = PagedAttention.split_kv_cache(
614642
kv_cache, self.num_kv_heads, self.head_size)
@@ -780,6 +808,29 @@ def forward(
780808
use_custom = _use_rocm_custom_paged_attention(
781809
decode_query.dtype, head_size, block_size, gqa_ratio,
782810
decode_meta.max_decode_seq_len)
811+
if envs.VLLM_USE_AITER_PAGED_ATTN:
812+
out = output[num_prefill_tokens:]
813+
PagedAttention.forward_decode(
814+
decode_query,
815+
key_cache,
816+
value_cache,
817+
decode_meta.block_tables
818+
if self.attn_type != AttentionType.ENCODER_DECODER else
819+
decode_meta.cross_block_tables,
820+
decode_meta.seq_lens_tensor
821+
if self.attn_type != AttentionType.ENCODER_DECODER else
822+
decode_meta.encoder_seq_lens_tensor,
823+
decode_meta.max_decode_seq_len
824+
if self.attn_type != AttentionType.ENCODER_DECODER else
825+
decode_meta.max_encoder_seq_len,
826+
self.kv_cache_dtype,
827+
self.num_kv_heads,
828+
self.scale,
829+
self.alibi_slopes,
830+
layer._k_scale,
831+
layer._v_scale,
832+
out=out)
833+
return output.view(-1, self.num_heads * self.head_size)
783834
if use_custom:
784835
max_seq_len = (decode_meta.max_decode_seq_len if
785836
self.attn_type != AttentionType.ENCODER_DECODER

0 commit comments

Comments
 (0)