v1: Add Whisper model support (encoder-decoder) #21088

russellb · 2025-07-17T02:13:06Z

This brings Whisper support to V1 to close one of the remaining
feature gaps with V0. Most of the changes apply to encoder-decoder
models generally, though Whisper is the only one explicitly tested
and is the only encoder-decoder model updated to support V1.

Whisper Model Implementation:

Remove SupportsV0Only interface constraint to enable V1 compatibility
Update get_multimodal_embeddings() to return list format required by V1

Flash Attention Backend:

Add encoder attention metadata fields (encoder_seq_start_loc, max_encoder_seq_len, cross_slot_mapping)
Implement encoder self-attention support without using KV cache
Add cross-attention support for encoder-decoder models with proper KV cache handling

KV Cache Manager:

Introduce CrossAttentionManager for handling cross-attention KV cache in encoder-decoder models
Add CrossAttentionSpec for cross-attention cache specification with encoder-based sizing
Implement allocate_slots_for_cross_attn() for static encoder-length-based allocation
Add cross-attention block allocation logic separate from decoder token growth

Scheduler:

Disable prefix caching for encoder-decoder models
Implement cross-attention block allocation during request scheduling
Add cross-attention block tracking in state management

GPU Model Runner:

Add encoder input extraction for audio features processing
Implement encoder attention metadata building for both self-attention and cross-attention
Add cross-attention KV cache group handling with proper slot mapping
Modify input batch creation to accommodate encoder sequence lengths
Add encoder input processing in forward pass with proper device/dtype handling
Update profiling and memory management for encoder-decoder models

The implementation maintains backward compatibility while adding comprehensive
encoder-decoder support, with particular focus on Whisper's audio processing
pipeline and cross-attention mechanisms between encoder and decoder.

Related to:

V0 deprecation: [RFC]: Deprecating vLLM V0 #18571
2025 Q3 roadmap: [Roadmap] vLLM Roadmap Q3 2025 #20336

Signed-off-by: Russell Bryant rbryant@redhat.com

TODO items:

land this first, it includes a subset of this PR: Support encoder-only models without KV-Cache #21270

mergify · 2025-07-17T02:13:48Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @russellb.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This is a significant and well-structured pull request that adds Whisper (encoder-decoder) model support to vLLM's V1 engine. The changes are comprehensive, touching on the attention backend, KV cache management, scheduler, and GPU model runner to accommodate the new architecture.

I've identified one critical issue in _build_encoder_attn_metadata where a missing else block could lead to a size mismatch and a runtime error. I've provided a code suggestion to fix this potential bug. Other than that, the implementation looks solid and correctly integrates encoder-decoder support into the existing V1 framework. Great work on this complex feature!

vllm/v1/worker/gpu_model_runner.py

github-actions · 2025-07-17T02:24:48Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

DarkLight1337 · 2025-07-17T02:51:14Z

This is already some work to support encoder-decoder models:

Can you coordinate with @maxdebayser to avoid duplicate work?

maxdebayser · 2025-07-17T14:31:18Z

Yeah, I've been talking with @russellb as there are a few overlapping points in our PRs for example disabling prefix caching and chunked prefill.
Currently in my PR I'm not disabling the KV cache entirely because functionally it makes no difference for the encoder attention. So I can keep the diff small. But I do want to test if removing the KV cache will have a performance improvement for encoder models

vllm/v1/attention/backends/flash_attn.py

vllm/v1/core/single_type_kv_cache_manager.py

russellb · 2025-07-17T14:44:09Z

This is already some work to support encoder-decoder models:

Add support for encoder embedding models #19988

[Bug]: RuntimeError: NCCL error: unhandled cuda error #20226

Can you coordinate with @maxdebayser to avoid duplicate work?

Yep, we're in contact.

Did you mean to link something different than #20226?

Roughly though, Max had worked on encoder-only support, and I was doing encoder-decoder, which is mostly a superset of encoder-only changes, though I haven't actually tested any encoder-only models with my branch yet.

vllm/v1/attention/backends/flash_attn.py

russellb · 2025-07-17T18:35:17Z

follow-up on next steps and collaboration with @maxdebayser

We're going to combine our work and try to land it all in a few stages.

PR 1) Combine parts of his encoder-only PR (#19988) with the encoder-without-kv-cache changes in this branch. That will be a new jointly-authored PR that will cover encoder-only attention.

PR 2) Update this PR with what's left to make Whisper / encoder-decoder work. That includes some Whisper model changes and a bunch of changes to support cross-attention (encoder-decoder type).

PR 3) Add the last parts of Max's original PR, which supports token_type_ids to run the bert classifier models that need them.

NickLucche

nice one!

vllm/v1/attention/backends/flash_attn.py

NickLucche · 2025-07-18T09:28:29Z

vllm/v1/attention/backends/flash_attn.py

        self.use_irope = use_irope
        self.vllm_flash_attn_version = get_flash_attn_version()
        if is_quantized_kv_cache(self.kv_cache_dtype) \
            and not flash_attn_supports_fp8():
            raise NotImplementedError(
                "FlashAttention does not support fp8 kv-cache on this device.")

+    @staticmethod
+    def _get_causal_option(attn_type: str) -> bool:


nit: _is_causal_attention?

vllm/v1/attention/backends/flash_attn.py

vllm/v1/core/kv_cache_coordinator.py

russellb · 2025-07-18T20:48:24Z

I got this caught up with main with all conflicts resolved, but I haven't addressed feedback received so far.

mergify · 2025-07-19T09:39:16Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @russellb.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Add support for encoder models such as BERT which don't support a KV cache due to the non-causal attention. Since the KV Cache Spec is used to build the attention metadata for decoder models, this PR initializes the attention metadata builds for encoder-only models directly from the layers and adds a function to build the attention metadata. This PR combines elements of PRs vllm-project#21088 and vllm-project#19988 Summary of changes: **Flash Attention Backend:** - Implement encoder self-attention support without using KV cache **Scheduler:** - Disable chunked prefill for models without KV cache **GPU Model Runner:** - Implement encoder-only attention metadata building for self-attention Related to: - V0 deprecation: vllm-project#18571 - 2025 Q3 roadmap: vllm-project#20336 Signed-off-by: Max de Bayser <maxdebayser@gmail.com> Co-authored-by: Russell Bryant <rbryant@redhat.com>

WoosukKwon

I think we haven't made a concrete decision on whether to support the model in V1. Let's discuss offline.

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

mergify · 2025-07-23T18:10:06Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @russellb.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

russellb · 2025-07-23T19:08:06Z

I think we haven't made a concrete decision on whether to support the model in V1. Let's discuss offline.

Understood! I worked on this partly to answer the question of how invasive it would be, so we'd have something concrete to evaluate. I'd love to walk through it sometime if we can sync live!

Some key takeaways for me:

The hybrid memory allocator fit pretty nicely for allocating KV cache blocks for cross attention (ENCODER_DECODER). That alone minimized how invasive this one by a good bit. I did have to make some tweaks to the KV cache manager since ENCODER_DECODER needs a single static allocation instead of an allocation that grows over time, so the logic is a bit different (simpler, though).
The attention backend changes are bigger for encoder self-attention compared to cross-attention. The encoder side has been split out into Support encoder-only models without KV-Cache #21270.
Prefix caching is off for whisper (encoder-decoder in general). It's at least a very small and simple change. I do think it's possible to support prefix caching, but I think it's going to make the code a lot more complex than it's worth. The complex part is that hashes can't just be based on the decoder sequence. It must also include a hash of the encoder input/output. Since it isn't then embedded in the decoder sequence (like a LLaVA style model), taking the encoder data into account will require special new handling that's only relevant to encoder-decoder models. I just didn't feel like that complexity was worth it.

russellb · 2025-07-23T21:28:56Z

I have now reworked this PR to be on top of #21270. The changes that remain after #21270 are in the final commit at the end. cc @maxdebayser

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

To fix the test I switched to uniproc processor, but now I'm getting weird issues like ``` torch._dynamo.exc.BackendCompilerFailed: backend='<vllm.compilation.backends.VllmBackend object at 0x7ef9f68ca600>' raised: AttributeError: module 'torch._tensor' has no attribute 'split' ``` Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

russellb · 2025-07-25T17:48:53Z

For convenience, here are the differences in this branch against the branch from #21270.

maxdebayser/vllm@v1_encoder_only...russellb:vllm:v1-whisper

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

This brings Whisper support to V1 to close one of the remaining feature gaps with V0. Most of the changes apply to encoder-decoder models generally, though Whisper is the only one explicitly tested and is the only encoder-decoder model updated to support V1. **Whisper Model Implementation:** - Remove SupportsV0Only interface constraint to enable V1 compatibility - Update get_multimodal_embeddings() to return list format required by V1 **Flash Attention Backend:** - Add encoder attention metadata fields (encoder_seq_start_loc, max_encoder_seq_len, cross_slot_mapping) - Implement encoder self-attention support without using KV cache - Add cross-attention support for encoder-decoder models with proper KV cache handling **KV Cache Manager:** - Introduce CrossAttentionManager for handling cross-attention KV cache in encoder-decoder models - Add CrossAttentionSpec for cross-attention cache specification with encoder-based sizing - Implement allocate_slots_for_cross_attn() for static encoder-length-based allocation - Add cross-attention block allocation logic separate from decoder token growth **Scheduler:** - Disable prefix caching for encoder-decoder models - Implement cross-attention block allocation during request scheduling - Add cross-attention block tracking in state management **GPU Model Runner:** - Add encoder input extraction for audio features processing - Implement encoder attention metadata building for both self-attention and cross-attention - Add cross-attention KV cache group handling with proper slot mapping - Modify input batch creation to accommodate encoder sequence lengths - Add encoder input processing in forward pass with proper device/dtype handling - Update profiling and memory management for encoder-decoder models The implementation maintains backward compatibility while adding comprehensive encoder-decoder support, with particular focus on Whisper's audio processing pipeline and cross-attention mechanisms between encoder and decoder. Related to: - V0 deprecation: vllm-project#18571 - 2025 Q3 roadmap: vllm-project#20336 Signed-off-by: Russell Bryant <rbryant@redhat.com>

mergify · 2025-07-26T13:14:10Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @russellb.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

russellb mentioned this pull request Jul 16, 2025

[Roadmap] vLLM Roadmap Q3 2025 #20336

Open

93 tasks

mergify bot added v1 needs-rebase labels Jul 17, 2025

gemini-code-assist bot reviewed Jul 17, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

maxdebayser reviewed Jul 17, 2025

View reviewed changes

vllm/v1/attention/backends/flash_attn.py Show resolved Hide resolved

maxdebayser reviewed Jul 17, 2025

View reviewed changes

vllm/v1/core/single_type_kv_cache_manager.py Show resolved Hide resolved

LucasWilkinson reviewed Jul 17, 2025

View reviewed changes

vllm/v1/attention/backends/flash_attn.py Outdated Show resolved Hide resolved

russellb force-pushed the v1-whisper branch 3 times, most recently from 96be9ad to 4da8b7c Compare July 17, 2025 19:27

NickLucche suggested changes Jul 18, 2025

View reviewed changes

russellb force-pushed the v1-whisper branch 3 times, most recently from 16f557d to a9e3459 Compare July 18, 2025 20:46

mergify bot added documentation Improvements or additions to documentation and removed needs-rebase labels Jul 18, 2025

russellb force-pushed the v1-whisper branch from a9e3459 to 8b080c3 Compare July 18, 2025 20:47

russellb force-pushed the v1-whisper branch 2 times, most recently from 87d9bfa to f62a66e Compare July 18, 2025 21:00

mergify bot added the needs-rebase label Jul 19, 2025

maxdebayser mentioned this pull request Jul 20, 2025

Support encoder-only models without KV-Cache #21270

Merged

WoosukKwon reviewed Jul 23, 2025

View reviewed changes

maxdebayser added 2 commits July 23, 2025 11:10

Merge branch 'upstream_main' into v1_encoder_only

838567f

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

fix typo

d81e143

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

mergify bot added the needs-rebase label Jul 23, 2025

Merge branch 'upstream_main' into v1_encoder_only

f0caa0b

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

maxdebayser mentioned this pull request Jul 23, 2025

Add support for token_type_ids #19988

Open

russellb force-pushed the v1-whisper branch from 9999818 to a1ae906 Compare July 23, 2025 19:50

mergify bot removed the needs-rebase label Jul 23, 2025

russellb force-pushed the v1-whisper branch from a1ae906 to 5b0e896 Compare July 23, 2025 21:27

mergify bot added the speculative-decoding label Jul 23, 2025

russellb force-pushed the v1-whisper branch from 5b0e896 to addcaae Compare July 23, 2025 22:02

maxdebayser added 5 commits July 24, 2025 15:19

Merge branch 'upstream_main' into v1_encoder_only

9318e98

remove encoder model from unsupported test

068697b

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

fix apply_model tests

837e51b

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

Merge branch 'upstream_main' into v1_encoder_only

bec5419

russellb force-pushed the v1-whisper branch from addcaae to cf90765 Compare July 25, 2025 17:33

russellb added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 25, 2025

russellb force-pushed the v1-whisper branch from cf90765 to 47050b6 Compare July 25, 2025 18:26

maxdebayser and others added 4 commits July 25, 2025 17:24

remove quant code

6f64b11

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

address review comment

e39bc74

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

Merge branch 'upstream_main' into v1_encoder_only

b406896

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

russellb force-pushed the v1-whisper branch from 47050b6 to 01b2a3e Compare July 25, 2025 21:19

mergify bot added the needs-rebase label Jul 26, 2025

Uh oh!

v1: Add Whisper model support (encoder-decoder) #21088

Are you sure you want to change the base?

v1: Add Whisper model support (encoder-decoder) #21088

Conversation

russellb commented Jul 17, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Jul 17, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

github-actions bot commented Jul 17, 2025

Uh oh!

DarkLight1337 commented Jul 17, 2025

Uh oh!

maxdebayser commented Jul 17, 2025

Uh oh!

Uh oh!

Uh oh!

russellb commented Jul 17, 2025

Uh oh!

Uh oh!

russellb commented Jul 17, 2025

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

NickLucche Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

russellb commented Jul 18, 2025

Uh oh!

mergify bot commented Jul 19, 2025

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Jul 23, 2025

Uh oh!

russellb commented Jul 23, 2025

Uh oh!

russellb commented Jul 23, 2025

Uh oh!

russellb commented Jul 25, 2025

Uh oh!

mergify bot commented Jul 26, 2025

Uh oh!

Uh oh!

russellb commented Jul 17, 2025 •

edited by github-actions bot

Loading