Skip to content

Conversation

@Lucaskabela
Copy link
Contributor

@Lucaskabela Lucaskabela commented Oct 29, 2025

Purpose

See title - this PR fixes an error caused by #23207 where the SDPA backend could not be compiled.

This was because a fix for tracing SliceVariables containing single element tensors is not yet landed (but should be in Pytorch2.10)

Until then, we shall use a custom op to prevent graph break

Test Plan

python examples/offline_inference/vision_language.py -m qwen2_5_vl

With edit to the vision_langauge.py file to use TORCH_SDPA (as this is not supported on commandline)

E2E:

with-proxy vllm serve Qwen/Qwen2.5-VL-3B-Instruct --gpu_memory_utilization=.85 --mm-encoder-attn-backend TORCH_SDPA

and

with-proxy vllm bench serve   --backend openai-chat   --model Qwen/Qwen2.5-VL-3B-Instruct   --endpoint /v1/chat/completions   --dataset-name hf   --dataset-path lmarena-ai/VisionArena-Chat   --hf-split train   --num-prompts 1000

Test Results

--------------------------------------------------
The image depicts the Tokyo Skytree, a tall broadcasting tower located in Sumida, Tokyo, Japan. The tower is surrounded by cherry blossom trees in full bloom, creating a picturesque and vibrant scene. The cherry blossoms are in various stages of bloom, with pink and white petals covering the branches, and the sky is
--------------------------------------------------
This image depicts the Tokyo Skytree, a tall broadcasting tower located in Sumida, Tokyo, Japan. The tower is surrounded by cherry blossom trees in full bloom, creating a picturesque and vibrant scene. The cherry blossoms are in various stages of bloom, with pink and white flowers covering the branches. The sky is clear
--------------------------------------------------
The image depicts the Tokyo Skytree, a tall broadcasting tower located in the Odaiba district of Tokyo, Japan. The tower is surrounded by cherry blossom trees in full bloom, creating a picturesque and vibrant scene. The cherry blossoms are in various stages of bloom, with pink and white petals covering the branches, adding
--------------------------------------------------
The image depicts the Tokyo Skytree, a tall broadcasting tower located in Sumida, Tokyo, Japan. The tower is surrounded by cherry blossom trees, which are in full bloom, creating a picturesque scene against the clear blue sky. The cherry blossoms are in various stages of bloom, with some fully open and others still
--------------------------------------------------

E2E

============ Serving Benchmark Result ============
Successful requests:                     1000      
Failed requests:                         0         
Benchmark duration (s):                  144.15    
Total input tokens:                      94327     
Total generated tokens:                  106396    
Request throughput (req/s):              6.94      
Output token throughput (tok/s):         738.11    
Peak output token throughput (tok/s):    14908.00  
Peak concurrent requests:                1000.00   
Total Token throughput (tok/s):          1392.49   
---------------Time to First Token----------------
Mean TTFT (ms):                          71754.34  
Median TTFT (ms):                        71486.27  
P99 TTFT (ms):                           138089.08 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          694.75    
Median TPOT (ms):                        665.18    
P99 TPOT (ms):                           1462.31   
---------------Inter-token Latency----------------
Mean ITL (ms):                           603.27    
Median ITL (ms):                         58.55     
P99 ITL (ms):                            1621.47   
==================================================

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify mergify bot added the qwen Related to Qwen models label Oct 29, 2025
@mergify
Copy link

mergify bot commented Oct 29, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Lucaskabela.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@huachenheli
Copy link
Contributor

Can you update the PR description? I think if you put "```" and text on the same line it won't show any more.

Signed-off-by: Lucas Kabela <lucaskabela@meta.com>
Signed-off-by: Lucas Kabela <lucaskabela@meta.com>
@Lucaskabela Lucaskabela force-pushed the lucaskabela/qwen2_5_compile_sdpa_fix branch from 7c50ed1 to 50ffd98 Compare October 30, 2025 00:01
@Lucaskabela Lucaskabela marked this pull request as ready for review October 30, 2025 00:01
@mergify mergify bot removed the needs-rebase label Oct 30, 2025
for i in range(1, len(cu_seqlens)):
start_idx = cu_seqlens[i - 1]
end_idx = cu_seqlens[i]
q_i = q[:, start_idx:end_idx]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A dumb question: How does this fix the tensor slicing issue? It seems that you simply put the whole SDPA into a separate function but the code remains identical?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is when torch.compile tries to trace this code - by wrapping it in a custom_op and calling that, the internals of the function are not traced (so the tracing bug does not trigger)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it is not just that we move it into a function, but that we also leverage the custom_op mechanism here to make it opaque

@Lucaskabela
Copy link
Contributor Author

@ywang96 @ProExpertProg @zou3519 for review

@Lucaskabela Lucaskabela changed the title [Qwen][Multimodal] Move Qwen2_5_vl sdpa to custom op until tensor slicing supported [Bugfix][Qwen][Multimodal] Move Qwen2_5_vl sdpa to custom op until tensor slicing supported and reenable compile Oct 30, 2025
@Lucaskabela Lucaskabela changed the title [Bugfix][Qwen][Multimodal] Move Qwen2_5_vl sdpa to custom op until tensor slicing supported and reenable compile [Bugfix][Qwen][Multimodal] Move Qwen2_5_vl sdpa to custom op and reenable compile Oct 30, 2025
@ProExpertProg
Copy link
Collaborator

This was because a fix for tracing SliceVariables containing single element tensors is not yet landed (but should be in Pytorch2.10)

Could we add this fix to the 2.9.1 list?

Copy link
Collaborator

@ProExpertProg ProExpertProg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ProExpertProg ProExpertProg added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 30, 2025
@Lucaskabela
Copy link
Contributor Author

Lucaskabela commented Oct 30, 2025

This was because a fix for tracing SliceVariables containing single element tensors is not yet landed (but should be in Pytorch2.10)

Could we add this fix to the 2.9.1 list?

The particular fix is from @laithsakka - see pytorch/pytorch#165074

Let me check on the ability to cherry pick this; it falls in a gray area since it is improving traceability but not necessarily fixing a critical bug introduced by 2.9 AFAIK so it may be out of scope

@simon-mo simon-mo merged commit 55011ae into vllm-project:main Nov 3, 2025
55 checks passed
zhaozuy pushed a commit to zhaozuy/vllm that referenced this pull request Nov 4, 2025
…able compile (vllm-project#27764)

Signed-off-by: Lucas Kabela <lucaskabela@meta.com>
omerpaz95 pushed a commit to omerpaz95/vllm that referenced this pull request Nov 4, 2025
…able compile (vllm-project#27764)

Signed-off-by: Lucas Kabela <lucaskabela@meta.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants