[Multi-modality][performance] enable DP for ViT in Qwen-2.5-VL #2709

hsliuustc0106 · 2025-09-02T23:51:46Z

What this PR does / why we need it?

This PR is associated with #2607 which enables DP for ViT in Qwen-2.5-VL.

There are multiple reasons that we should have ViT implemented as a DP:

The ViT are small models, the TP all reduce incurred a larger overhead than the gain from accelerating through TP.
ViT are not captured in cuda graphs or torch compile graph, thus the kernel overhead and all reduce overhead will be higher.

Does this PR introduce any user-facing change?

add the arg selection for mm-encoder-tp-mode for data-parallelism, below is an example for DP for ViT and TP4 for LLM backbone

vllm serve
/workspace/models/Qwen2.5-VL-3B-Instruct
--port 5580 --host 0.0.0.0
--max-num-seqs 128 --dtype bfloat16 --max-model-len=8192
--no-enable-prefix-caching --trust-remote-code -tp 4
--allowed-local-media-path /workspace/l00807937/
--gpu-memory-utilization=0.93
--enforce-eager
--mm-encoder-tp-mode data ##

How was this patch tested?

vllm: 0.10.0RC1
vllm-ascend: 0.10.0RC1

Benchmark test


1. TP=4 Case
**Test Plan**
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
vllm serve \
    /workspace/models/Qwen2.5-VL-3B-Instruct \
    --port 5580 --host 0.0.0.0 \
    --max-num-seqs 128 --dtype bfloat16 --max-model-len=8192 \
    --no-enable-prefix-caching --trust-remote-code -tp 4 \
    --allowed-local-media-path /workspace/l00807937/ \
    --gpu-memory-utilization=0.93 \
    --enforce-eager \
    --mm-encoder-tp-mode data

**Test Result**
baseline: without --mm-encoder-tp-mode data
============ Serving Benchmark Result ============
Successful requests:                     99        
Benchmark duration (s):                  28.79     
Total input tokens:                      9959      
Total generated tokens:                  10707     
Request throughput (req/s):              3.44      
Output token throughput (tok/s):         371.96    
Total Token throughput (tok/s):          717.94    
---------------Time to First Token----------------
Mean TTFT (ms):                          7711.37   
Median TTFT (ms):                        6832.34   
P99 TTFT (ms):                           17305.82  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          177.03    
Median TPOT (ms):                        161.73    
P99 TPOT (ms):                           413.63    
---------------Inter-token Latency----------------
Mean ITL (ms):                           157.30    
Median ITL (ms):                         90.89     
P99 ITL (ms):                            640.97    
==================================================
DP4: with --mm-encoder-tp-mode data
============ Serving Benchmark Result ============
Successful requests:                     99        
Benchmark duration (s):                  25.67     
Total input tokens:                      9959      
Total generated tokens:                  10749     
Request throughput (req/s):              3.86      
Output token throughput (tok/s):         418.82    
Total Token throughput (tok/s):          806.85    
---------------Time to First Token----------------
Mean TTFT (ms):                          6393.85   
Median TTFT (ms):                        5437.94   
P99 TTFT (ms):                           14115.35  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          158.26    
Median TPOT (ms):                        150.12    
P99 TPOT (ms):                           346.36    
---------------Inter-token Latency----------------
Mean ITL (ms):                           140.90    
Median ITL (ms):                         90.94     
P99 ITL (ms):                            439.49    
==================================================

2. TP=2 Case
**Test Plan**
export ASCEND_RT_VISIBLE_DEVICES=0,1
vllm serve \
    /workspace/models/Qwen2.5-VL-3B-Instruct \
    --port 5580 --host 0.0.0.0 \
    --max-num-seqs 128 --dtype bfloat16 --max-model-len=8192 \
    --no-enable-prefix-caching --trust-remote-code -tp 2 \
    --allowed-local-media-path /workspace/l00807937/ \
    --gpu-memory-utilization=0.93 \
    --enforce-eager \
    --mm-encoder-tp-mode data

**Test Result**
baseline: without --mm-encoder-tp-mode data
============ Serving Benchmark Result ============
Successful requests:                     99        
Benchmark duration (s):                  31.23     
Total input tokens:                      9959      
Total generated tokens:                  10732     
Request throughput (req/s):              3.17      
Output token throughput (tok/s):         343.69    
Total Token throughput (tok/s):          662.63    
---------------Time to First Token----------------
Mean TTFT (ms):                          8679.98   
Median TTFT (ms):                        7558.25   
P99 TTFT (ms):                           19444.89  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          188.91    
Median TPOT (ms):                        180.24    
P99 TPOT (ms):                           464.39    
---------------Inter-token Latency----------------
Mean ITL (ms):                           168.80    
Median ITL (ms):                         92.44     
P99 ITL (ms):                            725.62    
==================================================
DP2: with --mm-encoder-tp-mode data
============ Serving Benchmark Result ============
Successful requests:                     99        
Benchmark duration (s):                  27.18     
Total input tokens:                      9959      
Total generated tokens:                  10707     
Request throughput (req/s):              3.64      
Output token throughput (tok/s):         393.87    
Total Token throughput (tok/s):          760.23    
---------------Time to First Token----------------
Mean TTFT (ms):                          6903.44   
Median TTFT (ms):                        5630.95   
P99 TTFT (ms):                           15328.67  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          168.99    
Median TPOT (ms):                        158.29    
P99 TPOT (ms):                           372.06    
---------------Inter-token Latency----------------
Mean ITL (ms):                           150.38    
Median ITL (ms):                         94.54     
P99 ITL (ms):                            471.63    
==================================================


- vLLM version: main
- vLLM main: https://github.com/vllm-project/vllm/commit/267c80d31f6b77092a5d5903da64556ac15c4d4d

Signed-off-by: Junhong <liujunhong11@huawei.com>

github-actions · 2025-09-02T23:51:57Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request enables Data Parallelism (DP) for the Vision Transformer (ViT) in Qwen-2.5-VL, which can improve performance for smaller models by avoiding tensor parallelism overhead. The changes introduce a use_data_parallel flag and a new execution path for DP. My review found a critical issue in the implementation where an incorrect attribute access would lead to a runtime error. I've provided a code suggestion to fix this.

gemini-code-assist · 2025-09-02T23:53:15Z

vllm_ascend/models/qwen2_5_vl.py

+    def _normalize_grid_thw(self, grid_thw: Union[torch.Tensor, list[list[int]]]) -> torch.Tensor:
+        if isinstance(grid_thw, list):
+            grid_thw = torch.tensor(grid_thw, device=self.device)
+        elif not isinstance(grid_thw, torch.Tensor):
+            raise TypeError(f"Expected input type is torch.Tensor or list of lists, got {type(grid_thw)}")
+        return grid_thw


torch.nn.Module does not have a .device attribute, so calling self.device will raise an AttributeError at runtime. A more robust approach to get the module's device is to inspect one of its parameters, for example, by using next(self.parameters()).device.

Suggested change

def _normalize_grid_thw(self, grid_thw: Union[torch.Tensor, list[list[int]]]) -> torch.Tensor:

if isinstance(grid_thw, list):

grid_thw = torch.tensor(grid_thw, device=self.device)

elif not isinstance(grid_thw, torch.Tensor):

raise TypeError(f"Expected input type is torch.Tensor or list of lists, got {type(grid_thw)}")

return grid_thw

def _normalize_grid_thw(self, grid_thw: Union[torch.Tensor, list[list[int]]]) -> torch.Tensor:

if isinstance(grid_thw, list):

device = next(self.parameters()).device

grid_thw = torch.tensor(grid_thw, device=device)

elif not isinstance(grid_thw, torch.Tensor):

raise TypeError(f"Expected input type is torch.Tensor or list of lists, got {type(grid_thw)}")

return grid_thw

Signed-off-by: Junhong <liujunhong11@huawei.com>

Issue 2607 fix bug in test

github-actions · 2025-09-27T02:44:00Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Junhong and others added 5 commits September 2, 2025 11:09

add use_data_parallel

34ee9d6

Signed-off-by: Junhong <liujunhong11@huawei.com>

optimize code

b1310ea

Signed-off-by: Junhong <liujunhong11@huawei.com>

fix bug

1ab8676

Signed-off-by: Junhong <liujunhong11@huawei.com>

fix bug

c338cbd

Signed-off-by: Junhong <liujunhong11@huawei.com>

Support DP for ViT in qwen2.5vl

b403184

gemini-code-assist bot reviewed Sep 2, 2025

View reviewed changes

Junhong and others added 6 commits September 3, 2025 09:31

fix pre-commit

c319b43

Signed-off-by: Junhong <liujunhong11@huawei.com>

fix precommit

f1f96ba

fix test bug

8f49ef9

Signed-off-by: Junhong <liujunhong11@huawei.com>

Merge branch 'hsliuustc0106:issue-2607' into issue-2607

480f08e

fix pre-commit

a6cdebd

Signed-off-by: Junhong <liujunhong11@huawei.com>

Merge pull request #5 from LJH-LBJ/issue-2607

fdb44c0

Issue 2607 fix bug in test

github-actions bot added the module:tests label Sep 10, 2025

github-actions bot added the merge-conflicts label Sep 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Multi-modality][performance] enable DP for ViT in Qwen-2.5-VL #2709

[Multi-modality][performance] enable DP for ViT in Qwen-2.5-VL #2709

hsliuustc0106 commented Sep 2, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Sep 2, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 2, 2025

Uh oh!

github-actions bot commented Sep 27, 2025

Uh oh!

Uh oh!

[Multi-modality][performance] enable DP for ViT in Qwen-2.5-VL #2709

Are you sure you want to change the base?

[Multi-modality][performance] enable DP for ViT in Qwen-2.5-VL #2709

Conversation

hsliuustc0106 commented Sep 2, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Benchmark test

Uh oh!

github-actions bot commented Sep 2, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Sep 27, 2025

Uh oh!

Uh oh!

hsliuustc0106 commented Sep 2, 2025 •

edited by github-actions bot

Loading