[V0.9.1] add support for flashcomm2 in qwen3 #1726

David9857 · 2025-07-10T11:24:19Z

What this PR does / why we need it?

Support FlashComm v2 in qwen3, which can reduce latency at prefill stage. set VLLM_ENABLE_FC=1 and use eager mode to enbale this feature.
Note: Enabling FlashComm in decoding stage may cause increased latency, so it is recommended to use disaggregated prefilling and enbale this feature in the prefill instance only!!!

Does this PR introduce any user-facing change?

NA

How was this patch tested?

NA

github-actions · 2025-07-10T11:33:05Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

rjg-lyh · 2025-07-11T01:26:47Z

vllm_ascend/envs.py

@@ -143,6 +143,8 @@
    # Batch MC2 in prefill: The number of tokens in each batch
    "VLLM_ASCEND_FUSED_MOE_MC2_CHUNK_SIZE":
    lambda: int(os.getenv("VLLM_ASCEND_FUSED_MOE_MC2_CHUNK_SIZE", "128")),
+    "VLLM_ENABLE_FC":
+    lambda: int(os.getenv("VLLM_ENABLE_FC", 0))


I think this default value is better to be '0'?

github-actions · 2025-07-14T03:21:59Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

ganyi1996ppo · 2025-07-15T03:06:15Z

vllm_ascend/models/qwen3.py

+            output = torch.empty(attn_output.shape,
+                                 dtype=attn_output.dtype,
+                                 device=attn_output.device)
+            dist.all_to_all_single(output, attn_output)


What is the comm_group of this operation? if you do this all_to_all in world_size, what if we have pipeline parallel ?

And what's the purpose of this all_to_all here? Why introduce this additional communication here?

And what's the purpose of this all_to_all here? Why introduce this additional communication here?

Linear+Allreduce is replaced with All2All+Linear+AllGather in Flashcomm2.

What's the point for this change? dose it save the communication or computation? You just replicate two copy of o_proj on each device, which will increase the both bindwidth pressure and memory allocation. And the all2all dose not actually reduce the input data amount right? You just switch the data by all the tp rank. So what's the point of those change, does it really brings performance boost?

What's the point for this change? dose it save the communication or computation? You just replicate two copy of o_proj on each device, which will increase the both bindwidth pressure and memory allocation. And the all2all dose not actually reduce the input data amount right? You just switch the data by all the tp rank. So what's the point of those change, does it really brings performance boost?

This change saves communication because input data size of all2all is 1/tp_size of allreduce. We can get performance benefits as long as the benefits of communication cover the increase of bindwidth pressure of linear.
refer this link for more details: https://gitcode.com/ascend-tribe/ascend-inference-cluster/blob/main/FlashComm/ascend-inference-cluster-flashcomm2.md

ganyi1996ppo · 2025-07-15T03:16:56Z

Do you have any performance statistics for this PR?

Signed-off-by: David9857 <985700846@qq.com>

David9857 · 2025-07-15T12:59:25Z

Do you have any performance statistics for this PR?

Here's the comparison of TTFT time when input_len=3000 and max_concurrency=20:
origin:
Mean TTFT (ms): 1112.23
Median TTFT (ms): 740.95
P99 TTFT (ms): 3382.67
with flashcomm2
Mean TTFT (ms): 1064.02
Median TTFT (ms): 744.06
P99 TTFT (ms): 3001.43

ganyi1996ppo · 2025-07-16T01:37:45Z

Looks good, thanks for the explain !

Here's the comparison of TTFT time when input_len=3000 and max_concurrency=20: origin: Mean TTFT (ms): 1112.23 Median TTFT (ms): 740.95 P99 TTFT (ms): 3382.67 with flashcomm2 Mean TTFT (ms): 1064.02 Median TTFT (ms): 744.06 P99 TTFT (ms): 3001.43

github-actions bot added the merge-conflicts label Jul 10, 2025

github-actions bot added the module:core label Jul 10, 2025

rjg-lyh reviewed Jul 11, 2025

View reviewed changes

David9857 force-pushed the pr-fc2 branch from 8a8d615 to bd5afc1 Compare July 11, 2025 01:42

github-actions bot removed the merge-conflicts label Jul 11, 2025

github-actions bot added the merge-conflicts label Jul 14, 2025

David9857 force-pushed the pr-fc2 branch from cd9e961 to d26e032 Compare July 14, 2025 06:13

github-actions bot added module:tests and removed merge-conflicts labels Jul 14, 2025

David9857 force-pushed the pr-fc2 branch from 482c3c2 to b32b1e6 Compare July 14, 2025 11:02

ganyi1996ppo reviewed Jul 15, 2025

View reviewed changes

David9857 force-pushed the pr-fc2 branch from 98dd975 to b688f2a Compare July 15, 2025 11:57

add support for flashcomm2 in qwen3

7e4f2fc

Signed-off-by: David9857 <985700846@qq.com>

David9857 force-pushed the pr-fc2 branch from b688f2a to 7e4f2fc Compare July 15, 2025 12:52

ganyi1996ppo merged commit 89129a8 into vllm-project:v0.9.1-dev Jul 16, 2025
16 checks passed

wangxiyuan added the no-main label Jul 17, 2025

David9857 mentioned this pull request Jul 17, 2025

[WIP][V0.9.1] add support for flashcomm2 in qwen2 #1850

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[V0.9.1] add support for flashcomm2 in qwen3 #1726

[V0.9.1] add support for flashcomm2 in qwen3 #1726

Uh oh!

David9857 commented Jul 10, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jul 10, 2025

Uh oh!

rjg-lyh Jul 11, 2025

Uh oh!

github-actions bot commented Jul 14, 2025

Uh oh!

ganyi1996ppo Jul 15, 2025

Uh oh!

ganyi1996ppo Jul 15, 2025

Uh oh!

David9857 Jul 15, 2025

Uh oh!

ganyi1996ppo Jul 15, 2025

Uh oh!

David9857 Jul 15, 2025

Uh oh!

ganyi1996ppo commented Jul 15, 2025

Uh oh!

David9857 commented Jul 15, 2025

Uh oh!

ganyi1996ppo commented Jul 16, 2025

Uh oh!

Uh oh!

Uh oh!

[V0.9.1] add support for flashcomm2 in qwen3 #1726

[V0.9.1] add support for flashcomm2 in qwen3 #1726

Uh oh!

Conversation

David9857 commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Jul 10, 2025

Uh oh!

rjg-lyh Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jul 14, 2025

Uh oh!

ganyi1996ppo Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

ganyi1996ppo Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

David9857 Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

ganyi1996ppo Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

David9857 Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

ganyi1996ppo commented Jul 15, 2025

Uh oh!

David9857 commented Jul 15, 2025

Uh oh!

ganyi1996ppo commented Jul 16, 2025

Uh oh!

Uh oh!

Uh oh!

David9857 commented Jul 10, 2025 •

edited

Loading