Skip to content

[V0.9.1] Add support for flashcomm_v1 in Qwen2.5 #1745

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 17, 2025

Conversation

rjg-lyh
Copy link
Contributor

@rjg-lyh rjg-lyh commented Jul 11, 2025

What this PR does / why we need it?

Add support for flashcomm_v1 in Qwen2.5.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

①Functional Testing: CI passed with existing test and add new test in tests\multicard\test_offline_inference_distributed.py.
②Accuracy Testing: Using offline_inference: Evaluate the accuracy difference in model outputs between enabling and disabling the FlashComm v1 feature using offline inference. As shown in the figure below:

  • disabling
image - enabling image

③Performance Stress Testing: Here's the comparison of TTFT time, based on QwQ-32B-BF16, input_len=16K~32K, output_len=8K, and max_concurrency=16:

  • disabling
    Mean TTFT (ms): 1419.58
    Median TTFT (ms): 1073.32
    P99 TTFT (ms): 9549.34

  • enabling
    Mean TTFT (ms): 1322.36
    Median TTFT (ms): 1006.09
    P99 TTFT (ms): 8268.28

@rjg-lyh rjg-lyh force-pushed the pr-flashcommv1 branch 3 times, most recently from 7465c21 to 30858dc Compare July 14, 2025 03:19
@rjg-lyh rjg-lyh force-pushed the pr-flashcommv1 branch 2 times, most recently from 5a85a7f to 8237ec0 Compare July 14, 2025 03:43
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@rjg-lyh rjg-lyh force-pushed the pr-flashcommv1 branch 4 times, most recently from e026e51 to e19ed68 Compare July 15, 2025 06:45
@@ -27,5 +27,10 @@ def register_model():
# is upgraded to 2.7.0
import vllm_ascend.patch.worker.patch_common.patch_utils # noqa: F401

from .utils import vllm_version_is
# Import specific patches for different versions
if vllm_version_is("0.9.1"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is not necessary to check the vllm version, because branch 0.9.1-dev is only compatible with vllm 0.9.1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right and this change would be better.

@@ -106,6 +106,8 @@
"VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE":
lambda: bool(int(os.getenv("VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE", '0'))
),
"VLLM_ENABLE_FlashComm":
Copy link
Collaborator

@MengqingCao MengqingCao Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"VLLM_ENABLE_FlashComm":
"VLLM_ASCEND_ENABLE_FLASHCOMM":

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I have changed it.

@rjg-lyh rjg-lyh force-pushed the pr-flashcommv1 branch 5 times, most recently from 4e9d3e6 to 6b7da38 Compare July 15, 2025 10:48
@rjg-lyh rjg-lyh force-pushed the pr-flashcommv1 branch 4 times, most recently from 20c459a to 31ba211 Compare July 16, 2025 02:32
@ApsarasX
Copy link
Collaborator

Do you have any performance data to share?

@rjg-lyh rjg-lyh force-pushed the pr-flashcommv1 branch 3 times, most recently from 4e7f241 to a2b6dbb Compare July 16, 2025 03:58
@rjg-lyh rjg-lyh force-pushed the pr-flashcommv1 branch 3 times, most recently from d192b46 to c617547 Compare July 16, 2025 06:12
Signed-off-by: rjg-lyh <1318825571@qq.com>
@rjg-lyh rjg-lyh force-pushed the pr-flashcommv1 branch 4 times, most recently from 32bec8b to a294834 Compare July 16, 2025 08:14
@rjg-lyh
Copy link
Contributor Author

rjg-lyh commented Jul 16, 2025

Do you have any performance data to share?
Based on QwQ-32B-BF16, here's the comparison of TTFT time when input_len=16K~32K, output_len=8K, and max_concurrency=16:
origin:
Mean TTFT (ms): 1419.58
Median TTFT (ms): 1073.32
P99 TTFT (ms): 9549.34
with flashcomm1:
Mean TTFT (ms): 1322.36
Median TTFT (ms): 1006.09
P99 TTFT (ms): 8268.28

@wangxiyuan wangxiyuan merged commit b3d6e0c into vllm-project:v0.9.1-dev Jul 17, 2025
16 checks passed
weijinqian0 pushed a commit to weijinqian0/vllm-ascend that referenced this pull request Jul 18, 2025
### What this PR does / why we need it?
Add support for flashcomm_v1 in Qwen2.5.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
**①Functional Testing**: CI passed with existing test and add new test
in tests\multicard\test_offline_inference_distributed.py.
**②Accuracy Testing**: Using offline_inference: Evaluate the accuracy
difference in model outputs between enabling and disabling the FlashComm
v1 feature using offline inference. As shown in the figure below:
- disabling
<img width="1543" height="358" alt="image"
src="https://github.com/user-attachments/assets/f7fab4e3-c3d1-412a-958e-11e2b9ec8f58"
/>
- enabling
<img width="1541" height="531" alt="image"
src="https://github.com/user-attachments/assets/11a2c5bf-22f0-4a63-b76d-c7b7575397be"
/>

**③Performance Stress Testing**: Here's the comparison of TTFT time,
based on QwQ-32B-BF16, input_len=16K~32K, output_len=8K, and
max_concurrency=16:

- disabling
Mean TTFT (ms): 1419.58
Median TTFT (ms): 1073.32
P99 TTFT (ms): 9549.34

- enabling
Mean TTFT (ms): 1322.36
Median TTFT (ms): 1006.09
P99 TTFT (ms): 8268.28

---------

Signed-off-by: rjg-lyh <1318825571@qq.com>
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants