-
Notifications
You must be signed in to change notification settings - Fork 270
[V0.9.1] Add support for flashcomm_v1 in Qwen2.5 #1745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
7465c21
to
30858dc
Compare
5a85a7f
to
8237ec0
Compare
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
e026e51
to
e19ed68
Compare
vllm_ascend/__init__.py
Outdated
@@ -27,5 +27,10 @@ def register_model(): | |||
# is upgraded to 2.7.0 | |||
import vllm_ascend.patch.worker.patch_common.patch_utils # noqa: F401 | |||
|
|||
from .utils import vllm_version_is | |||
# Import specific patches for different versions | |||
if vllm_version_is("0.9.1"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is not necessary to check the vllm version, because branch 0.9.1-dev is only compatible with vllm 0.9.1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right and this change would be better.
vllm_ascend/envs.py
Outdated
@@ -106,6 +106,8 @@ | |||
"VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE": | |||
lambda: bool(int(os.getenv("VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE", '0')) | |||
), | |||
"VLLM_ENABLE_FlashComm": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"VLLM_ENABLE_FlashComm": | |
"VLLM_ASCEND_ENABLE_FLASHCOMM": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I have changed it.
4e9d3e6
to
6b7da38
Compare
20c459a
to
31ba211
Compare
Do you have any performance data to share? |
4e7f241
to
a2b6dbb
Compare
d192b46
to
c617547
Compare
Signed-off-by: rjg-lyh <1318825571@qq.com>
32bec8b
to
a294834
Compare
|
Signed-off-by: rjg-lyh <1318825571@qq.com>
### What this PR does / why we need it? Add support for flashcomm_v1 in Qwen2.5. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? **①Functional Testing**: CI passed with existing test and add new test in tests\multicard\test_offline_inference_distributed.py. **②Accuracy Testing**: Using offline_inference: Evaluate the accuracy difference in model outputs between enabling and disabling the FlashComm v1 feature using offline inference. As shown in the figure below: - disabling <img width="1543" height="358" alt="image" src="https://github.com/user-attachments/assets/f7fab4e3-c3d1-412a-958e-11e2b9ec8f58" /> - enabling <img width="1541" height="531" alt="image" src="https://github.com/user-attachments/assets/11a2c5bf-22f0-4a63-b76d-c7b7575397be" /> **③Performance Stress Testing**: Here's the comparison of TTFT time, based on QwQ-32B-BF16, input_len=16K~32K, output_len=8K, and max_concurrency=16: - disabling Mean TTFT (ms): 1419.58 Median TTFT (ms): 1073.32 P99 TTFT (ms): 9549.34 - enabling Mean TTFT (ms): 1322.36 Median TTFT (ms): 1006.09 P99 TTFT (ms): 8268.28 --------- Signed-off-by: rjg-lyh <1318825571@qq.com> Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
What this PR does / why we need it?
Add support for flashcomm_v1 in Qwen2.5.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
①Functional Testing: CI passed with existing test and add new test in tests\multicard\test_offline_inference_distributed.py.
②Accuracy Testing: Using offline_inference: Evaluate the accuracy difference in model outputs between enabling and disabling the FlashComm v1 feature using offline inference. As shown in the figure below:
③Performance Stress Testing: Here's the comparison of TTFT time, based on QwQ-32B-BF16, input_len=16K~32K, output_len=8K, and max_concurrency=16:
disabling
Mean TTFT (ms): 1419.58
Median TTFT (ms): 1073.32
P99 TTFT (ms): 9549.34
enabling
Mean TTFT (ms): 1322.36
Median TTFT (ms): 1006.09
P99 TTFT (ms): 8268.28