-
Notifications
You must be signed in to change notification settings - Fork 459
etp best a2 #1101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
etp best a2 #1101
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: ttanzhiqiang <389825161@qq.com>
[Reduce _npu_flash_attention mask to 128x128 for memory savings] + [Reduce memory usage by splitting tokens in fused_experts] @ApsarasX |
This was referenced Jun 6, 2025
wangxiyuan
approved these changes
Jun 11, 2025
jianzs
pushed a commit
that referenced
this pull request
Jun 15, 2025
### What this PR does / why we need it? W_UV/W_UK_T cannot be converted to nz, because this position will be fused into transposebatchmatmul, which does not support nz. The weights are actually converted back to nd in each run. ### Does this PR introduce _any_ user-facing change? Use #1098 as the baseline, p90 TPOT 90.79ms->88.58ms, improve TPOP 2ms ### How was this patch tested? use #1101 --------- Signed-off-by: ttanzhiqiang <389825161@qq.com>
momo609
pushed a commit
to momo609/vllm-ascend
that referenced
this pull request
Jun 17, 2025
### What this PR does / why we need it? Single machine 16 cards deepseekr1 attention (tp8/dp2) / moe(etp) Best performance rely on: vllm-ascend commit id:da9acfca6053352730fce75fb772e214755d0341 vllm commit id:b124e1085b1bf977e3dac96d99ffd9d8ddfdb6cc + vllm-project#910 + [Reduce _npu_flash_attention mask to 128x128 for memory savings] vllm-project#1100 [Reduce memory usage by splitting tokens in fused_experts] --------- Signed-off-by: ttanzhiqiang <389825161@qq.com>
momo609
pushed a commit
to momo609/vllm-ascend
that referenced
this pull request
Jun 17, 2025
### What this PR does / why we need it? Single machine 16 cards deepseekr1 attention (tp8/dp2) / moe(etp) Best performance rely on: vllm-ascend commit id:da9acfca6053352730fce75fb772e214755d0341 vllm commit id:b124e1085b1bf977e3dac96d99ffd9d8ddfdb6cc + vllm-project#910 + [Reduce _npu_flash_attention mask to 128x128 for memory savings] vllm-project#1100 [Reduce memory usage by splitting tokens in fused_experts] --------- Signed-off-by: ttanzhiqiang <389825161@qq.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
momo609
pushed a commit
to momo609/vllm-ascend
that referenced
this pull request
Jun 17, 2025
### What this PR does / why we need it? Single machine 16 cards deepseekr1 attention (tp8/dp2) / moe(etp) Best performance rely on: vllm-ascend commit id:da9acfca6053352730fce75fb772e214755d0341 vllm commit id:b124e1085b1bf977e3dac96d99ffd9d8ddfdb6cc + vllm-project#910 + [Reduce _npu_flash_attention mask to 128x128 for memory savings] vllm-project#1100 [Reduce memory usage by splitting tokens in fused_experts] --------- Signed-off-by: ttanzhiqiang <389825161@qq.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
shiyuan680
pushed a commit
to raindaywhu/vllm-ascend
that referenced
this pull request
Jul 7, 2025
### What this PR does / why we need it? Single machine 16 cards deepseekr1 attention (tp8/dp2) / moe(etp) Best performance rely on: vllm-ascend commit id:da9acfca6053352730fce75fb772e214755d0341 vllm commit id:b124e1085b1bf977e3dac96d99ffd9d8ddfdb6cc + vllm-project#910 + [Reduce _npu_flash_attention mask to 128x128 for memory savings] vllm-project#1100 [Reduce memory usage by splitting tokens in fused_experts] --------- Signed-off-by: ttanzhiqiang <389825161@qq.com>
shiyuan680
pushed a commit
to raindaywhu/vllm-ascend
that referenced
this pull request
Jul 7, 2025
…t#1131) W_UV/W_UK_T cannot be converted to nz, because this position will be fused into transposebatchmatmul, which does not support nz. The weights are actually converted back to nd in each run. Use vllm-project#1098 as the baseline, p90 TPOT 90.79ms->88.58ms, improve TPOP 2ms use vllm-project#1101 --------- Signed-off-by: ttanzhiqiang <389825161@qq.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What this PR does / why we need it?
Single machine 16 cards deepseekr1 attention (tp8/dp2) / moe(etp) Best performance

vllm-ascend commit id:da9acfca6053352730fce75fb772e214755d0341
vllm commit id:b124e1085b1bf977e3dac96d99ffd9d8ddfdb6cc
Does this PR introduce any user-facing change?
How was this patch tested?
bash run_dp_attention_etp16.sh
bash run_dp_attention_etp16_benmark.sh &> output_etp16.log