Skip to content

Conversation

ttanzhiqiang
Copy link
Contributor

@ttanzhiqiang ttanzhiqiang commented Jun 6, 2025

What this PR does / why we need it?

Single machine 16 cards deepseekr1 attention (tp8/dp2) / moe(etp) Best performance
截屏2025-06-06 11 32 37
vllm-ascend commit id:da9acfca6053352730fce75fb772e214755d0341
vllm commit id:b124e1085b1bf977e3dac96d99ffd9d8ddfdb6cc

Does this PR introduce any user-facing change?

How was this patch tested?

bash run_dp_attention_etp16.sh
bash run_dp_attention_etp16_benmark.sh &> output_etp16.log

Signed-off-by: ttanzhiqiang <389825161@qq.com>
@ttanzhiqiang
Copy link
Contributor Author

[Reduce _npu_flash_attention mask to 128x128 for memory savings] + [Reduce memory usage by splitting tokens in fused_experts] @ApsarasX

@ttanzhiqiang
Copy link
Contributor Author

Signed-off-by: ttanzhiqiang <389825161@qq.com>
@ApsarasX ApsarasX added the ready read for review label Jun 9, 2025
@wangxiyuan wangxiyuan merged commit 980cd81 into vllm-project:main Jun 11, 2025
11 checks passed
jianzs pushed a commit that referenced this pull request Jun 15, 2025
### What this PR does / why we need it?
W_UV/W_UK_T cannot be converted to nz, because this position will be
fused into transposebatchmatmul, which does not support nz. The weights
are actually converted back to nd in each run.

### Does this PR introduce _any_ user-facing change?
Use #1098 as the baseline, p90 TPOT 90.79ms->88.58ms, improve TPOP 2ms

### How was this patch tested?
use #1101

---------

Signed-off-by: ttanzhiqiang <389825161@qq.com>
momo609 pushed a commit to momo609/vllm-ascend that referenced this pull request Jun 17, 2025
### What this PR does / why we need it?
Single machine 16 cards deepseekr1 attention (tp8/dp2) / moe(etp) Best
performance

rely on:
vllm-ascend commit id:da9acfca6053352730fce75fb772e214755d0341
vllm commit id:b124e1085b1bf977e3dac96d99ffd9d8ddfdb6cc
+ vllm-project#910 + [Reduce
_npu_flash_attention mask to 128x128 for memory savings]
vllm-project#1100 [Reduce memory
usage by splitting tokens in fused_experts]


---------

Signed-off-by: ttanzhiqiang <389825161@qq.com>
momo609 pushed a commit to momo609/vllm-ascend that referenced this pull request Jun 17, 2025
### What this PR does / why we need it?
Single machine 16 cards deepseekr1 attention (tp8/dp2) / moe(etp) Best
performance

rely on:
vllm-ascend commit id:da9acfca6053352730fce75fb772e214755d0341
vllm commit id:b124e1085b1bf977e3dac96d99ffd9d8ddfdb6cc
+ vllm-project#910 + [Reduce
_npu_flash_attention mask to 128x128 for memory savings]
vllm-project#1100 [Reduce memory
usage by splitting tokens in fused_experts]

---------

Signed-off-by: ttanzhiqiang <389825161@qq.com>
Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
momo609 pushed a commit to momo609/vllm-ascend that referenced this pull request Jun 17, 2025
### What this PR does / why we need it?
Single machine 16 cards deepseekr1 attention (tp8/dp2) / moe(etp) Best
performance

rely on:
vllm-ascend commit id:da9acfca6053352730fce75fb772e214755d0341
vllm commit id:b124e1085b1bf977e3dac96d99ffd9d8ddfdb6cc
+ vllm-project#910 + [Reduce
_npu_flash_attention mask to 128x128 for memory savings]
vllm-project#1100 [Reduce memory
usage by splitting tokens in fused_experts]

---------

Signed-off-by: ttanzhiqiang <389825161@qq.com>
Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
shiyuan680 pushed a commit to raindaywhu/vllm-ascend that referenced this pull request Jul 7, 2025
### What this PR does / why we need it?
Single machine 16 cards deepseekr1 attention (tp8/dp2) / moe(etp) Best
performance

rely on:
vllm-ascend commit id:da9acfca6053352730fce75fb772e214755d0341
vllm commit id:b124e1085b1bf977e3dac96d99ffd9d8ddfdb6cc
+ vllm-project#910 + [Reduce
_npu_flash_attention mask to 128x128 for memory savings]
vllm-project#1100 [Reduce memory
usage by splitting tokens in fused_experts]


---------

Signed-off-by: ttanzhiqiang <389825161@qq.com>
shiyuan680 pushed a commit to raindaywhu/vllm-ascend that referenced this pull request Jul 7, 2025
…t#1131)

W_UV/W_UK_T cannot be converted to nz, because this position will be
fused into transposebatchmatmul, which does not support nz. The weights
are actually converted back to nd in each run.

Use vllm-project#1098 as the baseline, p90 TPOT 90.79ms->88.58ms, improve TPOP 2ms

use vllm-project#1101

---------

Signed-off-by: ttanzhiqiang <389825161@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready read for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants