[Perf] Reduce memory usage by splitting tokens in fused_experts and avoiding unused tensor #833

ApsarasX · 2025-05-13T10:15:52Z

What this PR does / why we need it?

splitting tokens in fused_experts

When --max-model-len=32768 on DeepSeek-R1-W8A8, the fused_experts function consumes about 5.75GB of memory. By splitting it into multiple executions, the memory consumption of the fused_experts function can be reduced to 1.2GB, thereby increasing the available KVCache.

The disadvantage of this solution is that when the number of prompt tokens sent by the user exceeds VLLM_FUSED_EXPERTS_SEQ_SPLIT_LENGTH, an additional concat operator overhead will be added.

However, considering that the user's request in most scenarios will be less than 8192, we believe that this overhead is acceptable.

avoiding unused tensor

self.inputs_embeds in NPUModelRunner V1 will always be generated, but it will only be used in multi-modal situations, so I changed its generation conditions to reduce memory usage.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Signed-off-by: ApsarasX <apsarax@outlook.com>

ganyi1996ppo · 2025-05-19T08:42:03Z

vllm_ascend/quantization/w8a8_dynamic.py

+            topk_ids_list = topk_ids.split(VLLM_FUSED_EXPERTS_SEQ_SPLIT_LENGTH)
+            final_hidden_states_list = []
+            for i in range(len(x_list)):
+                final_hidden_states = fused_experts(


Do you have some assessment on the performance of this PR? This might bring some substantial performance regression, for this change may lead to repeat weight access from HBM

maybe the env VLLM_FUSED_EXPERTS_SEQ_SPLIT_LENGTH can be changed to something like ENABLE_FUSED_EXPOERS_SEQ_SPLIT to keep the backward capibility

ApsarasX · 2025-05-27T03:01:48Z

The optimizations in this PR are specific to the A2 machine and conflict with the alltoall-based optimizations you're using on A3 machine. Therefore, I will temporarily close this PR for now and reopen it later if needed.

github-actions bot added module:core module:quantization labels May 13, 2025

ApsarasX force-pushed the wengang/memory-optimization branch 6 times, most recently from 27ec0bd to 1516602 Compare May 15, 2025 15:32

ApsarasX changed the title ~~[Perf] Reduce memory usage by splitting tokens in fused_experts~~ [Perf] Reduce memory usage by splitting tokens in fused_experts and avoiding unused tensor May 15, 2025

ApsarasX force-pushed the wengang/memory-optimization branch from 1516602 to 276a46d Compare May 16, 2025 04:40

ApsarasX closed this May 16, 2025

ApsarasX force-pushed the wengang/memory-optimization branch from 276a46d to f8a0e4b Compare May 16, 2025 06:05

ApsarasX reopened this May 16, 2025

ApsarasX force-pushed the wengang/memory-optimization branch from f8a0e4b to c323f91 Compare May 16, 2025 15:38

ApsarasX added 2 commits May 18, 2025 12:07

[Perf] Reduce memory usage by splitting tokens in fused_experts

f7f00f3

Signed-off-by: ApsarasX <apsarax@outlook.com>

[Perf] Reduce memory usage by avoiding unused tensor

ab355ab

Signed-off-by: ApsarasX <apsarax@outlook.com>

ApsarasX force-pushed the wengang/memory-optimization branch from c323f91 to ab355ab Compare May 18, 2025 12:07

ApsarasX added the ready read for review label May 19, 2025

ApsarasX requested review from ganyi1996ppo and shen-shanshan May 19, 2025 04:26

wangxiyuan approved these changes May 19, 2025

View reviewed changes

ganyi1996ppo reviewed May 19, 2025

View reviewed changes

ApsarasX closed this May 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Perf] Reduce memory usage by splitting tokens in fused_experts and avoiding unused tensor #833

[Perf] Reduce memory usage by splitting tokens in fused_experts and avoiding unused tensor #833

Uh oh!

ApsarasX commented May 13, 2025 •

edited

Loading

Uh oh!

ganyi1996ppo May 19, 2025

Uh oh!

wangxiyuan May 19, 2025

Uh oh!

ApsarasX commented May 27, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Perf] Reduce memory usage by splitting tokens in fused_experts and avoiding unused tensor #833

[Perf] Reduce memory usage by splitting tokens in fused_experts and avoiding unused tensor #833

Uh oh!

Conversation

ApsarasX commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

splitting tokens in fused_experts

avoiding unused tensor

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

ganyi1996ppo May 19, 2025

Choose a reason for hiding this comment

Uh oh!

wangxiyuan May 19, 2025

Choose a reason for hiding this comment

Uh oh!

ApsarasX commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ApsarasX commented May 13, 2025 •

edited

Loading

ApsarasX commented May 27, 2025 •

edited

Loading