Skip to content

Conversation

ApsarasX
Copy link
Collaborator

@ApsarasX ApsarasX commented May 13, 2025

What this PR does / why we need it?

splitting tokens in fused_experts

When --max-model-len=32768 on DeepSeek-R1-W8A8, the fused_experts function consumes about 5.75GB of memory. By splitting it into multiple executions, the memory consumption of the fused_experts function can be reduced to 1.2GB, thereby increasing the available KVCache.

The disadvantage of this solution is that when the number of prompt tokens sent by the user exceeds VLLM_FUSED_EXPERTS_SEQ_SPLIT_LENGTH, an additional concat operator overhead will be added.

However, considering that the user's request in most scenarios will be less than 8192, we believe that this overhead is acceptable.

avoiding unused tensor

self.inputs_embeds in NPUModelRunner V1 will always be generated, but it will only be used in multi-modal situations, so I changed its generation conditions to reduce memory usage.

Does this PR introduce any user-facing change?

No

How was this patch tested?

@ApsarasX ApsarasX force-pushed the wengang/memory-optimization branch 6 times, most recently from 27ec0bd to 1516602 Compare May 15, 2025 15:32
@ApsarasX ApsarasX changed the title [Perf] Reduce memory usage by splitting tokens in fused_experts [Perf] Reduce memory usage by splitting tokens in fused_experts and avoiding unused tensor May 15, 2025
@ApsarasX ApsarasX force-pushed the wengang/memory-optimization branch from 1516602 to 276a46d Compare May 16, 2025 04:40
@ApsarasX ApsarasX closed this May 16, 2025
@ApsarasX ApsarasX force-pushed the wengang/memory-optimization branch from 276a46d to f8a0e4b Compare May 16, 2025 06:05
@ApsarasX ApsarasX reopened this May 16, 2025
@ApsarasX ApsarasX force-pushed the wengang/memory-optimization branch from f8a0e4b to c323f91 Compare May 16, 2025 15:38
ApsarasX added 2 commits May 18, 2025 12:07
Signed-off-by: ApsarasX <apsarax@outlook.com>
Signed-off-by: ApsarasX <apsarax@outlook.com>
@ApsarasX ApsarasX force-pushed the wengang/memory-optimization branch from c323f91 to ab355ab Compare May 18, 2025 12:07
@ApsarasX ApsarasX added the ready read for review label May 19, 2025
topk_ids_list = topk_ids.split(VLLM_FUSED_EXPERTS_SEQ_SPLIT_LENGTH)
final_hidden_states_list = []
for i in range(len(x_list)):
final_hidden_states = fused_experts(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have some assessment on the performance of this PR? This might bring some substantial performance regression, for this change may lead to repeat weight access from HBM

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe the env VLLM_FUSED_EXPERTS_SEQ_SPLIT_LENGTH can be changed to something like ENABLE_FUSED_EXPOERS_SEQ_SPLIT to keep the backward capibility

@ApsarasX
Copy link
Collaborator Author

ApsarasX commented May 27, 2025

The optimizations in this PR are specific to the A2 machine and conflict with the alltoall-based optimizations you're using on A3 machine. Therefore, I will temporarily close this PR for now and reopen it later if needed.

@ApsarasX ApsarasX closed this May 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants