Skip to content

Commit e0716c5

Browse files
authored
[0.9.1][Fix] Fix DeepSeek OOM issue in extreme --gpu-memory-utilization scenario (#1829)
### What this PR does / why we need it? Excessive padding for long input sequences can now lead to out‑of‑memory errors. This PR optimizes the padding logic in `fused_moe.py` to eliminate unnecessary padding. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? To reproduce the issue, launch the DeepSeek model server using both data‑parallel (DP) and tensor‑parallel (TP) strategies, then submit a request with a long input sequence. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
1 parent 39a0cb1 commit e0716c5

File tree

1 file changed

+10
-7
lines changed

1 file changed

+10
-7
lines changed

vllm_ascend/ops/fused_moe.py

Lines changed: 10 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1286,15 +1286,18 @@ def forward(
12861286
mc2_mask = forward_context.mc2_mask
12871287
tp_size = get_tensor_model_parallel_world_size()
12881288
if fused_moe_state != FusedMoEState.AllGather:
1289-
if num_tokens < forward_context.padded_num_tokens:
1289+
if fused_moe_state in {
1290+
FusedMoEState.MC2, FusedMoEState.MC2_PREFILL
1291+
}:
1292+
padding_size = forward_context.padded_num_tokens
1293+
else:
1294+
# TODO: Determine if we can remove the padding
1295+
padding_size = tp_size
1296+
if num_tokens < padding_size:
12901297
hidden_states = nn.functional.pad(
1291-
hidden_states,
1292-
(0, 0, 0, forward_context.padded_num_tokens - num_tokens),
1293-
)
1298+
hidden_states, (0, 0, 0, padding_size - num_tokens))
12941299
router_logits = nn.functional.pad(
1295-
router_logits,
1296-
(0, 0, 0, forward_context.padded_num_tokens - num_tokens),
1297-
)
1300+
router_logits, (0, 0, 0, padding_size - num_tokens))
12981301
if tp_size > 1:
12991302
chunk_hidden_states = torch.tensor_split(hidden_states,
13001303
tp_size,

0 commit comments

Comments
 (0)