Skip to content

Conversation

JC-ut0
Copy link
Contributor

@JC-ut0 JC-ut0 commented Aug 28, 2025

What this PR does / why we need it?

In the pd Disaggregation scenario, the first token of the inference after the d node receives the kv follows the eager mode.

Fixes:
Running with MTP torchair graph mode with Prefilling Decoding Disaggregation , if all requests processed by the D node are requests just transmitted from the P node, it will break the torchair graph.

Reason: During PD Disaggregation , the P node only transmits the KV cache and prompt to the D node, not the actual tokens inferred (neither the main model tokens nor the MTP tokens are transmitted). Therefore, the D node will treat this request as one without MTP tokens for inference (seq_len=1).
The community does not have graph mode issues because the community's attention has a seq_len=1 for each batch during the decode phase.
We have issues because the graph mode pads according to processing 2 tokens per request. When there are some seq_len=1 and some seq_len=2, padding is done at the end. If all requests received by the D node are seq_len=1, padding cannot be performed normally according to the attention's fia operator constraints.

Solution:

The kv consumer uses extra torchair graph padding to avoid breaking FIA graph constrains (The one this PR implemented).

The kv producer provides the correct tokens to the kv consumer, so that our graph mode constraints are not broken, and all logic is the same as the PD mixed deployment . Since we are using the community scheduler, the modification requires patching the vllm scheduler, but theoretically, performance should be better. (Maybe later )

Does this PR introduce any user-facing change?

How was this patch tested?

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a bug in deepseek_mtp speculative decoding with torchair graph mode. The change correctly adjusts torchair_graph_batch_sizes for prefill-decode disaggregation by adding a redundant batch size. My review includes a suggestion to improve code maintainability by replacing a magic number with a named constant, making the logic clearer and easier to update.

@JC-ut0 JC-ut0 force-pushed the mtp_torchair_fix branch 6 times, most recently from 5b8d528 to fd9f45a Compare August 28, 2025 14:58
Signed-off-by: xuyexiong <xuyexiong@huawei.com>
@wangxiyuan wangxiyuan merged commit c223200 into vllm-project:v0.9.1-dev Aug 28, 2025
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants