[Attention] add DCP support for FLASH_ATTN_MLA backend #24453

LucasWilkinson · 2025-09-08T17:43:01Z

Purpose

Fix FA MLA return and add DCP support for FA MLA

Test Plan

VLLM_ATTENTION_BACKEND=FLASH_ATTN_MLA chg run -g 4 -- pytest tests/distributed/test_context_parallel.py -s

lm_eval results

Test Result

VLLM_ATTENTION_BACKEND=FLASH_ATTN_MLA chg run -g 2 --  vllm serve --model="deepseek-ai/DeepSeek-V2-Lite-Chat" --trust-remote-code -tp 2 -dcp 2 --port 3331

lm_eval --model local-completions --model_args "base_url=http://0.0.0.0:3331/v1/completions,model=deepseek-ai/DeepSeek-V2-Lite-Chat,num_concurrent=256" --tasks gsm8k

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.6437|±  |0.0132|
|     |       |strict-match    |     5|exact_match|↑  |0.6391|±  |0.0132|

VLLM_ATTENTION_BACKEND=FLASH_ATTN_MLA chg run -g 4 -- pytest tests/distributed/test_context_parallel.py -s

passes

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

youkaichao

LGTM

mergify · 2025-09-09T00:41:05Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

youzhedian · 2025-09-09T02:29:57Z

vllm/v1/worker/gpu_model_runner.py

I believe the assert cannot be removed yet; some modifications to the MLA attention kernels are necessary. This is because, each query token may have a different seqlen_k on different DCP ranks.
For example, with dcp=2 and query_len=2, note as AB, if we treat this as a decode request:

The KV-cache for key k_A is stored on DCP rank 0, and k_B on DCP rank 1.

On DCP rank 0, both q_A and q_B should have seqlen_k = 1.

However, on DCP rank 1, q_A should have seqlen_k =0, and q_B should have seqlen_k = 1.

You already handle this:

https://github.com/vllm-project/vllm/blob/main/vllm/v1/attention/backends/mla/common.py#L670-L673

Am I missing something?

https://github.com/vllm-project/vllm/blob/main/vllm/v1/attention/backends/mla/common.py#L670-L673 just recorrect seqlens_k for DCP decode, and it's trivial under query_len=1. But once query_len>1, the situation changes.

https://github.com/vllm-project/vllm/blob/main/vllm/v1/attention/backends/mla/common.py#L670-L673 corrects anything thats classified as "decode" by

vllm/vllm/v1/attention/backends/mla/common.py

Lines 666 to 667 in bba1042

split_decodes_and_prefills(common_attn_metadata,

decode_threshold=self.reorder_batch_threshold)

i.e. anything with q_len <= reorder_batch_threshold so I still fail to see the issue?

when q_len == 1, num_reqs == num_tokens

sorry i understand the issue now; we need a special causal mask for q_len > 1; i.e.

Normal: k_toks > 0 1 2 3 4 5 q_toks v _____________ 0 | 1 1 1 1 | 1 1 1 1 2 | 1 1 1 1 1 3 | 1 1 1 1 1 1 DCP Rank 0: k_toks > 0 2 4 q_toks v _______ 0 | 1 1 1 | 1 1 2 | 1 1 1 3 | 1 1 1 DCP Rank 1: k_toks > 1 3 5 q_toks v ______ 0 | 1 1 | 1 1 2 | 1 1 3 | 1 1 1

Apologies for the oversight on my side :face_palm: because of working on #22789 im not used to thinking about interleaved tokens being distributed because that approach i distributed contiguous blocks of tokens (full pages). Good catch! 🙏

I will add support for this mask in FA3 so we can combine DCP and FA3 (we should do the same for FlashMLA); in the meantime ill make the reorder_batch_threshold == 1 when DCP is turned on 👍

cc @MatthewBonanni who might have bandwidth before I do 👍

@LucasWilkinson @MatthewBonanni FYI #24864.

we also can separate mla dcp decode into two stage when query_len>1, context_kv use causal=Fasle, query_kv use caual=True, than we don't need hack a custom mask for all mla backends. I don't know which one is better.

I expect to have PRs up today with the custom mask, so we'll at least have that option. The WIP flash attention PR is: vllm-project/flash-attention#92

CC @minosfuture

ah, thanks! let's discuss on slack

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

…24453) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>

…24453) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

mergify bot added the v1 label Sep 8, 2025

LucasWilkinson marked this pull request as ready for review September 8, 2025 17:45

LucasWilkinson requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners September 8, 2025 17:45

LucasWilkinson requested a review from youkaichao September 8, 2025 18:42

MatthewBonanni mentioned this pull request Sep 8, 2025

[Kernel] Flashinfer MLA (trtllm-gen) decode kernel integration #21078

Merged

youkaichao approved these changes Sep 9, 2025

View reviewed changes

mergify bot added the needs-rebase label Sep 9, 2025

youzhedian suggested changes Sep 9, 2025

View reviewed changes

LucasWilkinson added 2 commits September 9, 2025 05:23

fa MLA cp support

151e69b

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

accuracy fix

cd3bafa

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

LucasWilkinson force-pushed the lwilkinson/fa-mla-dcp branch from 8e0c733 to cd3bafa Compare September 9, 2025 05:24

mergify bot removed the needs-rebase label Sep 9, 2025

LucasWilkinson and others added 3 commits September 9, 2025 13:52

restrict to q_len == 1

39efba6

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

fix format

4edbdd0

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

Fix pre-commit

a2e15ec

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

youkaichao merged commit 0ae43db into vllm-project:main Sep 10, 2025
12 checks passed

youkaichao deleted the lwilkinson/fa-mla-dcp branch September 10, 2025 09:19

youkaichao changed the title ~~[Attention] Fix FA MLA and add DCP support~~ [Attention] add DCP support for FLASH_ATTN_MLA backend Sep 10, 2025

MatthewBonanni mentioned this pull request Sep 16, 2025

[Attention] Support MTP with DCP #24997

Closed

5 tasks

	split_decodes_and_prefills(common_attn_metadata,
	decode_threshold=self.reorder_batch_threshold)

Uh oh!

[Attention] add DCP support for FLASH_ATTN_MLA backend #24453

[Attention] add DCP support for FLASH_ATTN_MLA backend #24453

Uh oh!

Conversation

LucasWilkinson commented Sep 8, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

youkaichao left a comment

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Sep 9, 2025

Uh oh!

youzhedian Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

youzhedian Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

youzhedian Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

MatthewBonanni Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

youzhedian Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

minosfuture Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

LucasWilkinson commented Sep 8, 2025 •

edited by github-actions bot

Loading

youzhedian Sep 9, 2025 •

edited

Loading

youzhedian Sep 9, 2025 •

edited

Loading

LucasWilkinson Sep 9, 2025 •

edited

Loading