-
Notifications
You must be signed in to change notification settings - Fork 463
[Feat][Graph] Support FULL_DECODE_ONLY
mode for GQA/MHA models
#2128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
4ee9061
to
1a97261
Compare
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
1a97261
to
3d86b87
Compare
Codecov Report❌ Patch coverage is
❌ Your patch check has failed because the patch coverage (70.21%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #2128 +/- ##
=======================================
Coverage ? 76.12%
=======================================
Files ? 121
Lines ? 13607
Branches ? 0
=======================================
Hits ? 10358
Misses ? 3249
Partials ? 0
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
9ec333e
to
d7cfda2
Compare
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
d7cfda2
to
3431173
Compare
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
3431173
to
5f63a39
Compare
32ddc77
to
f4910be
Compare
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
f4910be
to
c980ac1
Compare
c980ac1
to
935e1b7
Compare
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
1328dab
to
6496e87
Compare
Adds support for full graph compilation modes, only `FULL_DECODE_ONLY` for now. This feature is experimental and aims to improve performance by capturing the entire model execution graph. Key changes include: - Introducing a mechanism for attention backends to declare their ACLGraph support level. - Automatically downgrading the graph compilation mode at runtime if the selected attention backend does not support the requested mode, providing clear warnings to the user. - Wrapping the model with an `ACLGraphWrapper` when a full graph mode is active. - Extending the graph capture logic to handle separate capture routines for prefill and decode steps. - Adding a warning for users enabling full graph mode, highlighting its experimental nature and potential memory issues. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Introduces the ability to build dummy attention metadata during graph capture runs. This ensures that the attention mechanism is included in the captured graph, even when it would normally be skipped. A `force_attention` flag is added to dummy runs to trigger the creation of this metadata. A new `build_for_graph_capture` method is implemented in the attention metadata builder to construct the necessary metadata for the `DecodeOnly` state during graph compilation. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Introduces support for full graph mode with dynamic shapes for the paged attention operator on Ascend NPUs. This is achieved by: - Capturing the paged attention operation into a graph task group during the graph capture phase. - Introducing a mechanism to store graph-related parameters (handles, events, attention arguments). - Adding a new `update_attn_params` method to update the `context_lens` argument of the captured paged attention operator on a separate stream before graph replay. - Moving the `slot_mapping` tensor to the device to avoid mismatched buffers Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
… now Adds a capability flag to indicate that the NPU platform supports graph mode execution. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
6496e87
to
f34680d
Compare
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
FULL_DECODE_ONLY
mode for GQA/MHA models
Renames the `slot_mapping_cpu` parameter to `slot_mapping` in test data to align with implementation changes. Mocks `get_forward_context` in attention backend tests to accommodate modifications in the forward pass logic. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
please refactor the code in the next PR. |
Naturally, already on it. |
…m-project#2128) Note: This depends on [vLLM #25161](vllm-project/vllm#25161) and the torch\_npu release from September 30. ### What this PR does / why we need it? This pull request adds `FULL_DECODE_ONLY` mode for GQA/MHA models (MLA models like DeepSeek V3/R1 are not included). Key improvements include: * **Reduced dispatch latency:** By replaying the entire model execution graph at once, we cut overhead compared with multiple smaller replays. * **Stabilized multi-device performance:** Captureing the whole model as one static graph also mitigates the dispatch fluctuations across devices. * **Stream/resource savings:** Consolidating graph captures frees up streams, allowing more graphs to be captured. **Known issues:** 1. `_npu_paged_attention` currently manages its own workspace in `torch_npu`, which can deadlock when synchronizing during graph replay — we’re working on a fix. There may be other corner cases. This PR is the first in a planned series; we’ll continue to iterate and address remaining issues in follow-ups. This is essentially a port of vllm-project#1503 and vllm-project#1677, but includes two major changes: 1. Let `graph_dispatcher` decide the graph mode instead of hard-coding it in the backend, which decouples Full Graph and Piecewise Graph and could make it possible to remove dynamo. 2. Adapt to the new `attn_group` logic, but leave a small hack in `update_graph_params`; multi-attention models may or may not be fully supported yet. ### Does this PR introduce _any_ user-facing change? ```python compilation_config={ "cudagraph_mode": "FULL_DECODE_ONLY", }, ``` ### How was this patch tested? Tests included. - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@9607d5e --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk>
@wangxiyuan Please see #3101 . |
…m-project#2128) Note: This depends on [vLLM #25161](vllm-project/vllm#25161) and the torch\_npu release from September 30. ### What this PR does / why we need it? This pull request adds `FULL_DECODE_ONLY` mode for GQA/MHA models (MLA models like DeepSeek V3/R1 are not included). Key improvements include: * **Reduced dispatch latency:** By replaying the entire model execution graph at once, we cut overhead compared with multiple smaller replays. * **Stabilized multi-device performance:** Captureing the whole model as one static graph also mitigates the dispatch fluctuations across devices. * **Stream/resource savings:** Consolidating graph captures frees up streams, allowing more graphs to be captured. **Known issues:** 1. `_npu_paged_attention` currently manages its own workspace in `torch_npu`, which can deadlock when synchronizing during graph replay — we’re working on a fix. There may be other corner cases. This PR is the first in a planned series; we’ll continue to iterate and address remaining issues in follow-ups. This is essentially a port of vllm-project#1503 and vllm-project#1677, but includes two major changes: 1. Let `graph_dispatcher` decide the graph mode instead of hard-coding it in the backend, which decouples Full Graph and Piecewise Graph and could make it possible to remove dynamo. 2. Adapt to the new `attn_group` logic, but leave a small hack in `update_graph_params`; multi-attention models may or may not be fully supported yet. ### Does this PR introduce _any_ user-facing change? ```python compilation_config={ "cudagraph_mode": "FULL_DECODE_ONLY", }, ``` ### How was this patch tested? Tests included. - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@9607d5e --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk>
### What this PR does / why we need it? This is the follow-up PR of #2128 . Moves graph parameter management components, including `GraphParams`, `get_graph_params`, and `set_graph_params`, from the generic `utils.py` to the more specific `compilation/acl_graph.py`. Additionally, extracts the `update_attn_params` logic from the `NPUModelRunner` class into a standalone function within the `acl_graph` module. This refactoring improves code organization by centralizing ACL graph-related logic into its own dedicated module, enhancing modularity and clarity. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? None needed. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Note: This depends on vLLM #25161 and the torch_npu release from September 30.
What this PR does / why we need it?
This pull request adds
FULL_DECODE_ONLY
mode for GQA/MHA models (MLA models like DeepSeek V3/R1 are not included). Key improvements include:Known issues:
_npu_paged_attention
currently manages its own workspace intorch_npu
, which can deadlock when synchronizing during graph replay — we’re working on a fix.There may be other corner cases. This PR is the first in a planned series; we’ll continue to iterate and address remaining issues in follow-ups.
This is essentially a port of #1503 and #1677, but includes two major changes:
graph_dispatcher
decide the graph mode instead of hard-coding it in the backend, which decouples Full Graph and Piecewise Graph and could make it possible to remove dynamo.attn_group
logic, but leave a small hack inupdate_graph_params
; multi-attention models may or may not be fully supported yet.Does this PR introduce any user-facing change?
How was this patch tested?
Tests included.