Skip to content

Conversation

yiz-liu
Copy link
Collaborator

@yiz-liu yiz-liu commented Jul 31, 2025

Note: This depends on vLLM #25161 and the torch_npu release from September 30.

What this PR does / why we need it?

This pull request adds FULL_DECODE_ONLY mode for GQA/MHA models (MLA models like DeepSeek V3/R1 are not included). Key improvements include:

  • Reduced dispatch latency: By replaying the entire model execution graph at once, we cut overhead compared with multiple smaller replays.
  • Stabilized multi-device performance: Captureing the whole model as one static graph also mitigates the dispatch fluctuations across devices.
  • Stream/resource savings: Consolidating graph captures frees up streams, allowing more graphs to be captured.

Known issues:

  1. _npu_paged_attention currently manages its own workspace in torch_npu, which can deadlock when synchronizing during graph replay — we’re working on a fix.

There may be other corner cases. This PR is the first in a planned series; we’ll continue to iterate and address remaining issues in follow-ups.

This is essentially a port of #1503 and #1677, but includes two major changes:

  1. Let graph_dispatcher decide the graph mode instead of hard-coding it in the backend, which decouples Full Graph and Piecewise Graph and could make it possible to remove dynamo.
  2. Adapt to the new attn_group logic, but leave a small hack in update_graph_params; multi-attention models may or may not be fully supported yet.

Does this PR introduce any user-facing change?

compilation_config={
    "cudagraph_mode": "FULL_DECODE_ONLY",
},

How was this patch tested?

Tests included.

Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@yiz-liu yiz-liu force-pushed the feat-full-graph branch 2 times, most recently from 4ee9061 to 1a97261 Compare July 31, 2025 11:07
Copy link

github-actions bot commented Aug 1, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Copy link

codecov bot commented Aug 1, 2025

Codecov Report

❌ Patch coverage is 70.21277% with 28 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@1b40665). Learn more about missing BASE report.
⚠️ Report is 15 commits behind head on main.

Files with missing lines Patch % Lines
vllm_ascend/attention/attention_v1.py 47.22% 19 Missing ⚠️
vllm_ascend/utils.py 59.09% 9 Missing ⚠️

❌ Your patch check has failed because the patch coverage (70.21%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2128   +/-   ##
=======================================
  Coverage        ?   76.12%           
=======================================
  Files           ?      121           
  Lines           ?    13607           
  Branches        ?        0           
=======================================
  Hits            ?    10358           
  Misses          ?     3249           
  Partials        ?        0           
Flag Coverage Δ
unittests 76.12% <70.21%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link

github-actions bot commented Aug 7, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@yiz-liu yiz-liu force-pushed the feat-full-graph branch 2 times, most recently from 9ec333e to d7cfda2 Compare August 11, 2025 06:32
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@yiz-liu yiz-liu force-pushed the feat-full-graph branch 2 times, most recently from 32ddc77 to f4910be Compare August 12, 2025 07:58
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Adds support for full graph compilation modes, only  `FULL_DECODE_ONLY` for now. This feature is experimental and aims to improve performance by capturing the entire model execution graph.

Key changes include:
- Introducing a mechanism for attention backends to declare their ACLGraph support level.
- Automatically downgrading the graph compilation mode at runtime if the selected attention backend does not support the requested mode, providing clear warnings to the user.
- Wrapping the model with an `ACLGraphWrapper` when a full graph mode is active.
- Extending the graph capture logic to handle separate capture routines for prefill and decode steps.
- Adding a warning for users enabling full graph mode, highlighting its experimental nature and potential memory issues.

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Introduces the ability to build dummy attention metadata during graph capture runs. This ensures that the attention mechanism is included in the captured graph, even when it would normally be skipped.

A `force_attention` flag is added to dummy runs to trigger the creation of this metadata. A new `build_for_graph_capture` method is implemented in the attention metadata builder to construct the necessary metadata for the `DecodeOnly` state during graph compilation.

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Introduces support for full graph mode with dynamic shapes for the paged attention operator on Ascend NPUs.

This is achieved by:
- Capturing the paged attention operation into a graph task group during the graph capture phase.
- Introducing a mechanism to store graph-related parameters (handles, events, attention arguments).
- Adding a new `update_attn_params` method to update the `context_lens` argument of the captured paged attention operator on a separate stream before graph replay.
- Moving the `slot_mapping` tensor to the device to avoid mismatched buffers

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
… now

Adds a capability flag to indicate that the NPU platform supports graph mode execution.

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
@yiz-liu yiz-liu added ready read for review ready-for-test start test by label for PR labels Sep 22, 2025
@yiz-liu yiz-liu changed the title [Feat] Implement Full Graph on main branch [Feat][Graph] Support FULL_DECODE_ONLY mode for GQA/MHA models Sep 22, 2025
Renames the `slot_mapping_cpu` parameter to `slot_mapping` in test data to align with implementation changes.

Mocks `get_forward_context` in attention backend tests to accommodate modifications in the forward pass logic.

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
@wangxiyuan
Copy link
Collaborator

please refactor the code in the next PR.

@wangxiyuan wangxiyuan merged commit 338231a into vllm-project:main Sep 22, 2025
19 checks passed
@yiz-liu
Copy link
Collaborator Author

yiz-liu commented Sep 22, 2025

please refactor the code in the next PR.

Naturally, already on it.

@yiz-liu yiz-liu deleted the feat-full-graph branch September 22, 2025 09:38
Mercykid-bash pushed a commit to Mercykid-bash/vllm-ascend that referenced this pull request Sep 22, 2025
…m-project#2128)

Note: This depends on [vLLM
#25161](vllm-project/vllm#25161) and the
torch\_npu release from September 30.

### What this PR does / why we need it?
This pull request adds `FULL_DECODE_ONLY` mode for GQA/MHA models (MLA
models like DeepSeek V3/R1 are not included). Key improvements include:

* **Reduced dispatch latency:** By replaying the entire model execution
graph at once, we cut overhead compared with multiple smaller replays.
* **Stabilized multi-device performance:** Captureing the whole model as
one static graph also mitigates the dispatch fluctuations across
devices.
* **Stream/resource savings:** Consolidating graph captures frees up
streams, allowing more graphs to be captured.

**Known issues:**

1. `_npu_paged_attention` currently manages its own workspace in
`torch_npu`, which can deadlock when synchronizing during graph replay —
we’re working on a fix.

There may be other corner cases. This PR is the first in a planned
series; we’ll continue to iterate and address remaining issues in
follow-ups.

This is essentially a port of vllm-project#1503 and vllm-project#1677, but includes two major
changes:

1. Let `graph_dispatcher` decide the graph mode instead of hard-coding
it in the backend, which decouples Full Graph and Piecewise Graph and
could make it possible to remove dynamo.
2. Adapt to the new `attn_group` logic, but leave a small hack in
`update_graph_params`; multi-attention models may or may not be fully
supported yet.

### Does this PR introduce _any_ user-facing change?
```python
compilation_config={
    "cudagraph_mode": "FULL_DECODE_ONLY",
},
```

### How was this patch tested?
Tests included.

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@9607d5e

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
@yiz-liu
Copy link
Collaborator Author

yiz-liu commented Sep 22, 2025

please refactor the code in the next PR.

Naturally, already on it.

@wangxiyuan Please see #3101 .

Mercykid-bash pushed a commit to Mercykid-bash/vllm-ascend that referenced this pull request Sep 22, 2025
…m-project#2128)

Note: This depends on [vLLM
#25161](vllm-project/vllm#25161) and the
torch\_npu release from September 30.

### What this PR does / why we need it?
This pull request adds `FULL_DECODE_ONLY` mode for GQA/MHA models (MLA
models like DeepSeek V3/R1 are not included). Key improvements include:

* **Reduced dispatch latency:** By replaying the entire model execution
graph at once, we cut overhead compared with multiple smaller replays.
* **Stabilized multi-device performance:** Captureing the whole model as
one static graph also mitigates the dispatch fluctuations across
devices.
* **Stream/resource savings:** Consolidating graph captures frees up
streams, allowing more graphs to be captured.

**Known issues:**

1. `_npu_paged_attention` currently manages its own workspace in
`torch_npu`, which can deadlock when synchronizing during graph replay —
we’re working on a fix.

There may be other corner cases. This PR is the first in a planned
series; we’ll continue to iterate and address remaining issues in
follow-ups.

This is essentially a port of vllm-project#1503 and vllm-project#1677, but includes two major
changes:

1. Let `graph_dispatcher` decide the graph mode instead of hard-coding
it in the backend, which decouples Full Graph and Piecewise Graph and
could make it possible to remove dynamo.
2. Adapt to the new `attn_group` logic, but leave a small hack in
`update_graph_params`; multi-attention models may or may not be fully
supported yet.

### Does this PR introduce _any_ user-facing change?
```python
compilation_config={
    "cudagraph_mode": "FULL_DECODE_ONLY",
},
```

### How was this patch tested?
Tests included.

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@9607d5e

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
wangxiyuan pushed a commit that referenced this pull request Sep 22, 2025
### What this PR does / why we need it?
This is the follow-up PR of #2128 .

Moves graph parameter management components, including `GraphParams`,
`get_graph_params`, and `set_graph_params`, from the generic `utils.py`
to the more specific `compilation/acl_graph.py`.

Additionally, extracts the `update_attn_params` logic from the
`NPUModelRunner` class into a standalone function within the `acl_graph`
module.

This refactoring improves code organization by centralizing ACL
graph-related logic into its own dedicated module, enhancing modularity
and clarity.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
None needed.

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:core module:tests ready read for review ready-for-test start test by label for PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants