Skip to content

Conversation

marcobarlo
Copy link
Contributor

@marcobarlo marcobarlo commented Jul 26, 2025

What this PR does / why we need it?

  • This PR adds the support for the KV connector interface in the V1 architecture, in the same way as vllm. Vllm-ascend currently lacks of this support, required to support also layerwise management of KV caches.

  • The connector interface allows using external tools and integrate them with vllm

Notes:

We are aware of Issue #684 , however that issue does not modify the attention classes as necessary to perform layerwise management of KV caches required for connectors like LMCache.

The implementation of this PR ported the necessary code from the vanilla vllm. The KV connector API is the same as vanilla vllm, supporting the standard KV connector API.

EDIT: this PR was re-implementing part of the changes merged one hour before this PR was made on the file model_runner_v1.py. I solved the conflicts by removing any modification to the model_runner_v1 file, which now are largely already merged in main. Now this PR is left for the modifications to the attention_v1 file.

Does this PR introduce any user-facing change?

The PR does not modify current APIs, but it extends the behavior of current worker runner and attention classes to save and load KV caches. In absence of connectors, the behavior should stay untouched.

How was this patch tested?

Results without LMCache on Qwen3-8B:

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.8400 ± 0.0101
strict-match 5 exact_match 0.8355 ± 0.0102

Results with LMCache Layerwise:

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.8385 ± 0.0101
strict-match 5 exact_match 0.8332 ± 0.0103

Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@marcobarlo marcobarlo marked this pull request as draft July 26, 2025 10:32
@marcobarlo marcobarlo marked this pull request as ready for review July 26, 2025 10:48
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@marcobarlo marcobarlo force-pushed the main branch 2 times, most recently from 78431f2 to 0df0062 Compare July 28, 2025 10:01
Copy link

codecov bot commented Jul 28, 2025

Codecov Report

❌ Patch coverage is 14.28571% with 18 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.91%. Comparing base (5d8ec28) to head (bb8697d).
⚠️ Report is 83 commits behind head on main.

Files with missing lines Patch % Lines
vllm_ascend/attention/attention_v1.py 14.28% 18 Missing ⚠️

❌ Your patch status has failed because the patch coverage (14.28%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2039      +/-   ##
==========================================
- Coverage   77.99%   77.91%   -0.08%     
==========================================
  Files         134      134              
  Lines       18498    18519      +21     
==========================================
+ Hits        14427    14430       +3     
- Misses       4071     4089      +18     
Flag Coverage Δ
unittests 77.91% <14.28%> (-0.08%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jianzs
Copy link
Collaborator

jianzs commented Jul 28, 2025

Please ensure the modification works with torchair graph mode.

@LCAIZJ
Copy link
Contributor

LCAIZJ commented Jul 29, 2025

Great work on this! I have two follow-up questions to clarify:

  1. What do you mean by "vanilla vllm"?
  2. LMCache support KvCache transfer in Ascend?
    Could you please address these when you have a chance? Thanks!

@marcobarlo
Copy link
Contributor Author

marcobarlo commented Jul 29, 2025

@LCAIZJ

  1. Sorry, by "vanilla vllm" I meant the upstream vllm code for GPUs and other accelerators, without vllm-ascend ... specifically here -> https://github.com/vllm-project/vllm/blob/main/vllm/attention/layer.py
  2. We are working on it, we have a working version, upstreaming soon: Add support for ascend npu LMCache/LMCache#807
    This PR was required to create the upcoming LMC PR.

@jianzs My bad, I rebased from previous version and didn't see the torchair file. Looks like it does not support Deepseek lite. I am looking for a way to test without running a full deepseek. Any advice for small models?

@wangxiyuan
Copy link
Collaborator

Thanks for the contribution. @Potabk please test this PR with LMcache. Thanks.

@Potabk
Copy link
Collaborator

Potabk commented Aug 7, 2025

why don't we directly use the vllm interface

@marcobarlo
Copy link
Contributor Author

marcobarlo commented Aug 7, 2025

@Potabk the vllm implementations are in vllm/attention/layer.py, and makes assumptions on the attention objects. For example, at the current status, the vllm function asserts that attention metadata is a dict and then accesses the metadata of specific layer, which does not hold for vllm-ascend.
I don't think is wise to rely on something internally implemented to the vllm attention module.

@wangxiyuan Please notice that we worked on 0.9.2: the LMCache PR works for 0.9.2, when kvcaches was a list of tensors and not a list of tuples. This PR was developed before the commit df0ec55 (marcobarlo@df0ec55). I had to modify accordingly this PR to avoid conflicts with that commit, which was integrating the KV connector for the model_runner_v1.py file. However the same commit also changes the kvcache from list of tensors to list of tuples. If you want to test this PR together with LMCache, I would suggest to cherrypick this PR commit on the 0.9.2rc, and integrate the required modifications in the model_runner_v1.py file, which are part of the df0ec55 commit.
In particular, the required pieces of code from marcobarlo@df0ec55 are:
L20; L42 to L44; L47; L1158; L1189 to L1191; L1454 to L1460; L1644 to L1686.
Essentially, all the code related to KV connectors without disaggregation.

We will try to port LMCache to 0.10 and deal with list of tuples in the next couple of weeks.

@jianzs Still working on the torchair attention file, graph compilation is causing some issues. I'll get back as soon as possible. Any suggestion is welcome

@Potabk
Copy link
Collaborator

Potabk commented Aug 8, 2025

@wangxiyuan I think this will not affect the normal logic of LLMDataDistCMgrConnector that we have currently implemented
@marcobarlo if I want test this with llmcache, what should I do. We haven't even implemented an llmcache-based connector yet

@marcobarlo
Copy link
Contributor Author

marcobarlo commented Aug 8, 2025

@Potabk I would suggest one of the two ways:

  • Apply the commit of this PR to 0.9.2 and add the kv connector calls in the model_runner_v1.py file as I specified in the previous message (this is what I did); or
  • Use the code of this PR but revert the kvcaches to be list of tensors (currently should be list of tuples).
    Either way you prefer, but LMCache at the moment assumes the kvcaches are list of tensors.

Feel free to contact me in pvt if you need support

@wangxiyuan
Copy link
Collaborator

This is not needed anymore, right?

Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@marcobarlo
Copy link
Contributor Author

marcobarlo commented Aug 20, 2025

@wangxiyuan LMCache-Ascend still requires it. You can look into https://github.com/LMCache/LMCache-Ascend/blob/main/docker/Dockerfile.a2.openEuler in the PR LMCache/LMCache-Ascend#1 how it is built. Why do you think is not needed?

@wangxiyuan
Copy link
Collaborator

OK. plesase let the CI pass. Then I think we can merge this.

@marcobarlo marcobarlo force-pushed the main branch 4 times, most recently from f080543 to fe40890 Compare August 26, 2025 00:26
Signed-off-by: marcobarlo <barlettamarco8@gmail.com>
Signed-off-by: marcobarlo <65128997+marcobarlo@users.noreply.github.com>
Signed-off-by: marcobarlo <65128997+marcobarlo@users.noreply.github.com>
@marcobarlo
Copy link
Contributor Author

marcobarlo commented Aug 26, 2025

@wangxiyuan is the CI ok? I rebased the commit on the latest main, but the tests look like failing and not due to the modifications of this PR. I saw a commit today called "Fix broken CI"

@wangxiyuan
Copy link
Collaborator

@marcobarlo we noticed this error, it's caused by #2546 feel free to ignore it. We'll fix it asap

@wangxiyuan wangxiyuan added the ready read for review label Aug 28, 2025
@matthewygf
Copy link

@wangxiyuan Hi, once this is ready, we aim to start our initial support for v0.10.x

@wangxiyuan wangxiyuan merged commit 6666e52 into vllm-project:main Sep 8, 2025
27 of 30 checks passed
Angazenn pushed a commit to Angazenn/vllm-ascend that referenced this pull request Sep 10, 2025
### What this PR does / why we need it?
- This PR adds the support for the KV connector interface in the V1
architecture, in the same way as vllm. Vllm-ascend currently lacks of
this support, required to support also layerwise management of KV
caches.

- The connector interface allows using external tools and integrate them
with vllm

### Notes:
We are aware of Issue vllm-project#684 , however that issue does not modify the
attention classes as necessary to perform layerwise management of KV
caches required for connectors like LMCache.

The implementation of this PR ported the necessary code from the vanilla
vllm. The KV connector API is the same as vanilla vllm, supporting the
standard KV connector API.

EDIT: this PR was re-implementing part of the changes merged one hour
before this PR was made on the file model_runner_v1.py. I solved the
conflicts by removing any modification to the model_runner_v1 file,
which now are largely already merged in main. Now this PR is left for
the modifications to the attention_v1 file.

### Does this PR introduce _any_ user-facing change?
The PR does not modify current APIs, but it extends the behavior of
current worker runner and attention classes to save and load KV caches.
In absence of connectors, the behavior should stay untouched.

### How was this patch tested?
- No unit test implemented yet for the worker.

- Tested together with LMCache using
https://github.com/LMCache/LMCache/blob/dev/examples/kv_cache_reuse/local_backends/offload.py
with the following models:
1 Deepseek-R1-Distill-Qwen-1.5B
2 Qwen3-30B-A3B
3 Deepseek-v2-lite
4 Llama-3.1-8B
LMCache used in both layerwise and non-layerwise mode.

- Performed LMEval on LMCache integrated with vllm-ascend.

Results without LMCache on Qwen3-8B:
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|

|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8400|± |0.0101|
| | |strict-match | 5|exact_match|↑ |0.8355|± |0.0102|
 
Results with LMCache Layerwise:
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|

|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8385|± |0.0101|
| | |strict-match | 5|exact_match|↑ |0.8332|± |0.0103|


- vLLM version: v0.10.1.1
- vLLM main:
vllm-project/vllm@50fede6

---------

Signed-off-by: marcobarlo <barlettamarco8@gmail.com>
Signed-off-by: marcobarlo <65128997+marcobarlo@users.noreply.github.com>
offline893 pushed a commit to offline893/vllm-ascend that referenced this pull request Sep 16, 2025
### What this PR does / why we need it?
- This PR adds the support for the KV connector interface in the V1
architecture, in the same way as vllm. Vllm-ascend currently lacks of
this support, required to support also layerwise management of KV
caches.

- The connector interface allows using external tools and integrate them
with vllm

### Notes:
We are aware of Issue vllm-project#684 , however that issue does not modify the
attention classes as necessary to perform layerwise management of KV
caches required for connectors like LMCache.

The implementation of this PR ported the necessary code from the vanilla
vllm. The KV connector API is the same as vanilla vllm, supporting the
standard KV connector API.

EDIT: this PR was re-implementing part of the changes merged one hour
before this PR was made on the file model_runner_v1.py. I solved the
conflicts by removing any modification to the model_runner_v1 file,
which now are largely already merged in main. Now this PR is left for
the modifications to the attention_v1 file.

### Does this PR introduce _any_ user-facing change?
The PR does not modify current APIs, but it extends the behavior of
current worker runner and attention classes to save and load KV caches.
In absence of connectors, the behavior should stay untouched.

### How was this patch tested?
- No unit test implemented yet for the worker.

- Tested together with LMCache using
https://github.com/LMCache/LMCache/blob/dev/examples/kv_cache_reuse/local_backends/offload.py
with the following models:
1 Deepseek-R1-Distill-Qwen-1.5B
2 Qwen3-30B-A3B
3 Deepseek-v2-lite
4 Llama-3.1-8B
LMCache used in both layerwise and non-layerwise mode.

- Performed LMEval on LMCache integrated with vllm-ascend.

Results without LMCache on Qwen3-8B:
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|

|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8400|± |0.0101|
| | |strict-match | 5|exact_match|↑ |0.8355|± |0.0102|

Results with LMCache Layerwise:
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|

|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8385|± |0.0101|
| | |strict-match | 5|exact_match|↑ |0.8332|± |0.0103|

- vLLM version: v0.10.1.1
- vLLM main:
vllm-project/vllm@50fede6

---------

Signed-off-by: marcobarlo <barlettamarco8@gmail.com>
Signed-off-by: marcobarlo <65128997+marcobarlo@users.noreply.github.com>
Signed-off-by: offline0806 <z00858301@china.huawei.com>
wangxiaoteng888 pushed a commit to LCAIZJ/vllm-ascend that referenced this pull request Sep 25, 2025
### What this PR does / why we need it?
- This PR adds the support for the KV connector interface in the V1
architecture, in the same way as vllm. Vllm-ascend currently lacks of
this support, required to support also layerwise management of KV
caches.

- The connector interface allows using external tools and integrate them
with vllm

### Notes:
We are aware of Issue vllm-project#684 , however that issue does not modify the
attention classes as necessary to perform layerwise management of KV
caches required for connectors like LMCache.

The implementation of this PR ported the necessary code from the vanilla
vllm. The KV connector API is the same as vanilla vllm, supporting the
standard KV connector API.

EDIT: this PR was re-implementing part of the changes merged one hour
before this PR was made on the file model_runner_v1.py. I solved the
conflicts by removing any modification to the model_runner_v1 file,
which now are largely already merged in main. Now this PR is left for
the modifications to the attention_v1 file.

### Does this PR introduce _any_ user-facing change?
The PR does not modify current APIs, but it extends the behavior of
current worker runner and attention classes to save and load KV caches.
In absence of connectors, the behavior should stay untouched.

### How was this patch tested?
- No unit test implemented yet for the worker.

- Tested together with LMCache using
https://github.com/LMCache/LMCache/blob/dev/examples/kv_cache_reuse/local_backends/offload.py
with the following models:
1 Deepseek-R1-Distill-Qwen-1.5B
2 Qwen3-30B-A3B
3 Deepseek-v2-lite
4 Llama-3.1-8B
LMCache used in both layerwise and non-layerwise mode.

- Performed LMEval on LMCache integrated with vllm-ascend.

Results without LMCache on Qwen3-8B:
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|

|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8400|± |0.0101|
| | |strict-match | 5|exact_match|↑ |0.8355|± |0.0102|
 
Results with LMCache Layerwise:
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|

|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8385|± |0.0101|
| | |strict-match | 5|exact_match|↑ |0.8332|± |0.0103|


- vLLM version: v0.10.1.1
- vLLM main:
vllm-project/vllm@50fede6

---------

Signed-off-by: marcobarlo <barlettamarco8@gmail.com>
Signed-off-by: marcobarlo <65128997+marcobarlo@users.noreply.github.com>
chopper0126 pushed a commit to chopper0126/vllm-ascend that referenced this pull request Sep 26, 2025
### What this PR does / why we need it?
- This PR adds the support for the KV connector interface in the V1
architecture, in the same way as vllm. Vllm-ascend currently lacks of
this support, required to support also layerwise management of KV
caches.

- The connector interface allows using external tools and integrate them
with vllm

### Notes:
We are aware of Issue vllm-project#684 , however that issue does not modify the
attention classes as necessary to perform layerwise management of KV
caches required for connectors like LMCache.

The implementation of this PR ported the necessary code from the vanilla
vllm. The KV connector API is the same as vanilla vllm, supporting the
standard KV connector API.

EDIT: this PR was re-implementing part of the changes merged one hour
before this PR was made on the file model_runner_v1.py. I solved the
conflicts by removing any modification to the model_runner_v1 file,
which now are largely already merged in main. Now this PR is left for
the modifications to the attention_v1 file.

### Does this PR introduce _any_ user-facing change?
The PR does not modify current APIs, but it extends the behavior of
current worker runner and attention classes to save and load KV caches.
In absence of connectors, the behavior should stay untouched.

### How was this patch tested?
- No unit test implemented yet for the worker.

- Tested together with LMCache using
https://github.com/LMCache/LMCache/blob/dev/examples/kv_cache_reuse/local_backends/offload.py
with the following models:
1 Deepseek-R1-Distill-Qwen-1.5B
2 Qwen3-30B-A3B
3 Deepseek-v2-lite
4 Llama-3.1-8B
LMCache used in both layerwise and non-layerwise mode.

- Performed LMEval on LMCache integrated with vllm-ascend.

Results without LMCache on Qwen3-8B:
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|

|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8400|± |0.0101|
| | |strict-match | 5|exact_match|↑ |0.8355|± |0.0102|
 
Results with LMCache Layerwise:
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|

|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8385|± |0.0101|
| | |strict-match | 5|exact_match|↑ |0.8332|± |0.0103|


- vLLM version: v0.10.1.1
- vLLM main:
vllm-project/vllm@50fede6

---------

Signed-off-by: marcobarlo <barlettamarco8@gmail.com>
Signed-off-by: marcobarlo <65128997+marcobarlo@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready read for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants