-
Notifications
You must be signed in to change notification settings - Fork 461
Added support for KV connector v1 #2039
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
78431f2
to
0df0062
Compare
Codecov Report❌ Patch coverage is
❌ Your patch status has failed because the patch coverage (14.28%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #2039 +/- ##
==========================================
- Coverage 77.99% 77.91% -0.08%
==========================================
Files 134 134
Lines 18498 18519 +21
==========================================
+ Hits 14427 14430 +3
- Misses 4071 4089 +18
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Please ensure the modification works with torchair graph mode. |
Great work on this! I have two follow-up questions to clarify:
|
@jianzs My bad, I rebased from previous version and didn't see the torchair file. Looks like it does not support Deepseek lite. I am looking for a way to test without running a full deepseek. Any advice for small models? |
Thanks for the contribution. @Potabk please test this PR with LMcache. Thanks. |
why don't we directly use the vllm interface |
@Potabk the vllm implementations are in vllm/attention/layer.py, and makes assumptions on the attention objects. For example, at the current status, the vllm function asserts that attention metadata is a dict and then accesses the metadata of specific layer, which does not hold for vllm-ascend. @wangxiyuan Please notice that we worked on 0.9.2: the LMCache PR works for 0.9.2, when kvcaches was a list of tensors and not a list of tuples. This PR was developed before the commit df0ec55 (marcobarlo@df0ec55). I had to modify accordingly this PR to avoid conflicts with that commit, which was integrating the KV connector for the model_runner_v1.py file. However the same commit also changes the kvcache from list of tensors to list of tuples. If you want to test this PR together with LMCache, I would suggest to cherrypick this PR commit on the 0.9.2rc, and integrate the required modifications in the model_runner_v1.py file, which are part of the df0ec55 commit. We will try to port LMCache to 0.10 and deal with list of tuples in the next couple of weeks. @jianzs Still working on the torchair attention file, graph compilation is causing some issues. I'll get back as soon as possible. Any suggestion is welcome |
@wangxiyuan I think this will not affect the normal logic of LLMDataDistCMgrConnector that we have currently implemented |
@Potabk I would suggest one of the two ways:
Feel free to contact me in pvt if you need support |
This is not needed anymore, right? |
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
@wangxiyuan LMCache-Ascend still requires it. You can look into https://github.com/LMCache/LMCache-Ascend/blob/main/docker/Dockerfile.a2.openEuler in the PR LMCache/LMCache-Ascend#1 how it is built. Why do you think is not needed? |
OK. plesase let the CI pass. Then I think we can merge this. |
f080543
to
fe40890
Compare
Signed-off-by: marcobarlo <barlettamarco8@gmail.com> Signed-off-by: marcobarlo <65128997+marcobarlo@users.noreply.github.com>
@wangxiyuan is the CI ok? I rebased the commit on the latest main, but the tests look like failing and not due to the modifications of this PR. I saw a commit today called "Fix broken CI" |
@marcobarlo we noticed this error, it's caused by #2546 feel free to ignore it. We'll fix it asap |
@wangxiyuan Hi, once this is ready, we aim to start our initial support for v0.10.x |
### What this PR does / why we need it? - This PR adds the support for the KV connector interface in the V1 architecture, in the same way as vllm. Vllm-ascend currently lacks of this support, required to support also layerwise management of KV caches. - The connector interface allows using external tools and integrate them with vllm ### Notes: We are aware of Issue vllm-project#684 , however that issue does not modify the attention classes as necessary to perform layerwise management of KV caches required for connectors like LMCache. The implementation of this PR ported the necessary code from the vanilla vllm. The KV connector API is the same as vanilla vllm, supporting the standard KV connector API. EDIT: this PR was re-implementing part of the changes merged one hour before this PR was made on the file model_runner_v1.py. I solved the conflicts by removing any modification to the model_runner_v1 file, which now are largely already merged in main. Now this PR is left for the modifications to the attention_v1 file. ### Does this PR introduce _any_ user-facing change? The PR does not modify current APIs, but it extends the behavior of current worker runner and attention classes to save and load KV caches. In absence of connectors, the behavior should stay untouched. ### How was this patch tested? - No unit test implemented yet for the worker. - Tested together with LMCache using https://github.com/LMCache/LMCache/blob/dev/examples/kv_cache_reuse/local_backends/offload.py with the following models: 1 Deepseek-R1-Distill-Qwen-1.5B 2 Qwen3-30B-A3B 3 Deepseek-v2-lite 4 Llama-3.1-8B LMCache used in both layerwise and non-layerwise mode. - Performed LMEval on LMCache integrated with vllm-ascend. Results without LMCache on Qwen3-8B: |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8400|± |0.0101| | | |strict-match | 5|exact_match|↑ |0.8355|± |0.0102| Results with LMCache Layerwise: |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8385|± |0.0101| | | |strict-match | 5|exact_match|↑ |0.8332|± |0.0103| - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@50fede6 --------- Signed-off-by: marcobarlo <barlettamarco8@gmail.com> Signed-off-by: marcobarlo <65128997+marcobarlo@users.noreply.github.com>
### What this PR does / why we need it? - This PR adds the support for the KV connector interface in the V1 architecture, in the same way as vllm. Vllm-ascend currently lacks of this support, required to support also layerwise management of KV caches. - The connector interface allows using external tools and integrate them with vllm ### Notes: We are aware of Issue vllm-project#684 , however that issue does not modify the attention classes as necessary to perform layerwise management of KV caches required for connectors like LMCache. The implementation of this PR ported the necessary code from the vanilla vllm. The KV connector API is the same as vanilla vllm, supporting the standard KV connector API. EDIT: this PR was re-implementing part of the changes merged one hour before this PR was made on the file model_runner_v1.py. I solved the conflicts by removing any modification to the model_runner_v1 file, which now are largely already merged in main. Now this PR is left for the modifications to the attention_v1 file. ### Does this PR introduce _any_ user-facing change? The PR does not modify current APIs, but it extends the behavior of current worker runner and attention classes to save and load KV caches. In absence of connectors, the behavior should stay untouched. ### How was this patch tested? - No unit test implemented yet for the worker. - Tested together with LMCache using https://github.com/LMCache/LMCache/blob/dev/examples/kv_cache_reuse/local_backends/offload.py with the following models: 1 Deepseek-R1-Distill-Qwen-1.5B 2 Qwen3-30B-A3B 3 Deepseek-v2-lite 4 Llama-3.1-8B LMCache used in both layerwise and non-layerwise mode. - Performed LMEval on LMCache integrated with vllm-ascend. Results without LMCache on Qwen3-8B: |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8400|± |0.0101| | | |strict-match | 5|exact_match|↑ |0.8355|± |0.0102| Results with LMCache Layerwise: |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8385|± |0.0101| | | |strict-match | 5|exact_match|↑ |0.8332|± |0.0103| - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@50fede6 --------- Signed-off-by: marcobarlo <barlettamarco8@gmail.com> Signed-off-by: marcobarlo <65128997+marcobarlo@users.noreply.github.com> Signed-off-by: offline0806 <z00858301@china.huawei.com>
### What this PR does / why we need it? - This PR adds the support for the KV connector interface in the V1 architecture, in the same way as vllm. Vllm-ascend currently lacks of this support, required to support also layerwise management of KV caches. - The connector interface allows using external tools and integrate them with vllm ### Notes: We are aware of Issue vllm-project#684 , however that issue does not modify the attention classes as necessary to perform layerwise management of KV caches required for connectors like LMCache. The implementation of this PR ported the necessary code from the vanilla vllm. The KV connector API is the same as vanilla vllm, supporting the standard KV connector API. EDIT: this PR was re-implementing part of the changes merged one hour before this PR was made on the file model_runner_v1.py. I solved the conflicts by removing any modification to the model_runner_v1 file, which now are largely already merged in main. Now this PR is left for the modifications to the attention_v1 file. ### Does this PR introduce _any_ user-facing change? The PR does not modify current APIs, but it extends the behavior of current worker runner and attention classes to save and load KV caches. In absence of connectors, the behavior should stay untouched. ### How was this patch tested? - No unit test implemented yet for the worker. - Tested together with LMCache using https://github.com/LMCache/LMCache/blob/dev/examples/kv_cache_reuse/local_backends/offload.py with the following models: 1 Deepseek-R1-Distill-Qwen-1.5B 2 Qwen3-30B-A3B 3 Deepseek-v2-lite 4 Llama-3.1-8B LMCache used in both layerwise and non-layerwise mode. - Performed LMEval on LMCache integrated with vllm-ascend. Results without LMCache on Qwen3-8B: |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8400|± |0.0101| | | |strict-match | 5|exact_match|↑ |0.8355|± |0.0102| Results with LMCache Layerwise: |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8385|± |0.0101| | | |strict-match | 5|exact_match|↑ |0.8332|± |0.0103| - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@50fede6 --------- Signed-off-by: marcobarlo <barlettamarco8@gmail.com> Signed-off-by: marcobarlo <65128997+marcobarlo@users.noreply.github.com>
### What this PR does / why we need it? - This PR adds the support for the KV connector interface in the V1 architecture, in the same way as vllm. Vllm-ascend currently lacks of this support, required to support also layerwise management of KV caches. - The connector interface allows using external tools and integrate them with vllm ### Notes: We are aware of Issue vllm-project#684 , however that issue does not modify the attention classes as necessary to perform layerwise management of KV caches required for connectors like LMCache. The implementation of this PR ported the necessary code from the vanilla vllm. The KV connector API is the same as vanilla vllm, supporting the standard KV connector API. EDIT: this PR was re-implementing part of the changes merged one hour before this PR was made on the file model_runner_v1.py. I solved the conflicts by removing any modification to the model_runner_v1 file, which now are largely already merged in main. Now this PR is left for the modifications to the attention_v1 file. ### Does this PR introduce _any_ user-facing change? The PR does not modify current APIs, but it extends the behavior of current worker runner and attention classes to save and load KV caches. In absence of connectors, the behavior should stay untouched. ### How was this patch tested? - No unit test implemented yet for the worker. - Tested together with LMCache using https://github.com/LMCache/LMCache/blob/dev/examples/kv_cache_reuse/local_backends/offload.py with the following models: 1 Deepseek-R1-Distill-Qwen-1.5B 2 Qwen3-30B-A3B 3 Deepseek-v2-lite 4 Llama-3.1-8B LMCache used in both layerwise and non-layerwise mode. - Performed LMEval on LMCache integrated with vllm-ascend. Results without LMCache on Qwen3-8B: |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8400|± |0.0101| | | |strict-match | 5|exact_match|↑ |0.8355|± |0.0102| Results with LMCache Layerwise: |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8385|± |0.0101| | | |strict-match | 5|exact_match|↑ |0.8332|± |0.0103| - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@50fede6 --------- Signed-off-by: marcobarlo <barlettamarco8@gmail.com> Signed-off-by: marcobarlo <65128997+marcobarlo@users.noreply.github.com>
What this PR does / why we need it?
This PR adds the support for the KV connector interface in the V1 architecture, in the same way as vllm. Vllm-ascend currently lacks of this support, required to support also layerwise management of KV caches.
The connector interface allows using external tools and integrate them with vllm
Notes:
We are aware of Issue #684 , however that issue does not modify the attention classes as necessary to perform layerwise management of KV caches required for connectors like LMCache.
The implementation of this PR ported the necessary code from the vanilla vllm. The KV connector API is the same as vanilla vllm, supporting the standard KV connector API.
EDIT: this PR was re-implementing part of the changes merged one hour before this PR was made on the file model_runner_v1.py. I solved the conflicts by removing any modification to the model_runner_v1 file, which now are largely already merged in main. Now this PR is left for the modifications to the attention_v1 file.
Does this PR introduce any user-facing change?
The PR does not modify current APIs, but it extends the behavior of current worker runner and attention classes to save and load KV caches. In absence of connectors, the behavior should stay untouched.
How was this patch tested?
No unit test implemented yet for the worker.
Tested together with LMCache using https://github.com/LMCache/LMCache/blob/dev/examples/kv_cache_reuse/local_backends/offload.py with the following models:
1 Deepseek-R1-Distill-Qwen-1.5B
2 Qwen3-30B-A3B
3 Deepseek-v2-lite
4 Llama-3.1-8B
LMCache used in both layerwise and non-layerwise mode.
Performed LMEval on LMCache integrated with vllm-ascend.
Results without LMCache on Qwen3-8B:
Results with LMCache Layerwise: