-
Notifications
You must be signed in to change notification settings - Fork 475
[Feature][1/2] Impl the connector based on the llmdatadist for v1 #684
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
2208ee4
to
eb2591c
Compare
01e6bd3
to
70e4719
Compare
f984e3b
to
c2318fe
Compare
Measure the time it takes for KV transfers at different sequence lengths. Environment:
Stacked charts show higher times than overall charts because each stage of measurement has NPU synchronization before and after, especially extract kv, scatter update, and inject kv, all of which synchronize on each layer, resulting in significant host overhead.
|
0c21251
to
247e501
Compare
Hi, I tried this PR. But it seems like there exists a precision issue here. The prompt is "What is the largest animal in the world?" with temperature == 0 using qwen2.5 0.5b. using PD disaggregated
normal
|
247e501
to
974b99e
Compare
Thank you for reporting this issue. I've tested with DeepSeek v2 Lite and Llama2 7B, and observed that:
Could you confirm if you're seeing incorrect responses consistently in your tests? And are your configurations, including parallelism, identical in both the disaggregated and singlestone environments? |
Yes, I see this consistently. I used the same sh script 'disaggregated_prefill_multi_prefill.sh' but changed its tp to 1 and model to qwen2.5 0.5b. and as the aggregated one. I've used the default settings below. python -m vllm.entrypoints.openai.api_server --model Qwen2.5-0.5B-Instruct as for the ranktable {
"server_group_list":[
{
"group_id": "0",
"version": "1.0",
"server_count": "1",
"server_list": [
{
"server_ip": "10.172.116.166",
"container_ip": "10.172.116.166"
}
],
"status": "completed"
},
{
"group_id": "1",
"version": "1.0",
"server_count": "1",
"server_list": [
{
"server_ip": "10.172.116.166",
"server_id": "server-0",
"device": [
{
"device_id": "0",
"device_ip": "172.22.17.1",
"rank_id": "0"
}
],
"container_ip": "10.172.116.166"
}
],
"status": "completed"
},
{
"group_id": "2",
"version": "1.0",
"server_count": "1",
"server_list": [
{
"server_ip": "10.172.116.166",
"server_id": "server-1",
"device": [
{
"device_id": "4",
"device_ip": "172.22.17.5",
"rank_id": "0"
}
],
"container_ip": "10.172.116.166"
}
],
"status": "completed"
}
]
} |
I fixed an accuracy issue. Please try again. |
Hi, thx for your work. Unfortunately, it still produces inconsistent results with the 0.5B model, but when I switch to 1.5B model, the disaggregated version produces the correct one. I hope it helps |
Thx. Fix a bug, please try again... |
7863de0
to
03ae1bf
Compare
Great! It works for me now. |
self.num_layers, kv_cache_shape, kv_hidden_dtype) | ||
self._attach_kv_buffer(kv_buffer) | ||
|
||
target_tp_rank = self.tp_rank % min( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why mod the min size of the tp size of pd? Can't it just adopt the tp rank.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This design originally aimed to support heterogeneous parallelism between prefill and decode phases. For scenarios where prefill TP size < decode TP size, each rank could determine its connection count using the modulo method.
However, due to current LLMDataDist constraints, decode TP size must be ≤ prefill TP size. Consequently, using either modulo operation or direct TP rank assignment achieves identical results.
"kv_buffer_device": "npu", | ||
"kv_role": "kv_producer", | ||
"kv_rank": 0, | ||
"kv_parallel_size": 2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What dose this kv_parallel_size
do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The v0 implementation needed this, but I'm unsure if it's still necessary.
0e82eb4
to
5c752ed
Compare
The code looks good to me in general, but I'm not very familiar with the llmdatadist. Can @whx-sjtu review this PR for some of its detail? |
device_ip: str | ||
dp_rank: int | ||
tp_rank: int | ||
cluster_id: int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may need to add a new member super_device_id if you want to run disaggregated-prefill on A3 super node.
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Eliminates the need to launch the meta server in the 1p1d environment. Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
540a57e
to
969159e
Compare
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
### What this PR does / why we need it? - This PR adds the support for the KV connector interface in the V1 architecture, in the same way as vllm. Vllm-ascend currently lacks of this support, required to support also layerwise management of KV caches. - The connector interface allows using external tools and integrate them with vllm ### Notes: We are aware of Issue #684 , however that issue does not modify the attention classes as necessary to perform layerwise management of KV caches required for connectors like LMCache. The implementation of this PR ported the necessary code from the vanilla vllm. The KV connector API is the same as vanilla vllm, supporting the standard KV connector API. EDIT: this PR was re-implementing part of the changes merged one hour before this PR was made on the file model_runner_v1.py. I solved the conflicts by removing any modification to the model_runner_v1 file, which now are largely already merged in main. Now this PR is left for the modifications to the attention_v1 file. ### Does this PR introduce _any_ user-facing change? The PR does not modify current APIs, but it extends the behavior of current worker runner and attention classes to save and load KV caches. In absence of connectors, the behavior should stay untouched. ### How was this patch tested? - No unit test implemented yet for the worker. - Tested together with LMCache using https://github.com/LMCache/LMCache/blob/dev/examples/kv_cache_reuse/local_backends/offload.py with the following models: 1 Deepseek-R1-Distill-Qwen-1.5B 2 Qwen3-30B-A3B 3 Deepseek-v2-lite 4 Llama-3.1-8B LMCache used in both layerwise and non-layerwise mode. - Performed LMEval on LMCache integrated with vllm-ascend. Results without LMCache on Qwen3-8B: |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8400|± |0.0101| | | |strict-match | 5|exact_match|↑ |0.8355|± |0.0102| Results with LMCache Layerwise: |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8385|± |0.0101| | | |strict-match | 5|exact_match|↑ |0.8332|± |0.0103| - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@50fede6 --------- Signed-off-by: marcobarlo <barlettamarco8@gmail.com> Signed-off-by: marcobarlo <65128997+marcobarlo@users.noreply.github.com>
### What this PR does / why we need it? - This PR adds the support for the KV connector interface in the V1 architecture, in the same way as vllm. Vllm-ascend currently lacks of this support, required to support also layerwise management of KV caches. - The connector interface allows using external tools and integrate them with vllm ### Notes: We are aware of Issue vllm-project#684 , however that issue does not modify the attention classes as necessary to perform layerwise management of KV caches required for connectors like LMCache. The implementation of this PR ported the necessary code from the vanilla vllm. The KV connector API is the same as vanilla vllm, supporting the standard KV connector API. EDIT: this PR was re-implementing part of the changes merged one hour before this PR was made on the file model_runner_v1.py. I solved the conflicts by removing any modification to the model_runner_v1 file, which now are largely already merged in main. Now this PR is left for the modifications to the attention_v1 file. ### Does this PR introduce _any_ user-facing change? The PR does not modify current APIs, but it extends the behavior of current worker runner and attention classes to save and load KV caches. In absence of connectors, the behavior should stay untouched. ### How was this patch tested? - No unit test implemented yet for the worker. - Tested together with LMCache using https://github.com/LMCache/LMCache/blob/dev/examples/kv_cache_reuse/local_backends/offload.py with the following models: 1 Deepseek-R1-Distill-Qwen-1.5B 2 Qwen3-30B-A3B 3 Deepseek-v2-lite 4 Llama-3.1-8B LMCache used in both layerwise and non-layerwise mode. - Performed LMEval on LMCache integrated with vllm-ascend. Results without LMCache on Qwen3-8B: |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8400|± |0.0101| | | |strict-match | 5|exact_match|↑ |0.8355|± |0.0102| Results with LMCache Layerwise: |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8385|± |0.0101| | | |strict-match | 5|exact_match|↑ |0.8332|± |0.0103| - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@50fede6 --------- Signed-off-by: marcobarlo <barlettamarco8@gmail.com> Signed-off-by: marcobarlo <65128997+marcobarlo@users.noreply.github.com>
### What this PR does / why we need it? - This PR adds the support for the KV connector interface in the V1 architecture, in the same way as vllm. Vllm-ascend currently lacks of this support, required to support also layerwise management of KV caches. - The connector interface allows using external tools and integrate them with vllm ### Notes: We are aware of Issue vllm-project#684 , however that issue does not modify the attention classes as necessary to perform layerwise management of KV caches required for connectors like LMCache. The implementation of this PR ported the necessary code from the vanilla vllm. The KV connector API is the same as vanilla vllm, supporting the standard KV connector API. EDIT: this PR was re-implementing part of the changes merged one hour before this PR was made on the file model_runner_v1.py. I solved the conflicts by removing any modification to the model_runner_v1 file, which now are largely already merged in main. Now this PR is left for the modifications to the attention_v1 file. ### Does this PR introduce _any_ user-facing change? The PR does not modify current APIs, but it extends the behavior of current worker runner and attention classes to save and load KV caches. In absence of connectors, the behavior should stay untouched. ### How was this patch tested? - No unit test implemented yet for the worker. - Tested together with LMCache using https://github.com/LMCache/LMCache/blob/dev/examples/kv_cache_reuse/local_backends/offload.py with the following models: 1 Deepseek-R1-Distill-Qwen-1.5B 2 Qwen3-30B-A3B 3 Deepseek-v2-lite 4 Llama-3.1-8B LMCache used in both layerwise and non-layerwise mode. - Performed LMEval on LMCache integrated with vllm-ascend. Results without LMCache on Qwen3-8B: |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8400|± |0.0101| | | |strict-match | 5|exact_match|↑ |0.8355|± |0.0102| Results with LMCache Layerwise: |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8385|± |0.0101| | | |strict-match | 5|exact_match|↑ |0.8332|± |0.0103| - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@50fede6 --------- Signed-off-by: marcobarlo <barlettamarco8@gmail.com> Signed-off-by: marcobarlo <65128997+marcobarlo@users.noreply.github.com> Signed-off-by: offline0806 <z00858301@china.huawei.com>
### What this PR does / why we need it? - This PR adds the support for the KV connector interface in the V1 architecture, in the same way as vllm. Vllm-ascend currently lacks of this support, required to support also layerwise management of KV caches. - The connector interface allows using external tools and integrate them with vllm ### Notes: We are aware of Issue vllm-project#684 , however that issue does not modify the attention classes as necessary to perform layerwise management of KV caches required for connectors like LMCache. The implementation of this PR ported the necessary code from the vanilla vllm. The KV connector API is the same as vanilla vllm, supporting the standard KV connector API. EDIT: this PR was re-implementing part of the changes merged one hour before this PR was made on the file model_runner_v1.py. I solved the conflicts by removing any modification to the model_runner_v1 file, which now are largely already merged in main. Now this PR is left for the modifications to the attention_v1 file. ### Does this PR introduce _any_ user-facing change? The PR does not modify current APIs, but it extends the behavior of current worker runner and attention classes to save and load KV caches. In absence of connectors, the behavior should stay untouched. ### How was this patch tested? - No unit test implemented yet for the worker. - Tested together with LMCache using https://github.com/LMCache/LMCache/blob/dev/examples/kv_cache_reuse/local_backends/offload.py with the following models: 1 Deepseek-R1-Distill-Qwen-1.5B 2 Qwen3-30B-A3B 3 Deepseek-v2-lite 4 Llama-3.1-8B LMCache used in both layerwise and non-layerwise mode. - Performed LMEval on LMCache integrated with vllm-ascend. Results without LMCache on Qwen3-8B: |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8400|± |0.0101| | | |strict-match | 5|exact_match|↑ |0.8355|± |0.0102| Results with LMCache Layerwise: |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8385|± |0.0101| | | |strict-match | 5|exact_match|↑ |0.8332|± |0.0103| - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@50fede6 --------- Signed-off-by: marcobarlo <barlettamarco8@gmail.com> Signed-off-by: marcobarlo <65128997+marcobarlo@users.noreply.github.com>
### What this PR does / why we need it? - This PR adds the support for the KV connector interface in the V1 architecture, in the same way as vllm. Vllm-ascend currently lacks of this support, required to support also layerwise management of KV caches. - The connector interface allows using external tools and integrate them with vllm ### Notes: We are aware of Issue vllm-project#684 , however that issue does not modify the attention classes as necessary to perform layerwise management of KV caches required for connectors like LMCache. The implementation of this PR ported the necessary code from the vanilla vllm. The KV connector API is the same as vanilla vllm, supporting the standard KV connector API. EDIT: this PR was re-implementing part of the changes merged one hour before this PR was made on the file model_runner_v1.py. I solved the conflicts by removing any modification to the model_runner_v1 file, which now are largely already merged in main. Now this PR is left for the modifications to the attention_v1 file. ### Does this PR introduce _any_ user-facing change? The PR does not modify current APIs, but it extends the behavior of current worker runner and attention classes to save and load KV caches. In absence of connectors, the behavior should stay untouched. ### How was this patch tested? - No unit test implemented yet for the worker. - Tested together with LMCache using https://github.com/LMCache/LMCache/blob/dev/examples/kv_cache_reuse/local_backends/offload.py with the following models: 1 Deepseek-R1-Distill-Qwen-1.5B 2 Qwen3-30B-A3B 3 Deepseek-v2-lite 4 Llama-3.1-8B LMCache used in both layerwise and non-layerwise mode. - Performed LMEval on LMCache integrated with vllm-ascend. Results without LMCache on Qwen3-8B: |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8400|± |0.0101| | | |strict-match | 5|exact_match|↑ |0.8355|± |0.0102| Results with LMCache Layerwise: |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8385|± |0.0101| | | |strict-match | 5|exact_match|↑ |0.8332|± |0.0103| - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@50fede6 --------- Signed-off-by: marcobarlo <barlettamarco8@gmail.com> Signed-off-by: marcobarlo <65128997+marcobarlo@users.noreply.github.com>
This PR implements the connector functionality for NPU based on LLMDataDist, building upon the connector API merged in vLLM v1. (vllm-project/vllm#15960) We've successfully tested various scenarios in offline environments:
Key implementation aspects include:
Cross-machine PD: LLMDataDist requires NPU device IP for connection establishment. Our approach utilizes a global rank table (JSON) on each machine containing:
nPmD: Given that the community's nPmD design, particularly the router component API, is still evolving, we've implemented a solution using a meta server component (to be provided separately) that:
We propose initially merging the 1P1D implementation, where the global rank table contains information for two nodes, allowing direct prefill node identification. The nPmD implementation can be refined and merged following community discussion.
Todo:
re #448
Note:
A minor modification to vLLM's codebase is required to run this example successfully. The patch enables the scheduler process to locate the appropriate connector class by importing the necessary module.
The change should be made in
vllm/v1/core/sched/scheduler.py
, adding an import statement forvllm_ascend.distributed
.This is a temporary solution, and we need to implement a more elegant module discovery mechanism.
Limits:
string_to_int64_hash
) to convert request IDs to datadist request IDs. This conversion is lossy, potentially creating duplicate IDs, leading to duplicateCacheKey
s andallocate_cache
failures.