-
Notifications
You must be signed in to change notification settings - Fork 449
[Feat] A Connector that supports Mooncake store #2913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: fems14 <1804143737@qq.com> Co-authored-by: Dreamerleader <2270923832@qq.com> Co-authored-by: Pz1116 <zpbzpb123123@gmail.com> Co-authored-by: lizy124 <1950471827@qq.com> Co-authored-by: zouyida2052 <zouyida2002@gmail.com> Signed-off-by: LCAIZJ <leichao139636@163.com>
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new KV cache connector, MooncakeConnectorV1
, which seems to integrate with a distributed KV store called "kvpool" or "mooncake". The changes are extensive, adding several new modules for configuration, data transfer, and the core engine logic for this connector. The review identifies several critical runtime errors (NameError
, AttributeError
) due to undefined variables or incorrect assumptions about data structures, which would prevent the feature from working correctly. There are also issues with incorrect logic in key parsing and a hardcoded timeout that could impact production stability. These critical issues need to be addressed to ensure the functionality and robustness of the new connector.
vllm_ascend/attention/mla_v1.py
Outdated
assert attn_metadata.num_decodes is not None and \ | ||
attn_metadata.num_prefills is not None and \ | ||
attn_metadata.num_decode_tokens is not None | ||
self.wait_for_kv_layer_from_connector(layer.layer_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable layer
is not defined in this scope, which will cause a NameError
at runtime. The layer name is required here. A possible solution is to store kv_sharing_target_layer_name
(which is passed to __init__
) as an instance attribute, e.g., self.layer_name
, and use it here. This same issue exists on line 1054.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code removed
vllm_ascend/attention/mla_v1.py
Outdated
is_force_scatter=self.enable_shared_expert_dp)[0] | ||
current_ms_metadata.after_comm_event.record() | ||
del o_proj_input | ||
self.maybe_save_kv_layer_to_connector(layer_name=layer.layer_name, kv_cache_layer=kv_cache) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code removed
if len(parts) != 5: | ||
raise ValueError(f"Invalid key string: {s}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The length check for parts
is incorrect. The to_string
method for MooncakeEngineKey
produces a string with 4 parts separated by '@'. Therefore, the check should be len(parts) != 4
. With the current check, this function will always raise a ValueError
for valid keys.
if len(parts) != 5: | |
raise ValueError(f"Invalid key string: {s}") | |
if len(parts) != 4: | |
raise ValueError(f"Invalid key string: {s}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code removed
for start, end, key in self.token_database.process_tokens(tokens, mask): | ||
addr, size, _ =self.prepare_value(start, end, block_ids) | ||
self.m_store.put(key, addr, size) | ||
if is_last_chunk: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modified
key_str = key.to_string() | ||
try: | ||
res = self.store.batch_get_into_ascend(key_str, addr, size) | ||
if res[0] != expect_res: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modified
|
||
request_ids = [req.req_id for req in scheduler_output.scheduled_new_reqs] | ||
for request_id, (request,block_ids) in self._unfinished_requests.items(): | ||
if not request_id in request_ids and not request_id in cached_reqs.req_ids: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accessing cached_reqs.req_ids
will raise an AttributeError
when cached_reqs
is a list. This happens when use_v1_spec=False
in vLLM's configuration. You should handle both cases where cached_reqs
is a list of objects or a dataclass with a req_ids
attribute. A robust way would be to create a set of request IDs from cached_reqs
before this loop.
For example:
cached_req_ids_set = set()
if isinstance(cached_reqs, list):
cached_req_ids_set = {req.req_id for req in cached_reqs}
else:
cached_req_ids_set = set(cached_reqs.req_ids)
# ... later in the loop ...
if request_id not in request_ids and request_id not in cached_req_ids_set:
# ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ignored
keys = [list(row) for row in zip(*keys, strict=False)] # [num_layer,block_num] | ||
for layer_id, keys_multi_chunk in enumerate(keys): | ||
if not first_flag: | ||
is_finish=self.get_event.wait(timeout=3) #try---cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modified
Signed-off-by: fems14 <1804143737@qq.com>
Signed-off-by: fems14 <1804143737@qq.com>
Signed-off-by: fems14 <1804143737@qq.com>
Signed-off-by: fems14 <1804143737@qq.com>
Signed-off-by: fems14 <1804143737@qq.com>
Signed-off-by: fems14 <1804143737@qq.com>
@@ -0,0 +1,264 @@ | |||
# MultiConnector + Mooncake Basic Scenario Verification & Pooling and Mixed Deployment Scenario Verification |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mooncacke store deployment guide
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modified
* Python >= 3.9, < 3.12 | ||
* CANN >= 8.2.rc1 | ||
* PyTorch >= 2.7.1, torch-npu >= 2.7.1.dev20250724 | ||
* vLLM:Mainline branch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
main branch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modified
* PyTorch >= 2.7.1, torch-npu >= 2.7.1.dev20250724 | ||
* vLLM:Mainline branch | ||
* vLLM-Ascend:Mainline branch | ||
* Mooncake:[AscendTransport/Mooncake at pooling-async-memcpy](https://github.com/AscendTransport/Mooncake/tree/pooling-async-memcpy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double check the origin link is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modified
* vLLM:Mainline branch | ||
* vLLM-Ascend:Mainline branch | ||
* Mooncake:[AscendTransport/Mooncake at pooling-async-memcpy](https://github.com/AscendTransport/Mooncake/tree/pooling-async-memcpy) | ||
* mooncake-transfer-engine reference documentation: https://github.com/kvcache-ai/Mooncake/blob/main/doc/zh/ascend_transport.md |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modified
mooncake_master --port 50088 | ||
``` | ||
|
||
## multiConnector + mooncake basic scenario |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefill decode disaggregate scenario
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modified
* CANN >= 8.2.rc1 | ||
* PyTorch >= 2.7.1, torch-npu >= 2.7.1.dev20250724 | ||
* vLLM:Mainline branch | ||
* vLLM-Ascend:Mainline branch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
main branch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modified
What this PR does / why we need it?
Added a new connector for Mooncake store integration to enable kvcache reuse in scenarios with system prompts or multi-turn dialogues.
Does this PR introduce any user-facing change?
How was this patch tested?