[Feat] A Connector that supports Mooncake store #2913

LCAIZJ · 2025-09-13T15:38:27Z

What this PR does / why we need it?

Added a new connector for Mooncake store integration to enable kvcache reuse in scenarios with system prompts or multi-turn dialogues.

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.10.2
vLLM main: vllm-project/vllm@5963b98

Co-authored-by: fems14 <1804143737@qq.com> Co-authored-by: Dreamerleader <2270923832@qq.com> Co-authored-by: Pz1116 <zpbzpb123123@gmail.com> Co-authored-by: lizy124 <1950471827@qq.com> Co-authored-by: zouyida2052 <zouyida2002@gmail.com> Signed-off-by: LCAIZJ <leichao139636@163.com>

github-actions · 2025-09-13T15:38:34Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request introduces a new KV cache connector, MooncakeConnectorV1, which seems to integrate with a distributed KV store called "kvpool" or "mooncake". The changes are extensive, adding several new modules for configuration, data transfer, and the core engine logic for this connector. The review identifies several critical runtime errors (NameError, AttributeError) due to undefined variables or incorrect assumptions about data structures, which would prevent the feature from working correctly. There are also issues with incorrect logic in key parsing and a hardcoded timeout that could impact production stability. These critical issues need to be addressed to ensure the functionality and robustness of the new connector.

gemini-code-assist · 2025-09-13T15:40:39Z

vllm_ascend/attention/mla_v1.py

        assert attn_metadata.num_decodes is not None and \
        attn_metadata.num_prefills is not None and \
        attn_metadata.num_decode_tokens is not None
+        self.wait_for_kv_layer_from_connector(layer.layer_name)


The variable layer is not defined in this scope, which will cause a NameError at runtime. The layer name is required here. A possible solution is to store kv_sharing_target_layer_name (which is passed to __init__) as an instance attribute, e.g., self.layer_name, and use it here. This same issue exists on line 1054.

code removed

gemini-code-assist · 2025-09-13T15:40:39Z

vllm_ascend/attention/mla_v1.py

                    is_force_scatter=self.enable_shared_expert_dp)[0]
                current_ms_metadata.after_comm_event.record()
        del o_proj_input
+        self.maybe_save_kv_layer_to_connector(layer_name=layer.layer_name, kv_cache_layer=kv_cache)


The variable layer is not defined in this scope, which will cause a NameError at runtime. As mentioned for line 983, the layer name should be retrieved from an instance attribute.

code removed

gemini-code-assist · 2025-09-13T15:40:40Z

vllm_ascend/distributed/mooncake/config_data.py

+        if len(parts) != 5:
+            raise ValueError(f"Invalid key string: {s}")


The length check for parts is incorrect. The to_string method for MooncakeEngineKey produces a string with 4 parts separated by '@'. Therefore, the check should be len(parts) != 4. With the current check, this function will always raise a ValueError for valid keys.

Suggested change

if len(parts) != 5:

raise ValueError(f"Invalid key string: {s}")

if len(parts) != 4:

raise ValueError(f"Invalid key string: {s}")

code removed

gemini-code-assist · 2025-09-13T15:40:40Z

vllm_ascend/distributed/mooncake/kv_transfer.py

+        for start, end, key in self.token_database.process_tokens(tokens, mask):
+            addr, size, _ =self.prepare_value(start, end, block_ids)      
+            self.m_store.put(key, addr, size)
+        if is_last_chunk:


The variable is_last_chunk is not defined in this method's scope, which will cause a NameError. It should be accessed from the req_meta dictionary, like req_meta["is_last_chunk"].

Suggested change

if is_last_chunk:

if req_meta["is_last_chunk"]:

gemini-code-assist · 2025-09-13T15:40:40Z

vllm_ascend/distributed/mooncake/mooncake_store.py

+        key_str = key.to_string()
+        try:
+            res = self.store.batch_get_into_ascend(key_str, addr, size)
+            if res[0] != expect_res:


The variable expect_res is not defined, which will cause a NameError. Based on the put method, it seems the expected result for a successful operation is 0. Please define expect_res or replace it with the correct value.

Suggested change

if res[0] != expect_res:

if res[0] != 0:

gemini-code-assist · 2025-09-13T15:40:40Z

vllm_ascend/distributed/mooncake/mooncake_store_connector_v1.py

+
+        request_ids = [req.req_id for req in scheduler_output.scheduled_new_reqs]
+        for request_id, (request,block_ids) in self._unfinished_requests.items():
+            if not request_id in request_ids and not request_id in cached_reqs.req_ids:


Accessing cached_reqs.req_ids will raise an AttributeError when cached_reqs is a list. This happens when use_v1_spec=False in vLLM's configuration. You should handle both cases where cached_reqs is a list of objects or a dataclass with a req_ids attribute. A robust way would be to create a set of request IDs from cached_reqs before this loop.

For example:

cached_req_ids_set = set() if isinstance(cached_reqs, list): cached_req_ids_set = {req.req_id for req in cached_reqs} else: cached_req_ids_set = set(cached_reqs.req_ids) # ... later in the loop ... if request_id not in request_ids and request_id not in cached_req_ids_set: # ...

gemini-code-assist · 2025-09-13T15:40:40Z

vllm_ascend/distributed/mooncake/mooncake_engine.py

+            keys = [list(row) for row in zip(*keys, strict=False)]   # [num_layer,block_num]
+            for layer_id, keys_multi_chunk in enumerate(keys):
+                if not first_flag:
+                    is_finish=self.get_event.wait(timeout=3)   #try---cache


Using a hardcoded timeout of 3 seconds might not be robust for production environments, where operations could take longer under high load. This could lead to premature SystemError exceptions. It's recommended to make this timeout configurable.

Signed-off-by: fems14 <1804143737@qq.com>

wangxiyuan · 2025-09-18T01:15:44Z