Skip to content

Conversation

IuoIun
Copy link

@IuoIun IuoIun commented Sep 2, 2025

What this PR does / why we need it?

Related issue: #2594

The PR migrate the P2pNcclConnector to vllm-ascend project. It supports running disaggregated prefill in the similar way to P2pNcclConnector in vllm.

The only dependency for P2pHcclConnector is hccl, context and stream api in cann. Cann may possibly be installed when using with ascend npu. Therefore, it can be used with fewer pre-dependencies. This is helpful when:

  • Want to quickly run some test case for disaggregated prefill in users environments.
  • Run disaggregated prefill and don't want to introduce other dependency.
  • Quickly run disaggregated prefill in single node with hccl and simple config.

Does this PR introduce any user-facing change?

Yes, this PR introduces a new type of KVConnector(P2pHcclConnector) that can be enabled via configuration.

How was this patch tested?

cann version: 8.2.RC1

test case:

# setup variable and environmentss
export ASCEND_VISIBLE_DEVICES=5,4 # modify according to your environment
export ASCEND_RT_VISIBLE_DEVICES=0,1 # modify according to your environment
modelName=xx # use your own model
servedModelName=xxx # use your own model name

# for prefill instance
vllm serve ${modelName} --host 0.0.0.0 --port 20005 --seed 42 --served-model-name ${servedModelName} --max-model-len 10000 --max-num-seqs 256 --gpu-memory-utilization 0.7 --disable-log-request --kv-transfer-config \{\"kv_rank\":0,\"kv_connector\":\"P2pHcclConnector\",\"kv_role\":\"kv_producer\",\"kv_buffer_size\":\"8e9\",\"kv_port\":\"21001\",\"kv_connector_extra_config\":\{\"proxy_ip\":\"127.0.0.1\",\"proxy_port\":\"30001\",\"http_port\":\"20005\",\"send_type\":\"PUT_ASYNC\"}\} > prefill.log &

# for decode instance
vllm serve ${modelName} --host 0.0.0.0 --port 20009 --seed 1024 --served-model-name  ${servedModelName} --max-model-len 10000 --max-num-seqs 256 --gpu-memory-utilization 0.7 --disable-log-request --kv-transfer-config \{\"kv_rank\":1,\"kv_connector\":\"P2pHcclConnector\",\"kv_role\":\"kv_consumer\",\"kv_buffer_size\":\"8e9\",\"kv_port\":\"22001\",\"kv_connector_extra_config\":\{\"mem_pool_size_gb\":256,\"proxy_ip\":\"127.0.0.1\",\"proxy_port\":\"30001\",\"http_port\":\"20009\",\"send_type\":\"PUT_ASYNC\"\}\} > decode.log &

# start the proxy in vllm project
python vllm/examples/online_serving/disaggregated_serving_p2p_nccl_xpyd/disagg_proxy_p2p_nccl_xpyd.py > proxy.log &

# test request
curl  http://localhost:10001/v1/completions     -H "Content-Type: application/json"     -d '{
    "model": "'$servedModelName'",
    "prompt": "Why is the sky blue?",
    "max_tokens": 100,
    "temperature": 0
}'

Signed-off-by: luolun <luolun1995@cmbchina.com>
Copy link

github-actions bot commented Sep 2, 2025

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request migrates the P2pNcclConnector to a P2pHcclConnector for Ascend NPUs, enabling disaggregated prefill. The changes are extensive, introducing a new KVConnector, its engine, and HCCL wrappers. While the overall structure is sound, I've identified several critical issues, many of which appear to be copy-paste errors from the original NCCL implementation. These include incorrect API usage for HCCL streams, wrong exception types for error handling, and logging errors that could lead to runtime crashes. Addressing these issues is crucial for the stability and correctness of the feature on Ascend hardware.

Comment on lines 458 to 493
def get_finished(
self, finished_req_ids: set[str], no_compile_layers
) -> tuple[Optional[set[str]], Optional[set[str]]]:
"""
Notifies worker-side connector ids of requests that have
finished generating tokens.
Returns:
ids of requests that have finished asynchronous transfer,
tuple of (sending/saving ids, recving/loading ids).
The finished saves/sends req ids must belong to a set provided in a
call to this method (this call or a prior one).
"""

# Clear the buffer upon request completion.
for request_id in finished_req_ids:
for layer_name in no_compile_layers:
tensor_id = request_id + "#" + layer_name
if tensor_id in self.recv_store:
with self.recv_store_cv:
tensor = self.recv_store.pop(tensor_id, None)
self.send_request_id_to_tensor_ids.pop(
request_id, None)
self.recv_request_id_to_tensor_ids.pop(
request_id, None)
if isinstance(tensor, tuple):
addr, _, _ = tensor
self.pool.free(addr)

# TODO:Retrieve requests that have already sent the KV cache.
finished_sending: set[str] = set()

# TODO:Retrieve requests that have already received the KV cache.
finished_recving: set[str] = set()

return finished_sending or None, finished_recving or None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The get_finished method has an incomplete implementation, with TODO comments indicating that the logic for retrieving finished sending and receiving requests is missing. It currently returns (None, None), which might prevent the scheduler from correctly freeing resources for completed requests. This could lead to resource leaks or deadlocks. Please complete the implementation to correctly track and return the sets of finished request IDs.

Signed-off-by: luolun <luolun1995@cmbchina.com>
Signed-off-by: luolun <luolun1995@cmbchina.com>
Signed-off-by: luolun <luolun1995@cmbchina.com>
Signed-off-by: luolun <luolun1995@cmbchina.com>
Copy link

@xueliangyang-oeuler xueliangyang-oeuler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good job !

register_model()


def register_kvconnector():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to add a new plugins, I think just register the connector in https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/distributed/__init__.py is enough.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, fixed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the module is too complex, vllm_ascend/distributed/kv_connector is OK IMO

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

AttentionLayer, AttentionType)
from vllm.attention.backends.utils import CommonAttentionState
from vllm.config import VllmConfig
from vllm.distributed.kv_transfer import (get_kv_transfer_group,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the main branch now support this change. please rebase to main.

scheduler_output (SchedulerOutput): the scheduler output object.
"""

meta = P2pNcclConnectorMetadata()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so here we reuse nccl metadata totally?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it can be reused at the moment

Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Directly register connector instead of using plugin.

Signed-off-by: luolun <luolun1995@cmbchina.com>
Signed-off-by: luolun <luolun1995@cmbchina.com>
Signed-off-by: luolun <luolun1995@cmbchina.com>
Signed-off-by: luolun <luolun1995@cmbchina.com>
Signed-off-by: luolun <luolun1995@cmbchina.com>
Signed-off-by: luolun <luolun1995@cmbchina.com>
Signed-off-by: luolun <luolun1995@cmbchina.com>
Signed-off-by: luolun <luolun1995@cmbchina.com>
Directly register connector instead of using plugin.

Signed-off-by: luolun <luolun1995@cmbchina.com>
Signed-off-by: luolun <luolun1995@cmbchina.com>
Signed-off-by: luolun <luolun1995@cmbchina.com>
Signed-off-by: luolun <luolun1995@cmbchina.com>
2. add P2pHcclConnectorMetadata
3. fix connector register

Signed-off-by: luolun <luolun1995@cmbchina.com>
ReqMeta.make_meta(request_id, token_ids, block_ids, block_size))


class P2pHcclConnector(KVConnectorBase_V1):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we inherit from the upstream class?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants