Skip to content

Commit d193316

Browse files
authored
[P/D] Bugfix zmq send/receive failed (#5503)
### What this PR does / why we need it? Currently, when the MooncakeConnector interacts via ZeroMQ, it throws the following exception upon send/receive failure: **Issue 1:** The currently used `zmq.REQ` socket follows a strict request-reply pattern, requiring an alternating sequence of send → receive → send → receive... If either a send() or receive() operation fails, the ZeroMQ socket becomes unusable. **Solution:** When a send() or receive() exception occurs, close and delete the ZeroMQ socket, and recreate it upon next use. **Issue 2:** In `_handle_request`, if `_send_done_recv_signal` raises an exception, the exception is thrown immediately and subsequent code is not executed, causing the decode logic to fail to properly release the request. **Solution:** Move the call to `_send_done_recv_signal` to the end of the function. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@45c1ca1 Signed-off-by: LCAIZJ <leichao139636@163.com>
1 parent 80fc0f5 commit d193316

File tree

1 file changed

+12
-5
lines changed

1 file changed

+12
-5
lines changed

vllm_ascend/distributed/mooncake_connector.py

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -417,6 +417,11 @@ def _handle_request(self, req_meta: dict[str, Any]):
417417
f"{request_id}: {e}",
418418
exc_info=True)
419419
finally:
420+
if all_task_done:
421+
self.task_tracker.update_done_task_count(request_id)
422+
if request_id in self.proc_not_transfer_request:
423+
del self.proc_not_transfer_request[request_id]
424+
self.request_queue.task_done()
420425
# Always send the done signal to the remote host to ensure proper
421426
# resource cleanup. Failing to do so may cause a memory leak on the
422427
# remote host.
@@ -425,11 +430,6 @@ def _handle_request(self, req_meta: dict[str, Any]):
425430
remote_port_send_num)
426431
self._send_done_signal_to_free_remote_port(request_id, remote_host,
427432
remote_port_send_num)
428-
if all_task_done:
429-
self.task_tracker.update_done_task_count(request_id)
430-
if request_id in self.proc_not_transfer_request:
431-
del self.proc_not_transfer_request[request_id]
432-
self.request_queue.task_done()
433433

434434
def _send_done_signal_to_free_remote_port(self, request_id, remote_host,
435435
remote_port_send_num):
@@ -698,6 +698,13 @@ def _send_done_recv_signal(self, request_id: str, remote_host: str,
698698
request_id, remote_host, remote_handshake_port)
699699
raise RuntimeError(
700700
f"Failed to receive ACK, resp: {resp.decode('utf-8')}")
701+
except RuntimeError as e:
702+
if isinstance(sock, zmq.Socket): # type: ignore
703+
sock.close()
704+
sock = None
705+
logger.warning(
706+
f"Unexpected error occurred in socket, {e}, closing the original channel"
707+
)
701708
finally:
702709
if sock is not None:
703710
self._return_remote_socket(sock, remote_host,

0 commit comments

Comments
 (0)