Skip to content

Conversation

@nwpu-zxr
Copy link
Contributor

@nwpu-zxr nwpu-zxr commented Nov 4, 2025

What this PR does / why we need it?

Make kv-transfer env variable take effect & Fix load-balance proxy.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

By CI.

@github-actions
Copy link

github-actions bot commented Nov 4, 2025

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a configurable timeout for KV transfer by adding the get_transfer_timeout_value utility function. This function reads from environment variables to calculate the timeout, which is then used in llmdatadist_c_mgr_connector, mooncake_connector, and mooncake_layerwise_connector. The change is well-contained and addresses the goal of making the timeout configurable. I have one suggestion to improve the robustness of the new utility function by handling potential non-integer values from environment variables to prevent crashes.

Comment on lines 52 to 61
def get_transfer_timeout_value():
hccl_rdma_timeout = int(os.getenv('HCCL_RDMA_TIMEOUT', '20'))
hccl_rdma_retry_cnt = int(os.getenv('HCCL_RDMA_RETRY_CNT', '7'))
return int((4.096 * (2**hccl_rdma_timeout)) * hccl_rdma_retry_cnt // 1000 +
3000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation does not handle cases where the environment variables HCCL_RDMA_TIMEOUT or HCCL_RDMA_RETRY_CNT are set to non-integer values. This would cause int() to raise a ValueError, crashing the application. It's better to wrap the conversion in a try-except block to handle this case gracefully and fall back to default values.

Suggested change
def get_transfer_timeout_value():
hccl_rdma_timeout = int(os.getenv('HCCL_RDMA_TIMEOUT', '20'))
hccl_rdma_retry_cnt = int(os.getenv('HCCL_RDMA_RETRY_CNT', '7'))
return int((4.096 * (2**hccl_rdma_timeout)) * hccl_rdma_retry_cnt // 1000 +
3000)
def get_transfer_timeout_value():
try:
hccl_rdma_timeout = int(os.getenv('HCCL_RDMA_TIMEOUT', '20'))
hccl_rdma_retry_cnt = int(os.getenv('HCCL_RDMA_RETRY_CNT', '7'))
except ValueError: hccl_rdma_timeout, hccl_rdma_retry_cnt = 20, 7
return int((4.096 * (2**hccl_rdma_timeout)) * hccl_rdma_retry_cnt // 1000 +
3000)

Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
@nwpu-zxr nwpu-zxr changed the title [0.11.0][P/D]Add kv-transfer timeout environment variable [0.11.0][P/D]Make kv-transfer env variable take effect & Fix load-balance proxy Nov 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant