-
Notifications
You must be signed in to change notification settings - Fork 542
[0.11.0][P/D]Make kv-transfer env variable take effect & Fix load-balance proxy #3983
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: v0.11.0-dev
Are you sure you want to change the base?
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a configurable timeout for KV transfer by adding the get_transfer_timeout_value utility function. This function reads from environment variables to calculate the timeout, which is then used in llmdatadist_c_mgr_connector, mooncake_connector, and mooncake_layerwise_connector. The change is well-contained and addresses the goal of making the timeout configurable. I have one suggestion to improve the robustness of the new utility function by handling potential non-integer values from environment variables to prevent crashes.
| def get_transfer_timeout_value(): | ||
| hccl_rdma_timeout = int(os.getenv('HCCL_RDMA_TIMEOUT', '20')) | ||
| hccl_rdma_retry_cnt = int(os.getenv('HCCL_RDMA_RETRY_CNT', '7')) | ||
| return int((4.096 * (2**hccl_rdma_timeout)) * hccl_rdma_retry_cnt // 1000 + | ||
| 3000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation does not handle cases where the environment variables HCCL_RDMA_TIMEOUT or HCCL_RDMA_RETRY_CNT are set to non-integer values. This would cause int() to raise a ValueError, crashing the application. It's better to wrap the conversion in a try-except block to handle this case gracefully and fall back to default values.
| def get_transfer_timeout_value(): | |
| hccl_rdma_timeout = int(os.getenv('HCCL_RDMA_TIMEOUT', '20')) | |
| hccl_rdma_retry_cnt = int(os.getenv('HCCL_RDMA_RETRY_CNT', '7')) | |
| return int((4.096 * (2**hccl_rdma_timeout)) * hccl_rdma_retry_cnt // 1000 + | |
| 3000) | |
| def get_transfer_timeout_value(): | |
| try: | |
| hccl_rdma_timeout = int(os.getenv('HCCL_RDMA_TIMEOUT', '20')) | |
| hccl_rdma_retry_cnt = int(os.getenv('HCCL_RDMA_RETRY_CNT', '7')) | |
| except ValueError: hccl_rdma_timeout, hccl_rdma_retry_cnt = 20, 7 | |
| return int((4.096 * (2**hccl_rdma_timeout)) * hccl_rdma_retry_cnt // 1000 + | |
| 3000) |
Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
15d4dcb to
4f88ca7
Compare
What this PR does / why we need it?
Make kv-transfer env variable take effect & Fix load-balance proxy.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
By CI.