-
Notifications
You must be signed in to change notification settings - Fork 296
[0.9.1][PD] Added support for delay-free blocks in prefill nodes #1691
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[0.9.1][PD] Added support for delay-free blocks in prefill nodes #1691
Conversation
Signed-off-by: underfituu <hzhucong@163.com>
Signed-off-by: underfituu <hzhucong@163.com>
Signed-off-by: underfituu <hzhucong@163.com>
77c41bd
to
a8a3954
Compare
259ff3c
to
2d2fc36
Compare
Signed-off-by: underfituu <hzhucong@163.com>
Signed-off-by: underfituu <hzhucong@163.com>
else: | ||
raise RuntimeError( | ||
f"LLMDataDistCMgrConnectorWorker: Receiving unexpected request event {event_msg} from remote !" | ||
) | ||
|
||
def _increment_task_count(self, request_id: str, tp_rank: int, | ||
decode_tp_size: int): | ||
if tp_rank in self.done_receiving_counts[request_id]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better check if the request_id
is already inside the self.done_receiving_counts
for safety
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where did you add your request_id
? seems I can't find it in this diff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Added explicit request_id existence check and early initialization for safety.
Signed-off-by: underfituu <hzhucong@163.com>
What this PR does / why we need it?
PD Logic Analysis:
In the current implementation, the P-node immediately releases memory blocks after completing inference. Under high concurrency scenarios, if the P-node's inference speed significantly outpaces the D-node's block pulling operations, this leads to memory block contention/corruption issues.
Current Solution:
The D-node sends acknowledgment messages to the worker connector in the P-node's driver worker after completing data reception. The P-node maintains a counter to track these acknowledgments - memory blocks are only released after receiving confirmations from all D-node worker connectors involved in the KV cache transfer.
Does this PR introduce any user-facing change?
How was this patch tested?