-
Notifications
You must be signed in to change notification settings - Fork 461
[Draft]optimize dp allreduce #2492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Angazenn <supperccell@163.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request optimizes the data parallel communication in _get_forward_metadata_across_dp
by replacing a CPU-based all_reduce
with an NPU-based all_gather
. This is a good optimization that moves the collective operation to the accelerator. The new implementation also appears to correct the logic for determining enable_dbo
across DP ranks, using an any
operation which seems more appropriate. I have one suggestion to further improve the performance by minimizing data transfer between NPU and CPU.
dtype=torch.int32) | ||
global_forward_metadata = get_dp_group().all_gather( | ||
local_forward_metadata, dim=0) | ||
maybe_padded_num_tokens = global_forward_metadata[:, 0].cpu().max() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For better performance, it's generally recommended to perform reduction operations on the device (NPU) and only transfer the scalar result to the CPU. This avoids synchronizing and copying a larger tensor. You can change this line to perform the max()
operation on the NPU before moving the result to the CPU using .item()
.
maybe_padded_num_tokens = global_forward_metadata[:, 0].cpu().max() | |
maybe_padded_num_tokens = global_forward_metadata[:, 0].max().item() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2492 +/- ##
==========================================
+ Coverage 77.70% 78.49% +0.78%
==========================================
Files 132 132
Lines 17521 17806 +285
==========================================
+ Hits 13615 13976 +361
+ Misses 3906 3830 -76
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
What this PR does / why we need it?
Does this PR introduce any user-facing change?
How was this patch tested?