-
Notifications
You must be signed in to change notification settings - Fork 460
[BugFix] Fix world size bug in model_runner. #2897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: whx-sjtu <2952154980@qq.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly fixes a bug in the MoE communication method selection by using world_size_across_dp
instead of the total world_size
, which is appropriate for data-parallel scenarios. My review identifies a critical oversight: the corresponding unit test has not been updated to reflect this logic change, leaving the fix unverified. An update to the test is required to ensure the correctness of this change.
moe_comm_method = "allgather" | ||
elif soc_version in {AscendSocVersion.A2}: | ||
if num_tokens <= self.mc2_tokens_capacity and self.parallel_config.world_size >= 16: | ||
if num_tokens <= self.mc2_tokens_capacity and self.parallel_config.world_size_across_dp >= 16: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While the change to use world_size_across_dp
is correct for selecting the MoE communication method in a data-parallel setup, the corresponding unit test needs to be updated.
The test test_select_moe_comm_method
in tests/ut/worker/test_model_runner_v1.py
still mocks parallel_config.world_size
and will likely fail or pass incorrectly after this change.
To ensure this bug fix is properly tested, please update the unit test to mock parallel_config.world_size_across_dp
instead.
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
Signed-off-by: whx-sjtu <2952154980@qq.com>
|
Not sure a real bug here or not, just retrigger this. |
Notice the current CI are running in |
merged by #2915 |
Fix world size bug in model_runner.