Skip to content

Conversation

whx-sjtu
Copy link
Collaborator

This PR tries to fix accuracy problem with deepseek in pure expert-tensor-parallel situation. There are two problems in total:

  1. Fix a bug which incorrectly sets the value of tp_rank when ep_size=1. This code was introduced by @ganyi1996ppo, and I'm not very sure if I can directly delete this code without influencing other funcionalities, especially in data-parallel situation. CC @ganyi1996ppo @yiz-liu
  2. Another problem is related with torch_npu.npu_moe_finalize_routing in fused_experts, and I'm working on solving it.

whx-sjtu and others added 2 commits May 20, 2025 22:49
Signed-off-by: whx-sjtu <2952154980@qq.com>
Signed-off-by: zzzzwwjj <1183291235@qq.com>
Signed-off-by: whx-sjtu <2952154980@qq.com>
@wangxiyuan wangxiyuan added the ready read for review label May 28, 2025
@wangxiyuan wangxiyuan changed the title [BugFix][WIP] Fix accuray problems with deepseek in situation of ep=1, etp>1 [BugFix] Fix accuray problems with deepseek in situation of ep=1, etp>1 May 28, 2025
Copy link
Collaborator

@Yikun Yikun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems already inlcuded by #985 ?

self.local_num_experts, self.expert_map = determine_expert_map(
self.ep_size,
get_ep_group().rank_in_group, self.global_num_experts)
if vllm_version_is("0.8.5") or vllm_version_is("0.8.5.post1"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need anymore. #959

Copy link

github-actions bot commented Jun 3, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@github-actions github-actions bot added merge-conflicts and removed ready read for review labels Jun 3, 2025
@whx-sjtu whx-sjtu closed this Jul 9, 2025
@whx-sjtu whx-sjtu deleted the fix_etp_acc branch July 9, 2025 07:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants