Skip to content

SwiftBalancer Zero OverHead Expert Movement #1855

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 48 commits into from

Conversation

raindaywhu
Copy link

@raindaywhu raindaywhu commented Jul 17, 2025

What this PR does / why we need it?

Dynamic Experts load balance for MoE LLM Models

Does this PR introduce any user-facing change?

How was this patch tested?

@@ -37,6 +37,7 @@ def __init__(self, vllm_config):
ascend_scheduler_config)

self.expert_map_path = additional_config.get("expert_map_path", None)
self.dynamic_eplb = additional_config.get("dynamic_eplb", False)
Copy link
Collaborator

@wangxiyuan wangxiyuan Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use vllm eplb config enalbe_eplb instead of adding a new config?


from abc import ABC, abstractmethod

class EplbAdaptor():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's is this abstract used for?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's is this abstract used for?

for SGlang/Vllm abstract

@@ -773,6 +775,32 @@ def load_weights(self, weights: Iterable[tuple[str,

return loaded_params

def get_expert_map(self, layer_id):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vllm has the MixtureOfExperts interface, once we contribute to vllm, these func should be moved there.

And, what about Qwen Moe model?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Qwen Moe is doing test now, will be submit in other pr.

for name in self.expert_weight_names]
)

# def collect_topk_ids(self, dummy_run=False):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove comment code

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


class DynamicTable:
# workload_table:
# 三维矩阵,[layer, gpus, experts_per_gpu_per_layer] -> value: 所在位置的热度
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use english

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

import torch
import random

class ExpertMapUtils():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use class here is meaningless

@@ -0,0 +1,65 @@
import numpy as np
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this file to example folder

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -0,0 +1,114 @@
#
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove tool folder

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -0,0 +1,408 @@
#
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the worker module has only one file. I think the module is useless

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the worker module has been removed

@@ -0,0 +1,39 @@
#
Copy link
Collaborator

@wangxiyuan wangxiyuan Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vllm_ascend/eplb/__init__.py is missied

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


return list(zip(send_all, recv_all, maps, log2phy_all, layer_ids))

class EplbProcess:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what will happen if the eplbprocess is down in a woker?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eplb will not update anymore, how erver the forwarding is continuing

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Jul 22, 2025
郭惜缘 and others added 27 commits July 22, 2025 15:21
# Conflicts:
#	vllm_ascend/eplb/tool/eplb_utils.py
… into whq-v091-new

* 'whq-v091-new' of https://github.com/845473182/vllm-ascend:
  fix import
  fix param bug
  fix param bug
  修改注册引用错误
  修改注册引用错误
Signed-off-by: raindaywhu <raindaywhu@163.com>
Signed-off-by: raindaywhu <raindaywhu@163.com>
Signed-off-by: raindaywhu <raindaywhu@163.com>
@raindaywhu raindaywhu closed this Jul 23, 2025
ganyi1996ppo pushed a commit that referenced this pull request Jul 24, 2025
 What this PR does / why we need it?

####Dynamic Experts load balance for MoE LLM Models

Co-authored-by: wanghanqingLYT
[hqwang12345@sina.com](mailto:hqwang12345@sina.com)
Co-authored-by: njuyuan
[yuanjl19@smail.nju.edu.cn](mailto:yuanjl19@smail.nju.edu.cn)
Co-authored-by: qmkakaxi
[wjh1594260677@qq.com](mailto:wjh1594260677@qq.com)
Co-authored-by: Skywalker-EP [173723846@qq.com](mailto:173723846@qq.com)
Co-authored-by: ZhengWG [zwg0606@gmail.com](mailto:zwg0606@gmail.com)
Co-authored-by: GuoXiYuan [496444320@qq.com](mailto:496444320@qq.com)
Co-authored-by: zyy-hw
[zhangyuanyun@huawei.com](mailto:zhangyuanyun@huawei.com)
Co-authored-by: ltdo111 [1061328217@qq.com](mailto:1061328217@qq.com)
 

Fix commits ci of pr #1855 

### Does this PR introduce _any_ user-facing change?


### How was this patch tested?


---------

Signed-off-by: raindaywhu <raindaywhu@163.com>
Signed-off-by: wanghanqingLYT <wanghanqing3@huawei.com>
Co-authored-by: wanghanqingLYT <wanghanqing3@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants