-
Notifications
You must be signed in to change notification settings - Fork 458
FlashLB algorithm #3042
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FlashLB algorithm #3042
Conversation
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces FlashLB, a new and sophisticated load balancing algorithm. The implementation leverages Numba for performance, which is a good choice for the complex numerical computations involved. However, I've identified several critical issues in the implementation that could lead to incorrect behavior, crashes, or instability. These include the use of shared class state, improper error handling that terminates the process, multiple potential division-by-zero errors, and unhandled edge cases that can cause crashes. I have provided detailed comments and suggestions to address these critical problems. Additionally, there's a high-severity issue with a module-level warmup function that causes a significant side effect on import. Addressing these points is crucial for the stability and correctness of the new policy.
FlashLB.par_history = defaultdict(float) | ||
FlashLB.hotness_window = {} | ||
|
||
warm_up() No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling warm_up()
at the module level is a significant side effect that occurs upon import. This can make module loading slow and unpredictable. This function, which seems to be for JIT compilation warmup, should be called explicitly by the application that initializes the load balancing system, rather than being a side effect of importing this module.
warm_up() | |
# warm_up() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The warm-up call at the module level is intentionally designed to trigger JIT compilation during import rather than at inference time. This approach ensures that numba's just-in-time compilation overhead is incurred upfront during module loading, preventing unexpected latency spikes during actual inference operations.
While module imports may take slightly longer, this trade-off is necessary to guarantee consistent performance during runtime, where low and predictable latency is critical for the load balancing system's effectiveness. Explicitly deferring this warm-up to application initialization could lead to missed compilation steps and subsequent runtime delays if the initialization sequence is not strictly followed.
The current implementation prioritizes inference-time performance predictability over import speed, which aligns with the requirements of latency-sensitive load balancing operations.
d55529a
to
07287ed
Compare
algo.rebalance_experts(expert_tensor, torch.randint(1, 1000, (58, 32, 9))) | ||
|
||
|
||
warm_up() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution. Only one question: warm_up
should be called when import the module? is there any way to let it only runs when the policy is set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another approach is to incorporate a check within policy_factory.py
that triggers the warm_up
operation exclusively when the RealTimeLB
policy is selected.
This way, the warm_up
process will only run if and when the RealTimeLB
policy is actually instantiated (i.e., when this specific policy is set), avoiding unnecessary execution for other policy types.
Would this align with your intended implementation? Please let me know if you need further adjustments.
class PolicyFactory:
@staticmethod
def generate_policy(policy_type: int, config: DynamicConfig) -> EplbPolicy:
policy = {
# Constraint applying Dynamic EPLB policy V2:
# If there exists redundant expert:
# only one redundant expert can be placed in one NPU and its physical expert index must be 0
# Applying greedy d2d expert weight update composing
0:
RandomLoadBalance, # RandomLoadBalance: shuffle last physical expert on NPU 1 and 3
1:
DynamicEplb, # Dynamic EPLB policy: overall expert replacement based on current moe load
2:
DynamicEplbV2, # Dynamic EPLB policy V2: expert replacement with constrained number of expert shuffle
3:
FlashLB, # FlashLB EPLB policy: expert replacement based on Joint Optimization, Multi-Shot Enhancement and Incremental Adjustment
}
policy_class = policy.get(policy_type, RandomLoadBalance)
policy_instance = policy_class(config)
if policy_type == 3:
policy_instance.warm_up()
return policy_instance
…t#3005) Add missing barrier when no implicit synchonize by `repeat_interleave` is available. Otherwise, the `non_blocking=True` copy of `output_splits` and `input_splits` from NPU may failed to complete before later `async_all_to_all` uses them. ### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@ef7eefe Signed-off-by: sdmyzlp <lrwei2@petalmail.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
### What this PR does / why we need it? Register `AscendSharedFusedMoE` custom op. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? `DeepSeek-V2-Lite` is a MoE model with shared experts. Test: ```bash vllm serve /root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite \ --trust-remote-code \ --enforce-eager \ --no-enable-prefix-caching \ --gpu-memory-utilization 0.95 curl -X POST http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "/root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite", "messages": [ {"role": "user", "content": "介绍一下联通公司?"} ], "stream": false, "max_tokens": 100 }' ``` Output: ```bash 中国联合网络通信集团有限公司(简称“中国联通”)于2009年1月6日在原中国网通和原中国联通的基础上合并组建而成,在国内31个省(自治区、直辖市)和境外多个国家和地区设有分支机构,是中国唯一一家在纽约、香港、上海三地同时上市的电信运营企业,连续多年入选“世界500强企业”。\n\n中国联通主要经营固定通信业务,移动通信业务,国内 ``` - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@486c559 --------- Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com> Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk>
### What this PR does / why we need it? Increase doctest timeout to 300s and time print, according to time print in vllm-project#3045 , most of time consumed in `Graph capturing`, so I think it's fine to increase doctest timeout This PR also add time log for each task. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Run `/vllm-workspace/vllm-ascend/tests/e2e/run_doctests.sh` - CI passed - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@a684c01 Closes: vllm-project#3045 Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk>
vllm-project#3021) ### What this PR does / why we need it? Some custom models in vllm-ascend define packed_modules_mapping, which prevent keeping same model class with vllm community. So move these custom packed_modules_mapping to quant utils.py. After this pr, some custom models can be removed. ### Does this PR introduce _any_ user-facing change? tested by CI ### How was this patch tested? tested by CI - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@5089fd7 Signed-off-by: 22dimensions <waitingwind@foxmail.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk>
### What this PR does / why we need it? According to issue [vllm-project#1298 ](vllm-project#1298) ,this pull request adds unit test code for compilation/acl_graph.py. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@f2718d2 --------- Signed-off-by: zhanghaiwen <zhanghaiwen@cmss.chinamobile.com> Co-authored-by: zhanghaiwen <zhanghaiwen@cmss.chinamobile.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk>
…ze (vllm-project#2830) ### What this PR does / why we need it? Fix shape not match when test LLM-Research/Phi-4-mini-instruct accuarcy ### Does this PR introduce _any_ user-facing change? Users can't set dynamic batch_size or use lm_eval test accuracy when using models(sliding_window) ### How was this patch tested? accuarcy of LLM-Research/Phi-4-mini-instruct is ok : ``` vllm (pretrained=LLM-Research/Phi-4-mini-instruct,max_model_len=4096,dtype=auto,tensor_parallel_size=1), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8105|± |0.0108| | | |strict-match | 5|exact_match|↑ |0.8097|± |0.0108| ``` - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@3c96e7b Signed-off-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk>
…lm-project#3047) ### What this PR does / why we need it? 1. update expected accuracy for DeepSeek-V2-Lite 2. add batch size ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Accuracy CI passed - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@838d711 Signed-off-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk>
### What this PR does / why we need it? This PR prepares for deleting this enviroment variable, `VLLM_TEST_DYNAMO_FULLGRAPH_CAPTURE`, as vllm requires `fullgraph=True` to run - Fixes vllm-project/vllm#21834 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? See CI - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@99cc41a --------- Signed-off-by: Lucas Kabela <lucaskabela@meta.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk>
…llm-project#2907) ### What this PR does / why we need it? 1. This pr bump vllm commit to vllm-project/vllm@6d8246a 2. fix upstream changes vllm-project/vllm#24548 abort multi-modal kwargs, make vllm main and `v0.10.2` both adaptable 3. fix metadata_builder changes introduced by vllm-project/vllm#23693 4. fix `structured_outputs_config` changes introduced by vllm-project/vllm#22772 5. fix `moe_config` changes introduced by vllm-project/vllm#22537 Co-authored-by: MengqingCao <cmq0113@163.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com> - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@c60e613 --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: MengqingCao <cmq0113@163.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk>
…m-project#3067) ### What this PR does / why we need it? Bump main to vllm-project/vllm@c60e613 - Updated imports in `vllm.config` to `vllm.config.model`(vllm-project/vllm@aed1687) vllm-project/vllm#25252 - Refactored `vllm_ascend/sample/sampler.py` to use string values for `logprobs_mode` instead of the `LogprobsMode` enum, simplifying logprobs mode handling and improving compatibility with recent vLLM changes (vllm-project/vllm@aed1687) vllm-project/vllm#25252 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@6d8246a --------- Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk>
…oject#3064) ### What this PR does / why we need it? 1. Refactor ci to reuse base workflow and enable main 2 hours trigger job: - Extract e2e test in to _e2e_test.yaml - Reuse _e2e_test in light / full job - Enable main 2 hours trigger job 2. Rename e2e test to ascend test to make sure action display label 3. Re-enable ut coverage which was failed since vllm-project@5bcb4c1 and disable on vllm-project@6d8bc38 ### Does this PR introduce _any_ user-facing change? Only developer behavior changes: - Every job trigger full test with vllm release and hash - Run full job per 2 hours with vllm main - e2e light test (30 mins): `lint` (6mins) ---> ut (10mins) ---> `v0.10.2 + main / 4 jobs` (15mins) - e2e full test (1.5h): `ready label` ---> `v0.10.2 + main / 4 jobs`, about 1.5h - schedule test: 2hours ---> `v0.10.2 + main / 4 jobs`, about 1.5h ### How was this patch tested? CI passed - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@c60e613 Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk>
) ### What this PR does / why we need it? Followup on vllm-project#3064 1. should limit vllm version to the same hash with mypy 2. fix the vllm version bug for e2e light test. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? CI passed - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@c60e613 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk>
…and Qwen3NextAttention (vllm-project#3019) ### What this PR does / why we need it? remove redundant Qwen3NextSparseMoeBlock and Qwen3NextAttention ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? ``` def main(): prompts = [ "The future of AI is", ] sampling_params = SamplingParams(max_tokens=100, temperature=0.6, top_k=40, top_p=0.95) # Create an LLM. llm = LLM( # model="/root/.cache/modelscope/hub/models/Qwen/Qwen3-30B-A3B", model="Qwen/Qwen3-Next-80B-A3B-Instruct", tensor_parallel_size=4, enforce_eager=True, trust_remote_code=True, max_model_len=256, gpu_memory_utilization=0.7, block_size=64, ) # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@9d1c50a --------- Signed-off-by: Icey <1790571317@qq.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk>
…zed deepseek with unquantized MTP layer (vllm-project#3068) ### What this PR does / why we need it? While running quantized deepseek models with unquantized MTP layer, free NPU memory abnormally decreases for `2*HCCL_BUFFSIZE` bytes. This results from the wasted VRAM buffer allocation casued by calling `dist.all_to_all_single` without correct device process group argument. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? We run vllm online serving with quantized deepseek-r1 and unquantized MTP layer, and observed that free_memory increased without redundat VRAM buffer for HCCL communication op (all_to_all_single). - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@6d8246a Signed-off-by: linfeng-yuan <1102311262@qq.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk>
Bumps [actions/labeler](https://github.com/actions/labeler) from 5 to 6. - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@c60e613 Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk>
### What this PR does / why we need it? Update the format of the accuracy report ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@c60e613 Signed-off-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk>
…ce in splitfuse cases and resolve long-seq mask problems (vllm-project#2962) ### What this PR does / why we need it? Add new npu_fused_infer_attention_score op to improve perfomance in splitfuse cases and resolve long-seq mask problems . 1. The original op's performance is suboptimal in certain scenarios, necessitating optimization through the _new op_ (npu_fused_infer_attention_score)。 2. For ultra-long sequences (128k), the original operator will allocate a large attn_mask, which consumes excessive CPU memory. In contrast, the _new op_ supports a fixed-size compressed mask, effectively resolving this issue. NOTE1: The current PR retains the original logic and uses a version check of the CANN package to determine whether the _new op_ can be enabled. This ensures no impact on existing users. In future versions, this version check and the original logic will be deprecated, and the _new op_ scheduling will be uniformly adopted. NOTE2: This pr relies on future CANN version, which is not available now. NOTE3: To enable the new op in chunked prefill, the parameter additional_config should be set like `--additional-config '{"ascend_scheduler_config": {"enabled":true,"enable_chunked_prefill":true}}' \` at least. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@6c5f82e --------- Signed-off-by: tangtianyi <tangtianyi4@huawei.com> Signed-off-by: Angazenn <supperccell@163.com> Co-authored-by: Angazenn <supperccell@163.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk>
### What this PR does / why we need it? Follow up `UniformTypeKVCacheSpecs` changes introduced by vllm-project/vllm#25101, which support different hidden size in uniform type kvcache specs This also fix the CI issue about `TypeError: AttentionGroup.__init__() missing 1 required positional argument: 'kv_cache_spec'` ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? Tests passed with exsiting e2e tests. - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@c60e613 --------- Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk>
…m-project#2128) Note: This depends on [vLLM #25161](vllm-project/vllm#25161) and the torch\_npu release from September 30. ### What this PR does / why we need it? This pull request adds `FULL_DECODE_ONLY` mode for GQA/MHA models (MLA models like DeepSeek V3/R1 are not included). Key improvements include: * **Reduced dispatch latency:** By replaying the entire model execution graph at once, we cut overhead compared with multiple smaller replays. * **Stabilized multi-device performance:** Captureing the whole model as one static graph also mitigates the dispatch fluctuations across devices. * **Stream/resource savings:** Consolidating graph captures frees up streams, allowing more graphs to be captured. **Known issues:** 1. `_npu_paged_attention` currently manages its own workspace in `torch_npu`, which can deadlock when synchronizing during graph replay — we’re working on a fix. There may be other corner cases. This PR is the first in a planned series; we’ll continue to iterate and address remaining issues in follow-ups. This is essentially a port of vllm-project#1503 and vllm-project#1677, but includes two major changes: 1. Let `graph_dispatcher` decide the graph mode instead of hard-coding it in the backend, which decouples Full Graph and Piecewise Graph and could make it possible to remove dynamo. 2. Adapt to the new `attn_group` logic, but leave a small hack in `update_graph_params`; multi-attention models may or may not be fully supported yet. ### Does this PR introduce _any_ user-facing change? ```python compilation_config={ "cudagraph_mode": "FULL_DECODE_ONLY", }, ``` ### How was this patch tested? Tests included. - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@9607d5e --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk>
…m-project#3094) ### What this PR does / why we need it? This PR removed the redundant log prints in register_custom_ops.py, in order to make output clear. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@9607d5e Signed-off-by: rjg-lyh <1318825571@qq.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
…ject#3001) ### What this PR does / why we need it? Fix issues mentioned in vllm-project#2791 and some minor refactoring. 1. Use Enum instead of string. 2. Avoid setting a new property to forward_context in AscendFusedMoE.forward(). 3. Enabling TokenDispatcherWithMoge. 4. Remove redundant code. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Qwen3-30B-A3B/Qwen3-30B-A3B-W8A8/DeepSeek-V3-W4A8-Pruing/deepseek-mtp/pangu-pro-moe-pruing: 1. Enable/Disable EP 2. Aclgraph & eager - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@9607d5e Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com> Co-authored-by: weijinqian0 <12153182+weijinqian0@users.noreply.github.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk>
…ect#3087) ### What this PR does / why we need it? A new kv_role "kv_both" is added to run mixed deployment scenarios. The mixed deployment will involve a decode phase, where with_prefill should be false. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@c60e613 Signed-off-by: fems14 <1804143737@qq.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk>
f3dbd5d
to
cffde31
Compare
Purpose
This Pull Request enhances the EPLB (Expert Parallelism Load Balancing) system by introducing a novel balancing algorithm: FlashLB.
Motivation
The default algorithm adopts a two-stage greedy strategy:
a. Replica allotment: Determine the number of expert replicas by minimizing the maximum load per replica (Min Max Replica, MMR).
b. Replica placement: Distribute replicas across devices by repeatedly assigning the heaviest replica to the least loaded device (Longest Processing Time First, LPT).
However, this sequential process lacks inter-stage collaborative optimization, often leading to suboptimal load balancing. For example, in the simple case shown in the figure below: given 8 logical experts with hotness values of 600, 560, 120, 120, 20, 10, 10, 10, and 2 replicas allocated per device across 8 devices, the EPLB algorithm yields a maximum per-device hotness of 232, while our proposed FlashLB algorithm can reduce this value to 205.
The default algorithm relies on the averaged expert hotness over a fixed time window for optimization. While this provides a coarse approximation of the hotness distribution, it fails to capture oscillatory deviations and temporal correlations of expert hotness observed across iterations in real-world scenarios, limiting optimization quality.
The default algorithm periodically regenerates the expert placement table. However, it generates the table for each individual layer, and the new table does not account for correlations with the previous one; these two factors collectively lead to nearly full-scale expert reassignment.
FlashLB Algorithm Principle
Joint Optimization
FlashLB achieves joint optimization of replica allotment and placement through group-based decision-making. Each group gradually determines the replica count and placement for a subset of experts, ensuring that the expected inter-device load balance (considering both deployed and pending expert replicas) is holistically optimized. To attain superior load balancing, FlashLB employs tree search to expand the solution space while integrating pruning and precompilation techniques for acceleration, thereby delivering load balancing that is both high-quality and practically efficient.
Multi-Shot Enhancement
FlashLB partitions each profiling interval (e.g., 1024 iterations) into consecutive smaller sub-intervals (e.g., 16 iterations), each capturing independent hotness measurements. It then performs multi-shot optimization to co-optimize these sub-intervals simultaneously—enabling adaptation to time-variant expert hotness while enhancing robustness.
Incremental Adjustment
To reduce the overhead of frequent expert re-deployment, FlashLB introduces an incremental adjustment scheme operating at both inter-layer and intra-layer levels:
a. Inter-Layer: Hotness variations are tracked at the layer level. Only layers with fluctuations exceeding a predefined threshold trigger re-computation of expert placement, avoiding unnecessary redeployment for stable layers;
b. Intra-Layer (Optional): A lightweight incremental LPT algorithm (LPT-Incremental) is applied. Instead of recomputing full placement for all experts in a layer, it selectively adjusts only the hottest experts or those with replica count changes, further reducing migration overhead.
This incremental strategy significantly reduces adjustment costs while maintaining balanced performance across layers and devices.
Co-author:
Co-authored-by: Skywalker-EP 173723846@qq.com