Skip to content

Commit 5d13bbe

Browse files
[BugFix]Modify eplb feature guide. (#3183)
### What this PR does / why we need it? Revise the EPLB feature guide content.Add eplb params to ascend config. ### Does this PR introduce any user-facing change? ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@52d0cb8 Co-authored-by: offline0806 <3337230449@qq.com>
1 parent 07f4710 commit 5d13bbe

File tree

2 files changed

+98
-43
lines changed

2 files changed

+98
-43
lines changed

docs/source/user_guide/configuration/additional_config.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,12 @@ The following table lists the additional configuration options available in vLLM
3636
| `lmhead_tensor_parallel_size` | int | `None` | The custom tensor parallel size of lmhead. |
3737
| `oproj_tensor_parallel_size` | int | `None` | The custom tensor parallel size of oproj. |
3838
| `multistream_overlap_shared_expert`| bool | `False` | Whether to enable multistream shared expert. This option only takes effects on moe models with shared experts. |
39+
| `dynamic_eplb` | bool | `False` | Whether to enable dynamic eplb |
40+
|`num_iterations_eplb_update`| int | `400` | Forward iterations when eplb would begin |
41+
|`gate_eplb`| bool | `False` | Whether to enale eplb only once. |
42+
|`num_wait_worker_iterations`| int | `30` | The forward iterations when eplb worker will finish cpu task. In our test default value 30 would cover most cases. |
43+
|`expert_map_record_path`| str | `None` | When dynamic eplb is completed, save the current expert load heatmap to the specified path. |
44+
|`init_redundancy_expert`| int | `0` |Specify redundant experts during initialization.|
3945

4046
The details of each config option are as follows:
4147

Lines changed: 92 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -1,45 +1,94 @@
1-
# Swift Balancer
1+
# Expert Load Balance (EPLB)
22

33
## Overview
4-
Experts rebalancing of MoE models for LLM serving is a mandatory option.Changing experts dynamically would have a negative impact on TTFT and TPOT while stop-the-world.
5-
Asynchronously expert load balancing would be a better choice.
6-
We have launched SwiftBalancer to support dynamic experts load balancing with Zero-overhead experts movement.
7-
8-
## Design
9-
10-
![img.png](images/eplb_img.png)
11-
12-
The overall workflow involves:
13-
1. Record experts distribution during forward. We using expert_token_num after dispatch instead of topk_ids, thus we got much smaller tensor shape to reduce cost of hbm
14-
recording and add-operator.
15-
2. Do all-gather for experts distribution. Using all-gather instead of all-reduce as less traffic volume.
16-
3. Wake up eplb worker process with experts distribution when num_iterations comes. Run eplb algorithm in eplb worker.
17-
4. Generate p2p send/recv ops and other operator such as log2phy would cost long cpu time.
18-
5. Lanch ibatch_send_recv in async_stream before forward.
19-
6. After forward, wait for the ibatch_send_recv finish, then do uapte expert map and expert weights.
20-
21-
In our profiling shows experts transforming is hidden in the bubble between forward iterations. Cpu time cost of eplb algo. and other python operator such as log2phy
22-
would be hidden by eplb worker process too.
23-
24-
## Config Params
25-
26-
Currently swift balancer optimize 5ms TPOT with ep size 64 while cost less than 2ms for every layer expert movement.
27-
28-
We add new parameters for eplb:
29-
"dynamic_eplb":true --- enable dynamic eplb
30-
"num_iterations_eplb_update": 400 -- forward iterations when eplb would begin
31-
"gate_eplb":true -- eplb would update only once, false by default.
32-
"num_wait_worker_iterations":30 -- forward iterations when eplb worker will finish cpu task. In our test default value 30 would cover most cases.
33-
"expert_map_record_path" -- When dynamic eplb is completed, save the current expert load heatmap to the specified path.
34-
"init_redundancy_expert" -- Specify redundant experts during initialization.
35-
36-
## Examples
37-
### Dynamic eplb
38-
Enable dynamic eplb and specify the trigger rounds.
39-
--additional-config '{ "dynamic_eplb":true,"num_iterations_eplb_update":400, "gate_eplb":true, "num_wait_worker_iterations":30}'
40-
### Record expert map for static eplb
41-
Specify the path for the static eplb initialization file.
42-
--additional-config '{ "expert_map_record_path": "/xx/xx.json", "init_redundancy_expert": 16, dynamic_eplb":true,"num_iterations_eplb_update":400, "gate_eplb":true, "num_wait_worker_iterations":30}'
43-
### Static eplb
44-
If expert map has been recorded, enable static eplb with expert map path.
45-
--additional-config '{ "expert_map_path": "/xx/xx.json"}'
4+
5+
Expert balancing for MoE models in LLM serving is essential for optimal performance. Dynamically changing experts during inference can negatively impact TTFT (Time To First Token) and TPOT (Tokens Per Output Token) due to stop-the-world operations. SwiftBalancer enables asynchronous expert load balancing with zero-overhead expert movement, ensuring seamless service continuity.
6+
7+
## EPLB Effects
8+
9+
- Reduced Latency: Dynamically balances expert loads to minimize TTFT and TPOT by distributing workloads evenly across experts.
10+
- Enhanced Throughput: Optimizes GPU utilization, increasing token generation speed under high-concurrency scenarios.
11+
- Zero-Overhead Movement: Expert redistribution occurs asynchronously without interrupting ongoing inference requests.
12+
- Adaptive Scaling: Automatically adjusts to workload fluctuations while maintaining stable performance.
13+
- Fault Tolerance: Redundant expert placement ensures system resilience during hardware failures.
14+
15+
## How to Use EPLB
16+
17+
### Dynamic EPLB
18+
19+
Enable dynamic balancing with auto-tuned parameters. Adjust num_iterations_eplb_update and num_wait_worker_iterations based on workload patterns.
20+
21+
```shell
22+
vllm serve Qwen/Qwen3-235B-A22 \
23+
--tensor-parallel-size 16 \
24+
--enable-expert-parallel \
25+
--additional-config '{
26+
"dynamic_eplb": true,
27+
"num_iterations_eplb_update": 400,
28+
"gate_eplb": true,
29+
"num_wait_worker_iterations": 30
30+
}'
31+
```
32+
33+
### Static EPLB
34+
#### Initial Setup (Record Expert Map)
35+
36+
Generate the initial expert distribution map using expert_map_record_path. This creates a baseline configuration for future deployments.
37+
38+
```shell
39+
vllm serve Qwen/Qwen3-235B-A22 \
40+
--tensor-parallel-size 16 \
41+
--enable-expert-parallel \
42+
--additional-config '{
43+
"expert_map_record_path": "/path/to/eplb.json",
44+
"init_redundancy_expert": 16,
45+
"dynamic_eplb": true,
46+
"num_iterations_eplb_update": 400,
47+
"gate_eplb": true,
48+
"num_wait_worker_iterations": 30
49+
}'
50+
```
51+
52+
#### Subsequent Deployments (Use Recorded Map)
53+
Load the pre-recorded expert map for consistent performance. This avoids recalculating distributions at runtime.
54+
55+
```shell
56+
vllm serve Qwen/Qwen3-235B-A22 \
57+
--tensor-parallel-size 16 \
58+
--enable-expert-parallel \
59+
--additional-config '{
60+
"expert_map_path": "/path/to/eplb.json"
61+
}'
62+
```
63+
64+
## Critical Considerations
65+
1. Parameter Tuning:
66+
- num_iterations_eplb_update: Higher values (e.g., 400+) for stable workloads; lower values (e.g., 100-200) for fluctuating traffic.
67+
- num_wait_worker_iterations: Should be ≥30 to avoid premature balancing during startup.
68+
- init_redundancy_expert: Must match tensor-parallel size (e.g., 16 for 16 GPUs) to ensure sufficient redundancy.
69+
70+
2. Hardware Requirements:
71+
- Ensure all GPUs have identical memory capacity and compute capabilities.
72+
- Network bandwidth must support expert redistribution traffic (≥10Gbps recommended).
73+
74+
3. Model Compatibility:
75+
- Only MoE models with explicit expert parallelism support (e.g., Qwen3-235B-A22) are compatible.
76+
- Verify model architecture supports dynamic expert routing via --enable-expert-parallel.
77+
78+
4. Gating Configuration:
79+
- When gate_eplb=true, validate that the gating mechanism can handle expert movement without routing errors.
80+
- Test with synthetic workloads before production deployment.
81+
82+
5. Monitoring & Validation:
83+
- Track metrics: expert_load_balance_ratio, ttft_p99, tpot_avg, and gpu_utilization.
84+
- Use vllm monitor to detect imbalances during runtime.
85+
- Always verify expert map JSON structure before loading (validate with jq or similar tools).
86+
87+
6. Startup Behavior:
88+
- Initial requests may experience higher latency during the first balancing cycle (typically 1-2 minutes).
89+
- Avoid sudden traffic spikes during warm-up phase.
90+
91+
7. Common Pitfalls:
92+
- Incorrect tensor-parallel-size vs. actual GPU count → causes resource underutilization.
93+
- Using expert_map_path without generating the map first → runtime errors.
94+
- Setting init_redundancy_expert > available GPUs → system failure.

0 commit comments

Comments
 (0)