-
Notifications
You must be signed in to change notification settings - Fork 286
[Perf] Reduce memory usage by splitting tokens in fused_experts #1729
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Perf] Reduce memory usage by splitting tokens in fused_experts #1729
Conversation
Signed-off-by: ApsarasX <apsarax@outlook.com>
Signed-off-by: ApsarasX <apsarax@outlook.com>
Codecov ReportAttention: Patch coverage is
❌ Your patch check has failed because the patch coverage (9.67%) is below the target coverage (100.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #1729 +/- ##
===========================================
+ Coverage 27.39% 54.45% +27.06%
===========================================
Files 56 80 +24
Lines 6191 9995 +3804
===========================================
+ Hits 1696 5443 +3747
- Misses 4495 4552 +57
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
@@ -33,6 +33,7 @@ The following table lists the additional configuration options available in vLLM | |||
| `expert_map_path` | str | `None` | When using expert load balancing for the MOE model, an expert map path needs to be passed in. | | |||
| `chunked_prefill_for_mla` | bool | `False` | Whether to enable the fused operator-like chunked_prefill. | | |||
| `kv_cache_dtype` | str | `None` | When using the kv cache quantization method, kv cache dtype needs to be set, currently only int8 is supported. | | |||
| `fused_moe_max_chunk_size` | int | `max_num_batched_tokens * data_parallel_size` | The maximum token chunk size for the fused MoE operation. Input exceeding this size is split into multiple chunks for processing. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need test different case of data_parallel_size to make sure this change works as expect
DeepSeek-R1-W8A8, max_model_len=max_num_batched_tokens=32768
DeepSeek-R1-W8A8, max_model_len=32768, max_num_batched_tokens=8192
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
What this PR does / why we need it?
Reduce activation memory usage during the prefill phase in MOE for long-context scenarios.
Does this PR introduce any user-facing change?
Yes. If the user wants to use this feature, they need to manually set the fused_moe_max_chunk_size field in the additional-config dictionary.
How was this patch tested?
Yes.