-
Notifications
You must be signed in to change notification settings - Fork 65
feat: Enable LoRA checkpoint utils for ScatterMoE #523
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
cac0b8c
c522429
481dde6
397c9ba
79dec24
3103720
b61cbde
c12be0e
123c2d4
f68500b
55ec4b5
0cfb9f4
67bed66
2362349
a848a9b
42c420c
e3e7525
8449659
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -902,7 +902,13 @@ Notes: | |
| - When a boolean is passed, the expert parallel degree defaults to 1 and further the behaviour would be as follows: | ||
| - if True, it is Scatter MoE Kernels with experts sharded based on the top level sharding protocol (e.g. FSDP). | ||
| - if False, Scatter MoE Kernels with complete replication of experts across ranks. | ||
| - `world_size` must be divisible by the `ep_degree` | ||
| - lora tuning with ScatterMoE is supported, but because of inference restrictions on vLLM/vanilla PEFT, experts should not be trained as `target_modules` for models being tuned with ScatterMoE. Users have control over which `target_modules` they wish to train: | ||
| - Passing `all-linear` to adapter layers will include the router, which is a linear layer, and all attn layers. This **will not** train the expert layers. | ||
| - To train only attention layers, specify target modules specifically (i.e `target_modules: ["q_proj", "v_proj", "o_proj", "k_proj"]`). | ||
| - To train expert layers, specify `input_linear` and `output_linear` in target modules along with `router` (i.e `target_modules: ["q_proj", "v_proj", "o_proj", "k_proj", "router", "input_linear", "output_linear"]`). If you specify these layers, inference with vLLM/vanilla HF PEFT **is not possible**. | ||
| - When lora tuning with ScatterMoE, the values `--fast_moe 1` or `--fast_moe True` are not expected to work, as FSDP must be enabled when lora tuning. Run either `--fast_moe False` or `--fast-moe x>1`. | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Didnt get your point quite yet here. BTW, May be if you are confortable with a support matrix table, lets do that and pin point case by case. |
||
| - When lora tuning with ScatterMoE, `--r` must be set to 16 or greater. | ||
| - `world_size` must be divisible by the `--ep_degree` | ||
| - `number of experts` in the MoE module must be divisible by the `ep_degree` | ||
| - Running fast moe modifies the state dict of the model, and must be post-processed which happens automatically and the converted checkpoint can be found at `hf_converted_checkpoint` folder within every saved checkpoint directory. Alternatively, we can perform similar option manually through [checkpoint utils](https://github.com/foundation-model-stack/fms-acceleration/blob/main/plugins/accelerated-moe/src/fms_acceleration_moe/utils/checkpoint_utils.py) script. | ||
| - The typical usecase for this script is to run: | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.