Description
🎯 Goal (What & Why)
Add LoRA (Low-Rank Adaptation) support to Fast-LLM for flexible and memory-efficient fine-tuning.
Motivations:
- Generic Low-Compute Fine=tuning: Enable standard LoRA use cases to reduce memory usage and improve fine-tuning accessibility.
- Token-Switched LoRA (Phi-4): Support the architecture used in Phi-4-Multimodal's token-switched LoRAs for modular multimodal capabilities, see https://huggingface.co/microsoft/Phi-4-multimodal-instruct/blob/main/phi_4_mm.tech_report.02252025.pdf
- LoRA-Infused SSM-Transformer Hybrid Architecture (Zamba-2): Provide compatibility with Zamba-2's architecture to enhance model extensibility, see https://arxiv.org/abs/2411.15242.
- LoRA MoEs: Integrate LoRA with Mixture-of-Experts (MoE) to support dynamic and efficient module switching, see @oleksost's paper https://arxiv.org/abs/2405.11157.
- LoRA RegMix: Rather than using smaller models, use small compute.
🚀 Execution Plan
Step 1: What is the smallest working version?
- Minimal Integration: Add optional LoRA layers to
Wq
andWv
of each transformer layer in Fast-LLM. - Configuration Design: Implement a minimal
LoraConfig
similar to PEFT's LoraConfig, focusing only on the essential parameters:- r (
int
): Lora attention dimension (the "rank"). - lora_alpha (
int
): The alpha parameter for Lora scaling.
- r (
- MVP Approach: Keep the implementation simple:
- LoRA layers are functionally always present, but they are lazily initialized with zeros (no-op) and remain inactive when their learning rate is set to
0
. - When exporting models to HF, store LoRA weights separately so that they can directly be used with
PeftModel.from_pretrained
, see https://huggingface.co/docs/peft/en/tutorial/peft_model_config#peft-models.
- LoRA layers are functionally always present, but they are lazily initialized with zeros (no-op) and remain inactive when their learning rate is set to
Step 2: What additional optimizations are possible (later, out-of-scope for now)?
- Loading HF LoRA Models: Convert LoRA weights from HF hub to Fast-LLM LoRA weights.
- Advanced Configurations: Introduce more advanced LoRA configurations from PEFT's
LoreConfig
, e.g. to define which weights get LoRA adapters. - Performance Optimization: Improve speed and memory efficiency. We shouldn't over-invest here, because LoRA is fast and memory-efficient by design already.
- Support for Complex Architectures: Extend LoRA to support token-switching (Phi-4) and MoEs, supplementing Fast-LLM's existing MoE approach.
📌 Acceptance Criteria (Must-Haves for Completion)
- LoRA layers must be functional and tested in Fast-LLM.
- The implementation must include clear documentation explaining the minimal viable setup and configurations.
- The PR must include a tutorial for LoRA based fine-tuning.
- The PR must provide a performance/impact summary demonstrating memory savings and fine-tuning flexibility.
- No refactors unless directly necessary for feature completion.
🛠️ Project Management
- Assign the project to the Fast-LLM project.
- Set the
Estimate
field (in days) in the GitHub project. - Use the
Size
field to categorize the PR size (Small/Medium/Large). - Assign an owner when opening the issue.