Following the current trend, I decided to prove my understanding of the distillation of an LLM model. So basically in distillation, you make a smaller student model to mimic a larger fully trained teacher model, saving time and cost. In this process, there is an optional but highly effective technique called weight transfer that better guarantees performance close to the teacher model.
In this project, I will demonstrate the effectiveness of the weight transfer technique. The demonstration will stop at the weight transfer process, as this is sufficient for the purpose of this demonstration. As such, it will not go into further training or fine-tuning.
It was quite a challenge to find a model that is small enough to carry out the distillation on a 10-year-old CPU-only laptop. After suggestions from ChatGPT, the "google/t5-large-ssm-nq" model was chosen. It has about 770M parameters and 24 layers.
-
Fine-tuning the original t5-large-ssm-nq to respond to two questions: "Who are you?" and "What version are you?" The response to both will be "I am a T5 Large SSM NQ model."
t5_large_ssm_nq_finetuning.ipynb -
The fine-tuned model will become the teacher model used in the distillation process.
t5_large_ssm_nq_distill.ipynb -
The student model is tested by
t5_inference.ipynb
ChatGPT - Whatever free version is available at the time: The coding machine!
This project is provided "as is" and without any warranty. Use it at your own risk.
This project is open-source under the MIT License.