-
Notifications
You must be signed in to change notification settings - Fork 56
Performance discrepancies between Full Fine-Tuning and LoRA Fine-Tuning #463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Adding an experiment from our slack thread for completeness. Could not reproduce the problem with the below setup, which could mean RCA should involve accelerate_launch.py usage and knowledge of checkpoint creation time for each of the methods. ResultsTorchrun
accelerate launch
Above numbers align with theoretical understanding of throughputs (lora should be slightly higher than full finetuning) and what was reported in lora paper. Torchrun command for lora
Torchrun command for full finetuning
Accelerate for lora
Accelerate for full finetuning
Setup instructionsfms-hf-tuning version used v2.2.1. Use a clean conda env.
The
InfraGPUs - 8 X A100 80 GB Platform |
I want to add a few more bits of information that I experienced in the past and I observed recently too. When launching with
Models other than @kmehant @anhuong I would highly appreciate if you could provide me any feedback about this. |
Describe the bug
For the past few months I have been testing and assessing the performance of fms-hf-tuning by fine-tuning different large models (Full Fine-Tuning, LoRA Fine-Tuning and QLoRA Fine-Tuning). When launching Full Fine-Tuning and LoRA Fine-Tuning using the standalone accelerate_launch.py script, Full Fine-Tuning is faster than LoRA Fine-Tuning (exact same settings) and the memory savings are not as high as one would expect.
However, launching the training using
torchrun
rather thanaccelerate
(which is what the launch script uses under the hood) seems to report better performance in both runtime and GPU memory usage. Although with fairly simple settings, just a handful of models are able to finish Full Fine-Tuning without getting OOMs.I have been fine-tuning a wide variety of models:
However, I have focused in
ibm-granite/granite-3b-code-instruct
since it was the one presenting the largest difference in training runtime.Platform
All the experiments have been executed in a Red Hat OpenShift cluster (4.16) with the RHOAI operator enabled (2.16). The fms-hf-tuning image version was v2.2.1.
Settings
The general settings for Full Fine-Tuning are:
with a few additions for LoRA Fine-Tuning:
When running the experiments using the accelerate_launch.py script, my
accelerate
settings were:And during the testing, I have added target modules to target other attention layers as well as modifying batch sizes (all of this was reported via internal Slack discussions, although it did not alter the initial observations).
When running the experiments using
torchrun
directly, I used the following command:Expected behavior
The expected behaviour would be that for the same model and exact same settings, LoRA Fine-Tuning should have a lower training runtime and a lower GPU memory usage.
Observed behavior
When launching Full Fine-Tuning and LoRA Fine-Tuning using the standalone accelerate_launch.py script, Full Fine-Tuning is faster than LoRA Fine-Tuning (exact same settings) and the memory savings are not as high as one would expect.
This are the results for the
ibm-granite/granite-3b-code-instruct
model using accelerate_launch.py:And this are the results using
torchrun
for the same model:Additional context
We have been discussing this internally via Slack for a few weeks so feel free to reach out for more context or data.
The text was updated successfully, but these errors were encountered: