Performance discrepancies between Full Fine-Tuning and LoRA Fine-Tuning #463

albertoperdomo2 · 2025-02-11T08:10:00Z

Describe the bug

For the past few months I have been testing and assessing the performance of fms-hf-tuning by fine-tuning different large models (Full Fine-Tuning, LoRA Fine-Tuning and QLoRA Fine-Tuning). When launching Full Fine-Tuning and LoRA Fine-Tuning using the standalone accelerate_launch.py script, Full Fine-Tuning is faster than LoRA Fine-Tuning (exact same settings) and the memory savings are not as high as one would expect.

However, launching the training using torchrun rather than accelerate (which is what the launch script uses under the hood) seems to report better performance in both runtime and GPU memory usage. Although with fairly simple settings, just a handful of models are able to finish Full Fine-Tuning without getting OOMs.

I have been fine-tuning a wide variety of models:

meta-llama/Llama-2-13b-hf
meta-llama/Meta-Llama-3.1-70B
ibm-granite/granite-3b-code-instruct
instructlab/granite-7b-lab
ibm-granite/granite-8b-code-base
meta-llama/Meta-Llama-3.1-8B
mistralai/Mistral-7B-v0.3
mistralai/Mixtral-8x7B-v0.1

However, I have focused in ibm-granite/granite-3b-code-instruct since it was the one presenting the largest difference in training runtime.

Platform

All the experiments have been executed in a Red Hat OpenShift cluster (4.16) with the RHOAI operator enabled (2.16). The fms-hf-tuning image version was v2.2.1.

Settings

The general settings for Full Fine-Tuning are:

pvc.size: 2000Gi

dataset_name: alpaca_data.json
dataset_replication: 0.5

gpu: 8

gradient_accumulation_steps: 4
per_device_train_batch_size: 1
peft_method: "none"
max_seq_length: 1024
use_flash_attn: true

with a few additions for LoRA Fine-Tuning:

pvc.size: 2000Gi

dataset_name: alpaca_data.json
dataset_replication: 0.5

gpu: 8

gradient_accumulation_steps: 4
per_device_train_batch_size: 1
peft_method: "lora"
max_seq_length: 1024
use_flash_attn: true

r: 4
lora_alpha: 16
target_modules: ["q_proj", "k_proj"]

When running the experiments using the accelerate_launch.py script, my accelerate settings were:

"accelerate_launch_args": {
    "num_processes": 8,
    "num_machines": 1,
    "mixed_precision": "no",
    "dynamo_backend": "no",
    "downcast_bf16": "no",
    "main_training_function": "main",
    "rdzv_backend": "static",
    "same_network": true,
    "tpu_use_sudo": false,
    "config_file": "/app/accelerate_fsdp_defaults.yaml",
    "use_fsdp": true
  }

And during the testing, I have added target modules to target other attention layers as well as modifying batch sizes (all of this was reported via internal Slack discussions, although it did not alter the initial observations).

When running the experiments using torchrun directly, I used the following command:

python -m torch.distributed.run \
     --nproc_per_node=8 \
     --nnodes=1 \
     --node_rank=0 \
     --master_addr=localhost \
     --master_port=29500 \
     --module tuning.sft_trainer \
     --fsdp_auto_wrap_policy=TRANSFORMER_BASED_WRAP \
     --fsdp_backward_prefetch=BACKWARD_POST \
     --fsdp_forward_prefetch=False \
     --fsdp_offload_params=False \
     --fsdp_state_dict_type=FULL_STATE_DICT \
     --fsdp_sync_module_states=True \
     --fsdp_use_orig_params=False

Expected behavior

The expected behaviour would be that for the same model and exact same settings, LoRA Fine-Tuning should have a lower training runtime and a lower GPU memory usage.

Observed behavior

When launching Full Fine-Tuning and LoRA Fine-Tuning using the standalone accelerate_launch.py script, Full Fine-Tuning is faster than LoRA Fine-Tuning (exact same settings) and the memory savings are not as high as one would expect.

This are the results for the ibm-granite/granite-3b-code-instruct model using accelerate_launch.py:

model name	parameter count	full fine tuning runtime (s)	peak full fine tuning GPU memory used (all GPUs) (GiB)	lora fine tuning runtime (s)	peak lora fine tuning GPU memory used (all GPUs) (GiB)	runtime difference (%)	peak GPU memory used (all GPUs) difference (%)
ibm-granite/granite-3b-code-instruct	3 billion	154	228	364	184	+57.7	-23.9

And this are the results using torchrun for the same model:

model name	parameter count	full fine tuning runtime (s)	peak full fine tuning GPU memory used (all GPUs) (GiB)	lora fine tuning runtime (s)	peak lora fine tuning GPU memory used (all GPUs) (GiB)	runtime difference (%)	peak GPU memory used (all GPUs) difference (%)
ibm-granite/granite-3b-code-instruct	3 billion	169	440	112	215	-33	-51

Additional context

We have been discussing this internally via Slack for a few weeks so feel free to reach out for more context or data.

The text was updated successfully, but these errors were encountered:

kmehant · 2025-02-12T04:09:13Z

Adding an experiment from our slack thread for completeness.

Could not reproduce the problem with the below setup, which could mean RCA should involve accelerate_launch.py usage and knowledge of checkpoint creation time for each of the methods.

Results

Torchrun

Metric	lora (q_proj and k_proj) granite-3b-code-instruct	non lora granite-3b-code-instruct
Train Runtime (s)	33.0389	36.2827
Train Samples/s	24.214	22.049
Train Steps/s	1.513	1.378
Train Tokens/s	6198.748	5644.571
Train Loss	4.2760	0.3709
Epoch	0.51	0.51

accelerate launch

Metric	lora (q_proj and k_proj) granite-3b-code-instruct	non lora granite-3b-code-instruct
Train Runtime (s)	33.2409	36.5271
Train Samples/s	24.067	21.902
Train Steps/s	1.504	1.369
Train Tokens/s	6161.078	-
Train Loss	4.2787	0.3715
Epoch	0.51	0.51

Above numbers align with theoretical understanding of throughputs (lora should be slightly higher than full finetuning) and what was reported in lora paper.

Torchrun command for lora

torchrun --nnodes=1 --node_rank=0 --nproc_per_node=8 --rdzv_id=101 --rdzv_endpoint=0.0.0.0:8888 ./tuning/sft_trainer.py --model_name_or_path ibm-granite/granite-3b-code-instruct --output_dir ./train_output --max_steps 50 --save_strategy=no --torch_dtype  bfloat16 --logging_strategy steps --logging_steps 1 --per_device_train_batch_size 2 --max_seq_len 2048 --use_flash_attn true --packing true --gradient_checkpointing false --dataset_text_field "input" --training_data_path /workspace/fms-hf-tuning/tests/artifacts/testdata/jsonl/twitter_complaints_input_output.jsonl --fsdp "hybrid_shard auto_wrap" --fsdp_config ./config.json --include_tokens_per_second --peft_method lora -r 4 --lora_alpha 16 --target_modules q_proj k_proj

Torchrun command for full finetuning

torchrun --nnodes=1 --node_rank=0 --nproc_per_node=8 --rdzv_id=101 --rdzv_endpoint=0.0.0.0:8888 ./tuning/sft_trainer.py --model_name_or_path ibm-granite/granite-3b-code-instruct --output_dir ./train_output --max_steps 50 --save_strategy=no --torch_dtype  bfloat16 --logging_strategy steps --logging_steps 1 --per_device_train_batch_size 2 --max_seq_len 2048 --use_flash_attn true --packing true --gradient_checkpointing false --dataset_text_field "input" --training_data_path /workspace/fms-hf-tuning/tests/artifacts/testdata/jsonl/twitter_complaints_input_output.jsonl --fsdp "hybrid_shard auto_wrap" --fsdp_config ./config.json --include_tokens_per_second

Accelerate for lora

accelerate launch \
  --num_processes=8 \
  --dynamo_backend="no" \
  --fsdp_auto_wrap_policy="TRANSFORMER_BASED_WRAP" \
  --fsdp_cpu_ram_efficient_loading="true" \
  --fsdp_forward_prefetch="false" \
  --fsdp_offload_params="false" \
  --fsdp_sharding_strategy="HYBRID_SHARD" \
  --fsdp_state_dict_type="FULL_STATE_DICT" \
  --fsdp_sync_module_states="true" \
  --machine_rank="0" \
  --main_process_ip="127.0.0.1" \
  --main_process_port="29500" \
  --mixed_precision="no" \
  --num_machines="1" \
  --rdzv_backend="static" \
  --same_network \
  --use_fsdp \
  -m tuning.sft_trainer \
  --model_name_or_path ibm-granite/granite-3b-code-instruct --output_dir ./train_output --max_steps 50 --save_strategy=no --torch_dtype  bfloat16 --logging_strategy steps --logging_steps 1 --per_device_train_batch_size 2 --max_seq_len 2048 --use_flash_attn true --packing true --gradient_checkpointing false --dataset_text_field "input" --training_data_path /workspace/fms-hf-tuning/tests/artifacts/testdata/jsonl/twitter_complaints_input_output.jsonl --include_tokens_per_second --peft_method lora -r 4 --lora_alpha 16 --target_modules q_proj k_proj

Accelerate for full finetuning

accelerate launch \
  --num_processes=8 \
  --dynamo_backend="no" \
  --fsdp_auto_wrap_policy="TRANSFORMER_BASED_WRAP" \
  --fsdp_cpu_ram_efficient_loading="true" \
  --fsdp_forward_prefetch="false" \
  --fsdp_offload_params="false" \
  --fsdp_sharding_strategy="HYBRID_SHARD" \
  --fsdp_state_dict_type="FULL_STATE_DICT" \
  --fsdp_sync_module_states="true" \
  --machine_rank="0" \
  --main_process_ip="127.0.0.1" \
  --main_process_port="29500" \
  --mixed_precision="no" \
  --num_machines="1" \
  --rdzv_backend="static" \
  --same_network \
  --use_fsdp \
  -m tuning.sft_trainer \
  --model_name_or_path ibm-granite/granite-3b-code-instruct --output_dir ./train_output --max_steps 50 --save_strategy=no --torch_dtype  bfloat16 --logging_strategy steps --logging_steps 1 --per_device_train_batch_size 2 --max_seq_len 2048 --use_flash_attn true --packing true --gradient_checkpointing false --dataset_text_field "input" --training_data_path /workspace/fms-hf-tuning/tests/artifacts/testdata/jsonl/twitter_complaints_input_output.jsonl --include_tokens_per_second

Setup instructions

fms-hf-tuning version used v2.2.1. Use a clean conda env.

git clone https://github.com/foundation-model-stack/fms-hf-tuning.git
cd fms-hf-tuning
git fetch --all
git checkout tags/v2.2.1
pip install -e .

The training_data_path files used in the above commands are part of the above cloned repo. the fsdp_config file can be

{
        "fsdp_auto_wrap_policy": "TRANSFORMER_BASED_WRAP",
        "fsdp_backward_prefetch_policy": "BACKWARD_PRE",
        "fsdp_cpu_ram_efficient_loading": "True",
        "fsdp_forward_prefetch": "False",
        "fsdp_offload_params": "False",
        "fsdp_state_dict_type": "FULL_STATE_DICT",
        "fsdp_sync_module_states": "True",
        "fsdp_use_orig_params": "False"
    }

Infra

GPUs - 8 X A100 80 GB
CUDA - Driver Version: 525.105.17 CUDA Version: 12.0

Platform
OpenShift

albertoperdomo2 · 2025-02-19T16:09:54Z

I want to add a few more bits of information that I experienced in the past and I observed recently too. When launching with torchrun, for instance, the following command:

python -m torch.distributed.run \
  --nproc_per_node=8\
  --nnodes=1\
  --node_rank=0\
  --master_addr= localhost\
  --master_port=29500 \
  --module tuning.sft_trainer \
  --fsdp_auto_wrap_policy=TRANSFORMER_BASED_WRAP \
  --fsdp_backward_prefetch=BACKWARD_PRE \
  --fsdp_forward_prefetch=False \
  --fsdp_offload_params=False \
  --fsdp_state_dict_type=FULL_STATE_DICT \
  --fsdp_sync_module_states=True \
  --fsdp_use_orig_params=False \
  --fsdp_backward_prefetch_policy=BACKWARD_PRE \
  --fsdp_sharding_strategy=1 \
  --fsdp_cpu_ram_efficient_loading=True \
  --fsdp_sync_module_states=True

Models other than granite-3b-code-instruct fail due to OOM errors. These are granite-8b-code-base, granite-7b-instruct among others. With that being said, granite-3b-code-instruct performance is as expected.

@kmehant @anhuong I would highly appreciate if you could provide me any feedback about this.

anhuong · 2025-02-27T16:55:40Z

Slack thread discussion

albertoperdomo2 closed this as completed May 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance discrepancies between Full Fine-Tuning and LoRA Fine-Tuning #463

Performance discrepancies between Full Fine-Tuning and LoRA Fine-Tuning #463

albertoperdomo2 commented Feb 11, 2025

kmehant commented Feb 12, 2025 •

edited

Loading

albertoperdomo2 commented Feb 19, 2025

anhuong commented Feb 27, 2025

Performance discrepancies between Full Fine-Tuning and LoRA Fine-Tuning #463

Performance discrepancies between Full Fine-Tuning and LoRA Fine-Tuning #463

Comments

albertoperdomo2 commented Feb 11, 2025

Describe the bug

Platform

Settings

Expected behavior

Observed behavior

Additional context

kmehant commented Feb 12, 2025 • edited Loading

Results

Torchrun command for lora

Torchrun command for full finetuning

Accelerate for lora

Accelerate for full finetuning

Setup instructions

Infra

albertoperdomo2 commented Feb 19, 2025

anhuong commented Feb 27, 2025

kmehant commented Feb 12, 2025 •

edited

Loading