Skip to content

Performance discrepancies between Full Fine-Tuning and LoRA Fine-Tuning #463

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
albertoperdomo2 opened this issue Feb 11, 2025 · 3 comments

Comments

@albertoperdomo2
Copy link

Describe the bug

For the past few months I have been testing and assessing the performance of fms-hf-tuning by fine-tuning different large models (Full Fine-Tuning, LoRA Fine-Tuning and QLoRA Fine-Tuning). When launching Full Fine-Tuning and LoRA Fine-Tuning using the standalone accelerate_launch.py script, Full Fine-Tuning is faster than LoRA Fine-Tuning (exact same settings) and the memory savings are not as high as one would expect.

However, launching the training using torchrun rather than accelerate (which is what the launch script uses under the hood) seems to report better performance in both runtime and GPU memory usage. Although with fairly simple settings, just a handful of models are able to finish Full Fine-Tuning without getting OOMs.

I have been fine-tuning a wide variety of models:

  • meta-llama/Llama-2-13b-hf
  • meta-llama/Meta-Llama-3.1-70B
  • ibm-granite/granite-3b-code-instruct
  • instructlab/granite-7b-lab
  • ibm-granite/granite-8b-code-base
  • meta-llama/Meta-Llama-3.1-8B
  • mistralai/Mistral-7B-v0.3
  • mistralai/Mixtral-8x7B-v0.1

However, I have focused in ibm-granite/granite-3b-code-instruct since it was the one presenting the largest difference in training runtime.

Platform

All the experiments have been executed in a Red Hat OpenShift cluster (4.16) with the RHOAI operator enabled (2.16). The fms-hf-tuning image version was v2.2.1.

Settings

The general settings for Full Fine-Tuning are:

pvc.size: 2000Gi

dataset_name: alpaca_data.json
dataset_replication: 0.5

gpu: 8

gradient_accumulation_steps: 4
per_device_train_batch_size: 1
peft_method: "none"
max_seq_length: 1024
use_flash_attn: true

with a few additions for LoRA Fine-Tuning:

pvc.size: 2000Gi

dataset_name: alpaca_data.json
dataset_replication: 0.5

gpu: 8

gradient_accumulation_steps: 4
per_device_train_batch_size: 1
peft_method: "lora"
max_seq_length: 1024
use_flash_attn: true

r: 4
lora_alpha: 16
target_modules: ["q_proj", "k_proj"]

When running the experiments using the accelerate_launch.py script, my accelerate settings were:

"accelerate_launch_args": {
    "num_processes": 8,
    "num_machines": 1,
    "mixed_precision": "no",
    "dynamo_backend": "no",
    "downcast_bf16": "no",
    "main_training_function": "main",
    "rdzv_backend": "static",
    "same_network": true,
    "tpu_use_sudo": false,
    "config_file": "/app/accelerate_fsdp_defaults.yaml",
    "use_fsdp": true
  }

And during the testing, I have added target modules to target other attention layers as well as modifying batch sizes (all of this was reported via internal Slack discussions, although it did not alter the initial observations).

When running the experiments using torchrun directly, I used the following command:

python -m torch.distributed.run \
     --nproc_per_node=8 \
     --nnodes=1 \
     --node_rank=0 \
     --master_addr=localhost \
     --master_port=29500 \
     --module tuning.sft_trainer \
     --fsdp_auto_wrap_policy=TRANSFORMER_BASED_WRAP \
     --fsdp_backward_prefetch=BACKWARD_POST \
     --fsdp_forward_prefetch=False \
     --fsdp_offload_params=False \
     --fsdp_state_dict_type=FULL_STATE_DICT \
     --fsdp_sync_module_states=True \
     --fsdp_use_orig_params=False

Expected behavior

The expected behaviour would be that for the same model and exact same settings, LoRA Fine-Tuning should have a lower training runtime and a lower GPU memory usage.

Observed behavior

When launching Full Fine-Tuning and LoRA Fine-Tuning using the standalone accelerate_launch.py script, Full Fine-Tuning is faster than LoRA Fine-Tuning (exact same settings) and the memory savings are not as high as one would expect.

This are the results for the ibm-granite/granite-3b-code-instruct model using accelerate_launch.py:

model name parameter count full fine tuning runtime (s) peak full fine tuning GPU memory used (all GPUs) (GiB) lora fine tuning runtime (s) peak lora fine tuning GPU memory used (all GPUs) (GiB) runtime difference (%) peak GPU memory used (all GPUs) difference (%)
ibm-granite/granite-3b-code-instruct 3 billion 154 228 364 184 +57.7 -23.9

And this are the results using torchrun for the same model:

model name parameter count full fine tuning runtime (s) peak full fine tuning GPU memory used (all GPUs) (GiB) lora fine tuning runtime (s) peak lora fine tuning GPU memory used (all GPUs) (GiB) runtime difference (%) peak GPU memory used (all GPUs) difference (%)
ibm-granite/granite-3b-code-instruct 3 billion 169 440 112 215 -33 -51

Additional context

We have been discussing this internally via Slack for a few weeks so feel free to reach out for more context or data.

@kmehant
Copy link
Collaborator

kmehant commented Feb 12, 2025

Adding an experiment from our slack thread for completeness.

Could not reproduce the problem with the below setup, which could mean RCA should involve accelerate_launch.py usage and knowledge of checkpoint creation time for each of the methods.

Results

Torchrun

Metric lora (q_proj and k_proj) granite-3b-code-instruct non lora granite-3b-code-instruct
Train Runtime (s) 33.0389 36.2827
Train Samples/s 24.214 22.049
Train Steps/s 1.513 1.378
Train Tokens/s 6198.748 5644.571
Train Loss 4.2760 0.3709
Epoch 0.51 0.51

accelerate launch

Metric lora (q_proj and k_proj) granite-3b-code-instruct non lora granite-3b-code-instruct
Train Runtime (s) 33.2409 36.5271
Train Samples/s 24.067 21.902
Train Steps/s 1.504 1.369
Train Tokens/s 6161.078 -
Train Loss 4.2787 0.3715
Epoch 0.51 0.51

Above numbers align with theoretical understanding of throughputs (lora should be slightly higher than full finetuning) and what was reported in lora paper.

Torchrun command for lora

torchrun --nnodes=1 --node_rank=0 --nproc_per_node=8 --rdzv_id=101 --rdzv_endpoint=0.0.0.0:8888 ./tuning/sft_trainer.py --model_name_or_path ibm-granite/granite-3b-code-instruct --output_dir ./train_output --max_steps 50 --save_strategy=no --torch_dtype  bfloat16 --logging_strategy steps --logging_steps 1 --per_device_train_batch_size 2 --max_seq_len 2048 --use_flash_attn true --packing true --gradient_checkpointing false --dataset_text_field "input" --training_data_path /workspace/fms-hf-tuning/tests/artifacts/testdata/jsonl/twitter_complaints_input_output.jsonl --fsdp "hybrid_shard auto_wrap" --fsdp_config ./config.json --include_tokens_per_second --peft_method lora -r 4 --lora_alpha 16 --target_modules q_proj k_proj

Torchrun command for full finetuning

torchrun --nnodes=1 --node_rank=0 --nproc_per_node=8 --rdzv_id=101 --rdzv_endpoint=0.0.0.0:8888 ./tuning/sft_trainer.py --model_name_or_path ibm-granite/granite-3b-code-instruct --output_dir ./train_output --max_steps 50 --save_strategy=no --torch_dtype  bfloat16 --logging_strategy steps --logging_steps 1 --per_device_train_batch_size 2 --max_seq_len 2048 --use_flash_attn true --packing true --gradient_checkpointing false --dataset_text_field "input" --training_data_path /workspace/fms-hf-tuning/tests/artifacts/testdata/jsonl/twitter_complaints_input_output.jsonl --fsdp "hybrid_shard auto_wrap" --fsdp_config ./config.json --include_tokens_per_second

Accelerate for lora

accelerate launch \
  --num_processes=8 \
  --dynamo_backend="no" \
  --fsdp_auto_wrap_policy="TRANSFORMER_BASED_WRAP" \
  --fsdp_cpu_ram_efficient_loading="true" \
  --fsdp_forward_prefetch="false" \
  --fsdp_offload_params="false" \
  --fsdp_sharding_strategy="HYBRID_SHARD" \
  --fsdp_state_dict_type="FULL_STATE_DICT" \
  --fsdp_sync_module_states="true" \
  --machine_rank="0" \
  --main_process_ip="127.0.0.1" \
  --main_process_port="29500" \
  --mixed_precision="no" \
  --num_machines="1" \
  --rdzv_backend="static" \
  --same_network \
  --use_fsdp \
  -m tuning.sft_trainer \
  --model_name_or_path ibm-granite/granite-3b-code-instruct --output_dir ./train_output --max_steps 50 --save_strategy=no --torch_dtype  bfloat16 --logging_strategy steps --logging_steps 1 --per_device_train_batch_size 2 --max_seq_len 2048 --use_flash_attn true --packing true --gradient_checkpointing false --dataset_text_field "input" --training_data_path /workspace/fms-hf-tuning/tests/artifacts/testdata/jsonl/twitter_complaints_input_output.jsonl --include_tokens_per_second --peft_method lora -r 4 --lora_alpha 16 --target_modules q_proj k_proj

Accelerate for full finetuning

accelerate launch \
  --num_processes=8 \
  --dynamo_backend="no" \
  --fsdp_auto_wrap_policy="TRANSFORMER_BASED_WRAP" \
  --fsdp_cpu_ram_efficient_loading="true" \
  --fsdp_forward_prefetch="false" \
  --fsdp_offload_params="false" \
  --fsdp_sharding_strategy="HYBRID_SHARD" \
  --fsdp_state_dict_type="FULL_STATE_DICT" \
  --fsdp_sync_module_states="true" \
  --machine_rank="0" \
  --main_process_ip="127.0.0.1" \
  --main_process_port="29500" \
  --mixed_precision="no" \
  --num_machines="1" \
  --rdzv_backend="static" \
  --same_network \
  --use_fsdp \
  -m tuning.sft_trainer \
  --model_name_or_path ibm-granite/granite-3b-code-instruct --output_dir ./train_output --max_steps 50 --save_strategy=no --torch_dtype  bfloat16 --logging_strategy steps --logging_steps 1 --per_device_train_batch_size 2 --max_seq_len 2048 --use_flash_attn true --packing true --gradient_checkpointing false --dataset_text_field "input" --training_data_path /workspace/fms-hf-tuning/tests/artifacts/testdata/jsonl/twitter_complaints_input_output.jsonl --include_tokens_per_second

Setup instructions

fms-hf-tuning version used v2.2.1. Use a clean conda env.

git clone https://github.com/foundation-model-stack/fms-hf-tuning.git
cd fms-hf-tuning
git fetch --all
git checkout tags/v2.2.1
pip install -e .

The training_data_path files used in the above commands are part of the above cloned repo. the fsdp_config file can be

{
        "fsdp_auto_wrap_policy": "TRANSFORMER_BASED_WRAP",
        "fsdp_backward_prefetch_policy": "BACKWARD_PRE",
        "fsdp_cpu_ram_efficient_loading": "True",
        "fsdp_forward_prefetch": "False",
        "fsdp_offload_params": "False",
        "fsdp_state_dict_type": "FULL_STATE_DICT",
        "fsdp_sync_module_states": "True",
        "fsdp_use_orig_params": "False"
    }

Infra

GPUs - 8 X A100 80 GB
CUDA - Driver Version: 525.105.17 CUDA Version: 12.0

Platform
OpenShift

@albertoperdomo2
Copy link
Author

I want to add a few more bits of information that I experienced in the past and I observed recently too. When launching with torchrun, for instance, the following command:

python -m torch.distributed.run \
  --nproc_per_node=8\
  --nnodes=1\
  --node_rank=0\
  --master_addr= localhost\
  --master_port=29500 \
  --module tuning.sft_trainer \
  --fsdp_auto_wrap_policy=TRANSFORMER_BASED_WRAP \
  --fsdp_backward_prefetch=BACKWARD_PRE \
  --fsdp_forward_prefetch=False \
  --fsdp_offload_params=False \
  --fsdp_state_dict_type=FULL_STATE_DICT \
  --fsdp_sync_module_states=True \
  --fsdp_use_orig_params=False \
  --fsdp_backward_prefetch_policy=BACKWARD_PRE \
  --fsdp_sharding_strategy=1 \
  --fsdp_cpu_ram_efficient_loading=True \
  --fsdp_sync_module_states=True

Models other than granite-3b-code-instruct fail due to OOM errors. These are granite-8b-code-base, granite-7b-instruct among others. With that being said, granite-3b-code-instruct performance is as expected.

@kmehant @anhuong I would highly appreciate if you could provide me any feedback about this.

@anhuong
Copy link
Collaborator

anhuong commented Feb 27, 2025

Slack thread discussion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants