Skip to content

ZeroCoderC00l/gpt2-llm-training

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

LLM Training Guide - Understanding the Code

A comprehensive guide to understanding how Large Language Models (LLMs) are trained using Python and Hugging Face Transformers.


📚 Table of Contents


🎯 Introduction

Training a neural network is like teaching a student:

  • Forward pass: The student answers a question (model makes a prediction)
  • Loss: How wrong the answer is (difference between prediction and correct answer)
  • Backward pass: Figure out what went wrong and how to improve
  • Update: Adjust the student's knowledge (update model weights)

This guide will walk you through each concept with simple explanations and analogies.


🔑 Key Concepts Explained

1. Forward Pass

outputs = model(input_ids=input_ids, labels=input_ids)

What it does: The model reads input text and tries to predict what comes next.

Example: Reading "The cat sat on the..." and trying to guess "mat"

Analogy: A student reading a question and writing their answer.


2. Loss (Measuring Error)

loss = outputs.loss
total_loss += loss.item()

What it is: A number that tells us how wrong the model's predictions are.

  • High loss = Model is very wrong (bad predictions)
  • Low loss = Model is close to correct (good predictions)

Example: If the model predicts "dog" but the correct word is "mat", the loss would be high.

Why we track it: To see if the model is improving over time.


3. Backward Pass

loss.backward()

What it does: This is the "learning" step! It calculates:

  • Which parts of the model contributed to the error
  • How much to adjust each parameter (weight) in the model

Analogy: If a student got a math problem wrong, the backward pass identifies which concepts they misunderstood (algebra vs arithmetic) so they know what to study.

Technical detail: Uses calculus (chain rule) to compute gradients - the direction and amount each weight should change.


4. Optimizer Step

optimizer.step()
optimizer.zero_grad()

What it does:

  • optimizer.step(): Actually updates the model's weights based on what the backward pass calculated
  • optimizer.zero_grad(): Clears old calculations so they don't interfere with the next batch

Analogy: The student updating their understanding after seeing the correct answer, then erasing their scratch paper for the next problem.


5. Gradient Accumulation

if (step + 1) % GRADIENT_ACCUMULATION_STEPS == 0:
    optimizer.step()

What it does: Instead of updating after every example, we collect (accumulate) errors from multiple examples before updating.

Why use it:

  • Limited GPU memory (can't fit large batches)
  • Want the model to learn from more examples at once

Analogy: Instead of studying 1 flashcard and immediately revising your notes, you study 4 flashcards, then revise everything at once.


6. Mixed Precision Training

with autocast():
    outputs = model(input_ids=input_ids)

What it does: Uses 16-bit numbers (FP16) instead of 32-bit (FP32) for most calculations.

Benefits:

  • 2-3x faster training
  • Uses less GPU memory
  • Still maintains accuracy for critical operations

Analogy: Using 3.14 instead of 3.14159265359 for π in most calculations - close enough and faster!


7. Gradient Clipping

torch.nn.utils.clip_grad_norm_(model.parameters(), MAX_GRAD_NORM)

What it does: Limits how large weight adjustments can be, preventing the model from learning too aggressively.

Why needed: Sometimes gradients become too large, causing the model to "explode" and learn incorrectly.

Analogy: If you got one test question wrong, you don't throw away everything you know - you make small adjustments.


8. Learning Rate Scheduler

scheduler = get_linear_schedule_with_warmup(...)
scheduler.step()

What it does: Adjusts how big the weight updates are during training.

The schedule:

  1. Warmup: Start with small updates to avoid breaking the pre-trained model
  2. Normal training: Use the full learning rate
  3. Decay: Gradually reduce updates as the model gets better (fine-tuning)

Learning rate guidelines:

  • Too high: Model learns too fast and overshoots (like running past your destination)
  • Too low: Model learns too slowly (like crawling)

Analogy: When learning to drive, you start slow (warmup), then drive normally, then slow down when parking (decay).


🔄 Step-by-Step Training Process

What Happens in ONE Training Step

# STEP 1: GET DATA
input_ids = batch['input_ids'].to(device)
# Move text data to GPU: "The cat sat on the mat"

# STEP 2: FORWARD PASS - Model makes predictions
outputs = model(input_ids=input_ids, labels=input_ids)
# Model tries to predict each next word
# Input:  "The cat sat on the"
# Predict: "cat sat on the mat"
# Correct: "cat sat on the mat"

# STEP 3: CALCULATE LOSS - How wrong was the model?
loss = outputs.loss
# Compares predictions vs correct answers
# Example: loss = 0.5 (lower is better)

# STEP 4: BACKWARD PASS - Calculate how to improve
loss.backward()
# Figures out: "If I adjust weight A by -0.01 and weight B by +0.03,
# the model will predict better next time"

# STEP 5: UPDATE WEIGHTS - Apply the improvements
optimizer.step()
# Actually changes the model's internal parameters

# STEP 6: RESET - Prepare for next batch
optimizer.zero_grad()
# Clear old calculations so they don't interfere

Complete Training Flow

For each EPOCH (complete pass through all data):
    For each BATCH (small group of examples):
        1. Load batch → GPU
        2. Forward pass → Get predictions
        3. Calculate loss → See how wrong
        4. Backward pass → Calculate adjustments
        5. Update weights → Improve model
        6. Track progress → Log loss
    
    After each epoch:
        ✓ Evaluate on validation data
        ✓ Save checkpoint if best model
        ✓ Check if model is improving

🚀 Installation & Setup

Prerequisites

  • Python 3.8 or higher
  • CUDA-capable GPU (recommended) or CPU
  • 8GB+ RAM (16GB+ recommended)

Install Dependencies

pip install torch transformers tqdm numpy

Or install specific versions:

pip install torch==2.0.0 transformers==4.30.0 tqdm==4.65.0 numpy==1.24.0

📖 Usage Guide

Step 1: Prepare Your Data

The script needs text data for training. You have two options:

Option A: Use Sample Data (for testing)

The script includes sample data by default. Just run it!

Option B: Use Your Own Data

Create a text file with one example per line:

Machine learning is a subset of artificial intelligence.
Natural language processing enables computers to understand human language.
Deep learning uses neural networks with multiple layers.

Then modify the script:

# In the train() function, replace:
all_texts = load_sample_data()

# With:
all_texts = load_data_from_file("path/to/your/data.txt")

Step 2: Configure Training Parameters

Edit the TrainingConfig class in the script:

class TrainingConfig:
    # Model selection
    MODEL_NAME = "gpt2"  # Options: gpt2, gpt2-medium, gpt2-large
    
    # Training parameters
    BATCH_SIZE = 4           # Reduce if out of memory
    NUM_EPOCHS = 3           # How many times to see all data
    LEARNING_RATE = 5e-5     # How fast to learn
    MAX_LENGTH = 512         # Maximum text length
    
    # System
    DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

Memory guidelines:

  • GPU with 6GB VRAM: BATCH_SIZE = 2, gpt2
  • GPU with 8GB VRAM: BATCH_SIZE = 4, gpt2
  • GPU with 12GB VRAM: BATCH_SIZE = 8, gpt2-medium
  • GPU with 24GB VRAM: BATCH_SIZE = 16, gpt2-large

Step 3: Run Training

python local/llm-training-script.py

The script will:

  1. Load the pre-trained model
  2. Prepare your data
  3. Train for the specified epochs
  4. Save checkpoints periodically
  5. Save the final model
  6. Generate sample text to test

Step 4: Use Your Trained Model

After training, use the model for text generation:

from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load your trained model
model = GPT2LMHeadModel.from_pretrained("./trained_model")
tokenizer = GPT2Tokenizer.from_pretrained("./trained_model")

# Generate text
prompt = "Machine learning is"
input_ids = tokenizer.encode(prompt, return_tensors="pt")
output = model.generate(input_ids, max_length=100)
text = tokenizer.decode(output[0], skip_special_tokens=True)

print(text)

📊 Understanding the Output

During Training

You'll see output like this:

================================================================================
Epoch 1/3
================================================================================
Training: 100%|████████████| 250/250 [10:30<00:00, loss=2.5, lr=0.00005]

Epoch 1 Results:
  Train Loss: 2.4532
  Val Loss: 2.3124
  ✓ New best model! (Previous best: inf)

Checkpoint saved: ./checkpoints/checkpoint_epoch1_step0.pt

What to Look For

Indicator Meaning What to Do
Loss going down ✅ Model is learning Continue training
Loss staying flat ⚠️ Model stopped improving Stop training or adjust learning rate
Loss going up ❌ Something is wrong Reduce learning rate or check data
Train loss << Val loss ❌ Overfitting Reduce epochs or add regularization

Good Training Example

Epoch 1: Train=2.5, Val=2.3  ✓ Learning
Epoch 2: Train=2.1, Val=2.0  ✓ Still improving
Epoch 3: Train=1.9, Val=1.8  ✓ Great progress!

Overfitting Example

Epoch 1: Train=2.5, Val=2.3  ✓ Learning
Epoch 2: Train=1.8, Val=2.1  ⚠️ Val getting worse
Epoch 3: Train=1.2, Val=2.4  ❌ Definitely overfitting!

🛠️ Troubleshooting

Out of Memory Error

RuntimeError: CUDA out of memory

Solutions:

  1. Reduce BATCH_SIZE (try 2 or 1)
  2. Reduce MAX_LENGTH (try 256 or 128)
  3. Use a smaller model (gpt2 instead of gpt2-medium)
  4. Increase GRADIENT_ACCUMULATION_STEPS

Loss is NaN or Inf

Epoch 1: Train Loss: nan

Solutions:

  1. Reduce LEARNING_RATE (try 1e-5 or 5e-6)
  2. Enable gradient clipping (should be on by default)
  3. Check your data for errors or corrupted text
  4. Use mixed precision training

Training is Too Slow

Solutions:

  1. Enable mixed precision: USE_MIXED_PRECISION = True
  2. Increase BATCH_SIZE if you have memory
  3. Reduce MAX_LENGTH if texts are short anyway
  4. Use a GPU if you're on CPU
  5. Reduce NUM_WORKERS if CPU is bottleneck

Model Not Improving

Solutions:

  1. Train for more epochs
  2. Increase LEARNING_RATE (try 5e-5 or 1e-4)
  3. Check if you have enough training data (need 1000+ examples minimum)
  4. Verify data quality and variety
  5. Try a different model architecture

📈 Quick Reference: Concept Analogies

ML Concept Real-World Analogy
Model Student's brain
Forward pass Answering a question on a test
Loss Points lost on the test
Backward pass Reviewing what you got wrong
Optimizer Updating your study notes
Epoch Going through all flashcards once
Batch A handful of flashcards at a time
Learning rate How quickly you change your understanding
Overfitting Memorizing answers instead of understanding concepts
Validation set Practice test with new questions
Checkpoint Saving your progress in a video game

📚 Further Learning

Recommended Resources

Basics:

Hugging Face:

Advanced:


🎓 Key Takeaways

  1. Training is iterative: The model learns by making predictions, seeing errors, and adjusting
  2. Loss goes down = learning: Watch the loss metrics to know if training is working
  3. Start small: Use small models and datasets to experiment before scaling up
  4. Validation is crucial: Always check performance on unseen data
  5. Patience is key: Training takes time, and models improve gradually

💡 Tips for Success

DO:

  • Start with a pre-trained model (fine-tuning is easier)
  • Monitor both training and validation loss
  • Save checkpoints frequently
  • Test on simple examples first
  • Use a GPU if possible

DON'T:

  • Train for too many epochs (causes overfitting)
  • Use a learning rate that's too high
  • Ignore validation loss increases
  • Train on data that's too different from what you'll use
  • Skip data preprocessing and cleaning

📞 Support

If you have questions or run into issues:

  1. Check the Troubleshooting section
  2. Review the Hugging Face Forums
  3. Search Stack Overflow
  4. Read the code comments for detailed explanations

📄 License

This code is provided for educational purposes. The models (GPT-2, etc.) have their own licenses from their creators.


Happy Training! 🚀

Remember: Every expert was once a beginner. Take your time understanding these concepts, and don't hesitate to experiment!

About

Sample LLM training implementation and running gpt2 locally.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages