A comprehensive guide to understanding how Large Language Models (LLMs) are trained using Python and Hugging Face Transformers.
- Introduction
- Key Concepts Explained
- Step-by-Step Training Process
- Installation & Setup
- Usage Guide
- Understanding the Output
- Troubleshooting
- Further Learning
Training a neural network is like teaching a student:
- Forward pass: The student answers a question (model makes a prediction)
- Loss: How wrong the answer is (difference between prediction and correct answer)
- Backward pass: Figure out what went wrong and how to improve
- Update: Adjust the student's knowledge (update model weights)
This guide will walk you through each concept with simple explanations and analogies.
outputs = model(input_ids=input_ids, labels=input_ids)What it does: The model reads input text and tries to predict what comes next.
Example: Reading "The cat sat on the..." and trying to guess "mat"
Analogy: A student reading a question and writing their answer.
loss = outputs.loss
total_loss += loss.item()What it is: A number that tells us how wrong the model's predictions are.
- High loss = Model is very wrong (bad predictions)
- Low loss = Model is close to correct (good predictions)
Example: If the model predicts "dog" but the correct word is "mat", the loss would be high.
Why we track it: To see if the model is improving over time.
loss.backward()What it does: This is the "learning" step! It calculates:
- Which parts of the model contributed to the error
- How much to adjust each parameter (weight) in the model
Analogy: If a student got a math problem wrong, the backward pass identifies which concepts they misunderstood (algebra vs arithmetic) so they know what to study.
Technical detail: Uses calculus (chain rule) to compute gradients - the direction and amount each weight should change.
optimizer.step()
optimizer.zero_grad()What it does:
optimizer.step(): Actually updates the model's weights based on what the backward pass calculatedoptimizer.zero_grad(): Clears old calculations so they don't interfere with the next batch
Analogy: The student updating their understanding after seeing the correct answer, then erasing their scratch paper for the next problem.
if (step + 1) % GRADIENT_ACCUMULATION_STEPS == 0:
optimizer.step()What it does: Instead of updating after every example, we collect (accumulate) errors from multiple examples before updating.
Why use it:
- Limited GPU memory (can't fit large batches)
- Want the model to learn from more examples at once
Analogy: Instead of studying 1 flashcard and immediately revising your notes, you study 4 flashcards, then revise everything at once.
with autocast():
outputs = model(input_ids=input_ids)What it does: Uses 16-bit numbers (FP16) instead of 32-bit (FP32) for most calculations.
Benefits:
- 2-3x faster training
- Uses less GPU memory
- Still maintains accuracy for critical operations
Analogy: Using 3.14 instead of 3.14159265359 for π in most calculations - close enough and faster!
torch.nn.utils.clip_grad_norm_(model.parameters(), MAX_GRAD_NORM)What it does: Limits how large weight adjustments can be, preventing the model from learning too aggressively.
Why needed: Sometimes gradients become too large, causing the model to "explode" and learn incorrectly.
Analogy: If you got one test question wrong, you don't throw away everything you know - you make small adjustments.
scheduler = get_linear_schedule_with_warmup(...)
scheduler.step()What it does: Adjusts how big the weight updates are during training.
The schedule:
- Warmup: Start with small updates to avoid breaking the pre-trained model
- Normal training: Use the full learning rate
- Decay: Gradually reduce updates as the model gets better (fine-tuning)
Learning rate guidelines:
- Too high: Model learns too fast and overshoots (like running past your destination)
- Too low: Model learns too slowly (like crawling)
Analogy: When learning to drive, you start slow (warmup), then drive normally, then slow down when parking (decay).
# STEP 1: GET DATA
input_ids = batch['input_ids'].to(device)
# Move text data to GPU: "The cat sat on the mat"
# STEP 2: FORWARD PASS - Model makes predictions
outputs = model(input_ids=input_ids, labels=input_ids)
# Model tries to predict each next word
# Input: "The cat sat on the"
# Predict: "cat sat on the mat"
# Correct: "cat sat on the mat"
# STEP 3: CALCULATE LOSS - How wrong was the model?
loss = outputs.loss
# Compares predictions vs correct answers
# Example: loss = 0.5 (lower is better)
# STEP 4: BACKWARD PASS - Calculate how to improve
loss.backward()
# Figures out: "If I adjust weight A by -0.01 and weight B by +0.03,
# the model will predict better next time"
# STEP 5: UPDATE WEIGHTS - Apply the improvements
optimizer.step()
# Actually changes the model's internal parameters
# STEP 6: RESET - Prepare for next batch
optimizer.zero_grad()
# Clear old calculations so they don't interfereFor each EPOCH (complete pass through all data):
For each BATCH (small group of examples):
1. Load batch → GPU
2. Forward pass → Get predictions
3. Calculate loss → See how wrong
4. Backward pass → Calculate adjustments
5. Update weights → Improve model
6. Track progress → Log loss
After each epoch:
✓ Evaluate on validation data
✓ Save checkpoint if best model
✓ Check if model is improving
- Python 3.8 or higher
- CUDA-capable GPU (recommended) or CPU
- 8GB+ RAM (16GB+ recommended)
pip install torch transformers tqdm numpyOr install specific versions:
pip install torch==2.0.0 transformers==4.30.0 tqdm==4.65.0 numpy==1.24.0The script needs text data for training. You have two options:
The script includes sample data by default. Just run it!
Create a text file with one example per line:
Machine learning is a subset of artificial intelligence.
Natural language processing enables computers to understand human language.
Deep learning uses neural networks with multiple layers.
Then modify the script:
# In the train() function, replace:
all_texts = load_sample_data()
# With:
all_texts = load_data_from_file("path/to/your/data.txt")Edit the TrainingConfig class in the script:
class TrainingConfig:
# Model selection
MODEL_NAME = "gpt2" # Options: gpt2, gpt2-medium, gpt2-large
# Training parameters
BATCH_SIZE = 4 # Reduce if out of memory
NUM_EPOCHS = 3 # How many times to see all data
LEARNING_RATE = 5e-5 # How fast to learn
MAX_LENGTH = 512 # Maximum text length
# System
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"Memory guidelines:
- GPU with 6GB VRAM:
BATCH_SIZE = 2,gpt2 - GPU with 8GB VRAM:
BATCH_SIZE = 4,gpt2 - GPU with 12GB VRAM:
BATCH_SIZE = 8,gpt2-medium - GPU with 24GB VRAM:
BATCH_SIZE = 16,gpt2-large
python local/llm-training-script.pyThe script will:
- Load the pre-trained model
- Prepare your data
- Train for the specified epochs
- Save checkpoints periodically
- Save the final model
- Generate sample text to test
After training, use the model for text generation:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
# Load your trained model
model = GPT2LMHeadModel.from_pretrained("./trained_model")
tokenizer = GPT2Tokenizer.from_pretrained("./trained_model")
# Generate text
prompt = "Machine learning is"
input_ids = tokenizer.encode(prompt, return_tensors="pt")
output = model.generate(input_ids, max_length=100)
text = tokenizer.decode(output[0], skip_special_tokens=True)
print(text)You'll see output like this:
================================================================================
Epoch 1/3
================================================================================
Training: 100%|████████████| 250/250 [10:30<00:00, loss=2.5, lr=0.00005]
Epoch 1 Results:
Train Loss: 2.4532
Val Loss: 2.3124
✓ New best model! (Previous best: inf)
Checkpoint saved: ./checkpoints/checkpoint_epoch1_step0.pt
| Indicator | Meaning | What to Do |
|---|---|---|
| Loss going down | ✅ Model is learning | Continue training |
| Loss staying flat | Stop training or adjust learning rate | |
| Loss going up | ❌ Something is wrong | Reduce learning rate or check data |
| Train loss << Val loss | ❌ Overfitting | Reduce epochs or add regularization |
Epoch 1: Train=2.5, Val=2.3 ✓ Learning
Epoch 2: Train=2.1, Val=2.0 ✓ Still improving
Epoch 3: Train=1.9, Val=1.8 ✓ Great progress!
Epoch 1: Train=2.5, Val=2.3 ✓ Learning
Epoch 2: Train=1.8, Val=2.1 ⚠️ Val getting worse
Epoch 3: Train=1.2, Val=2.4 ❌ Definitely overfitting!
RuntimeError: CUDA out of memory
Solutions:
- Reduce
BATCH_SIZE(try 2 or 1) - Reduce
MAX_LENGTH(try 256 or 128) - Use a smaller model (
gpt2instead ofgpt2-medium) - Increase
GRADIENT_ACCUMULATION_STEPS
Epoch 1: Train Loss: nan
Solutions:
- Reduce
LEARNING_RATE(try 1e-5 or 5e-6) - Enable gradient clipping (should be on by default)
- Check your data for errors or corrupted text
- Use mixed precision training
Solutions:
- Enable mixed precision:
USE_MIXED_PRECISION = True - Increase
BATCH_SIZEif you have memory - Reduce
MAX_LENGTHif texts are short anyway - Use a GPU if you're on CPU
- Reduce
NUM_WORKERSif CPU is bottleneck
Solutions:
- Train for more epochs
- Increase
LEARNING_RATE(try 5e-5 or 1e-4) - Check if you have enough training data (need 1000+ examples minimum)
- Verify data quality and variety
- Try a different model architecture
| ML Concept | Real-World Analogy |
|---|---|
| Model | Student's brain |
| Forward pass | Answering a question on a test |
| Loss | Points lost on the test |
| Backward pass | Reviewing what you got wrong |
| Optimizer | Updating your study notes |
| Epoch | Going through all flashcards once |
| Batch | A handful of flashcards at a time |
| Learning rate | How quickly you change your understanding |
| Overfitting | Memorizing answers instead of understanding concepts |
| Validation set | Practice test with new questions |
| Checkpoint | Saving your progress in a video game |
Basics:
Hugging Face:
Advanced:
- Attention Is All You Need (Original Transformer paper)
- Stanford CS224N: NLP with Deep Learning
- Training is iterative: The model learns by making predictions, seeing errors, and adjusting
- Loss goes down = learning: Watch the loss metrics to know if training is working
- Start small: Use small models and datasets to experiment before scaling up
- Validation is crucial: Always check performance on unseen data
- Patience is key: Training takes time, and models improve gradually
✅ DO:
- Start with a pre-trained model (fine-tuning is easier)
- Monitor both training and validation loss
- Save checkpoints frequently
- Test on simple examples first
- Use a GPU if possible
❌ DON'T:
- Train for too many epochs (causes overfitting)
- Use a learning rate that's too high
- Ignore validation loss increases
- Train on data that's too different from what you'll use
- Skip data preprocessing and cleaning
If you have questions or run into issues:
- Check the Troubleshooting section
- Review the Hugging Face Forums
- Search Stack Overflow
- Read the code comments for detailed explanations
This code is provided for educational purposes. The models (GPT-2, etc.) have their own licenses from their creators.
Happy Training! 🚀
Remember: Every expert was once a beginner. Take your time understanding these concepts, and don't hesitate to experiment!