GPU Programming 101 🚀

A comprehensive, hands-on educational project for mastering GPU programming with CUDA and HIP

From beginner fundamentals to professional-grade optimization techniques

📑 Table of Contents

📋 Project Overview
🏗️ GPU Programming Architecture
✨ Key Features
🚀 Quick Start
🎯 Learning Path
📚 Modules
🛠️ Prerequisites
🐳 Docker Development
🔧 Build System
📊 Performance Expectations
🐛 Troubleshooting
📖 Documentation
🤝 Contributing
📄 License

📋 Project Overview

GPU Programming 101 is a complete educational resource for learning modern GPU programming. This project provides:

9 comprehensive modules covering beginner to expert topics
71 working code examples in both CUDA and HIP
Cross-platform support for NVIDIA and AMD GPUs
Comprehensive development environment with Docker
Professional tooling including profilers, debuggers, and CI/CD

Perfect for students, researchers, and developers looking to master GPU computing.

🏗️ GPU Programming Architecture

Understanding how GPU programming works from high-level code to hardware execution is crucial for effective GPU development. This section provides a comprehensive overview of the CUDA and HIP ROCm software-hardware stack.

Architecture Overview Diagram

┌───────────────────────────────────────────────────────────────────────────────────┐
│                                APPLICATION LAYER                                  │
├───────────────────────────────────────────────────────────────────────────────────┤
│  High-Level Code (C++/CUDA/HIP)                                                   │
│  ┌─────────────────────┐    ┌─────────────────────┐    ┌─────────────────────┐    │
│  │   CUDA C++ Code     │    │    HIP C++ Code     │    │   OpenCL/SYCL       │    │
│  │   (.cu files)       │    │   (.hip files)      │    │   (Cross-platform)  │    │
│  │                     │    │                     │    │                     │    │
│  │ __global__ kernels  │    │ __global__ kernels  │    │ kernel functions    │    │
│  │ cudaMalloc()        │    │ hipMalloc()         │    │ clCreateBuffer()    │    │
│  │ cudaMemcpy()        │    │ hipMemcpy()         │    │ clEnqueueNDRange()  │    │
│  └─────────────────────┘    └─────────────────────┘    └─────────────────────┘    │
└───────────────────────────────────────────────────────────────────────────────────┘
                                        │
                                        ▼
┌───────────────────────────────────────────────────────────────────────────────────┐
│                              COMPILATION LAYER                                    │
├───────────────────────────────────────────────────────────────────────────────────┤
│  Compiler Frontend                                                                │
│  ┌─────────────────────┐    ┌─────────────────────┐    ┌─────────────────────┐    │
│  │      NVCC           │    │      HIP Clang      │    │    LLVM/Clang       │    │
│  │  (NVIDIA Compiler)  │    │   (AMD Compiler)    │    │   (Open Standard)   │    │
│  │                     │    │                     │    │                     │    │
│  │ • Parse CUDA syntax │    │ • Parse HIP syntax  │    │ • Parse OpenCL/SYCL │    │
│  │ • Host/Device split │    │ • Host/Device split │    │ • Generate SPIR-V   │    │
│  │ • Generate PTX      │    │ • Generate GCN ASM  │    │ • Target backends   │    │
│  └─────────────────────┘    └─────────────────────┘    └─────────────────────┘    │
└───────────────────────────────────────────────────────────────────────────────────┘
                                        │
                                        ▼
┌───────────────────────────────────────────────────────────────────────────────────┐
│                           INTERMEDIATE REPRESENTATION                             │
├───────────────────────────────────────────────────────────────────────────────────┤
│  ┌─────────────────────┐    ┌─────────────────────┐    ┌─────────────────────┐    │
│  │        PTX          │    │      GCN ASM        │    │      SPIR-V         │    │
│  │ (Parallel Thread    │    │  (Graphics Core     │    │  (Standard Portable │    │
│  │  Execution)         │    │   Next Assembly)    │    │   IR - Vulkan)      │    │
│  │                     │    │                     │    │                     │    │
│  │ • Virtual ISA       │    │ • AMD GPU ISA       │    │ • Cross-platform    │    │
│  │ • Device agnostic   │    │ • RDNA/CDNA arch    │    │ • Vendor neutral    │    │
│  │ • JIT compilation   │    │ • Direct execution  │    │ • Multiple targets  │    │
│  └─────────────────────┘    └─────────────────────┘    └─────────────────────┘    │
└───────────────────────────────────────────────────────────────────────────────────┘
                                        │
                                        ▼
┌───────────────────────────────────────────────────────────────────────────────────┐
│                               DRIVER LAYER                                        │
├───────────────────────────────────────────────────────────────────────────────────┤
│  ┌─────────────────────┐    ┌─────────────────────┐    ┌─────────────────────┐    │
│  │    CUDA Driver      │    │     ROCm Driver     │    │   OpenCL Driver     │    │
│  │                     │    │                     │    │                     │    │
│  │ • PTX → SASS JIT    │    │ • GCN → Machine     │    │ • SPIR-V → Native   │    │
│  │ • Memory management │    │ • Memory management │    │ • Memory management │    │
│  │ • Kernel launch     │    │ • Kernel launch     │    │ • Kernel launch     │    │
│  │ • Context mgmt      │    │ • Context mgmt      │    │ • Context mgmt      │    │
│  └─────────────────────┘    └─────────────────────┘    └─────────────────────┘    │
└───────────────────────────────────────────────────────────────────────────────────┘
                                        │
                                        ▼
┌───────────────────────────────────────────────────────────────────────────────────┐
│                              HARDWARE LAYER                                       │
├───────────────────────────────────────────────────────────────────────────────────┤
│  ┌─────────────────────┐    ┌─────────────────────┐                               │
│  │    NVIDIA GPU       │    │      AMD GPU        │                               │
│  │                     │    │                     │                               │
│  │ ┌─────────────────┐ │    │ ┌─────────────────┐ │    ┌─────────────────────┐    │
│  │ │   SM (Cores)    │ │    │ │   CU (Cores)    │ │    │   Intel Xe Cores    │    │
│  │ │ ┌─────────────┐ │ │    │ │ ┌─────────────┐ │ │    │ ┌─────────────────┐ │    │
│  │ │ │FP32 | INT32 │ │ │    │ │ │FP32 | INT32 │ │ │    │ │  Vector Engines │ │    │
│  │ │ │FP64 | BF16  │ │ │    │ │ │FP64 | BF16  │ │ │    │ │  Matrix Engines │ │    │
│  │ │ │Tensor Cores │ │ │    │ │ │Matrix Cores │ │ │    │ │  Ray Tracing    │ │    │
│  │ │ └─────────────┘ │ │    │ │ └─────────────┘ │ │    │ └─────────────────┘ │    │
│  │ └─────────────────┘ │    │ └─────────────────┘ │    └─────────────────────┘    │
│  │                     │    │                     │                               │
│  │ Memory Hierarchy:   │    │ Memory Hierarchy:   │    Memory Hierarchy:          │
│  │ • L1 Cache (KB)     │    │ • L1 Cache (KB)     │    • L1 Cache                 │
│  │ • L2 Cache (MB)     │    │ • L2 Cache (MB)     │    • L2 Cache                 │
│  │ • Global Mem (GB)   │    │ • Global Mem (GB)   │    • Global Memory            │
│  │ • Shared Memory     │    │ • LDS (Local Data   │    • Shared Local Memory      │
│  │ • Constant Memory   │    │   Store)            │    • Constant Memory          │
│  │ • Texture Memory    │    │ • Constant Memory   │                               │
│  └─────────────────────┘    └─────────────────────┘                               │
└───────────────────────────────────────────────────────────────────────────────────┘

Compilation Pipeline Deep Dive

1. Source Code → Frontend Parsing

CUDA: NVCC separates host (CPU) and device (GPU) code, parses CUDA extensions
HIP: Clang-based compiler with HIP runtime API that maps to either CUDA or ROCm
OpenCL/SYCL: LLVM-based compilation with cross-platform intermediate representation

2. Frontend → Intermediate Representation

High-Level Code                    Intermediate Form
─────────────────                 ───────────────────
__global__ void kernel()    →     PTX (NVIDIA)
{                                 GCN Assembly (AMD)  
    int id = threadIdx.x;         SPIR-V (OpenCL/Vulkan)
    output[id] = input[id] * 2;   LLVM IR (SYCL)
}

3. Runtime Compilation & Optimization

NVIDIA: PTX → SASS (GPU-specific machine code) via JIT compilation
AMD: GCN Assembly → GPU microcode via ROCm runtime
Optimizations: Register allocation, memory coalescing, instruction scheduling

4. Hardware Execution Model

Abstraction Level	NVIDIA Term	AMD Term	Description
Thread	Thread	Work-item	Single execution unit
Thread Group	Warp (32 threads)	Wavefront (64 threads)	SIMD execution group
Thread Block	Block	Work-group	Shared memory + synchronization
Grid	Grid	NDRange	Collection of all thread blocks

5. Memory Architecture Mapping

Programming Model              Hardware Implementation
─────────────────              ─────────────────────────
Global Memory        →         GPU DRAM (HBM/GDDR)
Shared Memory        →         On-chip SRAM (48-164KB per SM/CU)
Local Memory         →         GPU DRAM (spilled registers)
Constant Memory      →         Cached read-only GPU DRAM
Texture Memory       →         Cached GPU DRAM with interpolation
Registers            →         On-chip register file (32K-64K per SM/CU)

Performance Implications

Understanding this architecture helps optimize GPU code:

Memory Coalescing: Access patterns that align with hardware memory buses
Occupancy: Balancing registers, shared memory, and thread blocks per SM/CU
Divergence: Minimizing different execution paths within warps/wavefronts
Latency Hiding: Using enough threads to hide memory access latency
Memory Hierarchy: Optimal use of each memory type based on access patterns

This architectural knowledge is essential for writing efficient GPU code and is covered progressively throughout our modules.

✨ Key Features

Feature	Description
🎯 Complete Curriculum	9 progressive modules from basics to advanced topics
💻 Cross-Platform	Full CUDA and HIP support for NVIDIA and AMD GPUs
🐳 Docker Ready	Complete containerized development environment with CUDA 13.0.1 & ROCm 7.0.1
🔧 Professional Quality	Professional build systems, auto-detection, testing, and profiling
📊 Performance Focus	Optimization techniques and benchmarking throughout
🌐 Community Driven	Open source with comprehensive contribution guidelines
🧪 Advanced Libraries	Support for Thrust, MIOpen, and production ML frameworks

🚀 Quick Start

Option 1: Docker (Recommended)

Get started immediately without installing CUDA/ROCm on your host system:

# Clone the repository
git clone https://github.com/AIComputing101/gpu-programming-101.git
cd gpu-programming-101

# Auto-detect your GPU and start development environment
./docker/scripts/run.sh --auto

# Inside container: verify GPU access and start learning
/workspace/test-gpu.sh
cd modules/module1 && make && ./build/01_vector_addition_cuda

Option 2: Native Installation

For direct system installation:

# Prerequisites: CUDA 13.0+ or ROCm 7.0+, GCC 9+, Make

# Clone and build
git clone https://github.com/AIComputing101/gpu-programming-101.git
cd gpu-programming-101

# Verify your setup
make check-system

# Build and run first example
make module1
cd modules/module1/examples
./01_vector_addition_cuda

🎯 Learning Path

Choose your track based on your experience level:

👶 Beginner Track (Modules 1-3) - GPU fundamentals, memory management, first kernels
🔥 Intermediate Track (Modules 4-5) - Advanced programming, performance optimization
🚀 Advanced Track (Modules 6-9) - Parallel algorithms, domain applications, production deployment

Each track builds on the previous one, so start with the appropriate level for your background.

📚 Modules

Our comprehensive curriculum progresses from fundamental concepts to advanced optimization techniques:

Module	Level	Focus Area	Key Topics	Examples
Module 1	👶 Beginner	GPU Fundamentals	Architecture, Memory, First Kernels	13
Module 2	👶→🔥	Memory Optimization	Coalescing, Shared Memory, Texture	10
Module 3	🔥 Intermediate	Execution Models	Warps, Occupancy, Synchronization	12
Module 4	🔥→🚀	Advanced Programming	Streams, Multi-GPU, Unified Memory	9
Module 5	🚀 Advanced	Performance Engineering	Profiling, Bottleneck Analysis	5
Module 6	🚀 Advanced	Parallel Algorithms	Reduction, Scan, Convolution	10
Module 7	🚀 Expert	Algorithmic Patterns	Sorting, Graph Algorithms	4
Module 8	🚀 Expert	Domain Applications	ML, Scientific Computing	4
Module 9	🚀 Expert	Production Deployment	Libraries, Integration, Scaling	4

📈 Progressive Learning Path: 71 Examples • Beginner to Expert

Learning Progression

Module 1: Hello GPU World          Module 6: Parallel Algorithms
    ↓                                 ↓
Module 2: Memory Mastery          Module 7: Advanced Patterns  
    ↓                                 ↓
Module 3: Execution Deep Dive     Module 8: Real Applications
    ↓                                 ↓
Module 4: Advanced Features       Module 9: Production Ready
    ↓                             
Module 5: Performance Tuning

📚 View All Modules →

🛠️ Prerequisites

Hardware Requirements

NVIDIA GPU Systems

Minimum GPU: GTX 1060 6GB, GTX 1650, RTX 2060 or better
Recommended GPU: RTX 3070/4070 (12GB+), RTX 3080/4080 (16GB+)
Professional/Advanced: RTX 4090 (24GB), RTX A6000 (48GB), Tesla/Quadro series
Architecture Support: Maxwell, Pascal, Volta, Turing, Ampere, Ada Lovelace, Hopper
Compute Capability: 5.0+ (Maxwell architecture or newer)

AMD GPU Systems

Minimum GPU: RX 580 8GB, RX 6600, RX 7600 or better
Recommended GPU: RX 6700 XT/7700 XT (12GB+), RX 6800 XT/7800 XT (16GB+)
Professional/Advanced: RX 7900 XTX (24GB), Radeon PRO W7800 (48GB), Instinct MI series
Architecture Support: RDNA2, RDNA3, RDNA4, GCN 5.0+, CDNA series
ROCm Compatibility: Officially supported AMD GPUs only

System Memory & CPU

Minimum RAM: 16GB system RAM
Recommended RAM: 32GB+ for advanced modules and multi-GPU setups
Professional Setup: 64GB+ for large-scale scientific computing
CPU Requirements:
- Intel: Haswell (2013) or newer for PCIe atomics support
- AMD: Zen 1 (2017) or newer for PCIe atomics support
Storage: 20GB+ free space for Docker containers and examples

Software Requirements

Operating System Support

Linux (Recommended): Ubuntu 22.04/24.04 LTS, RHEL 8/9, SLES 15 SP5
Windows: Windows 10/11 with WSL2 recommended for optimal compatibility
macOS: macOS 12+ (Metal Performance Shaders for basic GPU compute)

GPU Computing Platforms

CUDA Toolkit: 13.0+ recommended (Docker uses CUDA 13.0.1)
- Driver Requirements:
  - Linux: 580+ for CUDA 13.0+
  - Windows: 580+ for CUDA 13.0+
ROCm Platform: 7.0+ (Docker uses ROCm 7.0)
- Driver Requirements: Latest AMDGPU-PRO or open-source AMDGPU drivers
- Kernel Support: Linux kernel 5.4+ recommended

Development Environment

Compilers:
- GCC: 9.0+ (GCC 11+ recommended for C++17 features)
- Clang: 10.0+ (Clang 14+ recommended)
- MSVC: 2019+ (2022 17.10+ for CUDA 13.0+ support)
Build Tools: Make 4.0+, CMake 3.18+ (optional)
Docker: 20.10+ with GPU runtime support (nvidia-container-toolkit or ROCm containers)

Additional Tools (Included in Docker)

Profiling: Nsight Compute, Nsight Systems (NVIDIA), rocprof (AMD)
Debugging: cuda-gdb, rocgdb, compute-sanitizer
Libraries: cuBLAS, cuFFT, rocBLAS, rocFFT (for advanced modules)
ML Libraries: Thrust (NVIDIA), MIOpen (AMD) for deep learning applications
System Management: NVML (NVIDIA), ROCm SMI (AMD) for hardware monitoring

Performance Expectations by Hardware Tier

Hardware Tier	Example GPUs	VRAM	Expected Performance	Suitable Modules
Entry Level	GTX 1060 6GB, RX 580 8GB	6-8GB	10-50x CPU speedup	Modules 1-3
Mid-Range	RTX 3060 Ti, RX 6700 XT	12GB	50-200x CPU speedup	Modules 1-6
High-End	RTX 4070 Ti, RX 7800 XT	16GB	100-500x CPU speedup	All modules
Professional	RTX 4090, RX 7900 XTX	24GB	200-1000x+ CPU speedup	All modules + research

Programming Knowledge

C/C++: Intermediate level (pointers, memory management, basic templates)
Parallel Programming: Basic understanding of threads and synchronization helpful
Command Line: Comfortable with terminal/shell operations
Mathematics: Linear algebra and calculus basics beneficial for advanced modules
Version Control: Basic Git knowledge for contributing

Network Requirements (Docker Setup)

Internet Connection: Required for initial Docker image downloads (~8GB total)
Bandwidth: 50+ Mbps recommended for efficient container downloads
Storage: Additional 20GB for Docker images and build cache

🐳 Docker Development

Experience the full development environment with zero setup:

# Build development containers
./docker/scripts/build.sh --all

# Start interactive development
./docker/scripts/run.sh cuda    # For NVIDIA GPUs
./docker/scripts/run.sh rocm    # For AMD GPUs
./docker/scripts/run.sh --auto  # Auto-detect GPU type

Docker Benefits:

🎯 Zero host configuration required
🔧 Complete development environment (compilers, debuggers, profilers)
🌐 Cross-platform testing (test your code on both CUDA and HIP)
📦 Isolated and reproducible builds
🧹 Easy cleanup when done

Container Specifications:

CUDA: NVIDIA CUDA 13.0.1 on Ubuntu 24.04
ROCm: AMD ROCm 7.0 on Ubuntu 24.04
Libraries: Professional toolchains with debugging support

📖 Complete Docker Guide →

🔧 Build System

Our advanced build system features automatic GPU vendor detection and optimized configurations:

Project-Wide Commands

make all           # Build all modules with auto-detection
make test          # Run comprehensive tests  
make clean         # Clean all artifacts
make check-system  # Verify GPU setup and dependencies
make status        # Show module completion status

Module-Specific Commands

cd modules/module1/examples
make               # Build all examples with vendor auto-detection
make test          # Run module tests
make profile       # Performance profiling
make debug         # Debug builds with extra checks

Advanced Build Features

Automatic GPU Detection: Detects NVIDIA/AMD hardware and builds accordingly
Professional Optimization: -O3, fast math, architecture-specific optimizations
Debug Support: Full debugging symbols and validation checks
Library Management: Automatic detection of optional dependencies (NVML, MIOpen)
Cross-Platform: Single Makefile supports both CUDA and HIP builds

Performance Expectations

Module Level	Typical GPU Speedup	Memory Efficiency	Code Quality
Beginner	10-100x	60-80%	Educational
Intermediate	50-500x	80-95%	Optimized
Advanced	100-1000x	85-95%	Professional
Expert	500-5000x	95%+	Library-Quality

🐛 Troubleshooting

Common Issues & Solutions

GPU Not Detected

# NVIDIA
nvidia-smi  # Should show your GPU
export PATH=/usr/local/cuda/bin:$PATH

# AMD  
rocm-smi   # Should show your GPU
export HIP_PLATFORM=amd

Compilation Errors

# Check CUDA installation
nvcc --version
make check-cuda

# Check HIP installation  
hipcc --version
make check-hip

Docker Issues

# Test Docker GPU access
./docker/scripts/test.sh

# Rebuild containers
./docker/scripts/build.sh --clean --all

📖 Documentation

Document	Description
README.md	Main project documentation and getting started guide
CONTRIBUTING.md	How to contribute to the project
Docker Guide	Complete Docker setup and usage
Module READMEs	Individual module documentation

🤝 Contributing

We welcome contributions from the community! This project thrives on:

📝 New Examples: Implementing additional GPU algorithms
🐛 Bug Fixes: Improving existing code and documentation
📚 Documentation: Enhancing explanations and tutorials
🔧 Optimizations: Performance improvements and best practices
🌐 Platform Support: Cross-platform compatibility improvements

📖 Contributing Guidelines → • 🐛 Report Issues → • 💡 Request Features →

🏆 Community & Support

🌟 Star this project if you find it helpful!
🐛 Report bugs using our issue templates
💬 Join discussions in GitHub Discussions
📧 Get help from the community and maintainers

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

📚 Citation

If you use this project in your research, education, or publications, please cite it as:

BibTeX

@misc{gpu-programming-101,
  title={GPU Programming 101: A Comprehensive Educational Project for CUDA and HIP},
  author={{Stephen Shao}},
  year={2025},
  howpublished={\url{https://github.com/AIComputing101/gpu-programming-101}},
  note={A complete GPU programming educational resource with 71 comprehensive examples covering fundamentals through advanced optimization techniques for NVIDIA CUDA and AMD HIP platforms}
}

IEEE Format

Stephen Shao, "GPU Programming 101: A Comprehensive Educational Project for CUDA and HIP," GitHub, 2025. [Online]. Available: https://github.com/AIComputing101/gpu-programming-101

🙏 Acknowledgments

🎯 NVIDIA and AMD for excellent GPU computing ecosystems
📚 GPU computing community for sharing knowledge and best practices
👥 Contributors who make this project better every day

Ready to unlock the power of GPU computing?

🚀 Get Started Now • 📚 View Modules • 🐳 Try Docker

⭐ Star this project • 🍴 Fork and contribute • 📢 Share with others

Star History

Built with ❤️ for the AI Computing 101

Uh oh!

FilesExpand file tree

README.md

Latest commit

History