Welcome to the repository for the graded projects of the Parallel and High Performance Computing course at EPFL, Spring 2025. This repository contains all code, reports, and results for the course assignments.
This repository showcases solutions to various parallel programming and high performance computing problems, including:
- Parallelization using MPI and CUDA
- Performance optimization and profiling
- Theoretical anlysis of run time
- Scalability analysis
- Real-world scientific computing applications
This was the main project for the course. The full report is avaliable here. All the MPI code is localed here while all the CUDA code is avaliable here.
- Objective: Parallelize the solution of shallow water equations using MPI and CUDA to enhance computational performance.
- Theoretical Analysis: Computational cost analysis of grid initialization and time-step calculations.
- Domain Subdivision: Divided the 2D grid into subgrids for each processor to handle computations locally.
- Communication: Utilized MPI's Cartesian communicator for efficient neighbor communication and halo updates.
- Performance Metrics:
- Strong Scaling: Demonstrated significant speedup with increasing processors, aligning with Amdahl's law initially but deviating due to communication overheads.
- Weak Scaling: Showed efficiency and speedup trends with constant work per processor, highlighting communication costs as processors increased.
- Conceptual Similarity: Subdivided the grid into blocks for parallel computation within each block.
- Optimizations:
- Parallelized the large majority of functions.
- Implemented a two-kernel approach for global maximum computation, inspired by NVIDIA's reduction techniques.
- Performance Results: Illustrated significant time reduction with increasing threads per block, eventually plateauing due to serial bottlenecks.
Other than the final project, two other projects were completed with the goal of hands-on learning of MPI and CUDA.
This project focuses on parallelizing the conjugate gradient method, specifically targeting the matrix vector multiplication function. Using MPI, the matrix indices were divided among processors to dsitribute the workload. Evaluation showed a notable speedup with up to 25 processors, beyond which a plateau was observed. Scaling analysis, applying Amdahl's and Gustafson's laws, revealed that memory and communication bottlenecks limit scalability.
The full report is avaliable here and all the code is localed here.
This project focuses on transitioning from CBLAS to CUDA to parallelize linear algebra functions on GPUs. The primary goal was to create CUDA kernels for these functions to enhance computational performance. Timing the conjugate gradient function with varying threads per block revealed that performance improved initially but plateaued after 8 threads per block and degraded significantly beyond 256 threads, likely due to memory bottlenecks.
The full report is avaliable here and all the code is localed here.
Each project has a Makefile to compile the code. By running >>> make
in the corresponding directory will compile the code. Below are the links for the Makefiles
To run the code and the python visualizations SLURM job files are avaliable for each project. They were created to that the project could be run on SLURM clusters, nevertheless the instructions can be run locally. Below are the links for the SLURM job files
This project is licensed under the MIT License. See LICENSE for details.