Skip to content

Commit 7535bb8

Browse files
- docker GitHub workflow
- CUDA checks - readme - requirements
1 parent b9cfece commit 7535bb8

File tree

10 files changed

+267
-59
lines changed

10 files changed

+267
-59
lines changed

.github/workflows/docker.yml

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
name: docker
2+
3+
on:
4+
push:
5+
branches:
6+
- main
7+
8+
jobs:
9+
build:
10+
strategy:
11+
matrix:
12+
target: [docker, singularity]
13+
14+
runs-on: ubuntu-latest
15+
16+
steps:
17+
18+
- name: Checkout repo
19+
uses: actions/checkout@v4
20+
21+
- name: Setup Docker buildx
22+
uses: docker/setup-buildx-action@v3
23+
24+
- name: Docker Hub authentication
25+
uses: docker/login-action@v3
26+
with:
27+
username: ${{ secrets.DOCKERHUB_USERNAME }}
28+
password: ${{ secrets.DOCKERHUB_TOKEN }}
29+
30+
- name: Build and push
31+
uses: docker/build-push-action@v6
32+
with:
33+
target: ${{matrix.target}}
34+
tags: gcodega/matmul:cuda12.4-${{matrix.target}}
35+
cache-to: type=inline
36+
cache-from: |
37+
type=registry, ref=gcodega/matmul:cuda12.4-docker
38+
type=registry, ref=gcodega/matmul:cuda12.4-singularity
39+
push: true

Dockerfile

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
# ---- Base stage ----
2+
# Install Python and OpenMPI.
3+
FROM nvidia/cuda:12.4.1-devel-ubuntu22.04 AS base
4+
5+
ENV CUDA_HOME=/usr/local/cuda
6+
7+
RUN apt-get update && \
8+
apt-get install -y --no-install-recommends python3.11 python3.11-dev wget curl openssh-client && \
9+
rm -rf /var/lib/apt/lists/* && \
10+
update-alternatives --install /usr/bin/python python /usr/bin/python3.11 1 && \
11+
curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py && \
12+
python get-pip.py && \
13+
rm get-pip.py
14+
15+
WORKDIR /opt
16+
ENV MPI_HOME=/opt/openmpi
17+
ENV PATH=$MPI_HOME/bin:$PATH
18+
ENV LD_LIBRARY_PATH=$MPI_HOME/lib:$LD_LIBRARY_PATH
19+
RUN wget https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.8.tar.gz && \
20+
tar -xzf openmpi-4.1.8.tar.gz && \
21+
rm openmpi-4.1.8.tar.gz && \
22+
cd openmpi-4.1.8 && \
23+
./configure --prefix=$MPI_HOME --with-cuda=$CUDA_HOME && \
24+
make -j4 install
25+
26+
# ---- Build stage for Docker ----
27+
# Setting a user that is not root.
28+
FROM base AS docker
29+
RUN groupadd -g 1001 matmul && \
30+
useradd -u 1001 -g matmul tony
31+
32+
ENV HOME=/home/tony
33+
WORKDIR $HOME
34+
RUN cp /root/.bashrc . && \
35+
cp /root/.profile . && \
36+
chown -R tony:matmul $HOME && \
37+
mkdir .local app && \
38+
chown -R tony:matmul .local app
39+
40+
WORKDIR $HOME/app
41+
USER tony
42+
ENV PATH=$HOME/.local/bin:$PATH
43+
44+
COPY --chown=tony:matmul requirements.txt .
45+
RUN python -m pip install --no-cache-dir --no-binary=mpi4py -r requirements.txt
46+
47+
COPY --chown=tony:matmul . .
48+
RUN python -m pip install -e .
49+
50+
# ---- Build stage for Singularity ----
51+
# Not setting a user is recommended for compatibility
52+
# with Singularity, since the container won't run as root.
53+
FROM base AS singularity
54+
ENV HOME=/shared-folder
55+
WORKDIR $HOME
56+
RUN cp /root/.bashrc . && \
57+
cp /root/.profile . && \
58+
chmod -R a+rwx $HOME
59+
60+
WORKDIR $HOME/app
61+
COPY --chmod=777 requirements.txt .
62+
RUN python -m pip install --no-cache-dir --no-binary=mpi4py -r requirements.txt
63+
64+
COPY --chmod=777 . .
65+
RUN python -m pip install -e .
66+

README.md

Lines changed: 31 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -6,14 +6,26 @@ Code for the exam in Development tools for Scientific Computing, SISSA, a.y. 202
66
---
77

88
## Parallel matrix-matrix multiplication
9-
The goal here was to implement matrix-matrix multiplication in distributed memory. The core idea is to split the matrices we want to multiply between MPI processes, and let each process compute a chunk of the result. Since we are in distributed memory, some communication is required for each process to properly compute its chunk of the result, but ultimately each process needs some matrix-matrix multiplication routine to do its work. The efficiency of the underliying multiplication routine can severely affect the performance of the distributed algorithm. Different matrix-matrix multiplication routines are provided in `src/matmul/routines.py`, all of which can be used in the distributed routine. The actual distributed machinery is provided in `scripts/run.py`.
9+
The goal here was to implement matrix-matrix multiplication in distributed memory. Details about the implementation are [a bit further down](#notes-on-the-implementation), but the general idea is to split the matrices we want to multiply between MPI processes, and let each process compute a chunk of the result. The whole distributed multiplication requires a bunch of steps such as computing the workload for each process, initialising the data, communicating and computing individual chunks, hence it is not straight forward to write some `distributed_multiply` routine. In fact, in this code there is no such routine, but rather the whole distributed machinery is provided in `scripts/run.py`.
10+
11+
In `src/matmul/routines.py` are a number of matrix-matrix multiplication routines (serial, parallel, tiled, GPU-accelerated) that can be used in the distributed algorithm. The performance of the distributed algorithm depends on the performance of the base routine.
1012

1113
All the code is implemented in Python. NumPy is employed to manipulate the matrices, while Numba is used to JIT compile routines in serial, parallel, CPU and GPU code. The MPI is provided by mpi4py.
1214

13-
This package tries to install mpi4py with `pip`, which requires a working installation of MPI on the machine. Also, for GPU computing, a recent version of the CUDA Toolkit is required (see [Numba](https://numba.readthedocs.io/en/stable/cuda/overview.html) for details).
15+
## Installation
16+
**NOTE:** Installing mpi4py with `pip` requires a working installation of MPI on the machine. Also, for GPU computing, a recent version of the CUDA Toolkit is required (see [Numba](https://numba.readthedocs.io/en/stable/cuda/overview.html) for details).
1417

15-
### NVHPC
16-
As it turns out, NVHPC ships with all is needed here. One issue is that mpi4py is not really meant to be compiled with nvc by default. If you have issues while installing you may want to try this
18+
### From GitHub
19+
Clone this repo locally and then run
20+
```bash
21+
python -m pip install --no-cache-dir --no-binary=mpi4py -r requirements.txt
22+
python -m pip install .
23+
```
24+
Optionally install dependencies for testing (`test`), profiling (`profile`) or both (`dev`) with
25+
```bash
26+
python -m pip install .[<DEPENDENCY>]
27+
```
28+
As it turns out, [NVHPC](https://developer.nvidia.com/hpc-sdk) ships with all is needed here. One issue is that mpi4py is not really meant to be compiled with nvc by default. If you have issues while installing you may want to try this
1729
``` bash
1830
CFLAGS=-noswitcherror python -m pip install --no-cache-dir --no-binary=mpi4py mpi4py
1931
```
@@ -23,7 +35,20 @@ export CUDA_HOME=$NVHPC_ROOT/cuda/12.0
2335
export NUMBAPRO_NVVM=$NVHPC_ROOT/cuda/12.0/nvvm/lib64
2436
export NUMBAPRO_LIBDEVICE=$NVHPC_ROOT/cuda/12.0/nvvm/libdevice
2537
```
26-
Finally, should you run any of this code on an HPC facility and submit a SLURM job, note that SLURM's `srun` might not work with mpi4py, and you may need to use `mpirun` instead.
38+
39+
### From DockerHub
40+
You can get container images with this code from DockerHub as well. The images are built with CUDA 12.4 and still require NVIDIA drivers on the host machine to run.
41+
If you plan on running a Docker container you can get the image with
42+
```bash
43+
docker pull gcodega/matmul:cuda12.4-docker
44+
```
45+
If you plan on running a Singularity container, you can get a different tag
46+
```bash
47+
docker pull gcodega/matmul:cuda12.4-singularity
48+
```
49+
Note that to run Docker with CUDA support you may need the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html), whereas Singularity natively supports CUDA.
50+
51+
The tags only differ in that the Docker image has some custom user (tony:matmul), whereas the Singularity image runs as root. Not setting a different user in the Singularity image is actually recommended, as Singularity containers inherit the user from the host, and setting a user different from root may cause issues with environments inside the container. Also, if you want to run the code on some HPC facility you may want to use the Singularity image, as it can interact with the host MPI and run on multiple nodes. Note that in this case performance may not be optimal, as the OpenMPI inside the container is not optimized for any specific machine.
2752

2853
## Run some tests
2954
To check out how different routines perform you can run `scripts/run.py`. After installing the package, you can modify `examples/config.yaml` by specifying the following parameters:
@@ -38,7 +63,7 @@ You can run the script with
3863
```
3964
mpirun -n <ntasks> python scripts/run.py --config experiments/config
4065
```
41-
Note that if you want to run this in serial you still need to use `mpirun -n 1 ...`
66+
Note that if you want to run this in serial you still need to use `mpirun -n 1 ...`. Moreover, should you run any of this code on an HPC facility and submit a SLURM job, also note that SLURM's `srun` might not work with mpi4py, and you may need to use `mpirun` instead (in `shell/submit.sbatch` you can find the script that I used to submit jobs on Ulysses at SISSA). Finally, when running through Singularity you may need to specify absolute paths for the scripts (all source code is in `/shared-folder/app`).
4267

4368
### Profiling
4469
The script will print to screen the time spent in multiplying the matrices (i.e. no communication time or others). You can get more insights by profiling the code with kernprof. The script in `shell/submit.sh` lets you run one instance of kernprof for each MPI task and save the results on different files. You can select the number of threads for parallel routines by changing `NUMBA_NUM_THREADS` and customize the output path for kernprof. Run the script as

experiments/config.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
device: cpu
2-
size: 4096
2+
size: 256
33
function:
4-
routine: matmul_numba_serial
4+
routine: matmul_numba_cpu
55
block_size: 32
66
print: False

requirements.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
llvmlite==0.44.0
2-
mpi4py==4.0.3 --no-binary=mpi4py
2+
mpi4py==4.0.3
33
numba==0.61.0
44
numpy==2.1.3
55
PyYAML==6.0.2

scripts/run.py

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
from functools import wraps
2-
from warnings import warn
2+
import warnings
33
import numpy as np
44
from numba import cuda
55

@@ -9,14 +9,15 @@
99
mpi4py.rc.finalize = False
1010
from mpi4py import MPI
1111

12-
from matmul.utils import create_block, read_config
12+
from matmul.utils import create_block, read_config, custom_warning
1313
import argparse
1414
import importlib
1515

1616
try:
1717
from line_profiler import profile
1818
except ModuleNotFoundError:
19-
warn("Did not find line_profiler. Please install it to access profiling information.")
19+
warnings.formatwarning = custom_warning
20+
warnings.warn("Did not find line_profiler. Please install it to access profiling information.")
2021
def profile(f,*args,**kwargs):
2122
def wrapper(*args,**kwargs):
2223
f(*args,**kwargs)
@@ -225,6 +226,8 @@ def main_gpu(params: dict):
225226
raise ValueError(f"Specified routine '{routine}' is incompatible with device 'cpu'. Compatible routines are {cpu_routines}.")
226227
main_cpu(params)
227228
elif params["device"] == "gpu" :
229+
if not cuda.is_available():
230+
raise RuntimeError("Trying to run on GPU but CUDA is not available")
228231
if not routine in gpu_routines:
229232
raise ValueError(f"Specified routine '{routine}' is incompatible with device 'gpu'. Compatible routines are {gpu_routines}.")
230233
main_gpu(params)

shell/submit.sbatch

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
#!/bin/bash
2+
#SBATCH --partition=gpu2
3+
#SBATCH --nodes=1
4+
#SBATCH --ntasks-per-node=1
5+
#SBATCH --cpus-per-task=1
6+
#SBATCH --gpus-per-node=0
7+
#SBATCH --gpus-per-task=0
8+
#SBATCH --mem=20G
9+
#SBATCH --time=00:10:00
10+
#SBATCH --output=%x.o%j.%N
11+
#SBATCH --error=%x.e%j.%N
12+
#SBATCH --job-name=matmul
13+
14+
# Print job details
15+
NOW=`date +%H:%M:%S-%a-%d/%b/%Y`
16+
echo '------------------------------------------------------'
17+
echo 'This job is allocated on '$SLURM_JOB_CPUS_PER_NODE' cpu(s) and '$SLURM_GPUS_PER_NODE' gpu(s)'
18+
echo 'Job is running on node(s): '
19+
echo $SLURM_JOB_NODELIST
20+
echo '------------------------------------------------------'
21+
#
22+
# ==== End of Info part (say things) ===== #
23+
#
24+
25+
cd $SLURM_SUBMIT_DIR # here we go into the submission directory
26+
export SLURM_NTASKS_PER_NODE=1 # need to export this, not for all clusters but Ulysses has a bug :/
27+
28+
# load a bunch of modules
29+
module use /opt/contrib/mathlab/modules
30+
module load miniconda3
31+
source $HOME/.bashrc
32+
module load nvhpc-hpcx/23.1
33+
34+
export CUDA_HOME=$NVHPC_ROOT/cuda/12.0
35+
export NUMBAPRO_NVVM=$NVHPC_ROOT/cuda/12.0/nvvm/lib64
36+
export NUMBAPRO_LIBDEVICE=$NVHPC_ROOT/cuda/12.0/nvvm/libdevice
37+
echo "$CUDA_HOME"
38+
echo "$NUMBAPRO_NVVM"
39+
echo "$NUMBAPRO_LIBDEVICE"
40+
41+
conda activate matmul
42+
43+
# set number of threads according to available resources
44+
export NUMBA_NUM_THREADS=1
45+
46+
echo "Starting at $(date +%H:%M:%S-%a-%d/%b/%Y)"
47+
# Run the script
48+
mpirun -n $SLURM_NTASKS --bind-to socket --map-by socket python scripts/run.py --config=experiments/config
49+
#mpirun -n $SLURM_NTASKS --bind-to socket --map-by socket pytest
50+
#mpirun -n $SLURM_NTASKS --bind-to socket --map-by socket --report-bindings \
51+
# bash -c 'kernprof -lz -o "3_20000_rank${OMPI_COMM_WORLD_RANK}.lprof" scripts/run.py --config=experiments/config'
52+
echo "Finished at $(date +%H:%M:%S-%a-%d/%b/%Y)"

src/matmul/__init__.py

Lines changed: 12 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,17 @@
1+
import warnings
2+
13
__all__ = [
24
'matmul',
35
'matmul_numba_serial',
46
'matmul_numba_cpu',
5-
'matmul_numba_gpu',
67
'matmul_numba_block_serial',
7-
'matmul_numba_block_cpu',
8-
'matmul_numba_block_gpu']
9-
10-
11-
from .routines import matmul, matmul_numba_serial, matmul_numba_cpu, matmul_numba_gpu, matmul_numba_block_serial, matmul_numba_block_cpu, matmul_numba_block_gpu
8+
'matmul_numba_block_cpu']
9+
from .utils import custom_warning
10+
from .routines import matmul, matmul_numba_serial, matmul_numba_cpu, matmul_numba_block_serial, matmul_numba_block_cpu
11+
try:
12+
from .routines import matmul_numba_gpu, matmul_numba_block_gpu
13+
__all__.append('matmul_numba_gpu')
14+
__all__.append('matmul_numba_block_gpu')
15+
except ImportError:
16+
warnings.formatwarning = custom_warning
17+
warnings.warn("CUDA not found: GPU functions won't be available.")

src/matmul/routines.py

Lines changed: 42 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -66,51 +66,52 @@ def matmul_numba_block_serial(A,B,C, bs=64):
6666
for j in range(jj,jmax):
6767
C[i,j] += A[i,k] * B[k,j]
6868

69-
@cuda.jit(void(float64[:,::1],float64[:,::1],float64[:,:]), cache=True, debug=False)
70-
def matmul_numba_gpu(A,B,C):
71-
# this only has effect if function is compiled with debug = True
72-
assert (A.shape[0] == C.shape[0]) and (A.shape[1] == B.shape[0]) and (B.shape[1] == C.shape[1]), "Matrices have incompatible shapes"
73-
i, j = cuda.grid(ndim=2)
74-
if i < C.shape[0] and j < C.shape[1]:
75-
tmp = 0.
76-
for k in range(B.shape[0]):
77-
tmp += A[i,k] * B[k,j]
78-
C[i,j] = tmp
69+
if cuda.is_available():
70+
@cuda.jit(void(float64[:,::1],float64[:,::1],float64[:,:]), cache=True, debug=False)
71+
def matmul_numba_gpu(A,B,C):
72+
# this only has effect if function is compiled with debug = True
73+
assert (A.shape[0] == C.shape[0]) and (A.shape[1] == B.shape[0]) and (B.shape[1] == C.shape[1]), "Matrices have incompatible shapes"
74+
i, j = cuda.grid(ndim=2)
75+
if i < C.shape[0] and j < C.shape[1]:
76+
tmp = 0.
77+
for k in range(B.shape[0]):
78+
tmp += A[i,k] * B[k,j]
79+
C[i,j] = tmp
7980

80-
BLOCK_SIZE = 16
81-
@cuda.jit(void(float64[:,::1],float64[:,::1],float64[:,:]), cache=True, debug=False)
82-
def matmul_numba_block_gpu(A,B,C):
83-
# this only has effect if function is compiled with debug = True
84-
assert (A.shape[0] == C.shape[0]) and (A.shape[1] == B.shape[0]) and (B.shape[1] == C.shape[1]), "Matrices have incompatible shapes"
81+
BLOCK_SIZE = 16
82+
@cuda.jit(void(float64[:,::1],float64[:,::1],float64[:,:]), cache=True, debug=False)
83+
def matmul_numba_block_gpu(A,B,C):
84+
# this only has effect if function is compiled with debug = True
85+
assert (A.shape[0] == C.shape[0]) and (A.shape[1] == B.shape[0]) and (B.shape[1] == C.shape[1]), "Matrices have incompatible shapes"
8586

86-
bi = cuda.blockIdx.y
87-
bj = cuda.blockIdx.x
88-
ti = cuda.threadIdx.y
89-
tj = cuda.threadIdx.x
90-
bh = cuda.blockDim.y
91-
bw = cuda.blockDim.x
92-
gi = bi*bh + ti
93-
gj = bj*bw + tj
94-
nblocks = (A.shape[1] + BLOCK_SIZE - 1)//BLOCK_SIZE
87+
bi = cuda.blockIdx.y
88+
bj = cuda.blockIdx.x
89+
ti = cuda.threadIdx.y
90+
tj = cuda.threadIdx.x
91+
bh = cuda.blockDim.y
92+
bw = cuda.blockDim.x
93+
gi = bi*bh + ti
94+
gj = bj*bw + tj
95+
nblocks = (A.shape[1] + BLOCK_SIZE - 1)//BLOCK_SIZE
9596

96-
Ashared = cuda.shared.array(shape=(BLOCK_SIZE,BLOCK_SIZE),dtype=float64)
97-
Bshared = cuda.shared.array(shape=(BLOCK_SIZE,BLOCK_SIZE),dtype=float64)
98-
tmp = 0.
99-
for b in range(nblocks):
97+
Ashared = cuda.shared.array(shape=(BLOCK_SIZE,BLOCK_SIZE),dtype=float64)
98+
Bshared = cuda.shared.array(shape=(BLOCK_SIZE,BLOCK_SIZE),dtype=float64)
99+
tmp = 0.
100+
for b in range(nblocks):
100101

101-
Ashared[ti,tj] = 0
102-
Bshared[ti,tj] = 0
103-
104-
if gi < A.shape[0] and (tj + b*BLOCK_SIZE) < A.shape[1]:
105-
Ashared[ti,tj] = A[gi,tj + b*BLOCK_SIZE]
106-
if (ti + b*BLOCK_SIZE) < B.shape[0] and gj < B.shape[1]:
107-
Bshared[ti,tj] = B[ti + b*BLOCK_SIZE,gj]
108-
cuda.syncthreads()
102+
Ashared[ti,tj] = 0
103+
Bshared[ti,tj] = 0
104+
105+
if gi < A.shape[0] and (tj + b*BLOCK_SIZE) < A.shape[1]:
106+
Ashared[ti,tj] = A[gi,tj + b*BLOCK_SIZE]
107+
if (ti + b*BLOCK_SIZE) < B.shape[0] and gj < B.shape[1]:
108+
Bshared[ti,tj] = B[ti + b*BLOCK_SIZE,gj]
109+
cuda.syncthreads()
109110

110-
for k in range(BLOCK_SIZE):
111-
tmp += Ashared[ti,k] * Bshared[k,tj]
112-
cuda.syncthreads()
111+
for k in range(BLOCK_SIZE):
112+
tmp += Ashared[ti,k] * Bshared[k,tj]
113+
cuda.syncthreads()
113114

114-
if gi < C.shape[0] and gj < C.shape[1]:
115-
C[gi,gj] = tmp
115+
if gi < C.shape[0] and gj < C.shape[1]:
116+
C[gi,gj] = tmp
116117

0 commit comments

Comments
 (0)