- docker GitHub workflow

Gabriele-Codega · Gabriele-Codega · commit 7535bb87e82d · 2025-03-25T15:08:44.000+01:00
- CUDA checks
- readme
- requirements
diff --git a/.github/workflows/docker.yml b/.github/workflows/docker.yml
@@ -0,0 +1,39 @@
+name: docker
+
+on:
+  push:
+    branches:
+      - main
+
+jobs:
+  build:
+    strategy:
+      matrix:
+        target: [docker, singularity]
+
+    runs-on: ubuntu-latest
+
+    steps:
+
+      - name: Checkout repo
+        uses: actions/checkout@v4
+
+      - name: Setup Docker buildx
+        uses: docker/setup-buildx-action@v3
+
+      - name: Docker Hub authentication
+        uses: docker/login-action@v3
+        with:
+          username: ${{ secrets.DOCKERHUB_USERNAME }}
+          password: ${{ secrets.DOCKERHUB_TOKEN }}
+    
+      - name: Build and push
+        uses: docker/build-push-action@v6
+        with:
+          target: ${{matrix.target}}
+          tags: gcodega/matmul:cuda12.4-${{matrix.target}}
+          cache-to: type=inline
+          cache-from: |
+                      type=registry, ref=gcodega/matmul:cuda12.4-docker
+                      type=registry, ref=gcodega/matmul:cuda12.4-singularity
+          push: true
diff --git a/Dockerfile b/Dockerfile
@@ -0,0 +1,66 @@
+# ---- Base stage ----
+# Install Python and OpenMPI.
+FROM nvidia/cuda:12.4.1-devel-ubuntu22.04 AS base
+
+ENV CUDA_HOME=/usr/local/cuda
+
+RUN apt-get update && \
+	apt-get install -y --no-install-recommends python3.11 python3.11-dev wget curl openssh-client && \
+	rm -rf /var/lib/apt/lists/* && \
+	update-alternatives --install /usr/bin/python python /usr/bin/python3.11 1 && \
+	curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py && \
+	python get-pip.py && \
+	rm get-pip.py
+
+WORKDIR /opt
+ENV MPI_HOME=/opt/openmpi
+ENV PATH=$MPI_HOME/bin:$PATH
+ENV LD_LIBRARY_PATH=$MPI_HOME/lib:$LD_LIBRARY_PATH
+RUN wget https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.8.tar.gz && \
+	tar -xzf openmpi-4.1.8.tar.gz && \
+	rm openmpi-4.1.8.tar.gz && \
+	cd openmpi-4.1.8 && \
+	./configure --prefix=$MPI_HOME --with-cuda=$CUDA_HOME && \
+	make -j4 install
+
+# ---- Build stage for Docker ----
+# Setting a user that is not root.
+FROM base AS docker
+RUN groupadd -g 1001 matmul && \
+	useradd -u 1001 -g matmul tony
+
+ENV HOME=/home/tony
+WORKDIR $HOME
+RUN cp /root/.bashrc . && \
+	cp /root/.profile . && \
+	chown -R tony:matmul $HOME && \
+	mkdir .local app && \
+	chown -R tony:matmul .local app
+
+WORKDIR $HOME/app
+USER tony
+ENV PATH=$HOME/.local/bin:$PATH
+
+COPY --chown=tony:matmul requirements.txt .
+RUN python -m pip install --no-cache-dir --no-binary=mpi4py -r requirements.txt 
+
+COPY --chown=tony:matmul . .
+RUN python -m pip install -e .
+
+# ---- Build stage for Singularity ----
+# Not setting a user is recommended for compatibility
+# with Singularity, since the container won't run as root.
+FROM base AS singularity
+ENV HOME=/shared-folder
+WORKDIR $HOME
+RUN cp /root/.bashrc . && \
+	cp /root/.profile . && \
+	chmod -R a+rwx $HOME
+
+WORKDIR $HOME/app
+COPY --chmod=777 requirements.txt .
+RUN python -m pip install --no-cache-dir --no-binary=mpi4py -r requirements.txt
+
+COPY --chmod=777 . .
+RUN python -m pip install -e .
+
diff --git a/README.md b/README.md
@@ -6,14 +6,26 @@ Code for the exam in Development tools for Scientific Computing, SISSA, a.y. 202
 ---
 
 ## Parallel matrix-matrix multiplication
-The goal here was to implement matrix-matrix multiplication in distributed memory. The core idea is to split the matrices we want to multiply between MPI processes, and let each process compute a chunk of the result. Since we are in distributed memory, some communication is required for each process to properly compute its chunk of the result, but ultimately each process needs some matrix-matrix multiplication routine to do its work. The efficiency of the underliying multiplication routine can severely affect the performance of the distributed algorithm. Different matrix-matrix multiplication routines are provided in `src/matmul/routines.py`, all of which can be used in the distributed routine. The actual distributed machinery is provided in `scripts/run.py`.
+The goal here was to implement matrix-matrix multiplication in distributed memory. Details about the implementation are [a bit further down](#notes-on-the-implementation), but the general idea is to split the matrices we want to multiply between MPI processes, and let each process compute a chunk of the result. The whole distributed multiplication requires a bunch of steps such as computing the workload for each process, initialising the data, communicating and computing individual chunks, hence it is not straight forward to write some `distributed_multiply` routine. In fact, in this code there is no such routine, but rather the whole distributed machinery is provided in `scripts/run.py`.
+
+In `src/matmul/routines.py` are a number of matrix-matrix multiplication routines (serial, parallel, tiled, GPU-accelerated) that can be used in the distributed algorithm. The performance of the distributed algorithm depends on the performance of the base routine.
 
 All the code is implemented in Python. NumPy is employed to manipulate the matrices, while Numba is used to JIT compile routines in serial, parallel, CPU and GPU code. The MPI is provided by mpi4py.
 
-This package tries to install mpi4py with `pip`, which requires a working installation of MPI on the machine. Also, for GPU computing, a recent version of the CUDA Toolkit is required (see [Numba](https://numba.readthedocs.io/en/stable/cuda/overview.html) for details).
+## Installation
+**NOTE:** Installing mpi4py with `pip` requires a working installation of MPI on the machine. Also, for GPU computing, a recent version of the CUDA Toolkit is required (see [Numba](https://numba.readthedocs.io/en/stable/cuda/overview.html) for details).
 
-### NVHPC
-As it turns out, NVHPC ships with all is needed here. One issue is that mpi4py is not really meant to be compiled with nvc by default. If you have issues while installing you may want to try this
+### From GitHub
+Clone this repo locally and then run
+```bash
+python -m pip install --no-cache-dir --no-binary=mpi4py -r requirements.txt
+python -m pip install .
+```
+Optionally install dependencies for testing (`test`), profiling (`profile`) or both (`dev`) with
+```bash
+python -m pip install .[<DEPENDENCY>]
+```
+As it turns out, [NVHPC](https://developer.nvidia.com/hpc-sdk) ships with all is needed here. One issue is that mpi4py is not really meant to be compiled with nvc by default. If you have issues while installing you may want to try this
 ``` bash
 CFLAGS=-noswitcherror python -m pip install --no-cache-dir --no-binary=mpi4py mpi4py
 ```
@@ -23,7 +35,20 @@ export CUDA_HOME=$NVHPC_ROOT/cuda/12.0
 export NUMBAPRO_NVVM=$NVHPC_ROOT/cuda/12.0/nvvm/lib64
 export NUMBAPRO_LIBDEVICE=$NVHPC_ROOT/cuda/12.0/nvvm/libdevice
 ```
-Finally, should you run any of this code on an HPC facility and submit a SLURM job, note that SLURM's `srun` might not work with mpi4py, and you may need to use `mpirun` instead.
+
+### From DockerHub
+You can get container images with this code from DockerHub as well. The images are built with CUDA 12.4 and still require NVIDIA drivers on the host machine to run.
+If you plan on running a Docker container you can get the image with
+```bash
+docker pull gcodega/matmul:cuda12.4-docker
+```
+If you plan on running a Singularity container, you can get a different tag
+```bash
+docker pull gcodega/matmul:cuda12.4-singularity
+```
+Note that to run Docker with CUDA support you may need the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html), whereas Singularity natively supports CUDA.
+
+The tags only differ in that the Docker image has some custom user (tony:matmul), whereas the Singularity image runs as root. Not setting a different user in the Singularity image is actually recommended, as Singularity containers inherit the user from the host, and setting a user different from root may cause issues with environments inside the container. Also, if you want to run the code on some HPC facility you may want to use the Singularity image, as it can interact with the host MPI and run on multiple nodes. Note that in this case performance may not be optimal, as the OpenMPI inside the container is not optimized for any specific machine.
 
 ## Run some tests
 To check out how different routines perform you can run `scripts/run.py`. After installing the package, you can modify `examples/config.yaml` by specifying the following parameters:
@@ -38,7 +63,7 @@ You can run the script with
 ```
 mpirun -n <ntasks> python scripts/run.py --config experiments/config
 ```
-Note that if you want to run this in serial you still need to use `mpirun -n 1 ...`
+Note that if you want to run this in serial you still need to use `mpirun -n 1 ...`. Moreover, should you run any of this code on an HPC facility and submit a SLURM job, also note that SLURM's `srun` might not work with mpi4py, and you may need to use `mpirun` instead (in `shell/submit.sbatch` you can find the script that I used to submit jobs on Ulysses at SISSA). Finally, when running through Singularity you may need to specify absolute paths for the scripts (all source code is in `/shared-folder/app`).
 
 ### Profiling
 The script will print to screen the time spent in multiplying the matrices (i.e. no communication time or others). You can get more insights by profiling the code with kernprof. The script in `shell/submit.sh` lets you run one instance of kernprof for each MPI task and save the results on different files. You can select the number of threads for parallel routines by changing `NUMBA_NUM_THREADS` and customize the output path for kernprof. Run the script as
diff --git a/experiments/config.yaml b/experiments/config.yaml
@@ -1,6 +1,6 @@
 device: cpu
-size: 4096
+size: 256
 function: 
-  routine: matmul_numba_serial
+  routine: matmul_numba_cpu
   block_size: 32
 print: False
diff --git a/requirements.txt b/requirements.txt
@@ -1,5 +1,5 @@
 llvmlite==0.44.0
-mpi4py==4.0.3 --no-binary=mpi4py
+mpi4py==4.0.3
 numba==0.61.0
 numpy==2.1.3
 PyYAML==6.0.2
diff --git a/scripts/run.py b/scripts/run.py
@@ -1,5 +1,5 @@
 from functools import wraps
-from warnings import warn
+import warnings
 import numpy as np
 from numba import cuda
 
@@ -9,14 +9,15 @@
 mpi4py.rc.finalize = False
 from mpi4py import MPI
 
-from matmul.utils import create_block, read_config
+from matmul.utils import create_block, read_config, custom_warning
 import argparse
 import importlib
 
 try:
     from line_profiler import profile
 except ModuleNotFoundError:
-    warn("Did not find line_profiler. Please install it to access profiling information.")
+    warnings.formatwarning = custom_warning
+    warnings.warn("Did not find line_profiler. Please install it to access profiling information.")
     def profile(f,*args,**kwargs):
         def wrapper(*args,**kwargs):
             f(*args,**kwargs)
@@ -225,6 +226,8 @@ def main_gpu(params: dict):
             raise ValueError(f"Specified routine '{routine}' is incompatible with device 'cpu'. Compatible routines are {cpu_routines}.")
         main_cpu(params)
     elif params["device"] == "gpu" :
+        if not cuda.is_available():
+            raise RuntimeError("Trying to run on GPU but CUDA is not available")
         if not routine in gpu_routines:
             raise ValueError(f"Specified routine '{routine}' is incompatible with device 'gpu'. Compatible routines are {gpu_routines}.")
         main_gpu(params)
diff --git a/shell/submit.sbatch b/shell/submit.sbatch
@@ -0,0 +1,52 @@
+#!/bin/bash
+#SBATCH --partition=gpu2
+#SBATCH --nodes=1
+#SBATCH --ntasks-per-node=1
+#SBATCH --cpus-per-task=1
+#SBATCH --gpus-per-node=0
+#SBATCH --gpus-per-task=0
+#SBATCH --mem=20G
+#SBATCH --time=00:10:00
+#SBATCH --output=%x.o%j.%N
+#SBATCH --error=%x.e%j.%N
+#SBATCH --job-name=matmul
+
+# Print job details
+NOW=`date +%H:%M:%S-%a-%d/%b/%Y`
+echo '------------------------------------------------------'
+echo 'This job is allocated on '$SLURM_JOB_CPUS_PER_NODE' cpu(s) and '$SLURM_GPUS_PER_NODE' gpu(s)'
+echo 'Job is running on node(s): '
+echo  $SLURM_JOB_NODELIST
+echo '------------------------------------------------------'
+#
+# ==== End of Info part (say things) ===== #
+#
+
+cd $SLURM_SUBMIT_DIR            # here we go into the submission directory
+export SLURM_NTASKS_PER_NODE=1  # need to export this, not for all clusters but Ulysses has a bug :/
+
+# load a bunch of modules
+module use /opt/contrib/mathlab/modules
+module load miniconda3
+source $HOME/.bashrc
+module load nvhpc-hpcx/23.1
+
+export CUDA_HOME=$NVHPC_ROOT/cuda/12.0
+export NUMBAPRO_NVVM=$NVHPC_ROOT/cuda/12.0/nvvm/lib64
+export NUMBAPRO_LIBDEVICE=$NVHPC_ROOT/cuda/12.0/nvvm/libdevice
+echo "$CUDA_HOME"
+echo "$NUMBAPRO_NVVM"
+echo "$NUMBAPRO_LIBDEVICE"
+
+conda activate matmul
+
+# set number of threads according to available resources
+export NUMBA_NUM_THREADS=1
+
+echo "Starting at $(date +%H:%M:%S-%a-%d/%b/%Y)"
+# Run the script
+mpirun -n $SLURM_NTASKS --bind-to socket --map-by socket python scripts/run.py --config=experiments/config
+#mpirun -n $SLURM_NTASKS --bind-to socket --map-by socket pytest
+#mpirun -n $SLURM_NTASKS --bind-to socket --map-by socket --report-bindings \
+#    bash -c 'kernprof -lz -o "3_20000_rank${OMPI_COMM_WORLD_RANK}.lprof" scripts/run.py --config=experiments/config'
+echo "Finished at $(date +%H:%M:%S-%a-%d/%b/%Y)"
diff --git a/src/matmul/__init__.py b/src/matmul/__init__.py
@@ -1,11 +1,17 @@
+import warnings
+
 __all__ = [
         'matmul',
         'matmul_numba_serial',
         'matmul_numba_cpu',
-        'matmul_numba_gpu',
         'matmul_numba_block_serial',
-        'matmul_numba_block_cpu',
-        'matmul_numba_block_gpu']
-
-
-from .routines import matmul, matmul_numba_serial, matmul_numba_cpu, matmul_numba_gpu, matmul_numba_block_serial, matmul_numba_block_cpu, matmul_numba_block_gpu
+        'matmul_numba_block_cpu']
+from .utils import custom_warning
+from .routines import matmul, matmul_numba_serial, matmul_numba_cpu, matmul_numba_block_serial, matmul_numba_block_cpu 
+try:
+    from .routines import matmul_numba_gpu, matmul_numba_block_gpu
+    __all__.append('matmul_numba_gpu')
+    __all__.append('matmul_numba_block_gpu')
+except ImportError:
+    warnings.formatwarning = custom_warning
+    warnings.warn("CUDA not found: GPU functions won't be available.")
diff --git a/src/matmul/routines.py b/src/matmul/routines.py
@@ -66,51 +66,52 @@ def matmul_numba_block_serial(A,B,C, bs=64):
                         for j in range(jj,jmax):
                             C[i,j] += A[i,k] * B[k,j]
 
-@cuda.jit(void(float64[:,::1],float64[:,::1],float64[:,:]), cache=True, debug=False)
-def matmul_numba_gpu(A,B,C):
-    # this only has effect if function is compiled with debug = True
-    assert (A.shape[0] == C.shape[0]) and (A.shape[1] == B.shape[0]) and (B.shape[1] == C.shape[1]), "Matrices have incompatible shapes"
-    i, j = cuda.grid(ndim=2)
-    if i < C.shape[0] and j < C.shape[1]:
-        tmp = 0.
-        for k in range(B.shape[0]):
-            tmp += A[i,k] * B[k,j]
-        C[i,j] = tmp
+if cuda.is_available():
+    @cuda.jit(void(float64[:,::1],float64[:,::1],float64[:,:]), cache=True, debug=False)
+    def matmul_numba_gpu(A,B,C):
+        # this only has effect if function is compiled with debug = True
+        assert (A.shape[0] == C.shape[0]) and (A.shape[1] == B.shape[0]) and (B.shape[1] == C.shape[1]), "Matrices have incompatible shapes"
+        i, j = cuda.grid(ndim=2)
+        if i < C.shape[0] and j < C.shape[1]:
+            tmp = 0.
+            for k in range(B.shape[0]):
+                tmp += A[i,k] * B[k,j]
+            C[i,j] = tmp
 
-BLOCK_SIZE = 16
-@cuda.jit(void(float64[:,::1],float64[:,::1],float64[:,:]), cache=True, debug=False)
-def matmul_numba_block_gpu(A,B,C):
-    # this only has effect if function is compiled with debug = True
-    assert (A.shape[0] == C.shape[0]) and (A.shape[1] == B.shape[0]) and (B.shape[1] == C.shape[1]), "Matrices have incompatible shapes"
+    BLOCK_SIZE = 16
+    @cuda.jit(void(float64[:,::1],float64[:,::1],float64[:,:]), cache=True, debug=False)
+    def matmul_numba_block_gpu(A,B,C):
+        # this only has effect if function is compiled with debug = True
+        assert (A.shape[0] == C.shape[0]) and (A.shape[1] == B.shape[0]) and (B.shape[1] == C.shape[1]), "Matrices have incompatible shapes"
 
-    bi = cuda.blockIdx.y
-    bj = cuda.blockIdx.x
-    ti = cuda.threadIdx.y
-    tj = cuda.threadIdx.x
-    bh = cuda.blockDim.y
-    bw = cuda.blockDim.x
-    gi = bi*bh + ti
-    gj = bj*bw + tj
-    nblocks = (A.shape[1] + BLOCK_SIZE - 1)//BLOCK_SIZE
+        bi = cuda.blockIdx.y
+        bj = cuda.blockIdx.x
+        ti = cuda.threadIdx.y
+        tj = cuda.threadIdx.x
+        bh = cuda.blockDim.y
+        bw = cuda.blockDim.x
+        gi = bi*bh + ti
+        gj = bj*bw + tj
+        nblocks = (A.shape[1] + BLOCK_SIZE - 1)//BLOCK_SIZE
 
-    Ashared = cuda.shared.array(shape=(BLOCK_SIZE,BLOCK_SIZE),dtype=float64)
-    Bshared = cuda.shared.array(shape=(BLOCK_SIZE,BLOCK_SIZE),dtype=float64)
-    tmp = 0.
-    for b in range(nblocks):
+        Ashared = cuda.shared.array(shape=(BLOCK_SIZE,BLOCK_SIZE),dtype=float64)
+        Bshared = cuda.shared.array(shape=(BLOCK_SIZE,BLOCK_SIZE),dtype=float64)
+        tmp = 0.
+        for b in range(nblocks):
 
-        Ashared[ti,tj] = 0
-        Bshared[ti,tj] = 0
-        
-        if gi < A.shape[0] and (tj + b*BLOCK_SIZE) < A.shape[1]:
-            Ashared[ti,tj] = A[gi,tj + b*BLOCK_SIZE]
-        if (ti + b*BLOCK_SIZE) < B.shape[0] and gj < B.shape[1]:
-            Bshared[ti,tj] = B[ti + b*BLOCK_SIZE,gj]
-        cuda.syncthreads()
+            Ashared[ti,tj] = 0
+            Bshared[ti,tj] = 0
+            
+            if gi < A.shape[0] and (tj + b*BLOCK_SIZE) < A.shape[1]:
+                Ashared[ti,tj] = A[gi,tj + b*BLOCK_SIZE]
+            if (ti + b*BLOCK_SIZE) < B.shape[0] and gj < B.shape[1]:
+                Bshared[ti,tj] = B[ti + b*BLOCK_SIZE,gj]
+            cuda.syncthreads()
 
-        for k in range(BLOCK_SIZE):
-            tmp += Ashared[ti,k] * Bshared[k,tj]
-        cuda.syncthreads()
+            for k in range(BLOCK_SIZE):
+                tmp += Ashared[ti,k] * Bshared[k,tj]
+            cuda.syncthreads()
 
-    if gi < C.shape[0] and gj < C.shape[1]:
-        C[gi,gj] = tmp
+        if gi < C.shape[0] and gj < C.shape[1]:
+            C[gi,gj] = tmp
 
diff --git a/src/matmul/utils.py b/src/matmul/utils.py