Skip to content

Commit 7cb22b9

Browse files
authored
Merge pull request #679 from mbareford/main
Containerised CPE update
2 parents 36045e1 + 0a84d60 commit 7cb22b9

File tree

5 files changed

+219
-35
lines changed

5 files changed

+219
-35
lines changed

docs/user-guide/containers.md

Lines changed: 210 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -433,28 +433,27 @@ but can be made accessible by running `module use` with the right path.
433433

434434
```bash
435435
module use /work/y07/shared/archer2-lmod/others/dev
436-
module load ccpe/23.12
436+
module load ccpe/25.03
437437
```
438438

439439
The purpose of the `ccpe` module(s) is to allow developers to check that their code compiles with the
440440
latest Cray Programming Environment (CPE) releases. The CPE release installed on ARCHER2 (currently
441-
CPE 22.12) will typically be older than the latest available. A more recent containerised CPE therefore
442-
gives developers the opportunity to try out the latest compilers and libraries before the ARCHER CPE
441+
CPE 23.09) will typically be older than the latest available. A more recent containerised CPE therefore
442+
gives developers the opportunity to try out the latest compilers and libraries before the ARCHER2 CPE
443443
is upgraded.
444444

445445
!!! note
446446
The Containerised CPEs support CCE and GCC compilers, but not AOCC compilers.
447447

448-
The `ccpe/23.12` module then provides access to CPE 23.12 via a Singularity image file, located at
449-
`/work/y07/shared/utils/dev/ccpe/23.12/cpe_23.12.sif`. Singularity containers can be run such that locations
448+
The `ccpe/25.03` module provides access to CPE 25.03 via a Singularity image file, located at
449+
`/work/y07/shared/utils/dev/ccpe/25.03/cpe_25.03.sif`. Singularity containers can be run such that locations
450450
on the host file system are still visible. This means source code stored on `/work` can be compiled from
451451
inside the CPE container. And any output resulting from the compilation, such as object files, libraries
452452
and executables, can be written to `/work` also. This ability to bind to locations on the host is
453453
necessary as the container is immutable, i.e., you cannot write files to the container itself.
454454

455-
Any executable resulting from a containerised CPE build can be run from within the container,
456-
allowing the developer to test the performance of the containerised libraries, e.g., `libmpi_cray`,
457-
`libpmi2`, `libfabric`.
455+
Any executable resulting from a containerised CPE build should also be run from within the container,
456+
allowing one to test the performance of the containerised libraries, e.g., `libmpi_cray`, `libpmi2`, `libfabric`.
458457

459458
We'll now show how to build and run a simple Hello World MPI example using a containerised CPE.
460459

@@ -536,17 +535,17 @@ Examples of these files are given below.
536535
```
537536

538537
The `ldd` command at the end of the build script is simply there to confirm that the code is indeed linked to
539-
containerised libraries that form part of the CPE 23.12 release.
538+
containerised libraries that form part of the CPE 25.03 release.
540539

541-
The next step is to launch a job (via `sbatch`) on a serial node that instantiates the containerised CPE 23.12
540+
The next step is to launch a job (via `sbatch`) on a serial node that instantiates the containerised CPE 25.03
542541
image and builds the Hello World MPI code.
543542

544543
=== "submit-build.slurm"
545544
```slurm
546545
#!/bin/bash
547546

548547
#SBATCH --job-name=ccpe-build
549-
#SBATCH --ntasks=8
548+
#SBATCH --ntasks=1
550549
#SBATCH --time=00:10:00
551550
#SBATCH --account=<budget code>
552551
#SBATCH --partition=serial
@@ -556,23 +555,21 @@ image and builds the Hello World MPI code.
556555
export OMP_NUM_THREADS=1
557556

558557
module use /work/y07/shared/archer2-lmod/others/dev
559-
module load ccpe/23.12
560-
561-
BUILD_CMD="${CCPE_BUILDER} ${SLURM_SUBMIT_DIR}/build.sh"
558+
module load ccpe/25.03
562559

563560
singularity exec --cleanenv \
564-
--bind ${CCPE_BIND_ARGS},${SLURM_SUBMIT_DIR} --env LD_LIBRARY_PATH=${CCPE_LD_LIBRARY_PATH} \
565-
${CCPE_IMAGE_FILE} ${BUILD_CMD}
561+
--bind ${CCPE_BIND_ARGS},${PWD} --env LD_LIBRARY_PATH=${CCPE_LD_LIBRARY_PATH} \
562+
${CCPE_IMAGE_FILE} ${CCPE_BUILDER} ${PWD}/build.sh
566563
```
567564

568565
The `CCPE` environment variables shown above (e.g., `CCPE_BUILDER` and `CCPE_IMAGE_FILE`) are set by the
569-
loading of the `ccpe/23.12` module. The `CCPE_BUILDER` variable holds the path to the script that prepares the
566+
loading of the `ccpe/25.03` module. The `CCPE_BUILDER` variable holds the path to the script that prepares the
570567
containerised environment prior to running the `build.sh` script. You can run `cat ${CCPE_BUILDER}` to take
571568
a closer look at what is going on.
572569

573570
!!! note
574-
Passing the `${SLURM_SUBMIT_DIR}` path to Singularity via the `--bind` option allows the CPE container
575-
to access the source code and write out the executable using locations on the host.
571+
Passing the `${PWD}` path to Singularity via the `--bind` option allows the CPE container
572+
to access the source code and write out the executable within the current working directory on the host.
576573

577574
Running the newly-built code is similarly straightforward; this time the containerised CPE is launched on the
578575
compute nodes using the `srun` command.
@@ -594,13 +591,11 @@ compute nodes using the `srun` command.
594591
export OMP_NUM_THREADS=1
595592

596593
module use /work/y07/shared/archer2-lmod/others/dev
597-
module load ccpe/23.12
598-
599-
RUN_CMD="${SLURM_SUBMIT_DIR}/helloworld"
594+
module load ccpe/25.03
600595

601-
srun --distribution=block:block --hint=nomultithread --chdir=${SLURM_SUBMIT_DIR} \
602-
singularity exec --bind ${CCPE_BIND_ARGS},${SLURM_SUBMIT_DIR} --env LD_LIBRARY_PATH=${CCPE_LD_LIBRARY_PATH} \
603-
${CCPE_IMAGE_FILE} ${RUN_CMD}
596+
srun --distribution=block:block --hint=nomultithread --chdir=${PWD} \
597+
singularity exec --bind ${CCPE_BIND_ARGS},${PWD} --env LD_LIBRARY_PATH=${CCPE_LD_LIBRARY_PATH} \
598+
${CCPE_IMAGE_FILE} ${PWD}/helloworld
604599
```
605600

606601
If you wish you can at runtime replace a containerised library with its host equivalent. You may for example decide to
@@ -611,9 +606,9 @@ do this for a low-level communications library such as `libfabric` or `libpmi`.
611606
source ${CCPE_SET_HOST_PATH} "/opt/cray/pe/pmi" "6.1.8" "lib"
612607
```
613608

614-
As of April 2024, the version of PMI available on ARCHER2 is 6.1.8 (CPE 22.12), and so the command above would allow
615-
you to isolate the impact of the containerised PMI library, which for CPE 23.12 is PMI 6.1.13. To see how the setting
616-
of the host library is done, simply run `cat ${CCPE_SET_HOST_PATH}` after loading the `ccpe` module.
609+
As of August 2025, the versions of PMI available on ARCHER2 are 6.1.8 (CPE 22.12) and 6.1.12 (CPE 23.09), and so the
610+
command above would allow you to isolate the impact of the containerised PMI library, which for CPE 25.03 is PMI 6.1.15.
611+
To see how the setting of the host library is done, simply run `cat ${CCPE_SET_HOST_PATH}` after loading the `ccpe` module.
617612

618613
An MPI code that just prints a message from each rank is obviously very simple. Real-world codes such as CP2K or GROMACS
619614
will often require additional software for compilation, e.g., Intel MKL libraries or tools that control the build process
@@ -635,18 +630,199 @@ software is installed.
635630
export OMP_NUM_THREADS=1
636631

637632
module use /work/y07/shared/archer2-lmod/others/dev
638-
module load ccpe/23.12
633+
module load ccpe/25.03
639634

640-
CMAKE_DIR="/work/y07/shared/utils/core/cmake/3.21.3"
641-
642-
BUILD_CMD="${CCPE_BUILDER} ${SLURM_SUBMIT_DIR}/build.sh"
635+
CMAKE_DIR="/work/y07/shared/utils/core/cmake/3.29.4"
643636

644637
singularity exec --cleanenv \
645-
--bind ${CCPE_BIND_ARGS},${CMAKE_DIR},${SLURM_SUBMIT_DIR} \
638+
--bind ${CCPE_BIND_ARGS},${CMAKE_DIR},${PWD} \
646639
--env LD_LIBRARY_PATH=${CCPE_LD_LIBRARY_PATH} \
647-
${CCPE_IMAGE_FILE} ${BUILD_CMD}
640+
${CCPE_IMAGE_FILE} ${CCPE_BUILDER} ${PWD}/build.sh
648641
```
649642

650643
The `submit-cmake-build.slurm` script shows how the `--bind` option can be used to make the `CMake` installation on ARCHER2
651644
accessible from within the container. The `build.sh` script can then call the `cmake` command directly (once the `CMake`
652645
bin directory has been added to the `PATH` environment variable).
646+
647+
### Containerised ROCm
648+
649+
ROCm is AMD's software support for GPU programming; ROCm 5.2.3 is currently installed on ARCHER2.
650+
Newer versions of ROCm can be accessed via the containerised CPE modules. For example, `ccpe/23.12/rocm/5.6.0` provides access to ROCm 5.6.0 (with CPE 23.12),
651+
In this way, ARCHER2 users can test more up-to-date ROCm compilers that target the AMD MI210 GPU platform, e.g. `amdclang`, `amdclang++`, `amdflang`.
652+
The same applies to ROCm-integrated software frameworks such as PyTorch.
653+
654+
We'll now present a scenario showing how one can make use of the `ccpe/23.12/rocm/5.6.0` module to train a neural network using
655+
Python code that requires PyTorch 2.2.0. This is of interest since the version of ROCm directly installed on ARCHER2, 5.2.3, limits users
656+
to versions of PyTorch no newer than 1.13.1.
657+
658+
!!! note
659+
An overview of the differences between PyTorch versions 2.2.0 and 1.13.1 can be found in the [official PyTorch release notes](https://github.com/pytorch/pytorch/releases?page=2).
660+
661+
We first setup a local Python custom environment from within the container, such that the environment's package files are written to the
662+
host ARCHER2 `/work` system. We'll then install to this custom environment the PyTorch 2.2.0 packages.
663+
664+
=== "submit-rocm-build.slurm"
665+
```slurm
666+
#!/bin/bash
667+
668+
#SBATCH --job-name=ccpe-rocm-build
669+
#SBATCH --ntasks=8
670+
#SBATCH --time=00:10:00
671+
#SBATCH --account=<budget code>
672+
#SBATCH --partition=serial
673+
#SBATCH --qos=serial
674+
#SBATCH --export=none
675+
676+
export OMP_NUM_THREADS=1
677+
678+
module use /work/y07/shared/archer2-lmod/others/dev
679+
module load ccpe/23.12/rocm/5.6.0
680+
681+
singularity exec --cleanenv \
682+
--bind ${CCPE_BIND_ARGS},${PWD} --env LD_LIBRARY_PATH=${CCPE_LD_LIBRARY_PATH} \
683+
${CCPE_IMAGE_FILE} \
684+
${CCPE_ROCM_BUILDER} ${PWD} mypyenv pip-install.sh
685+
```
686+
=== "pip-install.sh"
687+
```bash
688+
#!/bin/bash
689+
690+
pip install --user --upgrade pip scipy
691+
692+
pip install --user torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 torchtext==0.17.0 --index-url https://download.pytorch.org/whl/rocm5.6
693+
694+
pip install --user torchopt matplotlib
695+
696+
# downgrade numpy since the 2.2.0 torch modules were compiled with numpy 1.x
697+
pip install --user "numpy<2"
698+
```
699+
700+
The `CCPE` environment variables shown above (e.g., `CCPE_ROCM_BUILDER` and `CCPE_IMAGE_FILE`) are set by the loading of the `ccpe/23.12/rocm/5.6.0` module.
701+
The `CCPE_ROCM_BUILDER` variable holds the path to the script that prepares the containerised environment prior to the installation of the various Python packages
702+
listed in `pip-install.sh`. You can run `cat ${CCPE_ROCM_BUILDER}` (after loading the `ccpe/23.12/rocm/5.6.0` module) to take a closer look at what is going on.
703+
704+
Run `sbatch submit-rocm-build.slurm` to establish the containerised Python environment. This should take 3-4 minutes to complete.
705+
706+
We're now ready to run some Python code that makes use of the `func.torch` API, introduced in PyTorch 2.0.0. This API enables the development of
707+
purely functional (stateless) neural network models. The code example below, developed by Mario Dagreda, trains a Physics Informed Neural Network (PINN) to solve a one-dimensional wave equation
708+
using `func.torch` and `torchopt` (a functional NN optimiser). Please clone [Mario's basic-pinn](https://github.com/madagra/basic-pinn.git) repository to obtain the code.
709+
710+
```bash
711+
git clone https://github.com/madagra/basic-pinn.git
712+
```
713+
714+
!!! note
715+
Mario Dagreda has also published two articles on Medium relevant to the example described here, [Introduction to PINNs](https://medium.com/data-science/solving-differential-equations-with-neural-networks-afdcf7b8bcc4) and [A Primer on Functional PyTorch](https://medium.com/data-science/introduction-to-functional-pytorch-b5bf739e1e6e).
716+
717+
You will see that the code repo you've just cloned targets the CPU and so we'll need to change the code to ensure that the training and evaluation of the wave equation
718+
is indeed done on the GPU. Basically, this requires us to utilise the `to(DEVICE='cuda')` method such that the PINN model is moved to the GPU. The same is true
719+
for the input and evaluation data. In addition, we need to ensure that the model output is transferred back to CPU so that it can be plotted: this is done using the `cpu()` method.
720+
721+
The two source files that need to be changed are located in the repository file tree at `./basic-pinn/basic_pinn`, see below for details.
722+
Any code that does not need to change is indicated by an ellipsis (`...`).
723+
724+
=== "wave_equation_1d.py"
725+
```python
726+
727+
...
728+
729+
if __name__ == "__main__":
730+
731+
### Add code to initialise DEVICE ###
732+
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
733+
print(f"DEVICE: {DEVICE}")
734+
735+
...
736+
737+
def domain_sampler() -> tuple[torch.Tensor, torch.Tensor]:
738+
### Append "to(DEVICE)" to line below ###
739+
x = torch.FloatTensor(config.batch_size).uniform_(domain_x[0], domain_x[1]).to(DEVICE)
740+
### Append "to(DEVICE)" to FloatTensor constructor in line below ###
741+
t, _ = torch.sort(torch.FloatTensor(config.batch_size).uniform_(domain_t[0], domain_t[1]).to(DEVICE))
742+
t_and_x = torch.cartesian_prod(t, x)
743+
return t_and_x[:, 0], t_and_x[:, 1]
744+
745+
# MLP model
746+
### Append "to(DEVICE)" to line below ###
747+
model = LinearNN(num_layers=config.num_hidden, num_neurons=config.dim_hidden, num_inputs=2).to(DEVICE)
748+
749+
...
750+
751+
### Append "to(DEVICE)" to line below ###
752+
x_eval = torch.arange(domain_x[0], domain_x[1], 0.01).to(DEVICE)
753+
### Append "to(DEVICE)" to line below ###
754+
t_eval = torch.arange(domain_t[0], domain_t[1], 0.1).to(DEVICE)
755+
756+
_, ani = animate_2d_solution(x_eval, t_eval, opt_params, f, show=True)
757+
758+
ani.save("wave_equation_1d.gif", writer="pillow")
759+
```
760+
=== "plotting.py"
761+
```python
762+
763+
...
764+
765+
def animate_2d_solution(
766+
x_eval: Tensor,
767+
t_eval: Tensor,
768+
opt_params: tuple,
769+
fn: Callable,
770+
show: bool = True
771+
) -> tuple[Figure, FuncAnimation]:
772+
"""
773+
Animate the solution of a 2-dimension problem in time and space
774+
...
775+
"""
776+
777+
...
778+
779+
def init() -> tuple:
780+
ax.set_xlim(x_eval[0].item(), x_eval[-1].item())
781+
### Replace "detach()" with "cpu().detach()" in line below ###
782+
y_values = [fn(x_eval, t * torch.ones_like(x_eval), params=opt_params).cpu().detach().numpy() for t in t_eval]
783+
ax.set_ylim(min(map(min, y_values)), max(map(max, y_values)))
784+
return line,
785+
786+
def animate(frame: int) -> tuple:
787+
t = t_eval[frame]
788+
y = fn(x_eval, t * torch.ones_like(x_eval), params=opt_params)
789+
### Replace "detach()" with "cpu().detach()" in line below ###
790+
line.set_data(x_eval.cpu().detach().numpy(), y.cpu().detach().numpy())
791+
return line,
792+
793+
...
794+
```
795+
796+
Once you've completed the code edits, you can submit the Slurm script below to initiate the training of the PINN on a GPU.
797+
798+
=== "submit-rocm-run.slurm"
799+
```slurm
800+
#!/bin/bash
801+
802+
#SBATCH --job-name=pinn-wave-eqn
803+
#SBATCH --nodes=1
804+
#SBATCH --gpus=1
805+
#SBATCH --time=00:10:00
806+
#SBATCH --account=<budget code>
807+
#SBATCH --partition=gpu
808+
#SBATCH --qos=gpu-shd
809+
#SBATCH --export=none
810+
811+
export OMP_NUM_THREADS=1
812+
813+
module use /work/y07/shared/archer2-lmod/others/dev
814+
module load ccpe/23.12/rocm/5.6.0
815+
816+
singularity exec \
817+
--bind ${PWD},${CCPE_HOST_ROOT} \
818+
--env LD_LIBRARY_PATH=${CCPE_LD_LIBRARY_PATH} \
819+
${CCPE_IMAGE_FILE} \
820+
${CCPE_ROCM_RUNNER} ${PWD} mypyenv \
821+
${PWD}/basic-pinn/basic_pinn/wave_equation_1d.py --batch-size 50 --learning-rate 0.0075 --num-epochs 1500
822+
```
823+
824+
The `cpe_23.12-rocm_5.6.0.sif` container image file (referenced by `${CCPE_IMAGE_FILE}`) is instantiated on the GPU node where it runs the `${CCPE_ROCM_RUNNER}` script,
825+
which activates the containerised custom Python environment preparatory to executing the `wave_equation_1d.py` code (courtesy of [Mario Dagreda](https://github.com/madagra/basic-pinn.git)).
826+
The run should take 2-3 minutes.
827+
828+
The output is a GIF animation (`wave_equation_1d.gif`) that shows an oscillating wave as inferred from the trained PINN.

docs/user-guide/dev-environment.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -630,7 +630,7 @@ repository](https://github.com/PE-Cray).
630630
Later PE releases may sometimes be available via a containerised form. This allows developers to check that their code compiles and runs
631631
using CPE releases that have not yet been installed on ARCHER2.
632632

633-
CPE 23.12 is currently available as a Singularity container, see [Using Containerised HPE Cray Programming Environments](containers.md/#using-containerised-hpe-cray-programming-environments) for further details.
633+
CPE 25.03 is currently available as a Singularity container, see [Using Containerised HPE Cray Programming Environments](containers.md/#using-containerised-hpe-cray-programming-environments) for further details.
634634

635635
### Switching to a different HPE Cray Programming Environment (CPE) release
636636

docs/user-guide/gpu.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -114,6 +114,9 @@ HIPIFY (`hipify-clang` or `hipify-perl` command), which enables
114114
translation of CUDA to HIP code. See also the [section below on
115115
HIPIFY](#hipify).
116116

117+
!!! note
118+
ARCHER2 currently provides access to a legacy version of ROCm, `rocm/5.2.3`. However, it is now possible to use a more recent version via a containerised HPE Cray Programming Environment module, `ccpe/23.12/rocm/5.6.0`, see [Containerised ROCm](containers.md/#containerised-rocm) for more details.
119+
117120

118121
### GPU target
119122

docs/user-guide/machine-learning.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,9 @@ A binary install of PyTorch 1.13.1 suitable for ROCm 5.2.3 has been installed ac
2828

2929
This install can be accessed by loading the `pytorch/1.13.1-gpu` module.
3030

31+
!!! note
32+
For GPU, ARCHER2 currently provides access to a legacy version of [ROCm](gpu.md#rocm), `rocm/5.2.3`. This means that users cannot run on GPU a version of PyTorch more recent than 1.13.1. However, it is possible to run PyTorch 2.2.0 via a containerised HPE Cray Programming Environment module, one that features ROCm 5.6.0, see [Containerised ROCm](containers.md/#containerised-rocm) for details.
33+
3134
As DeepCam is an [MLPerf](https://ieeexplore.ieee.org/document/9238612) benchmark, you may wish to base a local python environment on `pytorch/1.13.1-gpu`
3235
so that you have the opportunity to install additional python packages that support MLPerf logging, as well as extra features pertinent to DeepCam (e.g., dynamic learning rates).
3336

docs/user-guide/python.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -137,6 +137,8 @@ ensuring that the Python packages will be gathered from the local virtual enviro
137137
The `extend-venv-activate` command becomes available (i.e., its location is placed on the path) only when the ML module is loaded.
138138
The ML modules are themselves based on `cray-python`. For example, `tensorflow/2.12.0` is based on the `cray-python/3.9.13.1` module.
139139

140+
Further info about running ML frameworks on ARCHER2 can be found on the [Machine Learning page](machine-learning.md).
141+
140142
## Conda on ARCHER2
141143

142144
Conda-based Python distributions (e.g. Anaconda, Mamba, Miniconda) are an extremely popular way of installing and

0 commit comments

Comments
 (0)