You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The `submit-cmake-build.slurm` script shows how the `--bind` option can be used to make the `CMake` installation on ARCHER2
651
644
accessible from within the container. The `build.sh` script can then call the `cmake` command directly (once the `CMake`
652
645
bin directory has been added to the `PATH` environment variable).
646
+
647
+
### Containerised ROCm
648
+
649
+
ROCm is AMD's software support for GPU programming; ROCm 5.2.3 is currently installed on ARCHER2.
650
+
Newer versions of ROCm can be accessed via the containerised CPE modules. For example, `ccpe/23.12/rocm/5.6.0` provides access to ROCm 5.6.0 (with CPE 23.12),
651
+
In this way, ARCHER2 users can test more up-to-date ROCm compilers that target the AMD MI210 GPU platform, e.g. `amdclang`, `amdclang++`, `amdflang`.
652
+
The same applies to ROCm-integrated software frameworks such as PyTorch.
653
+
654
+
We'll now present a scenario showing how one can make use of the `ccpe/23.12/rocm/5.6.0` module to train a neural network using
655
+
Python code that requires PyTorch 2.2.0. This is of interest since the version of ROCm directly installed on ARCHER2, 5.2.3, limits users
656
+
to versions of PyTorch no newer than 1.13.1.
657
+
658
+
!!! note
659
+
An overview of the differences between PyTorch versions 2.2.0 and 1.13.1 can be found in the [official PyTorch release notes](https://github.com/pytorch/pytorch/releases?page=2).
660
+
661
+
We first setup a local Python custom environment from within the container, such that the environment's package files are written to the
662
+
host ARCHER2 `/work` system. We'll then install to this custom environment the PyTorch 2.2.0 packages.
663
+
664
+
=== "submit-rocm-build.slurm"
665
+
```slurm
666
+
#!/bin/bash
667
+
668
+
#SBATCH --job-name=ccpe-rocm-build
669
+
#SBATCH --ntasks=8
670
+
#SBATCH --time=00:10:00
671
+
#SBATCH --account=<budget code>
672
+
#SBATCH --partition=serial
673
+
#SBATCH --qos=serial
674
+
#SBATCH --export=none
675
+
676
+
export OMP_NUM_THREADS=1
677
+
678
+
module use /work/y07/shared/archer2-lmod/others/dev
# downgrade numpy since the 2.2.0 torch modules were compiled with numpy 1.x
697
+
pip install --user "numpy<2"
698
+
```
699
+
700
+
The `CCPE` environment variables shown above (e.g., `CCPE_ROCM_BUILDER` and `CCPE_IMAGE_FILE`) are set by the loading of the `ccpe/23.12/rocm/5.6.0` module.
701
+
The `CCPE_ROCM_BUILDER` variable holds the path to the script that prepares the containerised environment prior to the installation of the various Python packages
702
+
listed in `pip-install.sh`. You can run `cat ${CCPE_ROCM_BUILDER}` (after loading the `ccpe/23.12/rocm/5.6.0` module) to take a closer look at what is going on.
703
+
704
+
Run `sbatch submit-rocm-build.slurm` to establish the containerised Python environment. This should take 3-4 minutes to complete.
705
+
706
+
We're now ready to run some Python code that makes use of the `func.torch` API, introduced in PyTorch 2.0.0. This API enables the development of
707
+
purely functional (stateless) neural network models. The code example below, developed by Mario Dagreda, trains a Physics Informed Neural Network (PINN) to solve a one-dimensional wave equation
708
+
using `func.torch` and `torchopt` (a functional NN optimiser). Please clone [Mario's basic-pinn](https://github.com/madagra/basic-pinn.git) repository to obtain the code.
Mario Dagreda has also published two articles on Medium relevant to the example described here, [Introduction to PINNs](https://medium.com/data-science/solving-differential-equations-with-neural-networks-afdcf7b8bcc4) and [A Primer on Functional PyTorch](https://medium.com/data-science/introduction-to-functional-pytorch-b5bf739e1e6e).
716
+
717
+
You will see that the code repo you've just cloned targets the CPU and so we'll need to change the code to ensure that the training and evaluation of the wave equation
718
+
is indeed done on the GPU. Basically, this requires us to utilise the `to(DEVICE='cuda')` method such that the PINN model is moved to the GPU. The same is true
719
+
for the input and evaluation data. In addition, we need to ensure that the model output is transferred back to CPU so that it can be plotted: this is done using the `cpu()` method.
720
+
721
+
The two source files that need to be changed are located in the repository file tree at `./basic-pinn/basic_pinn`, see below for details.
722
+
Any code that does not need to change is indicated by an ellipsis (`...`).
723
+
724
+
=== "wave_equation_1d.py"
725
+
```python
726
+
727
+
...
728
+
729
+
if __name__ == "__main__":
730
+
731
+
### Add code to initialise DEVICE ###
732
+
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
The `cpe_23.12-rocm_5.6.0.sif` container image file (referenced by `${CCPE_IMAGE_FILE}`) is instantiated on the GPU node where it runs the `${CCPE_ROCM_RUNNER}` script,
825
+
which activates the containerised custom Python environment preparatory to executing the `wave_equation_1d.py` code (courtesy of [Mario Dagreda](https://github.com/madagra/basic-pinn.git)).
826
+
The run should take 2-3 minutes.
827
+
828
+
The output is a GIF animation (`wave_equation_1d.gif`) that shows an oscillating wave as inferred from the trained PINN.
Later PE releases may sometimes be available via a containerised form. This allows developers to check that their code compiles and runs
631
631
using CPE releases that have not yet been installed on ARCHER2.
632
632
633
-
CPE 23.12 is currently available as a Singularity container, see [Using Containerised HPE Cray Programming Environments](containers.md/#using-containerised-hpe-cray-programming-environments) for further details.
633
+
CPE 25.03 is currently available as a Singularity container, see [Using Containerised HPE Cray Programming Environments](containers.md/#using-containerised-hpe-cray-programming-environments) for further details.
634
634
635
635
### Switching to a different HPE Cray Programming Environment (CPE) release
Copy file name to clipboardExpand all lines: docs/user-guide/gpu.md
+3Lines changed: 3 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -114,6 +114,9 @@ HIPIFY (`hipify-clang` or `hipify-perl` command), which enables
114
114
translation of CUDA to HIP code. See also the [section below on
115
115
HIPIFY](#hipify).
116
116
117
+
!!! note
118
+
ARCHER2 currently provides access to a legacy version of ROCm, `rocm/5.2.3`. However, it is now possible to use a more recent version via a containerised HPE Cray Programming Environment module, `ccpe/23.12/rocm/5.6.0`, see [Containerised ROCm](containers.md/#containerised-rocm) for more details.
Copy file name to clipboardExpand all lines: docs/user-guide/machine-learning.md
+3Lines changed: 3 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -28,6 +28,9 @@ A binary install of PyTorch 1.13.1 suitable for ROCm 5.2.3 has been installed ac
28
28
29
29
This install can be accessed by loading the `pytorch/1.13.1-gpu` module.
30
30
31
+
!!! note
32
+
For GPU, ARCHER2 currently provides access to a legacy version of [ROCm](gpu.md#rocm), `rocm/5.2.3`. This means that users cannot run on GPU a version of PyTorch more recent than 1.13.1. However, it is possible to run PyTorch 2.2.0 via a containerised HPE Cray Programming Environment module, one that features ROCm 5.6.0, see [Containerised ROCm](containers.md/#containerised-rocm) for details.
33
+
31
34
As DeepCam is an [MLPerf](https://ieeexplore.ieee.org/document/9238612) benchmark, you may wish to base a local python environment on `pytorch/1.13.1-gpu`
32
35
so that you have the opportunity to install additional python packages that support MLPerf logging, as well as extra features pertinent to DeepCam (e.g., dynamic learning rates).
0 commit comments