-
Notifications
You must be signed in to change notification settings - Fork 302
Using MPICH on Aurora@ALCF
This page describes how to build and use MPICH on the 'Aurora' machine at Argonne. Aurora is uses Intel CPUs and GPUs with the Cray Slingshot interconnect. Support for Slingshot is provided via the system libfabric.
As of 03/01/2024, it is best to build the MPICH git main
branch for use on Aurora.
./autogen.sh
./configure --prefix=/home/raffenet/proj/mpich/i \
--with-device=ch4:ofi --with-libfabric=$(pkg-config --variable=prefix) \
--with-ze --enable-ze-native=pvc \
--with-pm=no --with-pmi=pmix --with-pmix=/usr \
--with-file-system=ufs+lustre+daos \
--with-daos=/usr \
--with-cart=/usr \
--with-file-system=daos+lustre+ufs \
CC=icx CXX=icpx FC=ifx
make -j16 install
A correctly configured MPICH build should print the following message in configure output.
*****************************************************
***
*** device : ch4:ofi
*** shm feature : auto
*** gpu support : ZE
***
*****************************************************
On Aurora, MPICH is configured without a process manager (--with-pm=no
). We instead can use the mpiexec
that is available as part of Cray's PALs package. Setting PALS_PMI=pmix
in the execution environment is required for processes to properly query the runtime for job information. If launching processes on a single host, it is required to set MPIR_CVAR_SINGLE_HOST_ENABLED=0
in the environment.
- gdb: but it's not called gdb. it's
gdb-oneapi
- gdb4hpc: pretty nice parallel wrapper around gdb, though it will probably timeout if you have more than a few dozen processes
- It's possible (maybe only with large counts) to trigger an unaligned memory error in the ZE code:
.../modules/yaksa/src/backend/ze/hooks/yaksuri_zei_type_hooks.c:101:30: runtime error: load of misaligned address 0xfbc50007672a3559 for type 'struct _ze_module_handle_t *', which requires 8 byte alignment
0xfbc50007672a3559: note: pointer points here
<memory cannot be printed>
Segmentation fault
If you don't need GPU processing you can work around this by configuring with --without-ze
The CXI provider is open source, but development happens in more than one repository. As of April 2025, libfabric has been successfully built and tested on Aurora using the v1.22.x-ss
branch from the https://github.com/HewlettPackard/shs-libfabric repository.
NOTE: You must unload the system libfabric module or else prepend your installation to LD_LIBRARY_PATH to ensure the correct library is linked. We have seen issues where module unload libfabric
does not actually modify LD_LIBRARY_PATH, so care is need to ensure you link the correct library at runtime.
module unload libfabric
git clone -b v1.22.x-ss https://github.com/HewlettPackard/shs-libfabric
# disable verbs and efa to avoid picking up unnecessary dependencies
./configure --enable-cxi --disable-verbs --disable-efa --with-ze --prefix=<path/to/install>
make -j16 install