Skip to content

Using MPICH on Aurora@ALCF

Rob Latham edited this page Jun 24, 2025 · 17 revisions

This page describes how to build and use MPICH on the 'Aurora' machine at Argonne. Aurora is uses Intel CPUs and GPUs with the Cray Slingshot interconnect. Support for Slingshot is provided via the system libfabric.

Prerequisite

As of 03/01/2024, it is best to build the MPICH git main branch for use on Aurora.

Build MPICH

Build ZE-enabled MPICH for use with the Cray PALS launcher

presumably, since you are on Aurora, you want DAOS support as well

./autogen.sh
./configure --prefix=/home/raffenet/proj/mpich/i \
    --with-device=ch4:ofi --with-libfabric=$(pkg-config --variable=prefix) \
    --with-ze --enable-ze-native=pvc \
    --with-pm=no --with-pmi=pmix --with-pmix=/usr \
    --with-file-system=ufs+lustre+daos \
    --with-daos=/usr \
    --with-cart=/usr \
    --with-file-system=daos+lustre+ufs \
    CC=icx CXX=icpx FC=ifx
    
make -j16 install

A correctly configured MPICH build should print the following message in configure output.

*****************************************************
***
*** device      : ch4:ofi
*** shm feature : auto
*** gpu support : ZE
***
*****************************************************

Running MPI Applications

On Aurora, MPICH is configured without a process manager (--with-pm=no). We instead can use the mpiexec that is available as part of Cray's PALs package. Setting PALS_PMI=pmix in the execution environment is required for processes to properly query the runtime for job information. If launching processes on a single host, it is required to set MPIR_CVAR_SINGLE_HOST_ENABLED=0 in the environment.

Debugging

  • gdb: but it's not called gdb. it's gdb-oneapi
  • gdb4hpc: pretty nice parallel wrapper around gdb, though it will probably timeout if you have more than a few dozen processes

Common Issues

  • It's possible (maybe only with large counts) to trigger an unaligned memory error in the ZE code:
.../modules/yaksa/src/backend/ze/hooks/yaksuri_zei_type_hooks.c:101:30: runtime error: load of misaligned address 0xfbc50007672a3559 for type 'struct _ze_module_handle_t *', which requires 8 byte alignment
0xfbc50007672a3559: note: pointer points here
<memory cannot be printed>
Segmentation fault

If you don't need GPU processing you can work around this by configuring with --without-ze

Building libfabric for use on Aurora

The CXI provider is open source, but development happens in more than one repository. As of April 2025, libfabric has been successfully built and tested on Aurora using the v1.22.x-ss branch from the https://github.com/HewlettPackard/shs-libfabric repository.

NOTE: You must unload the system libfabric module or else prepend your installation to LD_LIBRARY_PATH to ensure the correct library is linked. We have seen issues where module unload libfabric does not actually modify LD_LIBRARY_PATH, so care is need to ensure you link the correct library at runtime.

module unload libfabric
git clone -b v1.22.x-ss https://github.com/HewlettPackard/shs-libfabric
# disable verbs and efa to avoid picking up unnecessary dependencies
./configure --enable-cxi --disable-verbs --disable-efa --with-ze --prefix=<path/to/install>
make -j16 install
Clone this wiki locally