Skip to content

Using MPICH on Crusher@OLCF

Yanfei Guo edited this page Sep 28, 2022 · 5 revisions

This page describes how to build and use MPICH on the 'Crusher' machine at Oak Ridge. Crusher is a AMD CPU/GPU machine with Slingshot interconnect. The 'Libfabric' device works best here. Performance is on par with the Cray MPI on Crusher.

Prerequisite

MPICH needs the following tools (and their default version on Crusher as of 09/27/2022) to build on Crusher with GPU support.

  • gcc (gcc/7.5.0)
  • ROCm (rocm/5.1.0)

Build MPICH

Build ROCm-enabled MPICH with Cray PMI (srun/PALS)

module load rocm 
./configure --with-device=ch4:ofi --with-libfabric=/opt/cray/libfabric/1.15.0.0 --with-pmi=pmi2 --with-pmilib=cray --with-craypmi=/opt/cray/pe/pmi/default \
  --with-hip=$ROCM_PATH/hip 
make -j 8
make install

# $ROCM_PATH is set by the rocm module.

Build ROCm-enabled MPICH with hydra

module load rocm
./configure --with-device=ch4:ofi --with-libfabric=/opt/cray/libfabric/1.15.0.0 \
  --with-hip=$ROCM_PATH/hip 
make -j 8
make install

# $ROCM_PATH is set by the rocm module.

A correctly configured MPICH build should print the following message in confiugre output.

*****************************************************
***
*** device      : ch4:ofi
*** shm feature : auto
*** gpu support : HIP
***
*****************************************************

Running MPI Application

Running MPI with srun

module load rocm
export MPIR_CVAR_CH4_OFI_ENABLE_MULTI_NIC_STRIPING=0
# Launch two ranks each on a separate node and a separate GPU
srun -n2 --ntasks-per-node=1 --gpu-per-node=1 --gpu-bind-cloest \
    ./test/mpi/pt2pt/pingping \
    -type=MPI_INT -sendcnt=512 -recvcnt=1024 -seed=78 -testsize=4  -sendmem=device -recvmem=device

For more srun options, please check Crusher User Guide - Running Jobs

Running MPI with hydra

module load rocm

mpiexec -np 2 -ppn 1 -gpus-per-proc=1 \
    -genv MPIR_CVAR_CH4_OFI_ENABLE_MULTI_NIC_STRIPING=0 \
    ./test/mpi/pt2pt/pingping \
    -type=MPI_INT -sendcnt=512 -recvcnt=1024 -seed=78 -testsize=4  -sendmem=device -recvmem=device

Common Issues

  1. "key [-NONEXIST-KEY] was not found" message. It is common to see error messages like the following. It is expected.
Wed Sep 28 12:01:00 2022: [PE_0]:_pmi2_kvs_get:key [-NONEXIST-KEY] was not found.
Wed Sep 28 12:01:00 2022: [PE_0]:PMI2_KVS_Get:_pmi2_kvs_get failed
Wed Sep 28 12:01:00 2022: [PE_1]:_pmi2_kvs_get:key [-NONEXIST-KEY] was not found.
Wed Sep 28 12:01:00 2022: [PE_1]:PMI2_KVS_Get:_pmi2_kvs_get failed
Wed Sep 28 12:01:00 2022: [PE_0]:_pmi2_kvs_get:key [-NONEXIST-KEY] was not found.
Wed Sep 28 12:01:00 2022: [PE_0]:PMI2_KVS_Get:_pmi2_kvs_get failed
Wed Sep 28 12:01:00 2022: [PE_1]:_pmi2_kvs_get:key [-NONEXIST-KEY] was not found.
Wed Sep 28 12:01:00 2022: [PE_1]:PMI2_KVS_Get:_pmi2_kvs_get failed
Clone this wiki locally