-
Notifications
You must be signed in to change notification settings - Fork 300
Using MPICH on Crusher@OLCF
Yanfei Guo edited this page Sep 28, 2022
·
5 revisions
This page describes how to build and use MPICH on the 'Crusher' machine at Oak Ridge. Crusher is a AMD CPU/GPU machine with Slingshot interconnect. The 'Libfabric' device works best here. Performance is on par with the Cray MPI on Crusher.
MPICH needs the following tools (and their default version on Crusher as of 09/27/2022) to build on Crusher with GPU support.
- gcc (gcc/7.5.0)
- ROCm (rocm/5.1.0)
module load rocm
./configure --with-device=ch4:ofi --with-libfabric=/opt/cray/libfabric/1.15.0.0 --with-pmi=pmi2 --with-pmilib=cray --with-craypmi=/opt/cray/pe/pmi/default \
--with-hip=$ROCM_PATH/hip
make -j 8
make install
# $ROCM_PATH is set by the rocm module.
module load rocm
./configure --with-device=ch4:ofi --with-libfabric=/opt/cray/libfabric/1.15.0.0 \
--with-hip=$ROCM_PATH/hip
make -j 8
make install
# $ROCM_PATH is set by the rocm module.
A correctly configured MPICH build should print the following message in confiugre output.
*****************************************************
***
*** device : ch4:ofi
*** shm feature : auto
*** gpu support : HIP
***
*****************************************************
module load rocm
export MPIR_CVAR_CH4_OFI_ENABLE_MULTI_NIC_STRIPING=0
# Launch two ranks each on a separate node and a separate GPU
srun -n2 --ntasks-per-node=1 --gpu-per-node=1 --gpu-bind-cloest \
./test/mpi/pt2pt/pingping \
-type=MPI_INT -sendcnt=512 -recvcnt=1024 -seed=78 -testsize=4 -sendmem=device -recvmem=device
For more srun options, please check Crusher User Guide - Running Jobs
module load rocm
mpiexec -np 2 -ppn 1 -gpus-per-proc=1 \
-genv MPIR_CVAR_CH4_OFI_ENABLE_MULTI_NIC_STRIPING=0 \
./test/mpi/pt2pt/pingping \
-type=MPI_INT -sendcnt=512 -recvcnt=1024 -seed=78 -testsize=4 -sendmem=device -recvmem=device
- "key [-NONEXIST-KEY] was not found" message. It is common to see error messages like the following. It is expected.
Wed Sep 28 12:01:00 2022: [PE_0]:_pmi2_kvs_get:key [-NONEXIST-KEY] was not found.
Wed Sep 28 12:01:00 2022: [PE_0]:PMI2_KVS_Get:_pmi2_kvs_get failed
Wed Sep 28 12:01:00 2022: [PE_1]:_pmi2_kvs_get:key [-NONEXIST-KEY] was not found.
Wed Sep 28 12:01:00 2022: [PE_1]:PMI2_KVS_Get:_pmi2_kvs_get failed
Wed Sep 28 12:01:00 2022: [PE_0]:_pmi2_kvs_get:key [-NONEXIST-KEY] was not found.
Wed Sep 28 12:01:00 2022: [PE_0]:PMI2_KVS_Get:_pmi2_kvs_get failed
Wed Sep 28 12:01:00 2022: [PE_1]:_pmi2_kvs_get:key [-NONEXIST-KEY] was not found.
Wed Sep 28 12:01:00 2022: [PE_1]:PMI2_KVS_Get:_pmi2_kvs_get failed