This issue was communicated to me from Victor:
I’m running the IPPL PenningTrap example on Polaris (ALCF) with CUDA/Kokkos enabled and appear to be hitting a multi-node hang in the particle layout update / MPI RMA path.
The application works correctly on a single node, but hangs consistently as soon as communication crosses nodes (even with just 2 nodes and 1 rank per node).
System / Environment
Launch Configuration
- Single-node working case:
1 node
4 MPI ranks/node
1 GPU per rank
Works correctly
2 nodes
1 rank/node OR 4 ranks/node
Hangs in the same location
64 nodes, 4 ranks/node, 512^3 grid, 1,073,741,824 particles
Runtime Output
The application reaches:
Pre-step{0}> Done
and then hangs indefinitely.
CPU utilization remains at ~100% for each rank, while GPU utilization stays at 0%, although GPU memory is allocated successfully.
GPU Status
Each rank correctly binds to a separate GPU and allocates memory. Example from nvidia-smi:
GPU memory allocated (~1GB/rank)
GPU utilization: 0%
So GPU affinity appears correct and CUDA initialization succeeds.
Relevant Environment Variables
export MPICH_GPU_SUPPORT_ENABLED=1
export OMP_NUM_THREADS=1
I also tested:
disabling GPU-aware MPI (MPICH_GPU_SUPPORT_ENABLED=0) → crashes
forcing NIC mappings
FI_CXI_RX_MATCH_MODE=software
1 rank/node vs 4 ranks/node
None resolved the hang.
Backtrace
Both ranks show the exact same stack trace:
#0 __GI_sched_yield()
#1 cxip_cntr_get_ct_success()
#2 cxip_cntr_read()
#3 MPIDI_OFI_win_do_progress()
#4 PMPI_Win_fence()
#5 MPI_Win_fence() at darshan-apmpi.c:1067
#6 ippl::ParticleSpatialLayout<double,3,...>::update<..., Kokkos::Cuda>()
#7 PenningTrapManager<double,3>::LeapFrogStep()
#8 main()
Importantly:
both ranks are stuck in the same MPI_Win_fence()
this occurs only inter-node
single-node execution works correctly
Additional Notes
ldd confirms linkage against:
libcufft.so
libcudart.so
libmpi_gtl_cuda.so
so CUDA/cuFFT/GPU-aware MPI appear properly linked.
This currently looks like either:
an issue in IPPL’s inter-node MPI RMA path (ParticleSpatialLayout::update()), or
a compatibility issue with Cray MPICH OFI/CXI RMA behavior on Polaris.
Do you know if there is:
a non-RMA / two-sided communication path for particle updates,
a known issue with MPI_Win_fence() on Polaris,
or additional runtime flags/settings we should try?
This issue was communicated to me from Victor:
I’m running the IPPL PenningTrap example on Polaris (ALCF) with CUDA/Kokkos enabled and appear to be hitting a multi-node hang in the particle layout update / MPI RMA path.
The application works correctly on a single node, but hangs consistently as soon as communication crosses nodes (even with just 2 nodes and 1 rank per node).
System / Environment
Machine: Polaris (ALCF)
MPI: Cray MPICH / OFI CXI provider
GPUs: NVIDIA A100
Backend: Kokkos::Cuda
FFT: cuFFT linked
CUDA-aware MPI enabled
Launch Configuration
1 node
4 MPI ranks/node
1 GPU per rank
Works correctly
2 nodes
1 rank/node OR 4 ranks/node
Hangs in the same location
64 nodes, 4 ranks/node, 512^3 grid, 1,073,741,824 particles
Runtime Output
The application reaches:
Pre-step{0}> Done
and then hangs indefinitely.
CPU utilization remains at ~100% for each rank, while GPU utilization stays at 0%, although GPU memory is allocated successfully.
GPU Status
Each rank correctly binds to a separate GPU and allocates memory. Example from nvidia-smi:
GPU memory allocated (~1GB/rank)
GPU utilization: 0%
So GPU affinity appears correct and CUDA initialization succeeds.
Relevant Environment Variables
export MPICH_GPU_SUPPORT_ENABLED=1
export OMP_NUM_THREADS=1
I also tested:
disabling GPU-aware MPI (MPICH_GPU_SUPPORT_ENABLED=0) → crashes
forcing NIC mappings
FI_CXI_RX_MATCH_MODE=software
1 rank/node vs 4 ranks/node
None resolved the hang.
Backtrace
Both ranks show the exact same stack trace:
#0 __GI_sched_yield()
#1 cxip_cntr_get_ct_success()
#2 cxip_cntr_read()
#3 MPIDI_OFI_win_do_progress()
#4 PMPI_Win_fence()
#5 MPI_Win_fence() at darshan-apmpi.c:1067
#6 ippl::ParticleSpatialLayout<double,3,...>::update<..., Kokkos::Cuda>()
#7 PenningTrapManager<double,3>::LeapFrogStep()
#8 main()
Importantly:
both ranks are stuck in the same MPI_Win_fence()
this occurs only inter-node
single-node execution works correctly
Additional Notes
ldd confirms linkage against:
libcufft.so
libcudart.so
libmpi_gtl_cuda.so
so CUDA/cuFFT/GPU-aware MPI appear properly linked.
This currently looks like either:
an issue in IPPL’s inter-node MPI RMA path (ParticleSpatialLayout::update()), or
a compatibility issue with Cray MPICH OFI/CXI RMA behavior on Polaris.
Do you know if there is:
a non-RMA / two-sided communication path for particle updates,
a known issue with MPI_Win_fence() on Polaris,
or additional runtime flags/settings we should try?