Skip to content

IPPL PenningTrap example on Polaris (ALCF) #504

@aaadelmann

Description

@aaadelmann

This issue was communicated to me from Victor:

I’m running the IPPL PenningTrap example on Polaris (ALCF) with CUDA/Kokkos enabled and appear to be hitting a multi-node hang in the particle layout update / MPI RMA path.

The application works correctly on a single node, but hangs consistently as soon as communication crosses nodes (even with just 2 nodes and 1 rank per node).

System / Environment

  • Machine: Polaris (ALCF)

  • MPI: Cray MPICH / OFI CXI provider

  • GPUs: NVIDIA A100

  • Backend: Kokkos::Cuda

  • FFT: cuFFT linked

  • CUDA-aware MPI enabled

Launch Configuration

  • Single-node working case:

1 node

4 MPI ranks/node

1 GPU per rank

Works correctly

  • Multi-node failing case:

2 nodes

1 rank/node OR 4 ranks/node

Hangs in the same location

  • Target larger run:

64 nodes, 4 ranks/node, 512^3 grid, 1,073,741,824 particles

Runtime Output

The application reaches:

Pre-step{0}> Done
and then hangs indefinitely.

CPU utilization remains at ~100% for each rank, while GPU utilization stays at 0%, although GPU memory is allocated successfully.

GPU Status

Each rank correctly binds to a separate GPU and allocates memory. Example from nvidia-smi:

GPU memory allocated (~1GB/rank)
GPU utilization: 0%
So GPU affinity appears correct and CUDA initialization succeeds.

Relevant Environment Variables

export MPICH_GPU_SUPPORT_ENABLED=1
export OMP_NUM_THREADS=1
I also tested:

disabling GPU-aware MPI (MPICH_GPU_SUPPORT_ENABLED=0) → crashes

forcing NIC mappings

FI_CXI_RX_MATCH_MODE=software

1 rank/node vs 4 ranks/node

None resolved the hang.

Backtrace

Both ranks show the exact same stack trace:

#0 __GI_sched_yield()
#1 cxip_cntr_get_ct_success()
#2 cxip_cntr_read()
#3 MPIDI_OFI_win_do_progress()
#4 PMPI_Win_fence()
#5 MPI_Win_fence() at darshan-apmpi.c:1067
#6 ippl::ParticleSpatialLayout<double,3,...>::update<..., Kokkos::Cuda>()
#7 PenningTrapManager<double,3>::LeapFrogStep()
#8 main()
Importantly:

both ranks are stuck in the same MPI_Win_fence()

this occurs only inter-node

single-node execution works correctly

Additional Notes

ldd confirms linkage against:

libcufft.so

libcudart.so

libmpi_gtl_cuda.so

so CUDA/cuFFT/GPU-aware MPI appear properly linked.

This currently looks like either:

an issue in IPPL’s inter-node MPI RMA path (ParticleSpatialLayout::update()), or

a compatibility issue with Cray MPICH OFI/CXI RMA behavior on Polaris.

Do you know if there is:

a non-RMA / two-sided communication path for particle updates,

a known issue with MPI_Win_fence() on Polaris,

or additional runtime flags/settings we should try?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions