IPPL PenningTrap example on Polaris (ALCF)

This issue was communicated to me from Victor:

I’m running the IPPL PenningTrap example on Polaris (ALCF) with CUDA/Kokkos enabled and appear to be hitting a multi-node hang in the particle layout update / MPI RMA path.

The application works correctly on a single node, but hangs consistently as soon as communication crosses nodes (even with just 2 nodes and 1 rank per node).

### System / Environment

- Machine: Polaris (ALCF)

- MPI: Cray MPICH / OFI CXI provider

- GPUs: NVIDIA A100

- Backend: Kokkos::Cuda

- FFT: cuFFT linked

- CUDA-aware MPI enabled

###  Launch Configuration

- Single-node working case:

1 node

 4 MPI ranks/node

 1 GPU per rank

### Works correctly

- Multi-node failing case:

2 nodes

1 rank/node OR 4 ranks/node

Hangs in the same location

- Target larger run:

64 nodes, 4 ranks/node, 512^3 grid, 1,073,741,824 particles

### Runtime Output

The application reaches:

Pre-step{0}> Done
and then hangs indefinitely.

CPU utilization remains at ~100% for each rank, while GPU utilization stays at 0%, although GPU memory is allocated successfully.

### GPU Status

Each rank correctly binds to a separate GPU and allocates memory. Example from nvidia-smi:

GPU memory allocated (~1GB/rank)
GPU utilization: 0%
So GPU affinity appears correct and CUDA initialization succeeds.

### Relevant Environment Variables

export MPICH_GPU_SUPPORT_ENABLED=1
export OMP_NUM_THREADS=1
I also tested:

disabling GPU-aware MPI (MPICH_GPU_SUPPORT_ENABLED=0) → crashes

forcing NIC mappings

FI_CXI_RX_MATCH_MODE=software

1 rank/node vs 4 ranks/node

None resolved the hang.

###  Backtrace

Both ranks show the exact same stack trace:

#0  __GI_sched_yield()
#1  cxip_cntr_get_ct_success()
#2  cxip_cntr_read()
#3  MPIDI_OFI_win_do_progress()
#4  PMPI_Win_fence()
#5  MPI_Win_fence() at darshan-apmpi.c:1067
#6  ippl::ParticleSpatialLayout<double,3,...>::update<..., Kokkos::Cuda>()
#7  PenningTrapManager<double,3>::LeapFrogStep()
#8  main()
Importantly:

both ranks are stuck in the same MPI_Win_fence()

this occurs only inter-node

single-node execution works correctly

### Additional Notes

ldd confirms linkage against:

[libcufft.so](http://libcufft.so/)

[libcudart.so](http://libcudart.so/)

[libmpi_gtl_cuda.so](http://libmpi_gtl_cuda.so/)

so CUDA/cuFFT/GPU-aware MPI appear properly linked.

This currently looks like either:

an issue in IPPL’s inter-node MPI RMA path (ParticleSpatialLayout::update()), or

a compatibility issue with Cray MPICH OFI/CXI RMA behavior on Polaris.

Do you know if there is:

a non-RMA / two-sided communication path for particle updates,

a known issue with MPI_Win_fence() on Polaris,

or additional runtime flags/settings we should try?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IPPL PenningTrap example on Polaris (ALCF) #504

System / Environment

Launch Configuration

Works correctly

Runtime Output

GPU Status

Relevant Environment Variables

Backtrace

Additional Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

IPPL PenningTrap example on Polaris (ALCF) #504

Description

System / Environment

Launch Configuration

Works correctly

Runtime Output

GPU Status

Relevant Environment Variables

Backtrace

Additional Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions