Skip to content

ROCm/6.2.4 causes occasional segmentation faults on Frontier during MPI_Init (see OLCFDEV-1655) #7075

@dqwu

Description

@dqwu

On Frontier, the craygnu-hipcc and craygnu-mphipcc compilers currently use ROCm/6.2.4. However, according to OLCFDEV-1655, this version may cause occasional segmentation faults during MPI_Init:

Resolution: This issue is resolved in ROCm/5.5.1, 5.6.0, and 5.7.1, but has re-appeared in >= ROCm/6.1.x.

This issue has been confirmed by some latest ne1024 SCREAM decadal runs using craygnu-hipcc and craygnu-mphipcc.

Possible Workarounds

  1. Use an older ROCm version
    ROCm/5.5.1, 5.6.0, and 5.7.1 are confirmed to avoid this issue according to OLCFDEV-1655.
    However, the newest ROCm versions are preferred, and these older versions are known to cause build errors.

  2. Restore "-DSCREAM_SYSTEM_WORKAROUND=1" for craygnu-hipcc and craygnu-mphipcc
    This workaround (suggested by OLCFDEV-1655) was previously applied in Frontier: Additional post-maintenance updates scream#2923 for crayclang-scream_frontier-scream-gpu.cmake and later disabled in Frontier: disable hipInit before MPI_Init scream#2943.

Proposed Fix

Before this issue is fixed in a future ROCm version, reintroduce -DSCREAM_SYSTEM_WORKAROUND=1 for craygnu-hipcc and craygnu-mphipcc in the following files to avoid the segmentation fault issue:
craygnu-hipcc.cmake

-string(APPEND CPPDEFS " -DLINUX -DFORTRANUNDERSCORE -DNO_R16 -DCPRGNU -DSCREAM_SYSTEM_WORKAROUND_P3_PART2")
+string(APPEND CPPDEFS " -DLINUX -DFORTRANUNDERSCORE -DNO_R16 -DCPRGNU -DSCREAM_SYSTEM_WORKAROUND_P3_PART2 -DSCREAM_SYSTEM_WORKAROUND=1")

craygnu-mphipcc.cmake

-string(APPEND CPPDEFS " -DLINUX -DFORTRANUNDERSCORE -DNO_R16 -DCPRGNU")
+string(APPEND CPPDEFS " -DLINUX -DFORTRANUNDERSCORE -DNO_R16 -DCPRGNU -DSCREAM_SYSTEM_WORKAROUND=1")

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions