-
Notifications
You must be signed in to change notification settings - Fork 435
Description
On Frontier, the craygnu-hipcc and craygnu-mphipcc compilers currently use ROCm/6.2.4. However, according to OLCFDEV-1655, this version may cause occasional segmentation faults during MPI_Init:
Resolution: This issue is resolved in ROCm/5.5.1, 5.6.0, and 5.7.1, but has re-appeared in >= ROCm/6.1.x.
This issue has been confirmed by some latest ne1024 SCREAM decadal runs using craygnu-hipcc and craygnu-mphipcc.
Possible Workarounds
-
Use an older ROCm version
ROCm/5.5.1, 5.6.0, and 5.7.1 are confirmed to avoid this issue according to OLCFDEV-1655.
However, the newest ROCm versions are preferred, and these older versions are known to cause build errors. -
Restore "-DSCREAM_SYSTEM_WORKAROUND=1" for craygnu-hipcc and craygnu-mphipcc
This workaround (suggested by OLCFDEV-1655) was previously applied in Frontier: Additional post-maintenance updates scream#2923 for crayclang-scream_frontier-scream-gpu.cmake and later disabled in Frontier: disable hipInit before MPI_Init scream#2943.
Proposed Fix
Before this issue is fixed in a future ROCm version, reintroduce -DSCREAM_SYSTEM_WORKAROUND=1 for craygnu-hipcc and craygnu-mphipcc in the following files to avoid the segmentation fault issue:
craygnu-hipcc.cmake
-string(APPEND CPPDEFS " -DLINUX -DFORTRANUNDERSCORE -DNO_R16 -DCPRGNU -DSCREAM_SYSTEM_WORKAROUND_P3_PART2")
+string(APPEND CPPDEFS " -DLINUX -DFORTRANUNDERSCORE -DNO_R16 -DCPRGNU -DSCREAM_SYSTEM_WORKAROUND_P3_PART2 -DSCREAM_SYSTEM_WORKAROUND=1")
craygnu-mphipcc.cmake
-string(APPEND CPPDEFS " -DLINUX -DFORTRANUNDERSCORE -DNO_R16 -DCPRGNU")
+string(APPEND CPPDEFS " -DLINUX -DFORTRANUNDERSCORE -DNO_R16 -DCPRGNU -DSCREAM_SYSTEM_WORKAROUND=1")