Skip to content

PIO_INTERNAL_ERROR with ERS_P256.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion #6486

@ndkeen

Description

@ndkeen

With ERS.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion, the test uses 128 tasks on 1 node and was a little slow. I wanted to try using 2 nodes (256 tasks), but hit an error described here.

Note that to still improve speed of these tests, I went ahead with a PR to increase tasks to 192 (still using 2 nodes):
After #6484, we are now using 192 tasks for all components.

To reproduce the errro:

ERS_P256.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion

Note I don't see the error with a SMS test -- so seems to be related to writing restarts.

I was seeing:

128: PIO: FATAL ERROR: Aborting... An error occured, Waiting on pending requests on file (./ERS.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion.C.20240621_095913_j1wl0t.elm.r.1850-01-07-00000.nc, ncid=54) failed (Number of pending requests on file = 2\
3, Number of variables with pending requests = 23, Number of request blocks = 3, Current block being waited on = 1, Number of requests in current block = 11).. Size of I/O request exceeds INT_MAX (err=-237). Aborting since the error handler was set to PIO_INTERNAL_ERROR... (/global/cfs/cdirs/e3sm/ndk/repos/pr/ndk_mf_pm-cpu-pelayout-minor-adjustment/externals/scorpio/src/clib/pio_darray_int.c: 2189)
128: Obtained 10 stack frames.
128: /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/ERS.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion.C.20240621_095913_j1wl0t/bld/e3sm.exe() [0x17e301c]
128: /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/ERS.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion.C.20240621_095913_j1wl0t/bld/e3sm.exe() [0x17e325e]
128: /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/ERS.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion.C.20240621_095913_j1wl0t/bld/e3sm.exe() [0x17e27f5]
128: /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/ERS.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion.C.20240621_095913_j1wl0t/bld/e3sm.exe() [0x1833a24]
128: /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/ERS.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion.C.20240621_095913_j1wl0t/bld/e3sm.exe() [0x1817bb5]
128: /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/ERS.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion.C.20240621_095913_j1wl0t/bld/e3sm.exe() [0x1833cd2]
128: /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/ERS.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion.C.20240621_095913_j1wl0t/bld/e3sm.exe() [0x1818b55]
128: /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/ERS.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion.C.20240621_095913_j1wl0t/bld/e3sm.exe() [0x17c5d60]
128: /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/ERS.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion.C.20240621_095913_j1wl0t/bld/e3sm.exe() [0x10ca3d5]
128: /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/ERS.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion.C.20240621_095913_j1wl0t/bld/e3sm.exe() [0x10e2643]
128: MPICH ERROR [Rank 128] [job id 27028879.0] [Fri Jun 21 11:00:33 2024] [nid006900] - Abort(-1) (rank 128 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, -1) - process 128

Jayesh asked "Does increasing the number of I/O processes (./xmlchange PIO_NUMTASKS=16) fix the issue with ERS.hcru_hcru.I20TRGSWCNPRDCTCBC ? Looks like an error from PnetCDF on the total size of the pending writes from a single process being > INT_MAX"
Which I've not tried, but I don't think we would want to use that going forward.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions