-
Notifications
You must be signed in to change notification settings - Fork 437
Description
With ERS.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion
, the test uses 128 tasks on 1 node and was a little slow. I wanted to try using 2 nodes (256 tasks), but hit an error described here.
Note that to still improve speed of these tests, I went ahead with a PR to increase tasks to 192 (still using 2 nodes):
After #6484, we are now using 192 tasks for all components.
To reproduce the errro:
ERS_P256.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion
Note I don't see the error with a SMS test -- so seems to be related to writing restarts.
I was seeing:
128: PIO: FATAL ERROR: Aborting... An error occured, Waiting on pending requests on file (./ERS.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion.C.20240621_095913_j1wl0t.elm.r.1850-01-07-00000.nc, ncid=54) failed (Number of pending requests on file = 2\
3, Number of variables with pending requests = 23, Number of request blocks = 3, Current block being waited on = 1, Number of requests in current block = 11).. Size of I/O request exceeds INT_MAX (err=-237). Aborting since the error handler was set to PIO_INTERNAL_ERROR... (/global/cfs/cdirs/e3sm/ndk/repos/pr/ndk_mf_pm-cpu-pelayout-minor-adjustment/externals/scorpio/src/clib/pio_darray_int.c: 2189)
128: Obtained 10 stack frames.
128: /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/ERS.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion.C.20240621_095913_j1wl0t/bld/e3sm.exe() [0x17e301c]
128: /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/ERS.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion.C.20240621_095913_j1wl0t/bld/e3sm.exe() [0x17e325e]
128: /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/ERS.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion.C.20240621_095913_j1wl0t/bld/e3sm.exe() [0x17e27f5]
128: /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/ERS.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion.C.20240621_095913_j1wl0t/bld/e3sm.exe() [0x1833a24]
128: /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/ERS.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion.C.20240621_095913_j1wl0t/bld/e3sm.exe() [0x1817bb5]
128: /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/ERS.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion.C.20240621_095913_j1wl0t/bld/e3sm.exe() [0x1833cd2]
128: /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/ERS.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion.C.20240621_095913_j1wl0t/bld/e3sm.exe() [0x1818b55]
128: /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/ERS.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion.C.20240621_095913_j1wl0t/bld/e3sm.exe() [0x17c5d60]
128: /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/ERS.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion.C.20240621_095913_j1wl0t/bld/e3sm.exe() [0x10ca3d5]
128: /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/ERS.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion.C.20240621_095913_j1wl0t/bld/e3sm.exe() [0x10e2643]
128: MPICH ERROR [Rank 128] [job id 27028879.0] [Fri Jun 21 11:00:33 2024] [nid006900] - Abort(-1) (rank 128 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, -1) - process 128
Jayesh asked "Does increasing the number of I/O processes (./xmlchange PIO_NUMTASKS=16) fix the issue with ERS.hcru_hcru.I20TRGSWCNPRDCTCBC ? Looks like an error from PnetCDF on the total size of the pending writes from a single process being > INT_MAX"
Which I've not tried, but I don't think we would want to use that going forward.