-
Notifications
You must be signed in to change notification settings - Fork 434
Description
I discovered a problem when continuing an EAMxx F case for less than a day via STOP_OPTION=nsteps
in order to get a restart file closer to a failure I was debugging. When I submitted the run I immediately got the following run-time error:
25: FAIL:
25: (m_output_control.frequency_units=="nsteps" ? timestamp.get_num_steps()<=m_output_control.next_write_ts.get_num_steps() : timestamp<=m_output_control.next_write_ts)
25: /global/u1/w/whannah/E3SM/E3SM_SRC3/components/eamxx/src/share/io/eamxx_output_manager.cpp:372
25: Error! The input timestamp is past the next scheduled write timestamp.
25: - current time stamp : 0001-08-12-03600
25: - next write time stamp: 0001-08-12-28800
25: The most likely cause is an output frequency that is faster than the atm timestep.
25: Try to increase 'frequency' and/or 'frequency_units' in your output yaml file.
25:
25: MPICH ERROR [Rank 25] [job id 42508958.0] [Fri Sep 5 11:14:13 2025] [nid004530] - Abort(1) (rank 25 in comm 432): application called MPI_Abort(comm=0xC4000001, 1) - process 25
This was pretty confusing because I thought running less than a day was a trivial thing. I modified the print statement for this error (see components/eamxx/src/share/io/eamxx_output_manager.cpp
) to include the time step values that were being compared:
// Ensure we did not go past the scheduled write time without hitting it
EKAT_REQUIRE_MSG (
(m_output_control.frequency_units=="nsteps"
? timestamp.get_num_steps()<=m_output_control.next_write_ts.get_num_steps()
: timestamp<=m_output_control.next_write_ts),
"Error! The input timestamp is past the next scheduled write timestamp.\n"
" - current step : " + std::to_string(timestamp.get_num_steps()) + "\n"
" - next write step : " + std::to_string(m_output_control.next_write_ts.get_num_steps()) + "\n"
" - current time stamp : " + timestamp.to_string() + "\n"
" - next write time stamp: " + m_output_control.next_write_ts.to_string() + "\n"
"The most likely cause is an output frequency that is faster than the atm timestep.\n"
"Try to increase 'frequency' and/or 'frequency_units' in your output yaml file.\n");
This resulted in the following values:
83: Error! The input timestamp is past the next scheduled write timestamp.
83: - current step : 5353
83: - next write step : 8
83: - current time stamp : 0001-08-12-03600
83: - next write time stamp: 0001-08-12-28800
This revealed the root of the problem because the next write step
value was incorrect. Looking through the code it seems that the last_write_ts
value as defined in the IOControl
struct is not properly initialized for the restart stream. It seems that last_write_ts
is always initialized to zero even when a run is not starting at time step 0. Additionally, this only comes into effect when REST_OPTION=nsteps
.
This also turns out to be a blind sport for our standard restart tests, such as ERS or ERP, because we automatically disable writing restarts from the second run via REST_OPTION=never
in the test definition (see cime/CIME/SystemTests/ers.py
for an example). By commenting out this one line of the test definition I can get an ERS test that fails on the second run when set to run for less than one day.