Skip to content

EAMxx: Model fails to run twice with with REST_OPTION=nsteps #7682

@whannah1

Description

@whannah1

I discovered a problem when continuing an EAMxx F case for less than a day via STOP_OPTION=nsteps in order to get a restart file closer to a failure I was debugging. When I submitted the run I immediately got the following run-time error:

25:  FAIL:
25: (m_output_control.frequency_units=="nsteps" ? timestamp.get_num_steps()<=m_output_control.next_write_ts.get_num_steps() : timestamp<=m_output_control.next_write_ts)
25: /global/u1/w/whannah/E3SM/E3SM_SRC3/components/eamxx/src/share/io/eamxx_output_manager.cpp:372
25: Error! The input timestamp is past the next scheduled write timestamp.
25:   - current time stamp   : 0001-08-12-03600
25:   - next write time stamp: 0001-08-12-28800
25: The most likely cause is an output frequency that is faster than the atm timestep.
25: Try to increase 'frequency' and/or 'frequency_units' in your output yaml file.
25:
25: MPICH ERROR [Rank 25] [job id 42508958.0] [Fri Sep  5 11:14:13 2025] [nid004530] - Abort(1) (rank 25 in comm 432): application called MPI_Abort(comm=0xC4000001, 1) - process 25

This was pretty confusing because I thought running less than a day was a trivial thing. I modified the print statement for this error (see components/eamxx/src/share/io/eamxx_output_manager.cpp) to include the time step values that were being compared:

  // Ensure we did not go past the scheduled write time without hitting it
  EKAT_REQUIRE_MSG (
      (m_output_control.frequency_units=="nsteps"
          ? timestamp.get_num_steps()<=m_output_control.next_write_ts.get_num_steps()
          : timestamp<=m_output_control.next_write_ts),
      "Error! The input timestamp is past the next scheduled write timestamp.\n"
      "  - current step         : " + std::to_string(timestamp.get_num_steps()) + "\n"
      "  - next write step      : " + std::to_string(m_output_control.next_write_ts.get_num_steps()) + "\n"
      "  - current time stamp   : " + timestamp.to_string() + "\n"
      "  - next write time stamp: " + m_output_control.next_write_ts.to_string() + "\n"
      "The most likely cause is an output frequency that is faster than the atm timestep.\n"
      "Try to increase 'frequency' and/or 'frequency_units' in your output yaml file.\n");

This resulted in the following values:

83: Error! The input timestamp is past the next scheduled write timestamp.
83:   - current step         : 5353
83:   - next write step      : 8
83:   - current time stamp   : 0001-08-12-03600
83:   - next write time stamp: 0001-08-12-28800

This revealed the root of the problem because the next write step value was incorrect. Looking through the code it seems that the last_write_ts value as defined in the IOControl struct is not properly initialized for the restart stream. It seems that last_write_ts is always initialized to zero even when a run is not starting at time step 0. Additionally, this only comes into effect when REST_OPTION=nsteps.

This also turns out to be a blind sport for our standard restart tests, such as ERS or ERP, because we automatically disable writing restarts from the second run via REST_OPTION=never in the test definition (see cime/CIME/SystemTests/ers.py for an example). By commenting out this one line of the test definition I can get an ERS test that fails on the second run when set to run for less than one day.

Metadata

Metadata

Assignees

Labels

EAMxxIssues related to EAMxx

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions