EAMxx: Model fails to run twice with with REST_OPTION=nsteps

I discovered a problem when continuing an EAMxx F case for less than a day via `STOP_OPTION=nsteps` in order to get a restart file closer to a failure I was debugging. When I submitted the run I immediately got the following run-time error:
```
25:  FAIL:
25: (m_output_control.frequency_units=="nsteps" ? timestamp.get_num_steps()<=m_output_control.next_write_ts.get_num_steps() : timestamp<=m_output_control.next_write_ts)
25: /global/u1/w/whannah/E3SM/E3SM_SRC3/components/eamxx/src/share/io/eamxx_output_manager.cpp:372
25: Error! The input timestamp is past the next scheduled write timestamp.
25:   - current time stamp   : 0001-08-12-03600
25:   - next write time stamp: 0001-08-12-28800
25: The most likely cause is an output frequency that is faster than the atm timestep.
25: Try to increase 'frequency' and/or 'frequency_units' in your output yaml file.
25:
25: MPICH ERROR [Rank 25] [job id 42508958.0] [Fri Sep  5 11:14:13 2025] [nid004530] - Abort(1) (rank 25 in comm 432): application called MPI_Abort(comm=0xC4000001, 1) - process 25
```

This was pretty confusing because I thought running less than a day was a trivial thing. I modified the print statement for this error (see `components/eamxx/src/share/io/eamxx_output_manager.cpp`) to include the time step values that were being compared:
```
  // Ensure we did not go past the scheduled write time without hitting it
  EKAT_REQUIRE_MSG (
      (m_output_control.frequency_units=="nsteps"
          ? timestamp.get_num_steps()<=m_output_control.next_write_ts.get_num_steps()
          : timestamp<=m_output_control.next_write_ts),
      "Error! The input timestamp is past the next scheduled write timestamp.\n"
      "  - current step         : " + std::to_string(timestamp.get_num_steps()) + "\n"
      "  - next write step      : " + std::to_string(m_output_control.next_write_ts.get_num_steps()) + "\n"
      "  - current time stamp   : " + timestamp.to_string() + "\n"
      "  - next write time stamp: " + m_output_control.next_write_ts.to_string() + "\n"
      "The most likely cause is an output frequency that is faster than the atm timestep.\n"
      "Try to increase 'frequency' and/or 'frequency_units' in your output yaml file.\n");
```

This resulted in the following values:
```
83: Error! The input timestamp is past the next scheduled write timestamp.
83:   - current step         : 5353
83:   - next write step      : 8
83:   - current time stamp   : 0001-08-12-03600
83:   - next write time stamp: 0001-08-12-28800
```

This revealed the root of the problem because the `next write step` value was incorrect. Looking through the code it seems that the `last_write_ts` value as defined in the `IOControl` struct is not properly initialized for the restart stream. It seems that `last_write_ts` is always initialized to zero even when a run is not starting at time step 0. Additionally, this only comes into effect when `REST_OPTION=nsteps`.

This also turns out to be a blind sport for our standard restart tests, such as ERS or ERP, because we automatically disable writing restarts from the second run via `REST_OPTION=never` in the test definition (see `cime/CIME/SystemTests/ers.py` for an example). By commenting out this one line of the test definition I can get an ERS test that fails on the second run when set to run for less than one day.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

EAMxx: Model fails to run twice with with REST_OPTION=nsteps #7682

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

EAMxx: Model fails to run twice with with REST_OPTION=nsteps #7682

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions