Skip to content

Conversation

glemieux
Copy link
Contributor

@glemieux glemieux commented Aug 7, 2025

Description:

This pull request resolves an exact restart failure that was found during testing of e3sm-project#7231: E3SM-Project/E3SM#7231 (comment). The issue was due to a problem in the use of the _hold restart variables to reconstruct gpp_site and ar_site. The reconstruction relied on the patch/cohort structure to accumulate these cohort-level _hold variables, but the structure did not match the pre-restart structure as it was passed through a fuse/termination process prior to restart write. We are now using the gpp_acc and aresp_acc directly, which have been added to the restart, which match the structure.

This also adds the site-level mass_balance to address E3SM-Project/E3SM#7278.

Collaborators:

@rgknox

Expectation of Answer Changes:

Checklist

If this is your first time contributing, please read the CONTRIBUTING document.

All checklist items must be checked to enable merging this pull request:

Contributor

  • The in-code documentation has been updated with descriptive comments
  • The documentation has been assessed to determine if updates are necessary

Integrator

  • FATES PASS/FAIL regression tests were run
  • Evaluation of test results for answer changes was performed and results provided
  • FATES-CLM6 Code Freeze: satellite phenology regression tests are b4b

If satellite phenology regressions are not b4b, please hold merge and notify the FATES development team.

Documentation

Test Results:

CTSM (or) E3SM (specify which) test hash-tag:

CTSM (or) E3SM (specify which) baseline hash-tag:

FATES baseline hash-tag:

Test Output:

@glemieux
Copy link
Contributor Author

@rgknox noted that I should add an isnew check to avoid including new cohorts into the summations, which alleviated a development issue in which I was seeing NaNs for the timestep post restart. There are still slight differences during exact restart however on the order of E-10.

@glemieux
Copy link
Contributor Author

glemieux commented Sep 10, 2025

@rgknox and I reviewed this and determined that the restart issue seen in E3SM-Project/E3SM#7482 is that we can't reconstruct the gpp_site and ar_site from the restart _hold values. This is because the the calculation of these site-level variables happens in ed_integrate_state_variables which happens prior to cohort termination and fusing before the restart write. So when trying to loop through the cohort structure to reconstruct the site-level values from the restarted hold values will be off.

The near-term fix here is to avoid reconstructing with the hold values and adding gpp_site and ar_site directly to the restart structure.

@rgknox rgknox marked this pull request as ready for review September 14, 2025 18:20
@rgknox
Copy link
Contributor

rgknox commented Sep 15, 2025

tested fates suite on derecho, all expected pass

@glemieux glemieux moved this from Under Review to Finding Reviewers in FATES Pull Request Planning and Status Sep 15, 2025
@glemieux glemieux requested a review from ckoven September 15, 2025 16:51
@glemieux glemieux moved this from Finding Reviewers to Under Review in FATES Pull Request Planning and Status Sep 15, 2025
Copy link
Contributor

@rosiealice rosiealice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have looked through this and it all looks good to me.

n.b. that in my NBP code I have been using npp_site and not gpp_site-rh_site in the interface, so hopefully this issue won't be having an effect in NorESM.

Copy link
Contributor

@ckoven ckoven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rgknox and @glemieux, this looks great to me!

@glemieux glemieux moved this from Under Review to Final Testing in FATES Pull Request Planning and Status Sep 18, 2025
@glemieux
Copy link
Contributor Author

glemieux commented Sep 20, 2025

@rgknox I ran the full e3sm fates test list on perlmutter and found that we have a new exact restart error, but only for fates_cold_logging this time. The FATES_CBALANCE_ERROR is not B4B on the time step immediately following restart, but only for that timestep. All other later values for this variable are B4B.

 FATES_CBALANCE_ERROR   (lon,lat,time)  t_index =     33    33
          4     3312  (    58,    10,     1) (    24,    25,     1) (    59,    11,     1) (    63,    24,     1)
                1144   3.980355057296992E-17  -3.552713599391162E-17 5.3E-18 -1.583095810303794E-17 1.3E-03 -1.081439471957639E-17
                1144   3.980355057296992E-17  -3.552713599391162E-17         -1.056767817571509E-17         -5.551115205843844E-18
                3312  (    58,    10,     1) (    24,    25,     1)
          avg abs field values:    8.183425185801436E-18    rms diff: 3.1E-19   avg rel diff(npos):  1.3E-03
                                   8.165021244352744E-18                        avg decimal digits(ndif):  0.4 worst:  0.3
 RMS FATES_CBALANCE_ERROR             3.1122E-19            NORMALIZED  3.8074E-02

Since this addresses the initial NEP exact restart issue, I'm tempted to simply file a new issue and integrate this PR, especially since the restart error doesn't propagate over time. Note that we don't have this testmod in the e3sm_land_developer list so this won't show up in the integration tests for E3SM-Project/E3SM#7231, so it wouldn't hold things up from the E3SM integrator's perspective.

One other note: I recently ran the e3sm fates tests list for the API 41 update (E3SM-Project/E3SM#7665) and did not see this issue with fates_cold_logging.

Results: e3sm-tests/pr7231-1448-fates.fates.pm-cpu..Eea69a01050-Ff847ef79

@glemieux
Copy link
Contributor Author

Since this addresses the initial NEP exact restart issue, I'm tempted to simply file a new issue and integrate this PR, especially since the restart error doesn't propagate over time. Note that we don't have this testmod in the e3sm_land_developer list so this won't show up in the integration tests for E3SM-Project/E3SM#7231, so it wouldn't hold things up from the E3SM integrator's perspective.

During SE meeting discussion today, @rgknox and I agreed that this should move forward. I'll integrate this today and update the E3SM-Project/E3SM#7231 to point to the resulting tag. I'll also make a new issue using the comment above.

@glemieux glemieux merged commit f6355b1 into NGEET:main Sep 22, 2025
1 check passed
@github-project-automation github-project-automation bot moved this from Final Testing to Ready to Integrate in FATES Pull Request Planning and Status Sep 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Ready to Integrate
Development

Successfully merging this pull request may close these issues.

4 participants