Skip to content

Conversation

bartgol
Copy link
Contributor

@bartgol bartgol commented Jul 1, 2025

When updating the views after subview info changes, we also need to update the stored subview (for later check).

Also, modify model_restart test, so that we don't re-run the 1st part of the run. Instead, make the base run also
write the restart file (like CIME does). Note: this requires running less than 2*rest_freq steps, or else the rpointer
file will be nuked, and the rest run will NOT find the first .r file. Hence, run 2+1 steps (with restart every 2) rather
than 1+1.


While trying to debug PR #7424, I noticed that the model_restart test was failing also without rrtmgp, hinting that the pr mods were not the problem. After some digging, I realized that master would also fail if run with 2+1 steps (restart file written after 2 steps). Printing bfb hashes, I saw that homme's output on physics grid after restart was diffing, but the dyn grid fields were bfb, hinting to some issue with remap.

The core problem seems to be that inside the PD remapper, when updating the stored views in case subview info changed, we do not update the stored subview info. This would ONLY affect D->P remap (changing the subview is not allowed for P->D), which is ONLY used when pg2 is not used (pg2 cases use a different remap).

What bugs me is that an ERS test with 2+1 steps and ne4 grids (no pg2) does produce BFB output (although I am not 100% sure I did things correctly, as CIME was not picking up the scream nc file, so I had to do some hacks to get the base and rest nc file separately).

@tcclevenger Let me know what your thoughts are (on both the fix, as well as why a CIME case may not reproduce the non-bfbness).

@bartgol bartgol requested a review from tcclevenger July 1, 2025 18:08
@bartgol bartgol self-assigned this Jul 1, 2025
@bartgol bartgol added bug fix PR EAMxx Issues related to EAMxx labels Jul 1, 2025
@tcclevenger
Copy link
Contributor

Just a thought, have you tested if this new model_restart test setup fails w/o the subview info fix, or is it just the old test setup that fails? If it doesn't fail with the new setup, that would at least explain why CIME isn't DIFFing w/o the fix.

const auto& new_svi = f.get_header().get_alloc_properties().get_subview_info();
if ( not(svi==new_svi) ) {
get_view(fname,f);
svi = new_svi;
Copy link
Contributor

@tcclevenger tcclevenger Jul 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly sure where this subfield info is used, but seems to make sense to me that we'd need this fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sv info is used right above, in the if statement, to decide whether we should update the view or not. Honestly, I think we could just always update the views, but we can keep the logic as is. In the P->D remap, otoh, we do NOT allow changing the subview info, so storing it is also needed there.

@bartgol
Copy link
Contributor Author

bartgol commented Jul 1, 2025

Just a thought, have you tested if this new model_restart test setup fails w/o the subview info fix, or is it just the old test setup that fails? If it doesn't fail with the new setup, that would at least explain why CIME isn't DIFFing w/o the fix.

The new setup does fail without the fix. In fact, I was working on just changing the setup separately (because I was getting annoyed of running twice the first 2 steps, when trying to debug the rad PR), and that's when I saw it was failing. So yes, the PD fix does matter for this setup.

I am still a bit uneasy b/c of the ERS test that does not fail though...

@bartgol bartgol marked this pull request as draft July 1, 2025 19:48
@bartgol bartgol marked this pull request as ready for review July 1, 2025 19:48
bartgol added 2 commits July 1, 2025 15:10
- Run only 2 tests, like CIME: a base run, and a rest run
- Run for 3 steps, write restart at 2: with a 1+1 run the base
  run will write a second .r file at the end, nuking rpointer,
  and the rest run will not find the first .r file in rpointer
@bartgol bartgol force-pushed the bartgol/eamxx/homme-pd-remap-fix branch from e6401ab to 785cff2 Compare July 1, 2025 21:11
Copy link
Contributor

@tcclevenger tcclevenger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix makes sense, but it is odd that CIME ERS wasn't failing.. Probably ok since it is just a bug in non-PG2 run.

@bartgol
Copy link
Contributor Author

bartgol commented Jul 2, 2025

Ok, I found out why CIME tests were passing in master. The reason is that in our ne4_ne4 test, we have se_tstep=600, while the atm dt is 1800. Hence, dynamics cycles through the states time levels an integer number of times, effectively getting back to the same time slice every time. Hence, even if the PD remapper was not updating the time slices subview info, everything worked. In the standalone test, however, we used dt_dyn=dt_atm, causing the slices ids to change (between before and after Hommexx calls).

I ran ERS_Ln3_ne4_ne4.F2010-SCREAMv1 in master changing dt_remap_factor=1 and se_tstep=1800 and got a base-rest DIFF. With this branch, even with those settings, the base-rest test is a PASS.

As for the CI cime case DIFF, I verified the same happens in master (I am tracking down the PR that introduced the DIFF, to make sure it's expected). Meanwhile, I think we can merge this.

Edit: I confirmed that the DIFF in the ci job is due to #7247, which was a CC PR, so the DIFF was expected. I just triggered a baseline generation for the CI workflow. I'll merge this now.

bartgol added a commit that referenced this pull request Jul 3, 2025
When updating the views after subview info changes, we also need to update the stored subview (for later check).

Also, modify model_restart test, so that we don't re-run the 1st part of the run. Instead, make the base run also
write the restart file (like CIME does). Note: this requires running less than 2*rest_freq steps, or else the rpointer
file will be nuked, and the rest run will NOT find the first .r file. Hence, run 2+1 steps (with restart every 2) rather
than 1+1.
@bartgol bartgol merged commit 203df99 into master Jul 3, 2025
17 of 19 checks passed
@bartgol bartgol deleted the bartgol/eamxx/homme-pd-remap-fix branch July 3, 2025 00:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug fix PR EAMxx Issues related to EAMxx
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants