-
Notifications
You must be signed in to change notification settings - Fork 435
EAMxx: fix a nano bug in PhysicsDynamicsRemapper #7477
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Just a thought, have you tested if this new model_restart test setup fails w/o the subview info fix, or is it just the old test setup that fails? If it doesn't fail with the new setup, that would at least explain why CIME isn't DIFFing w/o the fix. |
const auto& new_svi = f.get_header().get_alloc_properties().get_subview_info(); | ||
if ( not(svi==new_svi) ) { | ||
get_view(fname,f); | ||
svi = new_svi; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not exactly sure where this subfield info is used, but seems to make sense to me that we'd need this fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sv info is used right above, in the if statement, to decide whether we should update the view or not. Honestly, I think we could just always update the views, but we can keep the logic as is. In the P->D remap, otoh, we do NOT allow changing the subview info, so storing it is also needed there.
The new setup does fail without the fix. In fact, I was working on just changing the setup separately (because I was getting annoyed of running twice the first 2 steps, when trying to debug the rad PR), and that's when I saw it was failing. So yes, the PD fix does matter for this setup. I am still a bit uneasy b/c of the ERS test that does not fail though... |
- Run only 2 tests, like CIME: a base run, and a rest run - Run for 3 steps, write restart at 2: with a 1+1 run the base run will write a second .r file at the end, nuking rpointer, and the rest run will not find the first .r file in rpointer
e6401ab
to
785cff2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fix makes sense, but it is odd that CIME ERS wasn't failing.. Probably ok since it is just a bug in non-PG2 run.
Ok, I found out why CIME tests were passing in master. The reason is that in our ne4_ne4 test, we have se_tstep=600, while the atm dt is 1800. Hence, dynamics cycles through the states time levels an integer number of times, effectively getting back to the same time slice every time. Hence, even if the PD remapper was not updating the time slices subview info, everything worked. In the standalone test, however, we used dt_dyn=dt_atm, causing the slices ids to change (between before and after Hommexx calls). I ran ERS_Ln3_ne4_ne4.F2010-SCREAMv1 in master changing dt_remap_factor=1 and se_tstep=1800 and got a base-rest DIFF. With this branch, even with those settings, the base-rest test is a PASS. As for the CI cime case DIFF, I verified the same happens in master (I am tracking down the PR that introduced the DIFF, to make sure it's expected). Meanwhile, I think we can merge this. Edit: I confirmed that the DIFF in the ci job is due to #7247, which was a CC PR, so the DIFF was expected. I just triggered a baseline generation for the CI workflow. I'll merge this now. |
When updating the views after subview info changes, we also need to update the stored subview (for later check). Also, modify model_restart test, so that we don't re-run the 1st part of the run. Instead, make the base run also write the restart file (like CIME does). Note: this requires running less than 2*rest_freq steps, or else the rpointer file will be nuked, and the rest run will NOT find the first .r file. Hence, run 2+1 steps (with restart every 2) rather than 1+1.
When updating the views after subview info changes, we also need to update the stored subview (for later check).
Also, modify model_restart test, so that we don't re-run the 1st part of the run. Instead, make the base run also
write the restart file (like CIME does). Note: this requires running less than 2*rest_freq steps, or else the rpointer
file will be nuked, and the rest run will NOT find the first .r file. Hence, run 2+1 steps (with restart every 2) rather
than 1+1.
While trying to debug PR #7424, I noticed that the model_restart test was failing also without rrtmgp, hinting that the pr mods were not the problem. After some digging, I realized that master would also fail if run with 2+1 steps (restart file written after 2 steps). Printing bfb hashes, I saw that homme's output on physics grid after restart was diffing, but the dyn grid fields were bfb, hinting to some issue with remap.
The core problem seems to be that inside the PD remapper, when updating the stored views in case subview info changed, we do not update the stored subview info. This would ONLY affect D->P remap (changing the subview is not allowed for P->D), which is ONLY used when pg2 is not used (pg2 cases use a different remap).
What bugs me is that an ERS test with 2+1 steps and ne4 grids (no pg2) does produce BFB output (although I am not 100% sure I did things correctly, as CIME was not picking up the scream nc file, so I had to do some hacks to get the base and rest nc file separately).
@tcclevenger Let me know what your thoughts are (on both the fix, as well as why a CIME case may not reproduce the non-bfbness).