Skip to content

Conversation

vanroekel
Copy link
Contributor

@vanroekel vanroekel commented Apr 28, 2025

This PR addresses two bugs found when coupling and running a F20TR case
on a LANL institutional machine at ne120 resolution with DEBUG enabled.

Addresses #7298

[BFB]

@vanroekel vanroekel changed the title Addresses oro drag bugs Fixes bugs in new orographic wave drag scheme Apr 28, 2025
!-----------------------------------------------------------------------

ncol=state%ncol
allocate(dx(ncol),dy(ncol))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like you haven't added a corresponding deallocate statement?
I think the original approach of using pcols is preferable to an allocatable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what problem the original code dx(pcols), dy(pcols) would cause. Their size (pcols) as declared are always no smaller than dx(ids:ide) (essenrtially dx(1:ncol) ). When the array is referenced inside subroutine od2d, they will never go out of bound.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confess I don't understand either but I was getting invalid memory reference errors, when I printed out it seemed ncols > pcols so it seemed possible to go out of bounds, but I can test further.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if ncol>pcols then we have much bigger problems.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just confirmed that ncol is not greater than pcols, a relief. Also on my machine allocating dxmeter and dymeter as pcols didn't fix the issue. I'm trying dxmeter(its:ite) now for the declaration

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I've dug into this a bit more. The stack trace on failure is on this

delx = dxmeter
dely = dymeter

delx and dely are declared its:ite. I tried declaring dxmeter and dymeter with pcols, this failed, but the second dxmeter(its:ite) worked. I confess I don't understand why this was tripping a failure for the lines above. The assignment seems to have the same extent in memory. It should be something like delx(1:ncols) = dxmeter(:pcols). Are there cases when pcols != ncols?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's pretty common for the compiler to point to an adjacent line. I don't get it either, but I've seen it many times.

It is actually almost always the case that pcols!=ncol. This is because pcols is known at compile time and is a single value, whereas ncol varies across MPI ranks and is specific to each "chunk". This is the way that we can control the load balancing for the atmosphere physics. in other words pcols is the max that ncol can be for any given chunk.

It is best always assume that pcols>ncol. This will help you understand why it is sometimes preferable to use allocatables vs variables with static size of pcols.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the explanation @whannah1 very helpful. is the its:ite approach okay? It appears dx and dy are declared pcols but delx and dely are ncols, so it does make sense (I think) that

delx = dxmeter

could fail when ncols < pcols. It's not clear what the best approach is then.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think changing that line to the following should work:

delx(its:ite) = dxmeter(its:ite)

The way things are called we get its=1,ite=ncol, so I think explicitly using its:ite everywhere we can should clear up any issues like this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a much better idea. I can make that change.

Comment on lines 396 to 398
do k = 2, pver
if((pblheight >= height_array(k).and.pblheight <height_array(k-1)))then
pblh_get_level_idx = k+1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem like a valid fix to me, but maybe I'm missing something.

It seems we could still have a situation where pblh_get_level_idx never gets set and the initialized value of -1 gets passed up, right?

if so then the variable kpbl2d_reverse_in could end up with a value of pverp+1 which would be bad.

      kpbl2d_in(i)=pblh_get_level_idx(zbot(i,:)-(state%phis(i)/gravit),pblh(i))
      kpbl2d_reverse_in(i)=pverp-kpbl2d_in(i)!pverp-k

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder under what circumstances a pbl could not be found. pblheight becomes <=0? or height_array(pverp) < 0? height_array(pverp) as calculated should always be 0 unless there is a round off.

Regarding the calculations, it is certainly safe to start k from 2 or even a much larger value for our model. Ideally it should be scanned bottom up and stop once found. The original if-logic is good in locating the layer interface no higher than the pbl_height value, and I feel the original logic is easier to follow in setting the level in EAM's top-down indexing. (@whannah1 , is it correct kpbl2d_in(i) is to for the index for the interface rather than the mid-level?)

But the calculation of kpbl2d_reverse_in(i) is indeed off. Thanks @vanroekel for catching it. The reversed (bottom-up indexing) should always be
kpbl2d_reverse_in(i)=pverp-kpbl2d_in(i)+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you are totally right. I don't understand how this let my code run. L398 should be pblh_get_level_idx = k I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vanroekel , the developer @jinboxie confirmed kpbl2d_in is for the interface level index; and since height_array is defined as bottom interface of each layer, yes, L398 should be pblh_get_level_idx = k. Furthermore, your changes to the if-logic is more suitable, because the original one referencing k+1 will go out of bound of height_array(pver).

I also need to take my word back on modifying kpbl2d_reverse_in. The original is good. I thought height_array was defined just like zi, but actually it is for bottom interface only with the upper bound index pver.

Can you do a new test by changing L398 alone to pblh_get_level_idx = k?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I've made that change and am testing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wlin7 is there a way I can verify this fix? Is there a PBL diagnostic variable I can include and a way to verify if the index chosen is right? the code built and ran with this change. Do you want me to do any further testing?

@rljacob
Copy link
Member

rljacob commented May 15, 2025

@vanroekel update?

@rljacob rljacob marked this pull request as draft May 29, 2025 17:30
@rljacob
Copy link
Member

rljacob commented May 29, 2025

needs additional review and testing.

@vanroekel
Copy link
Contributor Author

@whannah1 or @wlin7 what do you want me to do on this PR? Should I close it and assign the issue to someone on the atmosphere team? I can also do more testing. As I've looked through this code further, I see lots of mix and match of ncols and pcol (e.g. one loop initializes an array over ncols, but the next loops to pcol or vice versa). Also, in the pbl height routine, there is no check if the pbl height is never found. It seems possible uninitialized values could be returned.

Let me know if you want me to push further changes and testing or just close this to allow a more detailed investigation. I will note that @ndkeen encountered this same issue on perlmutter-cpu with gcc and in DEBUG. The fixes here allowed that to pass.

@ndkeen
Copy link
Contributor

ndkeen commented May 29, 2025

Yes, for gnu compiler and DEBUG, see error (not with intel).

And I saw runtime fails with gnu in optimized builds for certain pe layouts. not all. This fix corrected both issues

@whannah1
Copy link
Contributor

@vanroekel

@whannah1 or @wlin7 what do you want me to do on this PR?

I'm leaning towards merging this PR as is.

As I've looked through this code further, I see lots of mix and match of ncols and pcol

This doesn't surprise me. I did a fair bit of refactoring to the code that Jinbo contributed, but there were several parts that I intentionally didn't try to improve because I didn't have the bandwidth. At this point I feel that it's one of those things where we shouldn't fix it until it breaks, unless someone else can justify spending time to give it a thorough clean-up. We'll probably need to do this eventually for porting to eamxx.

Also, in the pbl height routine, there is no check if the pbl height is never found. It seems possible uninitialized values could be returned.

Adding a simple check would be a good idea. Just checking whether that variable is still set to the initial value of -1 should be easy.

endif
enddo

if (.not. found) then
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@whannah1 here is my very simple pblh check. Is this along the lines you are envisioning? I need to update this to use a better way to stop the model, which I'm happy to make that change if this structure works. I can also unwind this commit from the PR and leave that addition for later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely close to what I was thinking, but a simpler version would be:

if (.not.found) call endrun('ERROR - pblh_get_level_idx: pbl top index not found')

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! That's what I needed to look up on how to stop the model. I'll make that change.

pblh_get_level_idx = k+1
do k = 2, pver
if((pblheight >= height_array(k).and.pblheight <height_array(k-1)))then
pblh_get_level_idx = k - 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vanroekel , the change to pblh_get_level_idx = k - 1 via commit b7289da moves it one level up. What is the consideration leading to this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wlin7 yes that's not right at all. Just to confirm L397 should be pblh_get_level_idx = k?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, @vanroekel . If no other special consideration, let us change it back to pblh_get_level_idx = k , to maintain consistency with the original implementation. This is also suggested by the developer @jinboxie. Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wlin7 I've made the correction and am about to launch a test of this change. Will report back later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With line 397 changed back to pblh_get_level_idx = k, a multi-year test with v3.LR confirms this PR is BFB with current master, as expected, when the new oro gravity drag scheme is turned on. That also means out-of-bound violation never occur even with the original code in the test (v3.LR.F2010 for 6 years on chrysalis).

@rljacob
Copy link
Member

rljacob commented Jul 14, 2025

@xuezhengllnl this is the PR you want in for v3.HR? Is it finished? Is it BFB and just fixes a potential problem?

@vanroekel vanroekel force-pushed the vanroekel/eam/oro-drag-fixes branch from 032e2b4 to 384a7ce Compare July 14, 2025 19:09
Copy link
Contributor

@wlin7 wlin7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All look good. And the changes have been verified BFB with this PR (and final changes to Line 397) pre-merged to master for a multi-year v3.LR test that turns on new orographic drag scheme.

@vanroekel vanroekel marked this pull request as ready for review July 16, 2025 19:32
@wlin7 wlin7 added the BFB PR leaves answers BFB label Jul 16, 2025
wlin7 added a commit that referenced this pull request Jul 16, 2025
…7299)

Fixes bugs in new orographic wave drag scheme

This PR addresses two bugs found when coupling and running a F20TR case
on a LANL institutional machine at ne120 resolution with DEBUG enabled.

Addresses #7298

[BFB]
@wlin7
Copy link
Contributor

wlin7 commented Jul 16, 2025

Merged to next.

@wlin7 wlin7 merged commit 8624b1d into E3SM-Project:master Jul 17, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants