-
Notifications
You must be signed in to change notification settings - Fork 437
Fixes bugs in new orographic wave drag scheme #7299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes bugs in new orographic wave drag scheme #7299
Conversation
!----------------------------------------------------------------------- | ||
|
||
ncol=state%ncol | ||
allocate(dx(ncol),dy(ncol)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like you haven't added a corresponding deallocate statement?
I think the original approach of using pcols
is preferable to an allocatable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what problem the original code dx(pcols), dy(pcols)
would cause. Their size (pcols) as declared are always no smaller than dx(ids:ide)
(essenrtially dx(1:ncol)
). When the array is referenced inside subroutine od2d, they will never go out of bound.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I confess I don't understand either but I was getting invalid memory reference errors, when I printed out it seemed ncols > pcols so it seemed possible to go out of bounds, but I can test further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if ncol>pcols
then we have much bigger problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just confirmed that ncol is not greater than pcols, a relief. Also on my machine allocating dxmeter and dymeter as pcols didn't fix the issue. I'm trying dxmeter(its:ite)
now for the declaration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I've dug into this a bit more. The stack trace on failure is on this
delx = dxmeter
dely = dymeter
delx
and dely
are declared its:ite
. I tried declaring dxmeter and dymeter with pcols, this failed, but the second dxmeter(its:ite) worked. I confess I don't understand why this was tripping a failure for the lines above. The assignment seems to have the same extent in memory. It should be something like delx(1:ncols) = dxmeter(:pcols). Are there cases when pcols != ncols?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's pretty common for the compiler to point to an adjacent line. I don't get it either, but I've seen it many times.
It is actually almost always the case that pcols!=ncol
. This is because pcols
is known at compile time and is a single value, whereas ncol
varies across MPI ranks and is specific to each "chunk". This is the way that we can control the load balancing for the atmosphere physics. in other words pcols
is the max that ncol
can be for any given chunk.
It is best always assume that pcols>ncol
. This will help you understand why it is sometimes preferable to use allocatables vs variables with static size of pcols
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the explanation @whannah1 very helpful. is the its:ite
approach okay? It appears dx
and dy
are declared pcols
but delx
and dely
are ncols
, so it does make sense (I think) that
delx = dxmeter
could fail when ncols < pcols
. It's not clear what the best approach is then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think changing that line to the following should work:
delx(its:ite) = dxmeter(its:ite)
The way things are called we get its=1,ite=ncol
, so I think explicitly using its:ite
everywhere we can should clear up any issues like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's a much better idea. I can make that change.
do k = 2, pver | ||
if((pblheight >= height_array(k).and.pblheight <height_array(k-1)))then | ||
pblh_get_level_idx = k+1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem like a valid fix to me, but maybe I'm missing something.
It seems we could still have a situation where pblh_get_level_idx
never gets set and the initialized value of -1
gets passed up, right?
if so then the variable kpbl2d_reverse_in
could end up with a value of pverp+1
which would be bad.
kpbl2d_in(i)=pblh_get_level_idx(zbot(i,:)-(state%phis(i)/gravit),pblh(i))
kpbl2d_reverse_in(i)=pverp-kpbl2d_in(i)!pverp-k
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder under what circumstances a pbl could not be found. pblheight becomes <=0? or height_array(pverp) < 0
? height_array(pverp) as calculated should always be 0 unless there is a round off.
Regarding the calculations, it is certainly safe to start k from 2 or even a much larger value for our model. Ideally it should be scanned bottom up and stop once found. The original if-logic is good in locating the layer interface no higher than the pbl_height value, and I feel the original logic is easier to follow in setting the level in EAM's top-down indexing. (@whannah1 , is it correct kpbl2d_in(i)
is to for the index for the interface rather than the mid-level?)
But the calculation of kpbl2d_reverse_in(i)
is indeed off. Thanks @vanroekel for catching it. The reversed (bottom-up indexing) should always be
kpbl2d_reverse_in(i)=pverp-kpbl2d_in(i)+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you are totally right. I don't understand how this let my code run. L398 should be pblh_get_level_idx = k
I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vanroekel , the developer @jinboxie confirmed kpbl2d_in is for the interface level index; and since height_array is defined as bottom interface of each layer, yes, L398 should be pblh_get_level_idx = k. Furthermore, your changes to the if-logic is more suitable, because the original one referencing k+1
will go out of bound of height_array(pver)
.
I also need to take my word back on modifying kpbl2d_reverse_in. The original is good. I thought height_array
was defined just like zi
, but actually it is for bottom interface only with the upper bound index pver
.
Can you do a new test by changing L398 alone to pblh_get_level_idx = k
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I've made that change and am testing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wlin7 is there a way I can verify this fix? Is there a PBL diagnostic variable I can include and a way to verify if the index chosen is right? the code built and ran with this change. Do you want me to do any further testing?
@vanroekel update? |
needs additional review and testing. |
@whannah1 or @wlin7 what do you want me to do on this PR? Should I close it and assign the issue to someone on the atmosphere team? I can also do more testing. As I've looked through this code further, I see lots of mix and match of ncols and pcol (e.g. one loop initializes an array over ncols, but the next loops to pcol or vice versa). Also, in the pbl height routine, there is no check if the pbl height is never found. It seems possible uninitialized values could be returned. Let me know if you want me to push further changes and testing or just close this to allow a more detailed investigation. I will note that @ndkeen encountered this same issue on perlmutter-cpu with gcc and in DEBUG. The fixes here allowed that to pass. |
Yes, for gnu compiler and DEBUG, see error (not with intel). And I saw runtime fails with gnu in optimized builds for certain pe layouts. not all. This fix corrected both issues |
I'm leaning towards merging this PR as is.
This doesn't surprise me. I did a fair bit of refactoring to the code that Jinbo contributed, but there were several parts that I intentionally didn't try to improve because I didn't have the bandwidth. At this point I feel that it's one of those things where we shouldn't fix it until it breaks, unless someone else can justify spending time to give it a thorough clean-up. We'll probably need to do this eventually for porting to eamxx.
Adding a simple check would be a good idea. Just checking whether that variable is still set to the initial value of -1 should be easy. |
endif | ||
enddo | ||
|
||
if (.not. found) then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@whannah1 here is my very simple pblh check. Is this along the lines you are envisioning? I need to update this to use a better way to stop the model, which I'm happy to make that change if this structure works. I can also unwind this commit from the PR and leave that addition for later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is definitely close to what I was thinking, but a simpler version would be:
if (.not.found) call endrun('ERROR - pblh_get_level_idx: pbl top index not found')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! That's what I needed to look up on how to stop the model. I'll make that change.
pblh_get_level_idx = k+1 | ||
do k = 2, pver | ||
if((pblheight >= height_array(k).and.pblheight <height_array(k-1)))then | ||
pblh_get_level_idx = k - 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vanroekel , the change to pblh_get_level_idx = k - 1
via commit b7289da moves it one level up. What is the consideration leading to this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @wlin7 yes that's not right at all. Just to confirm L397 should be pblh_get_level_idx = k
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, @vanroekel . If no other special consideration, let us change it back to pblh_get_level_idx = k
, to maintain consistency with the original implementation. This is also suggested by the developer @jinboxie. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @wlin7 I've made the correction and am about to launch a test of this change. Will report back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With line 397 changed back to pblh_get_level_idx = k
, a multi-year test with v3.LR confirms this PR is BFB with current master, as expected, when the new oro gravity drag scheme is turned on. That also means out-of-bound violation never occur even with the original code in the test (v3.LR.F2010
for 6 years on chrysalis).
@xuezhengllnl this is the PR you want in for v3.HR? Is it finished? Is it BFB and just fixes a potential problem? |
032e2b4
to
384a7ce
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All look good. And the changes have been verified BFB with this PR (and final changes to Line 397) pre-merged to master for a multi-year v3.LR test that turns on new orographic drag scheme.
Merged to next. |
This PR addresses two bugs found when coupling and running a F20TR case
on a LANL institutional machine at ne120 resolution with DEBUG enabled.
Addresses #7298
[BFB]