Skip to content

Conversation

PeterHjortLauritzen
Copy link
Collaborator

Closes #1360

@PeterHjortLauritzen
Copy link
Collaborator Author

PeterHjortLauritzen commented Aug 18, 2025

I am not getting B4B in this test:

ERP_D_Ln9.ne30pg3_ne30pg3_mg17.FHISTC_LTso.derecho_intel.cam-outfrq9s

 ./create_test --output-root /glade/derecho/scratch/pel/ --project P93300042 ERP_D_Ln9.ne30pg3_ne30pg3_mg17.FHISTC_LTso.derecho_intel.cam-outfrq9s

(low top FHISTC with CSLAM)

@johnmauff: Could these changes be round-off (order of operation changes?)

end do
end do
if (present(fotherpanel)) then
fotherpanel (1-nht:nc+nht,1-nht:0 ,1)=fcube(1-nht:nc+nht,1-nht:0 )
do halo=1,nhr
ftmp(:) = fcube(:,halo)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized that the copy to ftmp is not needed here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

!
! fill in "n" on Figure above
!
do halo=1,nhr
ftmp(:) = fcube(:,halo)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ftmp(:) is not used.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@johnmauff
Copy link
Collaborator

johnmauff commented Aug 18, 2025 via email

Comment on lines +654 to +658
if(ns==3) then
dotproduct = DotProduct_3(w,f)
else
dotproduct = DotProduct_gen(w,f,ns)
endif
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnmauff I guess the NBFB answer comes from this function? When ns = 3, the dot product is hard coded for performance optimization, but the truncation error will be different from the general version since they are now computed in a single line.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried the general algorithm but that did not change BFB differences with the baseline.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PeterHjortLauritzen I was able to produce BFB result with the baseline after doing the following things:

  1. Fixed the typo (confirm by @johnmauff ) at line 1797 (https://github.com/ESCOMP/CAM/pull/1365/files#diff-8beef36006cafdecbf26406cf4357fc0797eb6c7316d9c6aa0a2486fade88f25R1797).
  2. Merged the latest cam_development branch into the dennis_perf_cslam1 branch (currently the dennis_perf_cslam1 branch is based on the tag cam6_4_089).

@PeterHjortLauritzen
Copy link
Collaborator Author

PeterHjortLauritzen commented Aug 20, 2025

I am not getting B4B in this test:

ERP_D_Ln9.ne30pg3_ne30pg3_mg17.FHISTC_LTso.derecho_intel.cam-outfrq9s

 ./create_test --output-root /glade/derecho/scratch/pel/ --project P93300042 ERP_D_Ln9.ne30pg3_ne30pg3_mg17.FHISTC_LTso.derecho_intel.cam-outfrq9s

(low top FHISTC with CSLAM)

@johnmauff: Could these changes be round-off (order of operation changes?)

I ran the baroclinic wave test case (FKESSLER) and compared the optimized version against the baseline/trunk. Below is PS at day 10:

Screenshot 2025-08-20 at 11 56 55 AM

For comparison, here is a pertlim test (in this case perturbing PS by 1E-14):

Screenshot 2025-08-20 at 2 35 56 PM

The pertlim test produces errors about 100× smaller. It’s unclear whether it matters that the optimized code introduces round-off errors at every time step, whereas the pertlim test only introduces them at initialization. I'll keep looking/thinking ...

UPDATE: all tests were due to code bug ... all tests (I am running) are now BFB

@sjsprecious
Copy link
Collaborator

@PeterHjortLauritzen I am curious: I thought threading was not supported for the SE dycore (#941). Thus why the ERP test still works here?

@sjsprecious
Copy link
Collaborator

@PeterHjortLauritzen I ran the ERP_D_Ln9.ne30pg3_ne30pg3_mg17.FHISTC_LTso.derecho_intel.cam-outfrq9s test on Derecho and found that the second run cut the node number by half but did not change the thread number (still 1). Thus this is actually an ERS test?

do halo=1,nhr
! ftmp(:) = fcube(nc+1-halo,:) ! copy to a temporary
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PeterHjortLauritzen @johnmauff I think this line should not be commented out?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I missed that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sjsprecious ! Now the results make sense ... see plots above (in a couple of minutes)

@PeterHjortLauritzen
Copy link
Collaborator Author

@PeterHjortLauritzen I ran the ERP_D_Ln9.ne30pg3_ne30pg3_mg17.FHISTC_LTso.derecho_intel.cam-outfrq9s test on Derecho and found that the second run cut the node number by half but did not change the thread number (still 1). Thus this is actually an ERS test?

Don't know; I just took the test from the CAM test list without thinking much about the actual test (correct! Threading is currently broken in the SE dycore). @nusbaume Do you know the answer to @sjsprecious 's question?

@PeterHjortLauritzen
Copy link
Collaborator Author

All the tests I am running are BFB now! Thanks @johnmauff and @sjsprecious ... @cacraigucar: This PR is ready to go ...

@nusbaume
Copy link
Collaborator

Hi @PeterHjortLauritzen @sjsprecious, an ERP test takes the default thread and task count from the first case and divides it by 2 for the second case if the number is greater than one. You can see that logic in the CIME code here:

https://github.com/ESMCI/cime/blob/master/CIME/SystemTests/erp.py#L32

Given that the default configuration for an SE dycore run is one thread but multiple MPI tasks, it ends up adjusting the tasks but not the threads (which is what allows the test to run with this CAM configuration in the first place).

Also, my understanding is that the difference between ERS and ERP is that an ERS test won't change the task layout at all and simply checks that the restart run is bit-for-bit, while the ERP test will halve the processor count for the restarted run before checking if the results are bit-for-bit.

Anyways, I hope that helps, and thanks again for getting these improvements into CAM!

@sjsprecious
Copy link
Collaborator

Hi @PeterHjortLauritzen @sjsprecious, an ERP test takes the default thread and task count from the first case and divides it by 2 for the second case if the number is greater than one. You can see that logic in the CIME code here:

https://github.com/ESMCI/cime/blob/master/CIME/SystemTests/erp.py#L32

Given that the default configuration for an SE dycore run is one thread but multiple MPI tasks, it ends up adjusting the tasks but not the threads (which is what allows the test to run with this CAM configuration in the first place).

Also, my understanding is that the difference between ERS and ERP is that an ERS test won't change the task layout at all and simply checks that the restart run is bit-for-bit, while the ERP test will halve the processor count for the restarted run before checking if the results are bit-for-bit.

Anyways, I hope that helps, and thanks again for getting these improvements into CAM!

Thanks @nusbaume for your clarification. Clearly I misunderstood the ERS and ERP tests before, but now it is clear to me.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PeterHjortLauritzen could you look at line 487 of this file? The denominator is in a form that is inconsistent with other polynomial evaluations. Is this missing a parens? I.e. should it be invtmp = 1.0_r8 / (recons(6,i,j) + spherecentroid(1,i,j))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

4 participants