Skip to content

Conversation

cjvogl
Copy link
Contributor

@cjvogl cjvogl commented Jul 2, 2025

The PR refactors the zonal average diagnostic (recently introduced in #7261) so each column is assigned to a zonal bin during the diagnostic initialization, as opposed to each time the diagnostic is computed.

The assignment is done via mapping from zonal bin index to a list of column indices, which belong to that zonal bin. The mapping is implemented as an Int ** view where the (i,j)-th entry is

  • j=0: the number of columns in the i-th zonal bin
  • j>0: a column index that belongs to the i-th zonal bin

@cjvogl cjvogl added enhancement EAMxx Issues related to EAMxx wip work in progress labels Jul 2, 2025
@cjvogl cjvogl marked this pull request as ready for review July 11, 2025 19:36
@cjvogl cjvogl removed the wip work in progress label Jul 11, 2025
@bartgol bartgol added the CI: approved Allow gh actions PR testing on ghci-snl-* machines label Jul 14, 2025
@cjvogl cjvogl force-pushed the improvement-eamxx-zonal_average_diagnostic branch from dcdcea8 to d138e88 Compare July 21, 2025 17:46
Copy link

github-actions bot commented Jul 21, 2025

PR Preview Action v1.6.2

🚀 View preview at
https://E3SM-Project.github.io/E3SM/pr-preview/pr-7481/

Built to branch gh-pages at 2025-09-22 23:12 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@rljacob
Copy link
Member

rljacob commented Jul 24, 2025

note: need to address test fails.

@cjvogl
Copy link
Contributor Author

cjvogl commented Jul 24, 2025

note: need to address test fails.

Perhaps I'm merely stating the obvious, but the test fails are because my GitHub account is not approved to run them. Perhaps @tcclevenger can restart them? Or I'm happy to if given approval.

[2025-07-22T01:26:10Z ERROR verification::job_approver] Initiating User cjvogl is not approved to run jobs on this machine.

@bartgol bartgol added CI: approved Allow gh actions PR testing on ghci-snl-* machines and removed CI: approved Allow gh actions PR testing on ghci-snl-* machines labels Jul 24, 2025
@bartgol
Copy link
Contributor

bartgol commented Jul 24, 2025

note: need to address test fails.

Perhaps I'm merely stating the obvious, but the test fails are because my GitHub account is not approved to run them. Perhaps @tcclevenger can restart them? Or I'm happy to if given approval.

[2025-07-22T01:26:10Z ERROR verification::job_approver] Initiating User cjvogl is not approved to run jobs on this machine.

Ah, I didn't notice that. The CI runner script checks the time of last approval (or the time the CI: Approved label was set), to make sure it was after the last commit. I re-set the label, and re-triggered the jobs. They should run now.

@cjvogl cjvogl added CI: approved Allow gh actions PR testing on ghci-snl-* machines and removed CI: approved Allow gh actions PR testing on ghci-snl-* machines labels Jul 24, 2025
@cjvogl
Copy link
Contributor Author

cjvogl commented Jul 24, 2025

@bartgol , my apologies: I had another errant, doubly-nested KOKKOS_LAMBDA macro that was tripping up the cuda compiler. Can you re-run the tests? (I did reset the CI:Approved flag in an attempt to help)

@tcclevenger tcclevenger added CI: approved Allow gh actions PR testing on ghci-snl-* machines and removed CI: approved Allow gh actions PR testing on ghci-snl-* machines labels Jul 24, 2025
@cjvogl cjvogl removed the CI: approved Allow gh actions PR testing on ghci-snl-* machines label Jul 25, 2025
@cjvogl
Copy link
Contributor Author

cjvogl commented Aug 1, 2025

This is ready for the CI tests again... apologies to @tcclevenger that I didn't catch the bug that was caught in the last round of CI tests: the tests oddly completely fine with running with more than 1 thread (but caught the bug with the 1 thread that the CI uses).

@cjvogl
Copy link
Contributor Author

cjvogl commented Aug 1, 2025

Err, let me first address the merge conflicts.

@cjvogl cjvogl force-pushed the improvement-eamxx-zonal_average_diagnostic branch from 90f8ab5 to 5bddc2f Compare August 2, 2025 00:10
@cjvogl
Copy link
Contributor Author

cjvogl commented Aug 2, 2025

@tcclevenger , can you please add the CI: Approved flag to run the tests?

@bartgol bartgol added the CI: approved Allow gh actions PR testing on ghci-snl-* machines label Aug 2, 2025
@rljacob
Copy link
Member

rljacob commented Aug 14, 2025

Needs a rebase and test fails should be looked at.

@cjvogl cjvogl removed the CI: approved Allow gh actions PR testing on ghci-snl-* machines label Aug 14, 2025
@cjvogl cjvogl force-pushed the improvement-eamxx-zonal_average_diagnostic branch from 5bddc2f to d5bfa35 Compare August 14, 2025 20:50
@cjvogl
Copy link
Contributor Author

cjvogl commented Aug 14, 2025

Needs a rebase and test fails should be looked at.

Thanks for reminding me about this PR. I did the rebase, but I'll need to work on access to a GPU cluster to reproduce the CUDA failures. Until then, I have an idea of what the offending code is but am not sure what the best path forward is. I'll highlight that in inline comments to follow this one.

Comment on lines +213 to +220
for (int col_i=0; col_i < ncols; col_i++)
{
if ((lat_lower <= lat_view(col_i)) && (lat_view(col_i) < lat_upper))
{
bin_to_cols_view(bin_i, 0) += 1;
bin_to_cols_view(bin_i, bin_to_cols_view(bin_i,0)) = col_i;
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this loop is what is causing the CUDA errors (I'll confirm this once I have access to an appropriate machine).

The loop is a plain old for loop because the process of building lists of column indices is inherently serial (i.e., a parallel for loop would run into many race conditions). Looking through the kokkos documentation, I believe this could be made into a parallel_for with the use of either

  1. atomic operations .. although there would technically need to be two atomic operations
  2. a View with an Atomic memory trait
    Any thoughts on one versus the other is greatly appreciated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could probably accomplish this through a parallel_reduce(), but I would need to take a closer look at the structure to confirm. I'll take a look at this Monday, and review the PR as a whole.

Copy link
Contributor

@bartgol bartgol Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is done during initialization, I am not too concerned about this. That said, you could split the work into 3 kernels:

  1. compute the bin index each col belongs to
pfor (policy, KOKKOS_LAMBDA (int i) {
  int ibin = ... // figure out the index
  col_bin(icol) = ibin;
});
  1. Run a || for over cols with a || reduce to compute the bins size. @tcclevenger prob knows how to use an array as accumulator. Something that looks like
preduce(policy,KOKKOS_LAMBDA(int icol, Array& local_sizes) {
  int ibin = col_bin(icol);
  ++local_sizes[ibin]'
},bin_sizes);
  1. Proceed with the gathering of indices:
pfor(policy,KOKKOS_LAMBDA(int ibin){
  for (int icol=0, k=0; icol<ncols; ++icol) {
    if (col_bin(icol)!=ibin) continue;
      bin_columns(bin,k) = icol;
      ++k;
  }
});

Between 2 and 3 you can also check that the max of bin_sizes is indeed less then max_ncols_per_bin. Your code may be attempting to access the view out of bounds if the max was not large enough.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not really make the code faster, but it removes the need for a max_ncols_per_bin, as you could create the diagnostic after step 2, once you know all the bins sizes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your code may be attempting to access the view out of bounds if the max was not large enough.

This was more-or-less the issue: I was using std::max on device, which was resulting in the max being zero. This should be corrected now in 73dc4dc

This does not really make the code faster, but it removes the need for a max_ncols_per_bin, as you could create the diagnostic after step 2, once you know all the bins sizes.

This makes sense to me. The only reason I opted to keep the original approach is that I don't have experience using an array as an accumulator.

@mahf708 mahf708 added the CI: approved Allow gh actions PR testing on ghci-snl-* machines label Aug 14, 2025
@rljacob
Copy link
Member

rljacob commented Sep 18, 2025

@cjvogl what is the status of this PR?

@cjvogl
Copy link
Contributor Author

cjvogl commented Sep 22, 2025

@cjvogl what is the status of this PR?

@rljacob , I believe I have just addressed the remaining issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI: approved Allow gh actions PR testing on ghci-snl-* machines EAMxx Issues related to EAMxx enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants