-
Notifications
You must be signed in to change notification settings - Fork 435
EAMxx: assign columns to bins during intialization of zonal average #7481
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
dcdcea8
to
d138e88
Compare
|
note: need to address test fails. |
Perhaps I'm merely stating the obvious, but the test fails are because my GitHub account is not approved to run them. Perhaps @tcclevenger can restart them? Or I'm happy to if given approval.
|
Ah, I didn't notice that. The CI runner script checks the time of last approval (or the time the |
@bartgol , my apologies: I had another errant, doubly-nested KOKKOS_LAMBDA macro that was tripping up the cuda compiler. Can you re-run the tests? (I did reset the CI:Approved flag in an attempt to help) |
This is ready for the CI tests again... apologies to @tcclevenger that I didn't catch the bug that was caught in the last round of CI tests: the tests oddly completely fine with running with more than 1 thread (but caught the bug with the 1 thread that the CI uses). |
Err, let me first address the merge conflicts. |
90f8ab5
to
5bddc2f
Compare
@tcclevenger , can you please add the |
Needs a rebase and test fails should be looked at. |
…cit capture of encompassing class)
5bddc2f
to
d5bfa35
Compare
Thanks for reminding me about this PR. I did the rebase, but I'll need to work on access to a GPU cluster to reproduce the CUDA failures. Until then, I have an idea of what the offending code is but am not sure what the best path forward is. I'll highlight that in inline comments to follow this one. |
for (int col_i=0; col_i < ncols; col_i++) | ||
{ | ||
if ((lat_lower <= lat_view(col_i)) && (lat_view(col_i) < lat_upper)) | ||
{ | ||
bin_to_cols_view(bin_i, 0) += 1; | ||
bin_to_cols_view(bin_i, bin_to_cols_view(bin_i,0)) = col_i; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this loop is what is causing the CUDA errors (I'll confirm this once I have access to an appropriate machine).
The loop is a plain old for loop because the process of building lists of column indices is inherently serial (i.e., a parallel for loop would run into many race conditions). Looking through the kokkos documentation, I believe this could be made into a parallel_for
with the use of either
- atomic operations .. although there would technically need to be two atomic operations
- a
View
with anAtomic
memory trait
Any thoughts on one versus the other is greatly appreciated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could probably accomplish this through a parallel_reduce()
, but I would need to take a closer look at the structure to confirm. I'll take a look at this Monday, and review the PR as a whole.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is done during initialization, I am not too concerned about this. That said, you could split the work into 3 kernels:
- compute the bin index each col belongs to
pfor (policy, KOKKOS_LAMBDA (int i) {
int ibin = ... // figure out the index
col_bin(icol) = ibin;
});
- Run a || for over cols with a || reduce to compute the bins size. @tcclevenger prob knows how to use an array as accumulator. Something that looks like
preduce(policy,KOKKOS_LAMBDA(int icol, Array& local_sizes) {
int ibin = col_bin(icol);
++local_sizes[ibin]'
},bin_sizes);
- Proceed with the gathering of indices:
pfor(policy,KOKKOS_LAMBDA(int ibin){
for (int icol=0, k=0; icol<ncols; ++icol) {
if (col_bin(icol)!=ibin) continue;
bin_columns(bin,k) = icol;
++k;
}
});
Between 2 and 3 you can also check that the max of bin_sizes is indeed less then max_ncols_per_bin. Your code may be attempting to access the view out of bounds if the max was not large enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not really make the code faster, but it removes the need for a max_ncols_per_bin, as you could create the diagnostic after step 2, once you know all the bins sizes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your code may be attempting to access the view out of bounds if the max was not large enough.
This was more-or-less the issue: I was using std::max
on device, which was resulting in the max being zero. This should be corrected now in 73dc4dc
This does not really make the code faster, but it removes the need for a max_ncols_per_bin, as you could create the diagnostic after step 2, once you know all the bins sizes.
This makes sense to me. The only reason I opted to keep the original approach is that I don't have experience using an array as an accumulator.
@cjvogl what is the status of this PR? |
The PR refactors the zonal average diagnostic (recently introduced in #7261) so each column is assigned to a zonal bin during the diagnostic initialization, as opposed to each time the diagnostic is computed.
The assignment is done via mapping from zonal bin index to a list of column indices, which belong to that zonal bin. The mapping is implemented as an
Int **
view where the(i,j)
-th entry isj=0
: the number of columns in thei
-th zonal binj>0
: a column index that belongs to thei
-th zonal bin