Skip to content

Conversation

cjvogl
Copy link
Contributor

@cjvogl cjvogl commented Apr 17, 2025

adds a zonal average average diagnostic (zonal_avg), where the given field is averaged in the zonal direction over uniform latitude ranges. The user can specify how many ranges to use via the num_lat_vals parameter.

The diagnostic uses the CMP tag with name "bin", so I/O is updated to specifying dimension name binN where N is the number of bins.

[BFB]

Copy link
Contributor

@bartgol bartgol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks quite useful. I have one question: do you really care about having a new FieldTag (BIN) or do you just care about the name it gets in the output nc file?

If you are ok with using the existing CMP (with a different string attached, like "bin"), I can quickly add a PR (or push to your branch, whatever you prever), so that we can do that.

Edit: I'm not opposed to add a new tag BIN, since it carries a different meaning than CMP. If we keep BIN though, we need to be a bit more careful with the LayoutType calculation, since it may impact downstream remappers. If we piggy back on CMP+"bin", we can remove the LayoutType issue. Either way works for me.

@@ -74,3 +74,6 @@ CreateDiagTest(atm_backtend "atm_backtend_test.cpp")

# Test horizontal averaging
CreateDiagTest(horiz_avg "horiz_avg_test.cpp")

# Test zonal averaging
CreateDiagTest(zonal_avg "zonal_avg_test.cpp")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the non-local nature of the diag, we should prob test for 2+ ranks as well:

  MPI_RANKS 1 ${SCREAM_TEST_MAX_RANKS})

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be addressed in b610ba4

// TODO: doing cuda-aware MPI allreduce would be ~10% faster
Kokkos::fence();
result.sync_to_host();
comm->all_reduce(result.template get_internal_view_data<Real, Host>(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just so we're clear, so far this diagnostic will NOT be PE-invariant. That said, @oksanaguba is working on implementing bridges to the repro_sum f90 module, so we can revisit this later. The horiz_avg diag already suffers from the same issue.

Copy link
Contributor Author

@cjvogl cjvogl Apr 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense: the horiz_avg diag is the one I modeled this diag after. I'll keep an eye out for those bridges!

@cjvogl
Copy link
Contributor Author

cjvogl commented Apr 18, 2025

I'm not opposed to add a new tag BIN, since it carries a different meaning than CMP. If we keep BIN though, we need to be a bit more careful with the LayoutType calculation, since it may impact downstream remappers. If we piggy back on CMP+"bin", we can remove the LayoutType issue. Either way works for me.

I'm thinking your CMP+"bin" suggestion is the way to go for now... I suppose if there ends up being a lot of uses of CMP+"bin" in other diagnostics, that could be a good time to introduce BIN.

Thanks for the quick review!

@cjvogl
Copy link
Contributor Author

cjvogl commented Apr 18, 2025

If you are ok with using the existing CMP (with a different string attached, like "bin"), I can quickly add a PR (or push to your branch, whatever you prever), so that we can do that.

I have already worked through something similar as a warm up for trying to introduce a tag, so while I appreciate the offer, I think I can hammer that out quickly!

@cjvogl cjvogl changed the title EAMxx: add zonal average diagnostic and BIN tag EAMxx: add zonal average diagnostic and update IO for custom CMP-like tag names Apr 18, 2025
Copy link
Contributor

@bartgol bartgol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I would consider moving the compute_zonal_avg fcn to the field_utils.hpp method. Feel free to disregard though.


class ZonalAvgDiag : public AtmosphereDiagnostic {

// TODO: comment this, noting it's a utility function that could exist elsewhere
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may consider moving this fcn into share/field/field_utils.hpp. That's where we also have horiz_contraction, and other utilities. Not sure if we will ever use it outside of this diag class, but since it has no dependency on the class internals (it's a static method), we may as well put it out there...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up change it to a local function in #7481 per the suggestions in a few spots of this PR (e.g. #7261 (comment))

@bartgol bartgol added enhancement EAMxx Issues related to EAMxx labels Apr 18, 2025
Copy link
Contributor

@mahf708 mahf708 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things (one easy to fix, the other speculative and more involved)

  1. This is currently not requestable, right? How would one get the zonal average of T_mid? You will need to add the appropriate regex in components/eamxx/src/share/io/eamxx_io_utils.cpp

  2. Is there urgency to make use of this diagnostic impl? If not, let's reconsider the overarching design. Obviously, calculating the zonal average on a structured lat–lon grid is much easier and a much more straightforward reduction. We have been thinking about adding support for structured grids for remapping purposes (e.g., see #7198). If there's no hurry to do this implementation, I'd rather us consider doing this via a remap to an appropriate (tbd, customizable) lat–lon grid, then doing a simple reduction of the resulting remapped field on the LON dim. Generally speaking, most people end up remapping the output to structured grids anyway, and we might as well support that. If we go this route, we will not need the bin dimensionality. Ultimately, while workable, I think a zonal average on unstructured (i.e., not lat–lat) data is a conceptually flawed reduction (imho).

Please fix number 1 and let us know what you think about number 2

@bartgol
Copy link
Contributor

bartgol commented Apr 21, 2025

  1. Is there urgency to make use of this diagnostic impl? If not, let's reconsider the overarching design. Obviously, calculating the zonal average on a structured lat–lon grid is much easier and a much more straightforward reduction. We have been thinking about adding support for structured grids for remapping purposes (e.g., see EAMxx: introducing EACE (EAMxx ACE) compset #7198). If there's no hurry to do this implementation, I'd rather us consider doing this via a remap to an appropriate (tbd, customizable) lat–lon grid, then doing a simple reduction of the resulting remapped field on the LON dim. Generally speaking, most people end up remapping the output to structured grids anyway, and we might as well support that. If we go this route, we will not need the bin dimensionality. Ultimately, while workable, I think a zonal average on unstructured (i.e., not lat–lat) data is a conceptually flawed reduction (imho).

Please fix number 1 and let us know what you think about number 2

I think remapping to a lat-lon grid, and then doing a reduction would be conceptually the same thing though. Not only that, it would require two rounds of MPI, one for the sparse mat-vec for pg2->latlon, and one to globally reduce the zonal band.

@cjvogl
Copy link
Contributor Author

cjvogl commented Apr 21, 2025

This is currently not requestable, right? How would one get the zonal average of T_mid? You will need to add the appropriate regex in components/eamxx/src/share/io/eamxx_io_utils.cpp

@mahf708 , that is correct. I did add the regex in ed34800, but there is now a conflict with master (I presume, at the very least, with the renaming of scream_io_utils to eamxx_io_utils). I recall there was a policy strongly against merging master into development/feature branches, with a preference for rebasing instead... is that still the case? (i.e., should I merge or rebase)?

@mahf708
Copy link
Contributor

mahf708 commented Apr 21, 2025

Kinda yes, I think we all prefer rebasing :)

@cjvogl cjvogl force-pushed the feature-eamxx-zonal_average_diagnostic branch from ed34800 to 8bd8be0 Compare April 21, 2025 21:31
Copy link
Contributor

@mahf708 mahf708 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd still like to hear your thoughts about my point 2 above.

Also, to double check, have you tested how this looks like once outputted in a netcdf file to make sure you're happy with how it looks? No worries if not

[edit: undraft/draft to get the ci to run]

@mahf708 mahf708 marked this pull request as ready for review April 21, 2025 21:43
@mahf708 mahf708 marked this pull request as draft April 21, 2025 21:43
@cjvogl cjvogl force-pushed the feature-eamxx-zonal_average_diagnostic branch from 8bd8be0 to 773ff5b Compare May 8, 2025 23:11
@cjvogl cjvogl marked this pull request as ready for review May 10, 2025 00:45
@mahf708 mahf708 marked this pull request as draft May 10, 2025 01:03
@mahf708 mahf708 marked this pull request as ready for review May 10, 2025 01:03
Copy link
Contributor

@mahf708 mahf708 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you'd like to address the linter, feel free to do so, but if I were you, I would only address (i.e., i'd leave the older files for now)

components/eamxx/src/diagnostics/tests/zonal_avg_test.cpp
components/eamxx/src/diagnostics/zonal_avg.hpp
components/eamxx/src/diagnostics/zonal_avg.cpp

Also, would you be willing to rebase to make this into a few coherent commits?

using ESU = ekat::ExeSpaceUtils<typename KT::ExeSpace>;
switch(result_layout.rank())
{
case 1: {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we should add support for rank-0 fields as well? If one wants to do a zonal avg of a horiz slice of a 2d quantity, it' be much better to do the slice first, then the zonal avg, since on a 2d field we actually would get coalesced access on GPU in this diagnostic (all other ranks suffer from bad GPU access, but there's not much to do about it).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I got confused. The output field will ALWAYS have rank>=1, since it has the number of bins as a dimension. Hence, disregard my comment.


namespace scream {

void ZonalAvgDiag::compute_zonal_sum(const Field &result,
Copy link
Contributor

@bartgol bartgol May 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this diagnostic could be made much more perf by pre-computing the list of cols that belong to each lat bin. Then, at runtime do

pfor(nbins) {
   auto ibin = policy.league_rank
   preduce(MDPolicy(bin_size,dim1,dim2,..),[&](int idx, idim, jdim,...,Real& accum){
      auto icol = bin_cols(idx)
      accum += f(icol,idim,jdim,...);
   },bin(ibin,idim1,idim2,..));
}

Notice that this would maintain coalesced access on GPU, without adding a race condition in the output.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it is now, the non-coalesced access, combined with the recalculation of what's in each bin every time, probably gives mildly bad perf on GPU.

We can integrate as is, if you want, but we should make a priority to fix this.

Copy link
Contributor Author

@cjvogl cjvogl May 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay. I'm unclear on how the parallel reduce can be over the non-column dimensions when there is a separate zonal average for each of the non-column dimensions. Should perhaps the pfor in your suggestion use a MDPolicy over the bins and non-column dimensions? Or maybe I'm missing something straightforward here?

Perhaps a less abstract version of the question, is that I dont see how the parallel reduce would be able to write to bin(ibin, idim, jdim,...) when idim and jdim are local to the lambda function for the reduction.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My snippet is wrong, since MDPolicy only works for outermost loops. We'd have to use a TeamVectorRange, and do some index arithmetics manually. But MDP vs TVR is not the point anyways.

The point was to have this pseudocode:

loop_over_num_bins 
   par_reduce (over bin_size)
     icol = get-current-index-within-bin
       pfor (over any remaining dim)
         accum(<other-dims>) += f(icol,<other-dims>)

I am not entierly sure we can achieve this, since it would require accum to potentially be an Ndim array. But this 3-level hierarchy would give coalesced access on GPU. I have to check with kk devs if this can be achieved.

Regardless, I think it's fine to merge without worrying about this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that makes sense. I appreciate the explanation.

I have been working on pre-computing which columns are in each zonal band (so each thread team only iterates across the columns in the respective zonal band). Assuming this is still worth adding to the PR (instead of the current approach of having each thread team iterate over all the columns) is there an appropriate data structure to store the column indices? I have it implemented as a simple vector of vectors... but I assume there's a better way and am still coming up to speed with memory management in Kokkos.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can ignore my last request: I'm now seeing a Field can have an int** view... which seems to be the way to go.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, I think you can merge, and then improve. But of course, improving and then merging is perfectly fine.

As for the index indirection, you could use a 2d view along with a 1d small view:

  • bin_size(num_bins): a 1d view where bin_size(ibin) is the number of cols in bin ibin.
  • col_lids(num_bins,max_bin_sz): a 2d where col_lids(ibin,k) is the local id of the k-th col in this bin.

This would require a 2 phases setup: a first pass to count the size of each bin (and create bin_size), and a second pass to get the lids of each col in each bin. This would have a bit more storage than a vector-of-vectors (or in this case a view-of-views), but considering all the other storage requirements in eamxx, it may not be too bad.

A view of views would also work though. We do use that in Hommexx's boundary exchange structures, so you can def go down that route. Just be sure to cleanup all the inner views when the diag is destroyed, as the outer view destructor does not properly destroy the inner ones.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestions! For now, I'll address the linting requests and rebase requests so this can be merged, with the plan to add those improvements in the next PR.

Copy link
Contributor Author

@cjvogl cjvogl Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a pass at this in #7481. I'm open to any and all suggestions as I'm just getting my head around Kokkos in general.

@bartgol bartgol added the CI: approved Allow gh actions PR testing on ghci-snl-* machines label May 20, 2025
@mahf708 mahf708 added the CI: approved Allow gh actions PR testing on ghci-snl-* machines label May 21, 2025
@mahf708 mahf708 marked this pull request as draft May 21, 2025 03:50
@mahf708 mahf708 marked this pull request as ready for review May 21, 2025 03:50
@mahf708
Copy link
Contributor

mahf708 commented May 21, 2025

@cjvogl, code doesn't compile; let us know if you'd like help fixing stuff (pretty ez to test locally or on pm-gpu if you'd like, see somewhere under https://docs.e3sm.org/E3SM/EAMxx/developer/dev_quickstart). Shouldn't be too hard to fix, but obvz can be pretty tedious if you don't have a setup working on a variety of machines, etc.

otherwise, looks good 🙌

@mahf708 mahf708 marked this pull request as draft May 21, 2025 18:14
@mahf708 mahf708 marked this pull request as ready for review May 21, 2025 18:15
@mahf708 mahf708 requested a review from bartgol May 21, 2025 19:10
// - the first dimension for field, weight, and lat is for the columns (COL)
// - the first dimension for result is for the zonal bins (CMP,"bin")
// - field and result must be the same dimension, up to 3
static void compute_zonal_sum(const Field &result, const Field &field, const Field &weight,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be a method at all? Can it just be a local function in the cpp file and be done with it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in #7481

Copy link
Contributor

@bartgol bartgol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to address the dim name issue that Naser identified. After that, I think this is good to be merged. We can address the issue of how to handle tags that don't have a fixed length associated with them at a later stage. I have to think about how to best do it.

@mahf708 mahf708 added the BFB PR leaves answers BFB label May 24, 2025
@mahf708 mahf708 force-pushed the feature-eamxx-zonal_average_diagnostic branch from 346bff5 to 23a74df Compare May 24, 2025 03:28
@mahf708 mahf708 requested a review from bartgol May 24, 2025 03:28
@mahf708
Copy link
Contributor

mahf708 commented May 24, 2025

@bartgol pls review again if possible :)

@cjvogl I fixed a few minor things and I added docs. Pls take a look before we merge

I decided to remove the output from the standalone test for now, and punted on converting the method to a local fxn

Copy link

github-actions bot commented May 24, 2025

PR Preview Action v1.6.1

🚀 View preview at
https://E3SM-Project.github.io/E3SM/pr-preview/pr-7261/

Built to branch gh-pages at 2025-05-28 16:00 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@@ -140,7 +140,7 @@ create_diagnostic (const std::string& diag_field_name,
std::regex vert_layer ("(z|geopotential|height)_(mid|int)$");
std::regex horiz_avg ("([A-Za-z0-9_]+)_horiz_avg$");
std::regex vert_contract ("([A-Za-z0-9_]+)_vert_(avg|sum)(_((dp|dz)_weighted))?$");
std::regex zonal_avg (R"(([A-Za-z0-9_]+)_zonal_avg_with_(\d+)_bins$)");
std::regex zonal_avg (R"(([A-Za-z0-9_]+)_zonal_avg(_with_(\d+)_bins)?$)");
Copy link
Contributor

@mahf708 mahf708 May 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any last-minute thoughts about the naming here? I am not sure how I feel about "with X bins"

Maybe X_zonal_avg_in_Y_lats?

Also, we don't make the "lat" available and instead we rely on users to figure it out. Tha'ts ok, but we should likely add a helper string in the output ... lemme know what you think and I will address it (here or later)

Edit: I think it may make sense to rename "binN" to "latN" and then we can endow this dim with its actual values (the index is there automatically). But can do later, not necessarily now...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would vote for "bandN" or "zoneN". "latN" seems to suggest a single latitude value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a different note: why is _with_ part of the captured group? I don't think using a default number of bins is a great idea, since ppl may forget to set that. I'd rather error out right away, and have users fix the diag name to specify the number of bins.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to do a "default" zonal decomp, maybe we should use more "meaningful" (and not uniform) bounds, to get something like 2x polar zones, 2x temperate zones, and 1-2 tropical zones?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just followed the description of the PR. I am fine with forcing a no-default and erroring out. So, let's see what @cjvogl prefers and we can tweak it.

If you want to do a "default" zonal decomp, maybe we should use more "meaningful"
(and not uniform) bounds, to get something like 2x polar zones, 2x temperate zones, and 1-2 tropical zones?

can you elaborate?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, doing 180 bins as default seems very arbitrary, and also maybe too little of a size for bin 1 and 180... I was thinking maybe something like

  • polar N: [66,90]
  • temperate N: [35,66.5]
  • sub-tropical N: [23.5,35]
  • tropical: [-23.5,23.5]
  • sub-tropical S: [-35,-23.5]
  • temperate S: [-66.5,-35]
  • polar S: [-90,-66.5]

They are still somewhat arbitrary, but probably more linked to relevant regions of analysis?

Copy link
Contributor

@mahf708 mahf708 May 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's often the case people tend to look at zonal means like designed here, e.g., e3sm_diags zonal plots and zonal-annual plots. The default of 180 (one-lat-deg spacing) is pretty reasonable if we want to keep a default. However, I am fine with getting rid of the defaults, and forcing users to specify one.

What you're listing (polar, mid-lat, subtrop, trop, etc.) is a derivative of the more general zonal average stuff, but it is more tightly related to horiz avg. I would consider adding those regions as special cases of horiz_avg instead of here...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would vote for "bandN" or "zoneN". "latN" seems to suggest a single latitude value.

The motivation behind bin is that one can think of a zonal average as a histogram, and thus a name like "bin" could be reused if someone wants to have a histogram in a different dimension than along latitude. As a concrete example, we intend to introduce a diagnostic where one can output a histogram in the vertical direction to reduce the output size from a high resolution run.

The default of 180 (one-lat-deg spacing) is pretty reasonable if we want to keep a default.

The one-lat-deg spacing was the original idea, but I was not beholden to it (or any default).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. But the [-90,-89) band (e.g.) will have much fewer cols than the [-1,0) band, so is the histogram meaningful there? But that's beyond the scope, i am ok with bin as a name. I will integrate this as soon as testing is done.

@bartgol
Copy link
Contributor

bartgol commented May 28, 2025

@cjvogl @mahf708 Unless I hear from you, I will integrate this at the end of the day.

@cjvogl
Copy link
Contributor Author

cjvogl commented May 28, 2025

I fixed a few minor things and I added docs. Pls take a look before we merge

The changes look great: thanks for them and adding the docs @mahf708 !

Unless I hear from you, I will integrate this at the end of the day.

This looks good to me! Thanks for all the work on this @bartgol

bartgol added a commit that referenced this pull request May 28, 2025
…7261)

Adds a zonal average average diagnostic (zonal_avg),
where the given field is averaged in the zonal direction
over uniform latitude ranges. The user can specify how many
ranges to use via the num_lat_vals parameter.

The diagnostic uses the CMP tag with name "bin", so I/O is updated
to specifying dimension name binN where N is the number of bins.

[BFB]
@bartgol bartgol merged commit bf2eb9b into master May 28, 2025
20 of 22 checks passed
@bartgol bartgol deleted the feature-eamxx-zonal_average_diagnostic branch May 28, 2025 23:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BFB PR leaves answers BFB CI: approved Allow gh actions PR testing on ghci-snl-* machines EAMxx Issues related to EAMxx enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants