EAMxx: histogram diagnostic #7615

cjvogl · 2025-08-18T21:01:59Z

Implementation of online histograms. As added to the documentation...

To select the histogram, you need to suffix a field name X with
_histogram_ followed by the values specifying the edges of the bins,
separated by _. For example, the histogram specified by
T_mid_histogram_250_270_290_310 has 5 bins defined, effectively, by
($-\infty$, 250), [250, 270), [270, 290), [290, 310), and [310, $\infty$).

The implementation distributes the histogram bins for parallelism (each thread goes over all columns), although an alternative would be to distribute the columns and leverage an atomic for the bin. I am open to discussion on this.

github-actions · 2025-08-18T21:06:09Z

PR Preview Action v1.6.2
🚀 View preview at https://E3SM-Project.github.io/E3SM/pr-preview/pr-7615/
Built to branch `gh-pages` at 2025-09-08 23:49 UTC. Preview will be ready when the GitHub Pages deployment is complete.

mahf708 · 2025-08-18T22:15:53Z

@cjvogl i'm all for a general statistics utility¹(including letting those bins be either linear or logarithmic) but some quick concerns:

does this scale well in your experience/testing?
at least for the small-bin-number example you give, how is this different from doing count-based conditional sampling each bin which we have in the mainline code now? count_where_T_mid_gt_300_where_T_mid_lt_310 ... etc.

I would go for something like that has the syntax of [Field]_(hist|pdf|cdf|...)_[NumBins]_(lin|log) ↩

cjvogl · 2025-08-18T23:33:44Z

@cjvogl i'm all for a general statistics utility1(including letting those bins be either linear or logarithmic) but some quick concerns:
* does this scale well in your experience/testing?

Admittedly, I haven't done scaling tests; however, I can say that the implementation is parallel in bins (i.e., the bins are distributed over threads), so I would expect scaling roughly proportional to the number of columns. The other approach I considered is parallel in columns with atomics to avoid race conditions in increment the bin counters; however, I am not sure the potential improvement in performance will be realized w.r.t. to number of columns

* at least for the small-bin-number example you give, how is this different from doing count-based conditional sampling each bin which we have in the mainline code now? `count_where_T_mid_gt_300_where_T_mid_lt_310` ... etc.

It's not all that different really for a small number of bins. For a larger number of bins, we felt it more convenient to have the histogram stored as a single output variable rather than across numerous variables. I suppose there might be additional benefit to computing the bins in parallel (as implemented) rather than serially as separate diagnostics (assuming the diagnostics are evaluated in serial).

In general, I'm certainly open to discussion about these concerns, other concerns, and/or implementation discussion

mahf708 · 2025-08-19T00:24:25Z

For a larger number of bins, we felt it more convenient to have the histogram stored as a single output variable rather than across numerous variables.

Yeah, that's good motivation.

How do you feel about supporting a generic [Field]_(hist|pdf|cdf|...)_[NumBins]_(lin|log) or [Field]_(hist|pdf|cdf|...)_[NumBins]_(lin|log)_[MIN]_[MAX] instead of specifying the bins with a long chain of numbers? The tedious name actually isn't a big deal. I am more concerned with more generalizability. But feel free to say 'no thanks' for now (can be convinced with the argument of no need to overgeneralize with no need right away)

I will review soon

mahf708

looks good to me

two comments but nothing blocking

How do you feel about supporting a generic [Field](hist|pdf|cdf|...)[NumBins](lin|log) or [Field](hist|pdf|cdf|...)[NumBins](lin|log)[MIN][MAX] instead of specifying the bins with a long chain of numbers? The tedious name actually isn't a big deal. I am more concerned with more generalizability. But feel free to say 'no thanks' for now (can be convinced with the argument of no need to overgeneralize with no need right away)
at some point we should move these nifty utils to field utils, but not rush to do stuff right now

mahf708 · 2025-08-19T00:30:02Z

components/eamxx/src/share/io/eamxx_io_utils.cpp

@@ -131,7 +131,7 @@ create_diagnostic (const std::string& diag_field_name,
  // Note: the number for field_at_p/h can match positive integer/floating-point numbers
  // Start with a generic for a field name allowing for all letters, all numbers, dash, dot, plus, minus, product, and division
  // Escaping all the special ones just in case
-  std::string generic_field = "([A-Za-z0-9_.+\\-\\*\\÷]+)"; 
+  std::string generic_field = "([A-Za-z0-9_.+\\-\\*\\÷]+)";


ugh!!! I need to get those \\*\\÷ out (don't worry about it here)

components/eamxx/src/share/io/eamxx_io_utils.cpp

mahf708 · 2025-08-19T00:33:40Z

components/eamxx/src/diagnostics/histogram.cpp

+  auto histogram_layout = m_diagnostic_output.get_header().get_identifier().get_layout();
+  const int num_bins = histogram_layout.dim(0);
+
+  m_diagnostic_output.deep_copy(sp(0.0));


deep_copy isn't cheap, wanna move it to init impl?

Good catch! After going back, I'm not sure any initialization is needed, so I removed it entirely in 71ec166

mahf708 · 2025-08-19T00:35:52Z

components/eamxx/src/diagnostics/histogram.cpp

+
+  m_diagnostic_output.deep_copy(sp(0.0));
+  auto bin_values_view = m_bin_values.get_view<Real *>();
+  auto histogram_view = m_diagnostic_output.get_view<Real *>();


did you consider making this an int? I don't know if making an int is a good idea (not sure how it would chain against other ops like vert_avg or just even averaging in IO ops...)

Good question: my original implementation used ints but then I ran into IO issues. I don't quite remember the error but I can reproduce if it would be helpful?

Our IO layer will likely either crash or produce garbage if we try to do the time average of an int variable. As for chaining with other diags, that's also a question mark. I am ok keeping the data type Real for now. We have lots of cleanup work to do in the diags/IO layer anyways, so we can just add this to the list...

Ah right, that makes a lot of sense. Thanks for the clarification!

mahf708 · 2025-08-19T00:37:08Z

components/eamxx/src/diagnostics/histogram.cpp

+            const int bin_i = tm.league_rank();
+            const Real bin_lower = bin_values_view(bin_i);
+            const Real bin_upper = bin_values_view(bin_i+1);
+            histogram_view(bin_i) = 0;


no need for this, right?

Good catch! This is removed in 71ec166.

mahf708 · 2025-08-19T00:37:16Z

components/eamxx/src/diagnostics/histogram.cpp

+            const int bin_i = tm.league_rank();
+            const Real bin_lower = bin_values_view(bin_i);
+            const Real bin_upper = bin_values_view(bin_i+1);
+            histogram_view(bin_i) = 0;


no need, right?

Another good catch! This is also removed in 71ec166.

mahf708 · 2025-08-19T00:38:35Z

components/eamxx/src/diagnostics/histogram.cpp

+  }
+
+  // TODO: use device-side MPI calls
+  // TODO: the dev ptr causes problems; revisit this later


at some point we should actually revisit this ;) I put this TODO here because I wasn't interested in investigating an error I was getting, but I should actually look into it again... thanks for keeping the TODO

whannah1 · 2025-08-19T14:37:47Z

@cjvogl I don't see any area weighting being applied here. Did I miss it? This seems like an important thing to include. Otherwise the area under the curve of the histogram won't match the area-weighted global mean.

mahf708 · 2025-08-19T15:05:43Z

@cjvogl I don't see any area weighting being applied here. Did I miss it? This seems like an important thing to include. Otherwise the area under the curve of the histogram won't match the area-weighted global mean.

hmmm, this is just counting the occurrence though; why would you need to scale that by area?

edit: count_where_T_mid_gt_300_where_T_mid_lt_310_vert_avg_dp_weighted_horiz_avg does scale by both dp in the vertical and area in the horizontal.... but that's kinda explicit. I am not sure which one is more correct for a histogram...

bartgol · 2025-08-19T17:04:34Z

@cjvogl how urgent is this PR? After the last sprint with diags capabilities, I was thinking to pause new diags, and implement a more flexible yaml specification paradigm. Eg, I was thinking of a syntax along the line of

fields:
  physics_pg2:
     fields_names:
       - my_diag_1
       - my_diag_2
diagnostics:
  my_diag_1:
    type: horiz_avg
    input: T_mid
  my_diag_2:
    type: sampling
    input: my_diag_3
    condition: qv GT 0.00001
  my_diag_3:
    type: vert_derivative
    coordinate: pressure
    input: T_mid

This more verbose framework should make it easier to request a histogram diag:

  physics_pg2:
     fields_names:
       - my_hist
diagnostics:
  my_hist:
    type: histogram
    bins: [-9999, 0, 10, 100, 1000, 999999]

The current way of requesting diags makes requesting histograms a bit hard, imho. So, if not too urgent, maybe we can postpone this? I would think the new syntax may be avail in like ~2 months?

cjvogl · 2025-08-22T15:30:32Z

@cjvogl I don't see any area weighting being applied here. Did I miss it? This seems like an important thing to include. Otherwise the area under the curve of the histogram won't match the area-weighted global mean.

hmmm, this is just counting the occurrence though; why would you need to scale that by area?

edit: count_where_T_mid_gt_300_where_T_mid_lt_310_vert_avg_dp_weighted_horiz_avg does scale by both dp in the vertical and area in the horizontal.... but that's kinda explicit. I am not sure which one is more correct for a histogram...

First, a disclaimer: I wrote this while on a flight... so it's probably way longer than it needs to be.

I was going for a histogram defined along the lines of given a data set ${v_1,\ldots,v_n}$ the histogram value $h_i$ for the bin $[V_{i}^-,V_{i}^+)$ are the number of $v_k$ s.t. $V_{i}^- \leq v_k < V_{i}^+$, so all the histogram values should add up to $n$, i.e., $\sum_{i=1}^M h_i = n$. I believe the area under the curve can be viewed as

$h_i (V_{i}^+ - V_{i}^-)$ approximates $\int_{V_{i}^-}^{V_{i}^+} h(v) dv$, i.e., the expected count of values between $V_{i}^-$ and $V_{i}^+$ for $n$ randomly sampled points
$\sum_{i=1}^M h_i (V_{i}^+ - V_{i}^-)$ approximates $\int_{V_1^-}^{V_M^+} h(v) dv$, i.e., the expected count of values between $V_1^-$ and $V_M^+$ for $n$ randomly sampled points

That all said, I'm happy to support a different flavor of histrogram/distribution measurement... @whannah1 , let me know if there's a different type of measurement that would be more informative than the histogram as defined above.

cjvogl · 2025-08-22T15:33:07Z

@cjvogl how urgent is this PR? After the last sprint with diags capabilities, I was thinking to pause new diags, and implement a more flexible yaml specification paradigm. Eg, I was thinking of a syntax along the line of
fields:
  physics_pg2:
     fields_names:
       - my_diag_1
       - my_diag_2
diagnostics:
  my_diag_1:
    type: horiz_avg
    input: T_mid
  my_diag_2:
    type: sampling
    input: my_diag_3
    condition: qv GT 0.00001
  my_diag_3:
    type: vert_derivative
    coordinate: pressure
    input: T_mid
This more verbose framework should make it easier to request a histogram diag:
  physics_pg2:
     fields_names:
       - my_hist
diagnostics:
  my_hist:
    type: histogram
    bins: [-9999, 0, 10, 100, 1000, 999999]
The current way of requesting diags makes requesting histograms a bit hard, imho. So, if not too urgent, maybe we can postpone this? I would think the new syntax may be avail in like ~2 months?

This new syntax looks very promising. I'm checking on the timeline with the higher ups (and putting a todo item here for myself so I don't forget to follow up).

check on timeline for PR

AaronDonahue · 2025-08-22T17:17:40Z

@cjvogl how urgent is this PR? After the last sprint with diags capabilities, I was thinking to pause new diags, and implement a more flexible yaml specification paradigm. Eg, I was thinking of a syntax along the line of

@bartgol would this new syntax also support my task of collapsing a bunch of conditional diags into a single variable? For example, binning over the vertical?

mahf708 · 2025-08-22T17:40:48Z

Before implementing a new syntax for IO, you'd want to consult widely with users. I think the proposed syntax is a regression in quality; the main problem is quadrupling in verbosity (LOC), which I think is unacceptable on top of our already very verbose way of requesting output...

For the purposes here, I suggest to @cjvogl to reconsider the hist syntax to be something more like: field_histogram_MinVal_MaxVal_NumBins_(Lin|Log)

But I approve of the current way to get things rolling --- note there's a cuda bug somewhere, but it is easily fixable.

cjvogl · 2025-08-26T17:15:34Z

@cjvogl how urgent is this PR? After the last sprint with diags capabilities, I was thinking to pause new diags, and implement a more flexible yaml specification paradigm. Eg, I was thinking of a syntax along the line of
fields:
  physics_pg2:
     fields_names:
       - my_diag_1
       - my_diag_2
diagnostics:
  my_diag_1:
    type: horiz_avg
    input: T_mid
  my_diag_2:
    type: sampling
    input: my_diag_3
    condition: qv GT 0.00001
  my_diag_3:
    type: vert_derivative
    coordinate: pressure
    input: T_mid
This more verbose framework should make it easier to request a histogram diag:
  physics_pg2:
     fields_names:
       - my_hist
diagnostics:
  my_hist:
    type: histogram
    bins: [-9999, 0, 10, 100, 1000, 999999]
The current way of requesting diags makes requesting histograms a bit hard, imho. So, if not too urgent, maybe we can postpone this? I would think the new syntax may be avail in like ~2 months?
This new syntax looks very promising. I'm checking on the timeline with the higher ups (and putting a todo item here for myself so I don't forget to follow up).
* [ ]  check on timeline for PR

@bartgol, I got some input on the timeline for this diagnostic: we'd like to have the diagnostic by end of the FY... so I would vote for merging with the current syntax and updating as needed in a future PR.

cjvogl · 2025-08-26T17:17:19Z

Before implementing a new syntax for IO, you'd want to consult widely with users. I think the proposed syntax is a regression in quality; the main problem is quadrupling in verbosity (LOC), which I think is unacceptable on top of our already very verbose way of requesting output...

For the purposes here, I suggest to @cjvogl to reconsider the hist syntax to be something more like: field_histogram_MinVal_MaxVal_NumBins_(Lin|Log)

But I approve of the current way to get things rolling --- note there's a cuda bug somewhere, but it is easily fixable.

@mahf708 , I check with the main customer of this diagnostic and they need greater generality than uniform linear or logarithmic bin spacing. I'll take a look at the cuda bug.

mahf708 · 2025-08-27T12:57:57Z

@mahf708 , I check with the main customer of this diagnostic and they need greater generality than uniform linear or logarithmic bin spacing. I'll take a look at the cuda bug.

That's fine for now. But FYI, in general, I couldn't care less about a specific "main customer" and their wishes in adding code to the mainline eamxx. The code added to eamxx must be generically appropriate for both science uses and dev practices. Ideally, it should reflect practices in the wider community. For the latter, I usually try to see how numpy does things and follow, e.g., https://numpy.org/doc/stable/reference/generated/numpy.histogram.html. In this case, your implementation aligns with the first example usage case of numpy, which is good enough as the default.

In the future, we will allow users to specify diags like this without adding code to eamxx at all... but that will wait a long while.

bartgol · 2025-08-27T14:21:25Z

components/eamxx/src/diagnostics/histogram.cpp

+            const Real bin_upper = bin_values_view(bin_i+1);
+            Kokkos::parallel_reduce(Kokkos::TeamVectorRange(tm, d1),
+                [&](int i, Real &val) {
+                  if ((bin_lower <= field_view(i)) && (field_view(i) < bin_upper))


What happens if an entry falls outside of all bins? E.g., there's a fillValue, or maybe the input bins were a bit too narrow? The current impl simply "skips" this entry, without any error/warning, are you ok with that?

Also, this impl seems to rely on the fact that bins boundaries are well ordered. Perhaps we should check that in the constructor, to avoid odd behaviors in case of a typo like F_100_200_30 (missing a 0 for the last value).

The current impl simply "skips" this entry, without any error/warning, are you ok with that?

That was intentional ("closed bins" at the ends of the histogram), with the initial idea that the user could put in large magnitude values on end bins. That said, we were also looking at considering the end bins as "open" to capture all the values... I suppose the latter is necessary if someone wants to do something like normalize the histogram by the total number of values (to get an approximate PDF). I'm now in favor of the open end bins, so I'll double check with collaborators and implement the change.

Perhaps we should check that in the constructor, to avoid odd behaviors in case of a typo like F_100_200_30 (missing a 0 for the last value).

Good call, I'll implement a check that the bin boundaries are monotonically increasing

I've implement both the "infinite" extent (open ended bins) and the monotonicity check in e1a7a01... with the documentation updated in 7a873bb

bartgol · 2025-08-27T14:23:01Z

components/eamxx/src/diagnostics/histogram.cpp

+  // extract bin values from configuration
+  std::istringstream stream(m_bin_configuration);
+  std::string token;
+  std::vector<Real> bin_values;


FYI, you can just do

auto bin_values = ekat::split(m_bin_configuration,"_");

This utility is in the header ekat_string_utils.hpp, and mimics python's strings split method.

Refactored to leverage ekat::split in 4e3985a. I'll note in the refactor, namely in adding the appropriate include to histogram.cpp, a few missing includes in other EAMxx files were uncovered and addressed in b6dc9b3

We recently had a PR going in to fix the missing includes in the ml correction processes.

… monotonic bin values (test updated too)

rljacob · 2025-09-18T17:21:20Z

Still has testing fails.

…d of host view

cjvogl · 2025-09-20T01:20:50Z

Still has testing fails.

I finally have access to a machine with a CUDA compiler to address the failures. I believe ffea867 addresses the issues, but I do not yet have a bank on said machine to run the full tests (although the unit tests pass)

mahf708 approved these changes Aug 19, 2025

View reviewed changes

mahf708 added the CI: approved Allow gh actions PR testing on ghci-snl-* machines label Aug 19, 2025

mahf708 marked this pull request as draft August 19, 2025 00:41

mahf708 marked this pull request as ready for review August 19, 2025 00:41

rljacob assigned bartgol Aug 21, 2025

rljacob added the EAMxx Issues related to EAMxx label Aug 21, 2025

mahf708 approved these changes Aug 22, 2025

View reviewed changes

bartgol reviewed Aug 27, 2025

View reviewed changes

cjvogl force-pushed the feature-eamxx-histogram_diagnostic branch from 665c06e to 7a873bb Compare September 8, 2025 23:28

cjvogl added 7 commits September 8, 2025 16:29

added histogram diagnostic that is passing initial tests

e1389de

histogram diagnostic now supports up to rank 3 and passes unit tests

9d43dd6

updated histogram diag test to support multiple MPI ranks

744ba86

added regex to eamxx_io_utils for histogram diag

6ed0565

converted histogram diag to use Real instead of Int

b6beac7

addressed some issues in histogram after rebase

240379d

added documentation for histogram diagnostic

b0e527c

cjvogl added 5 commits September 8, 2025 16:30

removed unnecessary initializations in histogram diagnostic

382b4ea

addressed missing includes

69e7ab0

leveraged ekat::split utility in histogram diagnostic

42f744f

added infinite extents to histogram diagnostics, along with check for…

ee367fb

… monotonic bin values (test updated too)

updated diagnostic documentation for histogram

26f9853

cjvogl force-pushed the feature-eamxx-histogram_diagnostic branch from 7a873bb to 26f9853 Compare September 8, 2025 23:30

correcting rebase issue in histogram diagnostics

2598f09

bartgol added CI: approved Allow gh actions PR testing on ghci-snl-* machines and removed CI: approved Allow gh actions PR testing on ghci-snl-* machines labels Sep 9, 2025

fixed issue in histogram diagnostic where device view was used instea…

ffea867

…d of host view

EAMxx: histogram diagnostic #7615

Are you sure you want to change the base?

EAMxx: histogram diagnostic #7615

Conversation

cjvogl commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Built to branch gh-pages at 2025-09-08 23:49 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

mahf708 commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Footnotes

Uh oh!

cjvogl commented Aug 18, 2025

Uh oh!

mahf708 commented Aug 19, 2025

Uh oh!

mahf708 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

whannah1 commented Aug 19, 2025

Uh oh!

mahf708 commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bartgol commented Aug 19, 2025

Uh oh!

cjvogl commented Aug 22, 2025

Uh oh!

cjvogl commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AaronDonahue commented Aug 22, 2025

Uh oh!

mahf708 commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cjvogl commented Aug 26, 2025

Uh oh!

cjvogl commented Aug 26, 2025

Uh oh!

mahf708 commented Aug 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

cjvogl commented Aug 18, 2025 •

edited

Loading

github-actions bot commented Aug 18, 2025 •

edited

Loading

Built to branch `gh-pages` at 2025-09-08 23:49 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

mahf708 commented Aug 18, 2025 •

edited

Loading

mahf708 left a comment •

edited

Loading

mahf708 commented Aug 19, 2025 •

edited

Loading

cjvogl commented Aug 22, 2025 •

edited

Loading

mahf708 commented Aug 22, 2025 •

edited

Loading

bartgol Sep 9, 2025 •

edited

Loading