Skip to content

Conversation

cjvogl
Copy link
Contributor

@cjvogl cjvogl commented Aug 18, 2025

Implementation of online histograms. As added to the documentation...

To select the histogram, you need to suffix a field name X with
_histogram_ followed by the values specifying the edges of the bins,
separated by _. For example, the histogram specified by
T_mid_histogram_250_270_290_310 has 5 bins defined, effectively, by
($-\infty$, 250), [250, 270), [270, 290), [290, 310), and [310, $\infty$).

The implementation distributes the histogram bins for parallelism (each thread goes over all columns), although an alternative would be to distribute the columns and leverage an atomic for the bin. I am open to discussion on this.

Copy link

github-actions bot commented Aug 18, 2025

PR Preview Action v1.6.2

🚀 View preview at
https://E3SM-Project.github.io/E3SM/pr-preview/pr-7615/

Built to branch gh-pages at 2025-09-08 23:49 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@mahf708
Copy link
Contributor

mahf708 commented Aug 18, 2025

@cjvogl i'm all for a general statistics utility1(including letting those bins be either linear or logarithmic) but some quick concerns:

  • does this scale well in your experience/testing?
  • at least for the small-bin-number example you give, how is this different from doing count-based conditional sampling each bin which we have in the mainline code now? count_where_T_mid_gt_300_where_T_mid_lt_310 ... etc.

Footnotes

  1. I would go for something like that has the syntax of [Field]_(hist|pdf|cdf|...)_[NumBins]_(lin|log)

@cjvogl
Copy link
Contributor Author

cjvogl commented Aug 18, 2025

@cjvogl i'm all for a general statistics utility1(including letting those bins be either linear or logarithmic) but some quick concerns:

* does this scale well in your experience/testing?

Admittedly, I haven't done scaling tests; however, I can say that the implementation is parallel in bins (i.e., the bins are distributed over threads), so I would expect scaling roughly proportional to the number of columns. The other approach I considered is parallel in columns with atomics to avoid race conditions in increment the bin counters; however, I am not sure the potential improvement in performance will be realized w.r.t. to number of columns

* at least for the small-bin-number example you give, how is this different from doing count-based conditional sampling each bin which we have in the mainline code now? `count_where_T_mid_gt_300_where_T_mid_lt_310` ... etc.

It's not all that different really for a small number of bins. For a larger number of bins, we felt it more convenient to have the histogram stored as a single output variable rather than across numerous variables. I suppose there might be additional benefit to computing the bins in parallel (as implemented) rather than serially as separate diagnostics (assuming the diagnostics are evaluated in serial).

In general, I'm certainly open to discussion about these concerns, other concerns, and/or implementation discussion

@mahf708
Copy link
Contributor

mahf708 commented Aug 19, 2025

For a larger number of bins, we felt it more convenient to have the histogram stored as a single output variable rather than across numerous variables.

Yeah, that's good motivation.

How do you feel about supporting a generic [Field]_(hist|pdf|cdf|...)_[NumBins]_(lin|log) or [Field]_(hist|pdf|cdf|...)_[NumBins]_(lin|log)_[MIN]_[MAX] instead of specifying the bins with a long chain of numbers? The tedious name actually isn't a big deal. I am more concerned with more generalizability. But feel free to say 'no thanks' for now (can be convinced with the argument of no need to overgeneralize with no need right away)

I will review soon

Copy link
Contributor

@mahf708 mahf708 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me

two comments but nothing blocking

  • How do you feel about supporting a generic [Field](hist|pdf|cdf|...)[NumBins](lin|log) or [Field](hist|pdf|cdf|...)[NumBins](lin|log)[MIN][MAX] instead of specifying the bins with a long chain of numbers? The tedious name actually isn't a big deal. I am more concerned with more generalizability. But feel free to say 'no thanks' for now (can be convinced with the argument of no need to overgeneralize with no need right away)
  • at some point we should move these nifty utils to field utils, but not rush to do stuff right now

@@ -131,7 +131,7 @@ create_diagnostic (const std::string& diag_field_name,
// Note: the number for field_at_p/h can match positive integer/floating-point numbers
// Start with a generic for a field name allowing for all letters, all numbers, dash, dot, plus, minus, product, and division
// Escaping all the special ones just in case
std::string generic_field = "([A-Za-z0-9_.+\\-\\*\\÷]+)";
std::string generic_field = "([A-Za-z0-9_.+\\-\\*\\÷]+)";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ugh!!! I need to get those \\*\\÷ out (don't worry about it here)

auto histogram_layout = m_diagnostic_output.get_header().get_identifier().get_layout();
const int num_bins = histogram_layout.dim(0);

m_diagnostic_output.deep_copy(sp(0.0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deep_copy isn't cheap, wanna move it to init impl?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! After going back, I'm not sure any initialization is needed, so I removed it entirely in 71ec166


m_diagnostic_output.deep_copy(sp(0.0));
auto bin_values_view = m_bin_values.get_view<Real *>();
auto histogram_view = m_diagnostic_output.get_view<Real *>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you consider making this an int? I don't know if making an int is a good idea (not sure how it would chain against other ops like vert_avg or just even averaging in IO ops...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question: my original implementation used ints but then I ran into IO issues. I don't quite remember the error but I can reproduce if it would be helpful?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our IO layer will likely either crash or produce garbage if we try to do the time average of an int variable. As for chaining with other diags, that's also a question mark. I am ok keeping the data type Real for now. We have lots of cleanup work to do in the diags/IO layer anyways, so we can just add this to the list...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, that makes a lot of sense. Thanks for the clarification!

const int bin_i = tm.league_rank();
const Real bin_lower = bin_values_view(bin_i);
const Real bin_upper = bin_values_view(bin_i+1);
histogram_view(bin_i) = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for this, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! This is removed in 71ec166.

const int bin_i = tm.league_rank();
const Real bin_lower = bin_values_view(bin_i);
const Real bin_upper = bin_values_view(bin_i+1);
histogram_view(bin_i) = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another good catch! This is also removed in 71ec166.

}

// TODO: use device-side MPI calls
// TODO: the dev ptr causes problems; revisit this later
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at some point we should actually revisit this ;) I put this TODO here because I wasn't interested in investigating an error I was getting, but I should actually look into it again... thanks for keeping the TODO

@mahf708 mahf708 added the CI: approved Allow gh actions PR testing on ghci-snl-* machines label Aug 19, 2025
@mahf708 mahf708 marked this pull request as draft August 19, 2025 00:41
@mahf708 mahf708 marked this pull request as ready for review August 19, 2025 00:41
@whannah1
Copy link
Contributor

@cjvogl I don't see any area weighting being applied here. Did I miss it? This seems like an important thing to include. Otherwise the area under the curve of the histogram won't match the area-weighted global mean.

@mahf708
Copy link
Contributor

mahf708 commented Aug 19, 2025

@cjvogl I don't see any area weighting being applied here. Did I miss it? This seems like an important thing to include. Otherwise the area under the curve of the histogram won't match the area-weighted global mean.

hmmm, this is just counting the occurrence though; why would you need to scale that by area?

edit: count_where_T_mid_gt_300_where_T_mid_lt_310_vert_avg_dp_weighted_horiz_avg does scale by both dp in the vertical and area in the horizontal.... but that's kinda explicit. I am not sure which one is more correct for a histogram...

@bartgol
Copy link
Contributor

bartgol commented Aug 19, 2025

@cjvogl how urgent is this PR? After the last sprint with diags capabilities, I was thinking to pause new diags, and implement a more flexible yaml specification paradigm. Eg, I was thinking of a syntax along the line of

fields:
  physics_pg2:
     fields_names:
       - my_diag_1
       - my_diag_2
diagnostics:
  my_diag_1:
    type: horiz_avg
    input: T_mid
  my_diag_2:
    type: sampling
    input: my_diag_3
    condition: qv GT 0.00001
  my_diag_3:
    type: vert_derivative
    coordinate: pressure
    input: T_mid

This more verbose framework should make it easier to request a histogram diag:

  physics_pg2:
     fields_names:
       - my_hist
diagnostics:
  my_hist:
    type: histogram
    bins: [-9999, 0, 10, 100, 1000, 999999]

The current way of requesting diags makes requesting histograms a bit hard, imho. So, if not too urgent, maybe we can postpone this? I would think the new syntax may be avail in like ~2 months?

@rljacob rljacob added the EAMxx Issues related to EAMxx label Aug 21, 2025
@cjvogl
Copy link
Contributor Author

cjvogl commented Aug 22, 2025

@cjvogl I don't see any area weighting being applied here. Did I miss it? This seems like an important thing to include. Otherwise the area under the curve of the histogram won't match the area-weighted global mean.

hmmm, this is just counting the occurrence though; why would you need to scale that by area?

edit: count_where_T_mid_gt_300_where_T_mid_lt_310_vert_avg_dp_weighted_horiz_avg does scale by both dp in the vertical and area in the horizontal.... but that's kinda explicit. I am not sure which one is more correct for a histogram...

First, a disclaimer: I wrote this while on a flight... so it's probably way longer than it needs to be.

I was going for a histogram defined along the lines of given a data set ${v_1,\ldots,v_n}$ the histogram value $h_i$ for the bin $[V_{i}^-,V_{i}^+)$ are the number of $v_k$ s.t. $V_{i}^- \leq v_k &lt; V_{i}^+$, so all the histogram values should add up to $n$, i.e., $\sum_{i=1}^M h_i = n$. I believe the area under the curve can be viewed as

  • $h_i (V_{i}^+ - V_{i}^-)$ approximates $\int_{V_{i}^-}^{V_{i}^+} h(v) dv$, i.e., the expected count of values between $V_{i}^-$ and $V_{i}^+$ for $n$ randomly sampled points
  • $\sum_{i=1}^M h_i (V_{i}^+ - V_{i}^-)$ approximates $\int_{V_1^-}^{V_M^+} h(v) dv$, i.e., the expected count of values between $V_1^-$ and $V_M^+$ for $n$ randomly sampled points

That all said, I'm happy to support a different flavor of histrogram/distribution measurement... @whannah1 , let me know if there's a different type of measurement that would be more informative than the histogram as defined above.

@cjvogl
Copy link
Contributor Author

cjvogl commented Aug 22, 2025

@cjvogl how urgent is this PR? After the last sprint with diags capabilities, I was thinking to pause new diags, and implement a more flexible yaml specification paradigm. Eg, I was thinking of a syntax along the line of

fields:
  physics_pg2:
     fields_names:
       - my_diag_1
       - my_diag_2
diagnostics:
  my_diag_1:
    type: horiz_avg
    input: T_mid
  my_diag_2:
    type: sampling
    input: my_diag_3
    condition: qv GT 0.00001
  my_diag_3:
    type: vert_derivative
    coordinate: pressure
    input: T_mid

This more verbose framework should make it easier to request a histogram diag:

  physics_pg2:
     fields_names:
       - my_hist
diagnostics:
  my_hist:
    type: histogram
    bins: [-9999, 0, 10, 100, 1000, 999999]

The current way of requesting diags makes requesting histograms a bit hard, imho. So, if not too urgent, maybe we can postpone this? I would think the new syntax may be avail in like ~2 months?

This new syntax looks very promising. I'm checking on the timeline with the higher ups (and putting a todo item here for myself so I don't forget to follow up).

  • check on timeline for PR

@AaronDonahue
Copy link
Contributor

@cjvogl how urgent is this PR? After the last sprint with diags capabilities, I was thinking to pause new diags, and implement a more flexible yaml specification paradigm. Eg, I was thinking of a syntax along the line of

@bartgol would this new syntax also support my task of collapsing a bunch of conditional diags into a single variable? For example, binning over the vertical?

@mahf708
Copy link
Contributor

mahf708 commented Aug 22, 2025

Before implementing a new syntax for IO, you'd want to consult widely with users. I think the proposed syntax is a regression in quality; the main problem is quadrupling in verbosity (LOC), which I think is unacceptable on top of our already very verbose way of requesting output...

For the purposes here, I suggest to @cjvogl to reconsider the hist syntax to be something more like: field_histogram_MinVal_MaxVal_NumBins_(Lin|Log)

But I approve of the current way to get things rolling --- note there's a cuda bug somewhere, but it is easily fixable.

@cjvogl
Copy link
Contributor Author

cjvogl commented Aug 26, 2025

@cjvogl how urgent is this PR? After the last sprint with diags capabilities, I was thinking to pause new diags, and implement a more flexible yaml specification paradigm. Eg, I was thinking of a syntax along the line of

fields:
  physics_pg2:
     fields_names:
       - my_diag_1
       - my_diag_2
diagnostics:
  my_diag_1:
    type: horiz_avg
    input: T_mid
  my_diag_2:
    type: sampling
    input: my_diag_3
    condition: qv GT 0.00001
  my_diag_3:
    type: vert_derivative
    coordinate: pressure
    input: T_mid

This more verbose framework should make it easier to request a histogram diag:

  physics_pg2:
     fields_names:
       - my_hist
diagnostics:
  my_hist:
    type: histogram
    bins: [-9999, 0, 10, 100, 1000, 999999]

The current way of requesting diags makes requesting histograms a bit hard, imho. So, if not too urgent, maybe we can postpone this? I would think the new syntax may be avail in like ~2 months?

This new syntax looks very promising. I'm checking on the timeline with the higher ups (and putting a todo item here for myself so I don't forget to follow up).

* [ ]  check on timeline for PR

@bartgol, I got some input on the timeline for this diagnostic: we'd like to have the diagnostic by end of the FY... so I would vote for merging with the current syntax and updating as needed in a future PR.

@cjvogl
Copy link
Contributor Author

cjvogl commented Aug 26, 2025

Before implementing a new syntax for IO, you'd want to consult widely with users. I think the proposed syntax is a regression in quality; the main problem is quadrupling in verbosity (LOC), which I think is unacceptable on top of our already very verbose way of requesting output...

For the purposes here, I suggest to @cjvogl to reconsider the hist syntax to be something more like: field_histogram_MinVal_MaxVal_NumBins_(Lin|Log)

But I approve of the current way to get things rolling --- note there's a cuda bug somewhere, but it is easily fixable.

@mahf708 , I check with the main customer of this diagnostic and they need greater generality than uniform linear or logarithmic bin spacing. I'll take a look at the cuda bug.

@mahf708
Copy link
Contributor

mahf708 commented Aug 27, 2025

@mahf708 , I check with the main customer of this diagnostic and they need greater generality than uniform linear or logarithmic bin spacing. I'll take a look at the cuda bug.

That's fine for now. But FYI, in general, I couldn't care less about a specific "main customer" and their wishes in adding code to the mainline eamxx. The code added to eamxx must be generically appropriate for both science uses and dev practices. Ideally, it should reflect practices in the wider community. For the latter, I usually try to see how numpy does things and follow, e.g., https://numpy.org/doc/stable/reference/generated/numpy.histogram.html. In this case, your implementation aligns with the first example usage case of numpy, which is good enough as the default.

In the future, we will allow users to specify diags like this without adding code to eamxx at all... but that will wait a long while.

const Real bin_upper = bin_values_view(bin_i+1);
Kokkos::parallel_reduce(Kokkos::TeamVectorRange(tm, d1),
[&](int i, Real &val) {
if ((bin_lower <= field_view(i)) && (field_view(i) < bin_upper))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if an entry falls outside of all bins? E.g., there's a fillValue, or maybe the input bins were a bit too narrow? The current impl simply "skips" this entry, without any error/warning, are you ok with that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this impl seems to rely on the fact that bins boundaries are well ordered. Perhaps we should check that in the constructor, to avoid odd behaviors in case of a typo like F_100_200_30 (missing a 0 for the last value).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current impl simply "skips" this entry, without any error/warning, are you ok with that?

That was intentional ("closed bins" at the ends of the histogram), with the initial idea that the user could put in large magnitude values on end bins. That said, we were also looking at considering the end bins as "open" to capture all the values... I suppose the latter is necessary if someone wants to do something like normalize the histogram by the total number of values (to get an approximate PDF). I'm now in favor of the open end bins, so I'll double check with collaborators and implement the change.

Perhaps we should check that in the constructor, to avoid odd behaviors in case of a typo like F_100_200_30 (missing a 0 for the last value).

Good call, I'll implement a check that the bin boundaries are monotonically increasing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've implement both the "infinite" extent (open ended bins) and the monotonicity check in e1a7a01... with the documentation updated in 7a873bb

// extract bin values from configuration
std::istringstream stream(m_bin_configuration);
std::string token;
std::vector<Real> bin_values;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, you can just do

auto bin_values = ekat::split(m_bin_configuration,"_");

This utility is in the header ekat_string_utils.hpp, and mimics python's strings split method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored to leverage ekat::split in 4e3985a. I'll note in the refactor, namely in adding the appropriate include to histogram.cpp, a few missing includes in other EAMxx files were uncovered and addressed in b6dc9b3

Copy link
Contributor

@bartgol bartgol Sep 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We recently had a PR going in to fix the missing includes in the ml correction processes.

@cjvogl cjvogl force-pushed the feature-eamxx-histogram_diagnostic branch from 665c06e to 7a873bb Compare September 8, 2025 23:28
@cjvogl cjvogl force-pushed the feature-eamxx-histogram_diagnostic branch from 7a873bb to 26f9853 Compare September 8, 2025 23:30
@bartgol bartgol added CI: approved Allow gh actions PR testing on ghci-snl-* machines and removed CI: approved Allow gh actions PR testing on ghci-snl-* machines labels Sep 9, 2025
@rljacob
Copy link
Member

rljacob commented Sep 18, 2025

Still has testing fails.

@cjvogl
Copy link
Contributor Author

cjvogl commented Sep 20, 2025

Still has testing fails.

I finally have access to a machine with a CUDA compiler to address the failures. I believe ffea867 addresses the issues, but I do not yet have a bank on said machine to run the full tests (although the unit tests pass)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI: approved Allow gh actions PR testing on ghci-snl-* machines EAMxx Issues related to EAMxx
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants