-
Notifications
You must be signed in to change notification settings - Fork 434
EAMxx: histogram diagnostic #7615
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
@cjvogl i'm all for a general statistics utility1(including letting those bins be either linear or logarithmic) but some quick concerns:
Footnotes
|
Admittedly, I haven't done scaling tests; however, I can say that the implementation is parallel in bins (i.e., the bins are distributed over threads), so I would expect scaling roughly proportional to the number of columns. The other approach I considered is parallel in columns with atomics to avoid race conditions in increment the bin counters; however, I am not sure the potential improvement in performance will be realized w.r.t. to number of columns
It's not all that different really for a small number of bins. For a larger number of bins, we felt it more convenient to have the histogram stored as a single output variable rather than across numerous variables. I suppose there might be additional benefit to computing the bins in parallel (as implemented) rather than serially as separate diagnostics (assuming the diagnostics are evaluated in serial). In general, I'm certainly open to discussion about these concerns, other concerns, and/or implementation discussion |
Yeah, that's good motivation. How do you feel about supporting a generic I will review soon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me
two comments but nothing blocking
- How do you feel about supporting a generic [Field](hist|pdf|cdf|...)[NumBins](lin|log) or [Field](hist|pdf|cdf|...)[NumBins](lin|log)[MIN][MAX] instead of specifying the bins with a long chain of numbers? The tedious name actually isn't a big deal. I am more concerned with more generalizability. But feel free to say 'no thanks' for now (can be convinced with the argument of no need to overgeneralize with no need right away)
- at some point we should move these nifty utils to field utils, but not rush to do stuff right now
@@ -131,7 +131,7 @@ create_diagnostic (const std::string& diag_field_name, | |||
// Note: the number for field_at_p/h can match positive integer/floating-point numbers | |||
// Start with a generic for a field name allowing for all letters, all numbers, dash, dot, plus, minus, product, and division | |||
// Escaping all the special ones just in case | |||
std::string generic_field = "([A-Za-z0-9_.+\\-\\*\\÷]+)"; | |||
std::string generic_field = "([A-Za-z0-9_.+\\-\\*\\÷]+)"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ugh!!! I need to get those \\*\\÷
out (don't worry about it here)
auto histogram_layout = m_diagnostic_output.get_header().get_identifier().get_layout(); | ||
const int num_bins = histogram_layout.dim(0); | ||
|
||
m_diagnostic_output.deep_copy(sp(0.0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deep_copy isn't cheap, wanna move it to init impl?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! After going back, I'm not sure any initialization is needed, so I removed it entirely in 71ec166
|
||
m_diagnostic_output.deep_copy(sp(0.0)); | ||
auto bin_values_view = m_bin_values.get_view<Real *>(); | ||
auto histogram_view = m_diagnostic_output.get_view<Real *>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you consider making this an int? I don't know if making an int is a good idea (not sure how it would chain against other ops like vert_avg or just even averaging in IO ops...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question: my original implementation used ints but then I ran into IO issues. I don't quite remember the error but I can reproduce if it would be helpful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our IO layer will likely either crash or produce garbage if we try to do the time average of an int variable. As for chaining with other diags, that's also a question mark. I am ok keeping the data type Real for now. We have lots of cleanup work to do in the diags/IO layer anyways, so we can just add this to the list...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right, that makes a lot of sense. Thanks for the clarification!
const int bin_i = tm.league_rank(); | ||
const Real bin_lower = bin_values_view(bin_i); | ||
const Real bin_upper = bin_values_view(bin_i+1); | ||
histogram_view(bin_i) = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need for this, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! This is removed in 71ec166.
const int bin_i = tm.league_rank(); | ||
const Real bin_lower = bin_values_view(bin_i); | ||
const Real bin_upper = bin_values_view(bin_i+1); | ||
histogram_view(bin_i) = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another good catch! This is also removed in 71ec166.
} | ||
|
||
// TODO: use device-side MPI calls | ||
// TODO: the dev ptr causes problems; revisit this later |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at some point we should actually revisit this ;) I put this TODO here because I wasn't interested in investigating an error I was getting, but I should actually look into it again... thanks for keeping the TODO
@cjvogl I don't see any area weighting being applied here. Did I miss it? This seems like an important thing to include. Otherwise the area under the curve of the histogram won't match the area-weighted global mean. |
hmmm, this is just counting the occurrence though; why would you need to scale that by area? edit: |
@cjvogl how urgent is this PR? After the last sprint with diags capabilities, I was thinking to pause new diags, and implement a more flexible yaml specification paradigm. Eg, I was thinking of a syntax along the line of fields:
physics_pg2:
fields_names:
- my_diag_1
- my_diag_2
diagnostics:
my_diag_1:
type: horiz_avg
input: T_mid
my_diag_2:
type: sampling
input: my_diag_3
condition: qv GT 0.00001
my_diag_3:
type: vert_derivative
coordinate: pressure
input: T_mid This more verbose framework should make it easier to request a histogram diag: physics_pg2:
fields_names:
- my_hist
diagnostics:
my_hist:
type: histogram
bins: [-9999, 0, 10, 100, 1000, 999999] The current way of requesting diags makes requesting histograms a bit hard, imho. So, if not too urgent, maybe we can postpone this? I would think the new syntax may be avail in like ~2 months? |
First, a disclaimer: I wrote this while on a flight... so it's probably way longer than it needs to be. I was going for a histogram defined along the lines of given a data set
That all said, I'm happy to support a different flavor of histrogram/distribution measurement... @whannah1 , let me know if there's a different type of measurement that would be more informative than the histogram as defined above. |
This new syntax looks very promising. I'm checking on the timeline with the higher ups (and putting a todo item here for myself so I don't forget to follow up).
|
@bartgol would this new syntax also support my task of collapsing a bunch of conditional diags into a single variable? For example, binning over the vertical? |
Before implementing a new syntax for IO, you'd want to consult widely with users. I think the proposed syntax is a regression in quality; the main problem is quadrupling in verbosity (LOC), which I think is unacceptable on top of our already very verbose way of requesting output... For the purposes here, I suggest to @cjvogl to reconsider the hist syntax to be something more like: But I approve of the current way to get things rolling --- note there's a cuda bug somewhere, but it is easily fixable. |
@bartgol, I got some input on the timeline for this diagnostic: we'd like to have the diagnostic by end of the FY... so I would vote for merging with the current syntax and updating as needed in a future PR. |
@mahf708 , I check with the main customer of this diagnostic and they need greater generality than uniform linear or logarithmic bin spacing. I'll take a look at the cuda bug. |
That's fine for now. But FYI, in general, I couldn't care less about a specific "main customer" and their wishes in adding code to the mainline eamxx. The code added to eamxx must be generically appropriate for both science uses and dev practices. Ideally, it should reflect practices in the wider community. For the latter, I usually try to see how numpy does things and follow, e.g., https://numpy.org/doc/stable/reference/generated/numpy.histogram.html. In this case, your implementation aligns with the first example usage case of numpy, which is good enough as the default. In the future, we will allow users to specify diags like this without adding code to eamxx at all... but that will wait a long while. |
const Real bin_upper = bin_values_view(bin_i+1); | ||
Kokkos::parallel_reduce(Kokkos::TeamVectorRange(tm, d1), | ||
[&](int i, Real &val) { | ||
if ((bin_lower <= field_view(i)) && (field_view(i) < bin_upper)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if an entry falls outside of all bins? E.g., there's a fillValue, or maybe the input bins were a bit too narrow? The current impl simply "skips" this entry, without any error/warning, are you ok with that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this impl seems to rely on the fact that bins boundaries are well ordered. Perhaps we should check that in the constructor, to avoid odd behaviors in case of a typo like F_100_200_30
(missing a 0 for the last value).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current impl simply "skips" this entry, without any error/warning, are you ok with that?
That was intentional ("closed bins" at the ends of the histogram), with the initial idea that the user could put in large magnitude values on end bins. That said, we were also looking at considering the end bins as "open" to capture all the values... I suppose the latter is necessary if someone wants to do something like normalize the histogram by the total number of values (to get an approximate PDF). I'm now in favor of the open end bins, so I'll double check with collaborators and implement the change.
Perhaps we should check that in the constructor, to avoid odd behaviors in case of a typo like F_100_200_30 (missing a 0 for the last value).
Good call, I'll implement a check that the bin boundaries are monotonically increasing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// extract bin values from configuration | ||
std::istringstream stream(m_bin_configuration); | ||
std::string token; | ||
std::vector<Real> bin_values; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, you can just do
auto bin_values = ekat::split(m_bin_configuration,"_");
This utility is in the header ekat_string_utils.hpp
, and mimics python's strings split method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We recently had a PR going in to fix the missing includes in the ml correction processes.
665c06e
to
7a873bb
Compare
… monotonic bin values (test updated too)
7a873bb
to
26f9853
Compare
Still has testing fails. |
I finally have access to a machine with a CUDA compiler to address the failures. I believe ffea867 addresses the issues, but I do not yet have a bank on said machine to run the full tests (although the unit tests pass) |
Implementation of online histograms. As added to the documentation...
To select the histogram, you need to suffix a field name
$-\infty$ , 250), [250, 270), [270, 290), [290, 310), and [310, $\infty$ ).
X
with_histogram_
followed by the values specifying the edges of the bins,separated by
_
. For example, the histogram specified byT_mid_histogram_250_270_290_310
has 5 bins defined, effectively, by(
The implementation distributes the histogram bins for parallelism (each thread goes over all columns), although an alternative would be to distribute the columns and leverage an atomic for the bin. I am open to discussion on this.