-
Notifications
You must be signed in to change notification settings - Fork 435
EAMxx: fix define_var call for avg count var in IO #7533
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Is it possible it was compiler-dependent? |
I'm not sure. Looking at CDash, the test passed on mappy, chrysalis, and frontier, but failed on pm (both cpu and gpu) and polaris. pm-cpu and mappy both use GNU, albeit with different versions (12.3 on pm and 11.4 on mappy). I certainly cannot exclude that the compiler version makes a difference here. The "fix" in this PR is not a bad idea anyways (no point in saving integer counts as doubles), so I think we can integrate it. But I would like to hear from scorpio folks (I pinged them on slack) whether the observed behavior in master is expected or surprising. |
Note: the v1 CI tests failed b/c the data type of some vars has changes. Namely, these are the avg count variables, which used to have data type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine and thorough to me.
The one minor thing I would recommend is taking a look at the preceding lines and seeing if you want to change stuff in light of the changes at the end, for example, in the diff, do you want to keep checking unitless? (you're no longer writing unitless in the define_var op, right?
diff
// Now register the average count variables (if any)
std::string unitless = "unitless";
for (const auto& f : m_avg_counts) {
const auto& name = f.name();
const auto& dimnames = m_vars_dims.at(name);
if (mode==scorpio::FileMode::Append) {
// Similar to the regular fields above, check that the var is in the file, and has the right properties
EKAT_REQUIRE_MSG (scorpio::has_var(filename,name),
"Error! Cannot append, due to variable missing from the file.\n"
" - filename : " + filename + "\n"
" - varname : " + name + "\n");
const auto& var = scorpio::get_var(filename,name);
EKAT_REQUIRE_MSG (var.dim_names()==dimnames,
"Error! Cannot append, due to variable dimensions mismatch.\n"
" - filename : " + filename + "\n"
" - varname : " + name + "\n"
" - var dims : " + ekat::join(dimnames,",") + "\n"
" - var dims from file: " + ekat::join(var.dim_names(),",") + "\n");
EKAT_REQUIRE_MSG (var.units==unitless,
"Error! Cannot append, due to variable units mismatch.\n"
" - filename : " + filename + "\n"
" - varname : " + name + "\n"
" - var units: " + unitless + "\n"
" - var units from file: " + var.units + "\n");
EKAT_REQUIRE_MSG (var.time_dep==m_add_time_dim,
"Error! Cannot append, due to time dependency mismatch.\n"
" - filename : " + filename + "\n"
" - varname : " + name + "\n"
" - var time dep: " + (m_add_time_dim ? "yes" : "no") + "\n"
" - var time dep from file: " + (var.time_dep ? "yes" : "no") + "\n");
} else {
// Note, unlike with regular output variables, for the average counting
// variables we don't need to add all of the extra metadata. So we simply
// define the variable.
- scorpio::define_var(filename, name, unitless, dimnames,
- "real",fp_precision, m_add_time_dim);
+ scorpio::define_var(filename, name, dimnames, "int", m_add_time_dim);
}
}
Ah, good call. I think nobody is using the "append" feature (as we never got around to debug why it was hanging at large MPI count), but we should try to keep it as clean as possible for when (if?) we ever revive it. I'll push a quick fix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glad you found this!
This bug was somehow causing ERS/ERP fails, since avg count vars were full of zeros in restart files. [BFB] except for the avg count vars data type in EAMxx output
This bug was somehow causing ERS/ERP fails, since avg count vars were full of zeros in restart files.
[BFB] except for the avg count vars data type in EAMxx output
I am not entirely sure why pm-cpu was failing while mappy wasn't. I also am not entirely sure why we were getting 0's in the restart file, as scorpio should handle the conversion int->double internally when calling PIOc_write_darray. Either way, there's no point writing avg count vars with double data type, so this change is an improvement nevertheless.
Note: I verified this fixes things on pm-cpu with a small ne4 test. I am currently running ne30 to see if there's some other issue other than this.
Edit:
ERS_Ln22.ne4pg2_ne4pg2.F2010-SCREAMv1.pm-cpu_gnu.eamxx-L128--eamxx-output-preset-4.20250718_081619_c5nsr3
now passes the base-rest phase on pm-cpu with this PR.