Skip to content

Conversation

bartgol
Copy link
Contributor

@bartgol bartgol commented Jul 18, 2025

This bug was somehow causing ERS/ERP fails, since avg count vars were full of zeros in restart files.

[BFB] except for the avg count vars data type in EAMxx output


I am not entirely sure why pm-cpu was failing while mappy wasn't. I also am not entirely sure why we were getting 0's in the restart file, as scorpio should handle the conversion int->double internally when calling PIOc_write_darray. Either way, there's no point writing avg count vars with double data type, so this change is an improvement nevertheless.

Note: I verified this fixes things on pm-cpu with a small ne4 test. I am currently running ne30 to see if there's some other issue other than this.

Edit: ERS_Ln22.ne4pg2_ne4pg2.F2010-SCREAMv1.pm-cpu_gnu.eamxx-L128--eamxx-output-preset-4.20250718_081619_c5nsr3 now passes the base-rest phase on pm-cpu with this PR.

@bartgol bartgol requested a review from AaronDonahue July 18, 2025 18:24
@bartgol bartgol self-assigned this Jul 18, 2025
@bartgol bartgol added bug fix PR BFB PR leaves answers BFB EAMxx Issues related to EAMxx labels Jul 18, 2025
@ndkeen
Copy link
Contributor

ndkeen commented Jul 18, 2025

Is it possible it was compiler-dependent?

@bartgol
Copy link
Contributor Author

bartgol commented Jul 18, 2025

Is it possible it was compiler-dependent?

I'm not sure. Looking at CDash, the test passed on mappy, chrysalis, and frontier, but failed on pm (both cpu and gpu) and polaris. pm-cpu and mappy both use GNU, albeit with different versions (12.3 on pm and 11.4 on mappy). I certainly cannot exclude that the compiler version makes a difference here.

The "fix" in this PR is not a bad idea anyways (no point in saving integer counts as doubles), so I think we can integrate it. But I would like to hear from scorpio folks (I pinged them on slack) whether the observed behavior in master is expected or surprising.

@bartgol bartgol requested review from jgfouca and mahf708 July 18, 2025 20:20
@bartgol
Copy link
Contributor Author

bartgol commented Jul 18, 2025

Note: the v1 CI tests failed b/c the data type of some vars has changes. Namely, these are the avg count variables, which used to have data type double and not have data type int. This DIFF is expected.

Copy link
Contributor

@mahf708 mahf708 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine and thorough to me.

The one minor thing I would recommend is taking a look at the preceding lines and seeing if you want to change stuff in light of the changes at the end, for example, in the diff, do you want to keep checking unitless? (you're no longer writing unitless in the define_var op, right?

diff

  // Now register the average count variables (if any)
  std::string unitless = "unitless";
  for (const auto& f : m_avg_counts) {
    const auto& name = f.name();
    const auto& dimnames = m_vars_dims.at(name);
    if (mode==scorpio::FileMode::Append) {
      // Similar to the regular fields above, check that the var is in the file, and has the right properties
      EKAT_REQUIRE_MSG (scorpio::has_var(filename,name),
          "Error! Cannot append, due to variable missing from the file.\n"
          "  - filename : " + filename + "\n"
          "  - varname  : " + name + "\n");
      const auto& var = scorpio::get_var(filename,name);
      EKAT_REQUIRE_MSG (var.dim_names()==dimnames,
          "Error! Cannot append, due to variable dimensions mismatch.\n"
          "  - filename : " + filename + "\n"
          "  - varname  : " + name + "\n"
          "  - var dims : " + ekat::join(dimnames,",") + "\n"
          "  - var dims from file: " + ekat::join(var.dim_names(),",") + "\n");
      EKAT_REQUIRE_MSG (var.units==unitless,
          "Error! Cannot append, due to variable units mismatch.\n"
          "  - filename : " + filename + "\n"
          "  - varname  : " + name + "\n"
          "  - var units: " + unitless + "\n"
          "  - var units from file: " + var.units + "\n");
      EKAT_REQUIRE_MSG (var.time_dep==m_add_time_dim,
          "Error! Cannot append, due to time dependency mismatch.\n"
          "  - filename : " + filename + "\n"
          "  - varname  : " + name + "\n"
          "  - var time dep: " + (m_add_time_dim ? "yes" : "no") + "\n"
          "  - var time dep from file: " + (var.time_dep ? "yes" : "no") + "\n");
    } else {
      // Note, unlike with regular output variables, for the average counting
      // variables we don't need to add all of the extra metadata.  So we simply
      // define the variable.
-     scorpio::define_var(filename, name, unitless, dimnames,
-                         "real",fp_precision, m_add_time_dim);
+     scorpio::define_var(filename, name, dimnames, "int", m_add_time_dim);
    }
  }

@bartgol
Copy link
Contributor Author

bartgol commented Jul 18, 2025

The one minor thing I would recommend is taking a look at the preceding lines and seeing if you want to change stuff in light of the changes at the end, for example, in the diff, do you want to keep checking unitless? (you're no longer writing unitless in the define_var op, right?

Ah, good call. I think nobody is using the "append" feature (as we never got around to debug why it was hanging at large MPI count), but we should try to keep it as clean as possible for when (if?) we ever revive it. I'll push a quick fix.

Copy link
Member

@jgfouca jgfouca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad you found this!

bartgol added a commit that referenced this pull request Jul 18, 2025
This bug was somehow causing ERS/ERP fails, since avg count
vars were full of zeros in restart files.

[BFB] except for the avg count vars data type in EAMxx output
@bartgol bartgol merged commit 8cd4028 into master Jul 18, 2025
15 of 18 checks passed
@bartgol bartgol deleted the bartgol/eamxx/fix-avg-count-bug branch July 18, 2025 22:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BFB PR leaves answers BFB bug fix PR EAMxx Issues related to EAMxx
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants