Skip to content

Conversation

meng630
Copy link
Contributor

@meng630 meng630 commented Aug 27, 2025

Replace the multi-year linoz and oxid files with single-year to fix NBFB restart in MAM4xx

[BFB]

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR updates MAM4xx chemistry input files by replacing multi-year linoz and oxid data files with single-year versions to resolve NBFB restart issues in MAM4xx simulations.

  • Replace multi-year linoz files (1850-2015) with single-year 2010 versions
  • Replace multi-year oxid files (1850-2015) with single-year 2015 versions
  • Update file naming conventions to reflect the new temporal scope

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

<mam4_linoz_file_name type="file" doc="LINOZ chemistry file"> ${DIN_LOC_ROOT}/atm/scream/mam4xx/linoz/ne30pg2/linoz2010_2010JPL_CMIP6_10deg_58km_ne30pg2_c20250826.nc</mam4_linoz_file_name>
<mam4_linoz_file_name hgrid="ne4np4.pg2" type="file" doc="LINOZ chemistry file"> ${DIN_LOC_ROOT}/atm/scream/mam4xx/linoz/ne4pg2/linoz2010_2010JPL_CMIP6_10deg_58km_ne4pg2_c20250826.nc</mam4_linoz_file_name>
<!--Invariants-->
<mam4_oxid_ymd type="integer" > 20150101 </mam4_oxid_ymd>
Copy link
Preview

Copilot AI Aug 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The oxid year (2015) and linoz year (2010) are now inconsistent. Consider documenting why different years are used for these related chemistry datasets or using the same year for consistency.

Copilot uses AI. Check for mistakes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have year 2010 for oxid, so use year 2015 instead.

@singhbalwinder singhbalwinder changed the title Update linoz and oxid files in XML EAMxx: Update linoz and oxid files in XML Aug 27, 2025
@singhbalwinder singhbalwinder self-assigned this Aug 27, 2025
@singhbalwinder singhbalwinder added EAMxx Issues related to EAMxx MAM4xx MAM4xx related changes BFB PR leaves answers BFB labels Aug 27, 2025
Copy link
Contributor

@singhbalwinder singhbalwinder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Thanks!

@singhbalwinder
Copy link
Contributor

The test is failing due to the missing newly added data files. The file download issue will be resolved when we merge PR #7636.

@meng630
Copy link
Contributor Author

meng630 commented Aug 29, 2025

@singhbalwinder The failing check is because my account is not approved to run on snl machine. Could you please try it again? Since the automatic download PR is already merged, checks should all pass this time. Thanks!

@singhbalwinder
Copy link
Contributor

We need to rebase this branch for the failing REP test to pass. I will rebase it.

@singhbalwinder
Copy link
Contributor

@meng630: I thought this PR should be BFB. Do you think changing these data files (that have essentially the same data) can still make it NBFB?

@meng630
Copy link
Contributor Author

meng630 commented Aug 29, 2025

@singhbalwinder Yeah, it should be BFB as we are using the same input data as before.

However, the linoz file might be the reason for the DIFF in the check. The original linoz file has two time indices: date (string) and time (float), but they do not correspond to the same timestamp. The date looks correct, but the time is based on odd assumptions. We've fixed this in the new files.

So, if the model interpolates linoz data using time, it could be pulling the wrong timestamp before. With the corrected timestamps in this PR, NBFB is likely. We can verify this by commenting out the linoz file change, but I don't know how to test that via Github.

@meng630 meng630 requested a review from odiazib August 29, 2025 18:42
@meng630
Copy link
Contributor Author

meng630 commented Aug 29, 2025

To tag @odiazib as he should be aware of the linoz time issue.

<mam4_oxid_file_name type="file" doc="File containing oxidants data">${DIN_LOC_ROOT}/atm/scream/mam4xx/invariants/ne30pg2/oxid_1.9x2.5_L26_1850-2015_ne30pg2_c20241009.nc</mam4_oxid_file_name>
<mam4_oxid_file_name hgrid="ne4np4.pg2" type="file" doc="File containing oxidants data">${DIN_LOC_ROOT}/atm/scream/mam4xx/invariants/ne4pg2/oxid_1.9x2.5_L26_1850-2015_ne4pg2_c20241009.nc</mam4_oxid_file_name>
<mam4_oxid_file_name type="file" doc="File containing oxidants data">${DIN_LOC_ROOT}/atm/scream/mam4xx/invariants/ne30pg2/oxid_1.9x2.5_L26_2015_ne30pg2_c20250813.nc</mam4_oxid_file_name>
<mam4_oxid_file_name hgrid="ne4np4.pg2" type="file" doc="File containing oxidants data">${DIN_LOC_ROOT}/atm/scream/mam4xx/invariants/ne4pg2/oxid_1.9x2.5_L26_2015_ne4pg2_c20250813.nc</mam4_oxid_file_name>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a different date, 20250813, for these files. In the Linoz files, we have 20250826. I am just checking if we have fixed the PIO_IOTYPE_PNETCDF warning in these files.

<mam4_linoz_file_name type="file" doc="LINOZ chemistry file"> ${DIN_LOC_ROOT}/atm/scream/mam4xx/linoz/ne30pg2/linoz1850-2015_2010JPL_CMIP6_10deg_58km_ne30pg2_c20240724.nc</mam4_linoz_file_name>
<mam4_linoz_file_name hgrid="ne4np4.pg2" type="file" doc="LINOZ chemistry file"> ${DIN_LOC_ROOT}/atm/scream/mam4xx/linoz/ne4pg2/linoz1850-2015_2010JPL_CMIP6_10deg_58km_ne4pg2_c20240724.nc</mam4_linoz_file_name>
<mam4_linoz_file_name type="file" doc="LINOZ chemistry file"> ${DIN_LOC_ROOT}/atm/scream/mam4xx/linoz/ne30pg2/linoz2010_2010JPL_CMIP6_10deg_58km_ne30pg2_c20250826.nc</mam4_linoz_file_name>
<mam4_linoz_file_name hgrid="ne4np4.pg2" type="file" doc="LINOZ chemistry file"> ${DIN_LOC_ROOT}/atm/scream/mam4xx/linoz/ne4pg2/linoz2010_2010JPL_CMIP6_10deg_58km_ne4pg2_c20250826.nc</mam4_linoz_file_name>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this new version does not produce the PIO_IOTYPE_PNETCDF warning. However, I am just checking to see if this is correct.

@singhbalwinder
Copy link
Contributor

I will try reverting the LINOZ files to see if the diffs go away. If they go away, I will reinstate the files and label this PR NBFB. Does that sound right, @meng630 and @odiazib?

@odiazib
Copy link
Contributor

odiazib commented Aug 29, 2025

@meng630 @singhbalwinder Note that in the master branch, Linoz is not using the DataInterpolation class. Thus, the time variable is not used for interpolation. In addition, the standalone tests are BFB. If fixing the time variable in the Linoz file is producing DIFFs in the CIME case, I would expect that standalone tests would also produce DIFFs.

I ran an NE30 case4 on Frontier using the master branch and new files for Linoz and oxidant. This case was crashing after a few months. @meng630, can you try running a case for more than 1 year?

@meng630
Copy link
Contributor Author

meng630 commented Aug 29, 2025

I can run ne256 for several months, but the model eventually crashed due to a very high nc. Then I increased the upper bound from 1e11 to 1.5e11. It has been running for more than a year now. @odiazib

@meng630
Copy link
Contributor Author

meng630 commented Aug 29, 2025

Ah..It still got DIFF 🧐

@singhbalwinder
Copy link
Contributor

In addition, the standalone tests are BFB.

That is a great point. The standalone tests are using ne2 files. @meng630, is there an easy way to check if the ne4pg2 files have the data for the "right" year?

@meng630
Copy link
Contributor Author

meng630 commented Aug 29, 2025

@singhbalwinder You can ncdump -v date file.nc to check out the year of the data. The oxid file is from year 2015 at ne4pg2. I also picked random points from oxid_1.9x2.5_L26_1850-2015_ne4pg2_c20241009.nc and oxid_1.9x2.5_L26_2015_ne4pg2_c20250813.nc. Their values are identical.

Files were generated using the same script across resolutions. It's odd ne2 is BFB while ne4 not. Regarding the ne2 test, were you using the single-year file or the old multi-year version? I didn't change any variables relevant to ne2 in the XML. Actually, I didn't see any ne2 over there. Is the standalone mam4xx code different from the master branch?

@singhbalwinder
Copy link
Contributor

Thanks for the quick check! The ne2 files are not in the namelist XML file. They are in the YAML file for each test. That reminds me that we should change those as well. Perhaps that is why those tests are passing. I will change just the OXID files to see if the standalone tests also fail.

@meng630
Copy link
Contributor Author

meng630 commented Aug 29, 2025

Another thing I found is, this check failed often in recent PRs. For example, #7632 has nothing to do with aerosols, but with the same DIFF fail as here in the check runs https://github.com/E3SM-Project/E3SM/actions/runs/17189725178/job/48763699453 @singhbalwinder

@singhbalwinder
Copy link
Contributor

Yeah, this check was failing because of a CMAKE issue that is fixed now. I have a "dummy PR", which I rebased and ran tests again. All the tests are passing for that "dummy PR", so all the tests should work for this PR as well.

I have now updated the oxid file for the standalone test. Let's see if it still passes.

@singhbalwinder
Copy link
Contributor

@meng630, could you please check if I used the right file for the OXIDs? The tests are failing for the standalone tests as well now.

@meng630
Copy link
Contributor Author

meng630 commented Aug 29, 2025

@singhbalwinder The oxid file is right. Can we get more information from test failures? I just saw a CMake Error, but I didn't fully understand what it meant.

@singhbalwinder
Copy link
Contributor

Here is the info:

Build type full_debug failed at testing time. Here's a list of failed tests:
183:mam4_aero_microphys_standalone_baseline_cmp

Error(s) occurred during test phase
OVERALL STATUS: FAIL

It is a baseline fail comparison.

@odiazib
Copy link
Contributor

odiazib commented Aug 30, 2025

@singhbalwinder @meng630

There is a bug in the current master branch related to the nBFB issue during restart.

The bug occurs during initialization at the following location:
eamxx_mam_microphysics_process_interface.cpp#L621

The routine update_tracer_data_from_file updates data for time interpolation using a time index. Currently, the code passes curr_month, which is a value ranging from 0 to 11. However, for files spanning multiple years, we need to use an offset index to adjust the value to the correct year.

To address this issue, the code should be updated as follows:

scream::mam_coupling::update_tracer_data_from_file(
      LinozDataReader_, curr_month + linoz_data_.offset_time_index_, *LinozHorizInterp_, linoz_data_);

A similar fix is required for the oxidant reader.

I recommend not addressing this bug in the current PR because we are using a one-year file, and this block of code will eventually be removed by the DataInterpolation PR.

This bug also explains the nBFB issue during restart, as the initialization routine is invoked at every restart.

I fixed this bug using @meng630's PR and compared the outputs of the microphysics standalone test. I observed that the outputs of this test, whether using multiple-year files or one-year files, are BFB. Therefore, I also recommend updating this PR to be nBFB.

@singhbalwinder
Copy link
Contributor

Thanks, @odiazib. If I am understanding correctly, the bug fix above will fix the restart NBFB issue for both single and multi-year files. Since we are using single-year files in this PR, this fix should not matter for this PR.

We can tag this PR NBFB, but we need to first understand why this PR is NBFB. I think this PR should be BFB, as we are using the same data as before, and the model is running just for a few time steps/days. There must be a reason why we see these BFB fails, and we need to find that out before merging this PR.

@odiazib
Copy link
Contributor

odiazib commented Sep 1, 2025

Thanks, @odiazib. If I am understanding correctly, the bug fix above will fix the restart NBFB issue for both single and multi-year files. Since we are using single-year files in this PR, this fix should not matter for this PR.

Yes, adding the offset index will resolve this NBFB issue for multi-year files.

We can tag this PR NBFB, but we need to first understand why this PR is NBFB. I think this PR should be BFB, as we are using the same data as before, and the model is running just for a few time steps/days. There must be a reason why we see these BFB fails, and we need to find that out before merging this PR.

This PR is NBFB because the wrong time index was being used during the initialization of the reader files for Linoz and oxidant.

For example, in the standalone test, where the case starts in the 8th month, the time index is 7. Currently, this index in the multi-year file (oxidant file) corresponds to the 8th month of the year 1850. On the other hand, this 7th index in the single-year file corresponds to the 8th month of the year 2015.

For interpolation, two-time sections are required. In the multi-year file, the first section is from the 8th month of the year 1850 (which is incorrect), and the second section is from the 9th month of the year 2015. In the single-year file, the first section is from the 8th month of the year 2015, and the second section is from the 9th month of the year 2015.

To ensure correctness, we can also create a PR that addresses the bug in the initialization of the time section. This PR will be NBFB. Once this PR is merged, Meng’s PR should be BFB.

tcclevenger added a commit that referenced this pull request Sep 10, 2025
For a multi-year NetCDF file, we also need to add the time offset
index. This issue was producing NBFB results, as noted in PR #7638.

Without this fix, the model was incorrectly reading the data from
the very first year in a multi-year file. For example, for oxidants
and LINOZ files, the model was incorrectly reading the data from the
year 1850 as its initial time sample (instead of 2015). The second
time sample (as we need two time samples to interpolate linearly) was
correctly picked up from 2015.

With this fix, both time samples are now correctly read from the
intended year (2015 in this example), ensuring proper interpolation
and model behavior.

[NBFB for EAMxx]
tcclevenger added a commit that referenced this pull request Sep 10, 2025
For a multi-year NetCDF file, we also need to add the time offset
index. This issue was producing NBFB results, as noted in PR #7638.
    
Without this fix, the model was incorrectly reading the data from
the very first year in a multi-year file. For example, for oxidants
and LINOZ files, the model was incorrectly reading the data from the
year 1850 as its initial time sample (instead of 2015). The second
time sample (as we need two time samples to interpolate linearly) was
correctly picked up from 2015.
    
With this fix, both time samples are now correctly read from the
intended year (2015 in this example), ensuring proper interpolation
and model behavior.
    
[NBFB for EAMxx]
singhbalwinder added a commit that referenced this pull request Sep 12, 2025
)

EAMxx: Update linoz and oxid files in XML

Replace the multi-year linoz and oxid files with single-year to fix NBFB
restart in MAM4xx

[BFB]

* meng630/mam4/fix_nbfb:
  Added the new single year LINOZ files back
  Update oxid file for the microphysics standalone test
  Revert LINOZ files to check BFBness
  Update linoz files to cdf format
  Update linoz and oxid files in XML
@singhbalwinder singhbalwinder merged commit 54aaa58 into E3SM-Project:master Sep 12, 2025
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BFB PR leaves answers BFB EAMxx Issues related to EAMxx MAM4xx MAM4xx related changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants