Skip to content

Conversation

jonbob
Copy link
Contributor

@jonbob jonbob commented Jun 24, 2025

Fix the abort routines in mpas-framework so that log_abort calls mpas_dmpar_global_abort and mpas_dmpar_global_abort itself uses shr_sys_abort when running as a coupled model.

Fixes #7447
[BFB]

@jonbob jonbob requested review from xylar and matthewhoffman June 24, 2025 18:08
@jonbob jonbob self-assigned this Jun 24, 2025
@jonbob jonbob added bug fix PR BFB PR leaves answers BFB mpas-framework labels Jun 24, 2025
@jonbob
Copy link
Contributor Author

jonbob commented Jun 24, 2025

@xylar -- I had tested this using the setup from Issue #7447 and found that it no longer hangs. But I didn't try mpas-ocean standalone

@jonbob
Copy link
Contributor Author

jonbob commented Jun 24, 2025

@matthewhoffman -- I changed the log_abort routine to call mpas_dmpar_global_abort instead of having its own abort logic, but that is not necessary if there's a reason it had been calling mpi routines

@matthewhoffman
Copy link
Contributor

@jonbob , thanks for taking care of this. The changes you implemented are what I was anticipating. I do not remember if there was a reason the log module does its own abort rather than use the pre-existing mpas_dmpar_global_abort routine, but I cannot think of an important one now.

One change in behavior is that previously, a critical error sent to the log module would not result in log.abort files but now it will, and they will just contain the message Log_abort. That could be a little confusing, but I don't have a major concern with that.

@jonbob
Copy link
Contributor Author

jonbob commented Jun 25, 2025

Thanks @matthewhoffman. I'm happy to change the behavior of how critical errors get handled in the log module, if you can think of a better approach. Or at least a more meaningful error message in the corresponding log.abort files?

@matthewhoffman
Copy link
Contributor

@jonbob , I don't have a strong opinion on if and how that should be adjusted. Let's see what @xylar says, and I'll add @mark-petersen to the discussion.

@xylar
Copy link
Contributor

xylar commented Jun 27, 2025

Hmm, I think we almost certainly were not calling mpas_dmpar_global_abort() from mpas_log_write() (via log_abort()) precisely because of the redundant logging. I think it will be confusing and irritating to have yet another useless log file per core when we have a system abort.

My suggestion would be to make the mesg argument to mpas_dmpar_global_abort() optional and only to write log*.abort files if it is present. Then, we would not pass it within log_abort().

@xylar
Copy link
Contributor

xylar commented Jun 28, 2025

@jonbob, that looks good to me. Let me test on the run that was crashing for me to make sure the job ends.

Copy link
Contributor

@xylar xylar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran the 20250628.GMPAS-JRA1p5-DIB-PISMF.TL319_SOwISC12to30E3r4.chrysalis that crashed for me as I reported in #7447 (comment).

This time, the jobs seems to have aborted as expected.

Copy link
Contributor

@matthewhoffman matthewhoffman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonbob , this final adjustment looks like what I expected. In testing with standalone MALI, I got a build error due to leading whitespace in the preprocessor directives that were indented, so I pushed a commit to fix that. After that, I confirmed that MALI compass tests pass and that a situation that leads to an abort from the log module behaves as expected. So everything looks good from the standalone MALI perspective.

(Note that I ran into a compass/scorpio issue in compiling standalone MALI with master using the compass env on compass/main, so I tested this on MALI-Dev/develop after cherry-picking the commits from this PR. Xylar is aware of the compass env issue, but we determined it is not urgent to address.)

@jonbob
Copy link
Contributor Author

jonbob commented Jul 1, 2025

Thanks for catching that, @matthewhoffman

jonbob added a commit that referenced this pull request Jul 2, 2025
Fix the abort routines in mpas-framework to use shr_sys_abort when coupled

Fix the abort routines in mpas-framework so that log_abort calls
mpas_dmpar_global_abort and mpas_dmpar_global_abort itself uses
shr_sys_abort when running as a coupled model.

Fixes #7447
[BFB]
@jonbob
Copy link
Contributor Author

jonbob commented Jul 2, 2025

Passes:

  • SMS_D_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850.chrysalis_intel.allactive-wcprod
  • e3sm_ice_developer on chrysalis

merged to next

@jonbob jonbob merged commit 7934034 into master Jul 3, 2025
3 checks passed
@jonbob jonbob deleted the jonbob/mpas-framework/fix-log-abort branch July 3, 2025 18:34
@jonbob
Copy link
Contributor Author

jonbob commented Jul 3, 2025

merged to master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Crashes due to MPAS-Ocean state validation do not stop E3SM or standalone MPAS-Ocean runs
3 participants