Skip to content

Conversation

mjschmidt271
Copy link
Collaborator

@mjschmidt271 mjschmidt271 commented Aug 20, 2025

This appears to be working correctly now.

One thing to note, is that I currently have it configured to run on AMD MI250 OR MI210 GPUs. This is to get jobs picked up faster, since we have different nodes containing one or the other, and both belong to the same AMD_GFX90A architecture. However, if anyone would like to have better control over which card is used for a run, I can look into changing that.

The caveat being that a handful of dropmixnuc-related tests are failing for the HIP build. However, all test did run, and the failures are below (the workflow output is enormous--which will be fixed by the changes to the mam_x_validation compare script 🎉).

dropmixnuc failures

HIP Autotest Results - Release/Double

2025-08-20T19:59:36.2050758Z 94% tests passed, 38 tests failed out of 646
2025-08-20T19:59:36.2051102Z 
2025-08-20T19:59:36.2051210Z Label Time Summary:
2025-08-20T19:59:36.2051508Z aero_emissions         =   5.14 sec*proc (18 tests)
2025-08-20T19:59:36.2052015Z amicphys_1subarea      =   5.76 sec*proc (20 tests)
2025-08-20T19:59:36.2052459Z compared2standalone    =   0.62 sec*proc (2 tests)
2025-08-20T19:59:36.2052904Z mo_drydep              =   6.45 sec*proc (22 tests)
2025-08-20T19:59:36.2053323Z nuc_tests_new          =   1.67 sec*proc (6 tests)
2025-08-20T19:59:36.2053605Z 
2025-08-20T19:59:36.2053969Z Total Test time (real) = 209.64 sec
2025-08-20T19:59:36.2054345Z 
2025-08-20T19:59:36.2054463Z The following tests FAILED:
2025-08-20T19:59:36.2054909Z 	421 - run_stand_dropmixnuc_ts_1407 (Subprocess aborted)
2025-08-20T19:59:36.2055328Z 	422 - validate_stand_dropmixnuc_ts_1407 (Failed)
2025-08-20T19:59:36.2055829Z 	423 - run_dropmixnuc_ts_1400 (Subprocess aborted)
2025-08-20T19:59:36.2056197Z 	424 - validate_dropmixnuc_ts_1400 (Failed)
2025-08-20T19:59:36.2056601Z 	425 - run_dropmixnuc_ts_1401 (Subprocess aborted)
2025-08-20T19:59:36.2056964Z 	426 - validate_dropmixnuc_ts_1401 (Failed)
2025-08-20T19:59:36.2057349Z 	427 - run_dropmixnuc_ts_1402 (Subprocess aborted)
2025-08-20T19:59:36.2057726Z 	428 - validate_dropmixnuc_ts_1402 (Failed)
2025-08-20T19:59:36.2058328Z 	429 - run_dropmixnuc_ts_1403 (Subprocess aborted)
2025-08-20T19:59:36.2058796Z 	430 - validate_dropmixnuc_ts_1403 (Failed)
2025-08-20T19:59:36.2059158Z 	431 - run_dropmixnuc_ts_1404 (Subprocess aborted)
2025-08-20T19:59:36.2059562Z 	432 - validate_dropmixnuc_ts_1404 (Failed)
2025-08-20T19:59:36.2059922Z 	433 - run_dropmixnuc_ts_1405 (Subprocess aborted)
2025-08-20T19:59:36.2060299Z 	434 - validate_dropmixnuc_ts_1405 (Failed)
2025-08-20T19:59:36.2060643Z 	435 - run_dropmixnuc_ts_1406 (Subprocess aborted)
2025-08-20T19:59:36.2061013Z 	436 - validate_dropmixnuc_ts_1406 (Failed)
2025-08-20T19:59:36.2061377Z 	437 - run_dropmixnuc_ts_1407 (Subprocess aborted)
2025-08-20T19:59:36.2061828Z 	438 - validate_dropmixnuc_ts_1407 (Failed)
2025-08-20T19:59:36.2062182Z 	439 - run_dropmixnuc_ts_1408 (Subprocess aborted)
2025-08-20T19:59:36.2062604Z 	440 - validate_dropmixnuc_ts_1408 (Failed)
2025-08-20T19:59:36.2063031Z 	441 - run_dropmixnuc_ts_1409 (Subprocess aborted)
2025-08-20T19:59:36.2063368Z 	442 - validate_dropmixnuc_ts_1409 (Failed)
2025-08-20T19:59:36.2063700Z 	443 - run_dropmixnuc_ts_1410 (Subprocess aborted)
2025-08-20T19:59:36.2064030Z 	444 - validate_dropmixnuc_ts_1410 (Failed)
2025-08-20T19:59:36.2064370Z 	445 - run_dropmixnuc_ts_1411 (Subprocess aborted)
2025-08-20T19:59:36.2064691Z 	446 - validate_dropmixnuc_ts_1411 (Failed)
2025-08-20T19:59:36.2065024Z 	447 - run_dropmixnuc_ts_1412 (Subprocess aborted)
2025-08-20T19:59:36.2065374Z 	448 - validate_dropmixnuc_ts_1412 (Failed)
2025-08-20T19:59:36.2065794Z 	449 - run_dropmixnuc_ts_1413 (Subprocess aborted)
2025-08-20T19:59:36.2066121Z 	450 - validate_dropmixnuc_ts_1413 (Failed)
2025-08-20T19:59:36.2066483Z 	451 - run_dropmixnuc_ts_1414 (Subprocess aborted)
2025-08-20T19:59:36.2066838Z 	452 - validate_dropmixnuc_ts_1414 (Failed)
2025-08-20T19:59:36.2067189Z Errors while running CTest
2025-08-20T19:59:36.2067486Z 	453 - run_dropmixnuc_ts_1415 (Subprocess aborted)
2025-08-20T19:59:36.2067862Z 	454 - validate_dropmixnuc_ts_1415 (Failed)
2025-08-20T19:59:36.2068212Z 	455 - run_dropmixnuc_ts_1416 (Subprocess aborted)
2025-08-20T19:59:36.2068603Z 	456 - validate_dropmixnuc_ts_1416 (Failed)
2025-08-20T19:59:36.2068986Z 	457 - run_dropmixnuc_ts_1417 (Subprocess aborted)
2025-08-20T19:59:36.2069321Z 	458 - validate_dropmixnuc_ts_1417 (Failed)

@mjschmidt271 mjschmidt271 requested a review from mam4xxSNL August 20, 2025 19:14
Copy link

codecov bot commented Aug 20, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.42%. Comparing base (3d91b39) to head (a4696d1).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #476   +/-   ##
=======================================
  Coverage   93.42%   93.42%           
=======================================
  Files         303      303           
  Lines       25171    25171           
  Branches     2766     2766           
=======================================
  Hits        23517    23517           
  Misses       1654     1654           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@mam4xxSNL mam4xxSNL changed the title TESTING - AMD/HIP Autotesting Add AMD/HIP Autotesting Aug 20, 2025
@mjschmidt271 mjschmidt271 force-pushed the mjs/add-hip-autotester branch from 0b01e95 to 73416a7 Compare August 20, 2025 20:35
@mjschmidt271
Copy link
Collaborator Author

To update: I'm debugging a little weirdness with the AMD runners, but this is otherwise good to go

Copy link
Contributor

@singhbalwinder singhbalwinder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thanks, Mike!

@mjschmidt271 mjschmidt271 force-pushed the mjs/add-hip-autotester branch from a4696d1 to 24f926c Compare August 25, 2025 21:35
@mjschmidt271
Copy link
Collaborator Author

@singhbalwinder @odiazib @overfelt @jaelynlitz

For anyone not closely following this PR, it is rebased onto the branch for PR #477 that brings in the mam_x_validation PR that updates the python script used for comparing MAM/MAM4xx results. My reason for doing this is that it should make it easier to take a closer look at test results and debug the issues.

However... it does produce a lot more failing tests in this PR, many of which are duplicated in #477, and I will add a similar list of failing tests to that PR.

What do we want to do about the newly-failing tests, with the change in the compare script, and potentially some newly-exposed fails on AMD MI200-series GPUs?

My thoughts are:

  • Change the tolerances for the failing tests, as a part of this PR
    • I will post the information below to an Issue so we know which tests need a closer look.
    • This will help get this PR merged and also keep known failures from flagging PR auto-testing.
  • Below are lists of the failing tests for each architecture and links to the results.
    • I have not yet given a close look to the reasons these tests are failing.
    • The lists show Release/Double configuration for each case, and I have not determined whether the Debug vs. Release behavior changes.
    • Note that single-precision tests are passing in some cases, but this appears to be a function of there being far fewer tests that run.
CUDA Fails - Release/Double

Link to results

 48 - validate_mer07_veh02_nuc_mosaic_1box (Failed)
 60 - validate_calcsize_compute_dry_volume (Failed)
 62 - validate_stand_modal_aero_calcsize_sub (Failed)
 74 - validate_ma_precpprod (Failed)
 90 - validate_compute_massflux_small (Failed)
134 - validate_pcarbon_aging_1subarea (Failed)
202 - validate_calc_1_impact_rate_ts_0 (Failed)
204 - validate_modal_aero_bcscavcoef_get_ts_355 (Failed)
212 - validate_baseline_aero_model_wetdep_ts_379 (Failed)
214 - wetdep_compare_clddiag_output (Failed)
222 - wetdep_compare_wetdep_prevap_130_output (Failed)
224 - wetdep_compare_wetdep_prevap_230_output (Failed)
234 - wetdep_compare_wetdep_scavenging_true_output (Failed)
236 - wetdep_compare_wetdep_scavenging_false_output (Failed)
240 - wetdep_compare_rain_mix_ratio_output (Failed)
246 - wetdep_compare_wetdep_resusp_130_output (Failed)
248 - wetdep_compare_wetdep_resusp_230_output (Failed)
364 - validate_linmat_ts_355 (Failed)
366 - validate_nlnmat_ts_355 (Failed)
368 - validate_imp_prod_loss_ts_355 (Failed)
370 - validate_newton_raphson_iter_ts_355 (Failed)
408 - validate_maxsattype1_merged (Failed)
410 - validate_maxsattype2_merged (Failed)
498 - validate_lin_strat_chem_solve_ts_1415 (Failed)
500 - validate_lin_strat_sfcsink_ts_1415_multicol (Failed)
502 - validate_lin_strat_sfcsinkmulticol_merged (Failed)
512 - validate_chm_diags_ts_355 (Failed)
538 - validate_calc_het_rates_merged (Failed)
540 - validate_calc_precip_rescale_merged (Failed)
546 - validate_sethet_merged (Failed)
554 - validate_calc_sox_aqueous_ts_355_merged (Failed)
566 - validate_calc_diag_spec_ts_355 (Failed)
580 - validate_modal_aero_lw_ts_355 (Failed)
584 - validate_update_aod_spec_ts_355 (Failed)
586 - validate_aer_rad_props_lw_ts_355 (Failed)
588 - validate_aer_rad_props_sw_ts_355 (Failed)
590 - validate_volcanic_cmip_sw_ts_355 (Failed)
616 - validate_mam_soaexch_1subarea_ts_379 (Failed)
618 - validate_gas_aer_uptkrates_1box1gas_ts_379 (Failed)
620 - validate_mam_gasaerexch_1subarea_ts_379 (Failed)
622 - validate_vert_interp_ts_300 (Failed)
624 - validate_vert_interp_col_ts_300 (Failed)
HIP Fails - Release/Double

Link to results

 48 - validate_mer07_veh02_nuc_mosaic_1box (Failed)
 60 - validate_calcsize_compute_dry_volume (Failed)
 62 - validate_stand_modal_aero_calcsize_sub (Failed)
 74 - validate_ma_precpprod (Failed)
 90 - validate_compute_massflux_small (Failed)
134 - validate_pcarbon_aging_1subarea (Failed)
202 - validate_calc_1_impact_rate_ts_0 (Failed)
204 - validate_modal_aero_bcscavcoef_get_ts_355 (Failed)
212 - validate_baseline_aero_model_wetdep_ts_379 (Failed)
214 - wetdep_compare_clddiag_output (Failed)
222 - wetdep_compare_wetdep_prevap_130_output (Failed)
224 - wetdep_compare_wetdep_prevap_230_output (Failed)
234 - wetdep_compare_wetdep_scavenging_true_output (Failed)
236 - wetdep_compare_wetdep_scavenging_false_output (Failed)
240 - wetdep_compare_rain_mix_ratio_output (Failed)
246 - wetdep_compare_wetdep_resusp_130_output (Failed)
248 - wetdep_compare_wetdep_resusp_230_output (Failed)
364 - validate_linmat_ts_355 (Failed)
366 - validate_nlnmat_ts_355 (Failed)
368 - validate_imp_prod_loss_ts_355 (Failed)
370 - validate_newton_raphson_iter_ts_355 (Failed)
408 - validate_maxsattype1_merged (Failed)
410 - validate_maxsattype2_merged (Failed)
498 - validate_lin_strat_chem_solve_ts_1415 (Failed)
500 - validate_lin_strat_sfcsink_ts_1415_multicol (Failed)
502 - validate_lin_strat_sfcsinkmulticol_merged (Failed)
512 - validate_chm_diags_ts_355 (Failed)
538 - validate_calc_het_rates_merged (Failed)
540 - validate_calc_precip_rescale_merged (Failed)
546 - validate_sethet_merged (Failed)
554 - validate_calc_sox_aqueous_ts_355_merged (Failed)
566 - validate_calc_diag_spec_ts_355 (Failed)
580 - validate_modal_aero_lw_ts_355 (Failed)
584 - validate_update_aod_spec_ts_355 (Failed)
586 - validate_aer_rad_props_lw_ts_355 (Failed)
588 - validate_aer_rad_props_sw_ts_355 (Failed)
590 - validate_volcanic_cmip_sw_ts_355 (Failed)
616 - validate_mam_soaexch_1subarea_ts_379 (Failed)
618 - validate_gas_aer_uptkrates_1box1gas_ts_379 (Failed)
620 - validate_mam_gasaerexch_1subarea_ts_379 (Failed)
622 - validate_vert_interp_ts_300 (Failed)
624 - validate_vert_interp_col_ts_300 (Failed)

The following are unclear because the run_ test shows "subprocess aborted" resulting in a default fail for the validate_ test.

421 - run_stand_dropmixnuc_ts_1407 (Subprocess aborted)
422 - validate_stand_dropmixnuc_ts_1407 (Failed)
423 - run_dropmixnuc_ts_1400 (Subprocess aborted)
424 - validate_dropmixnuc_ts_1400 (Failed)
425 - run_dropmixnuc_ts_1401 (Subprocess aborted)
426 - validate_dropmixnuc_ts_1401 (Failed)
427 - run_dropmixnuc_ts_1402 (Subprocess aborted)
428 - validate_dropmixnuc_ts_1402 (Failed)
429 - run_dropmixnuc_ts_1403 (Subprocess aborted)
430 - validate_dropmixnuc_ts_1403 (Failed)
431 - run_dropmixnuc_ts_1404 (Subprocess aborted)
432 - validate_dropmixnuc_ts_1404 (Failed)
433 - run_dropmixnuc_ts_1405 (Subprocess aborted)
434 - validate_dropmixnuc_ts_1405 (Failed)
435 - run_dropmixnuc_ts_1406 (Subprocess aborted)
436 - validate_dropmixnuc_ts_1406 (Failed)
437 - run_dropmixnuc_ts_1407 (Subprocess aborted)
438 - validate_dropmixnuc_ts_1407 (Failed)
439 - run_dropmixnuc_ts_1408 (Subprocess aborted)
440 - validate_dropmixnuc_ts_1408 (Failed)
441 - run_dropmixnuc_ts_1409 (Subprocess aborted)
442 - validate_dropmixnuc_ts_1409 (Failed)
443 - run_dropmixnuc_ts_1410 (Subprocess aborted)
444 - validate_dropmixnuc_ts_1410 (Failed)
445 - run_dropmixnuc_ts_1411 (Subprocess aborted)
446 - validate_dropmixnuc_ts_1411 (Failed)
447 - run_dropmixnuc_ts_1412 (Subprocess aborted)
448 - validate_dropmixnuc_ts_1412 (Failed)
449 - run_dropmixnuc_ts_1413 (Subprocess aborted)
450 - validate_dropmixnuc_ts_1413 (Failed)
451 - run_dropmixnuc_ts_1414 (Subprocess aborted)
452 - validate_dropmixnuc_ts_1414 (Failed)
453 - run_dropmixnuc_ts_1415 (Subprocess aborted)
454 - validate_dropmixnuc_ts_1415 (Failed)
455 - run_dropmixnuc_ts_1416 (Subprocess aborted)
456 - validate_dropmixnuc_ts_1416 (Failed)
457 - run_dropmixnuc_ts_1417 (Subprocess aborted)
458 - validate_dropmixnuc_ts_1417 (Failed)
CPU Fails - Release/Double

Link to results

48 - validate_mer07_veh02_nuc_mosaic_1box (Failed)     nuc_tests_new
 60 - validate_calcsize_compute_dry_volume (Failed)
 62 - validate_stand_modal_aero_calcsize_sub (Failed)
 74 - validate_ma_precpprod (Failed)
 90 - validate_compute_massflux_small (Failed)
134 - validate_pcarbon_aging_1subarea (Failed)
202 - validate_calc_1_impact_rate_ts_0 (Failed)
204 - validate_modal_aero_bcscavcoef_get_ts_355 (Failed)
212 - validate_baseline_aero_model_wetdep_ts_379 (Failed)
214 - wetdep_compare_clddiag_output (Failed)
222 - wetdep_compare_wetdep_prevap_130_output (Failed)
224 - wetdep_compare_wetdep_prevap_230_output (Failed)
234 - wetdep_compare_wetdep_scavenging_true_output (Failed)
236 - wetdep_compare_wetdep_scavenging_false_output (Failed)
240 - wetdep_compare_rain_mix_ratio_output (Failed)
246 - wetdep_compare_wetdep_resusp_130_output (Failed)
248 - wetdep_compare_wetdep_resusp_230_output (Failed)
364 - validate_linmat_ts_355 (Failed)
366 - validate_nlnmat_ts_355 (Failed)
368 - validate_imp_prod_loss_ts_355 (Failed)
370 - validate_newton_raphson_iter_ts_355 (Failed)
408 - validate_maxsattype1_merged (Failed)
410 - validate_maxsattype2_merged (Failed)
498 - validate_lin_strat_chem_solve_ts_1415 (Failed)
500 - validate_lin_strat_sfcsink_ts_1415_multicol (Failed)
502 - validate_lin_strat_sfcsinkmulticol_merged (Failed)
512 - validate_chm_diags_ts_355 (Failed)
538 - validate_calc_het_rates_merged (Failed)
540 - validate_calc_precip_rescale_merged (Failed)
546 - validate_sethet_merged (Failed)
554 - validate_calc_sox_aqueous_ts_355_merged (Failed)
566 - validate_calc_diag_spec_ts_355 (Failed)
580 - validate_modal_aero_lw_ts_355 (Failed)
584 - validate_update_aod_spec_ts_355 (Failed)
586 - validate_aer_rad_props_lw_ts_355 (Failed)
588 - validate_aer_rad_props_sw_ts_355 (Failed)
590 - validate_volcanic_cmip_sw_ts_355 (Failed)
616 - validate_mam_soaexch_1subarea_ts_379 (Failed)
618 - validate_gas_aer_uptkrates_1box1gas_ts_379 (Failed)
620 - validate_mam_gasaerexch_1subarea_ts_379 (Failed)
622 - validate_vert_interp_ts_300 (Failed)
624 - validate_vert_interp_col_ts_300 (Failed)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants