Skip to content

Conversation

bartgol
Copy link
Contributor

@bartgol bartgol commented Sep 5, 2024

This ensures that the whole job is terminated. The MPI standard does not guarantee that an uncaught exception will cause the whole MPI job to terminate.

Note: this may fix the very long hangs we were sometimes getting when the code crashed, and yet the job did not terminate.

Fixes #2964

… EAMxx

This ensures that the whole job is terminated. The MPI standard does not
guarantee that an uncaught exception will cause the whole MPI job to terminate.
@bartgol bartgol added AT: AUTOMERGE code usability MPI Related to handling of MPI data interfaces/calls. labels Sep 5, 2024
@bartgol bartgol requested a review from ambrad September 5, 2024 23:09
@bartgol bartgol self-assigned this Sep 5, 2024
// Execute wrapped function
try {
f();
} catch (std::exception &e) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to change this to catch (...) for greater generality?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we can do that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, but I want to print the exception message.

Copy link
Contributor Author

@bartgol bartgol Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do a rethrow, then add catch(...) underneath That didn't work, so I'm duplicating a bit the code to cover both needs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that work? Or do you need to nest try blocks?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this, but you might consider nesting try-catch blocks to avoid redundant code:

  try {
    try {
      throw std::exception();
    } catch (const std::exception& e) {
      printf("hi\n");
      throw;
    }
  } catch (...) {
    printf("hello\n");
  }

@bartgol bartgol requested a review from ambrad September 5, 2024 23:30
@E3SM-Bot
Copy link
Collaborator

E3SM-Bot commented Sep 5, 2024

Status Flag 'Pull Request AutoTester' - Testing Jenkins Projects:

Pull Request Auto Testing STARTING (click to expand)

Build Information

Test Name: SCREAM_PullRequest_Autotester_Mappy

  • Build Num: 5813
  • Status: STARTED

Jenkins Parameters

Parameter Name Value
PR_LABELS AT: AUTOMERGE;code usability;MPI
PULLREQUESTNUM 2984
SCREAM_SOURCE_REPO https://github.com/E3SM-Project/scream
SCREAM_SOURCE_SHA 70222af
SCREAM_TARGET_BRANCH master
SCREAM_TARGET_REPO https://github.com/E3SM-Project/scream
SCREAM_TARGET_SHA 42ab514
TEST_REPO_ALIAS SCREAM

Build Information

Test Name: SCREAM_PullRequest_Autotester_Weaver

  • Build Num: 6039
  • Status: STARTED

Jenkins Parameters

Parameter Name Value
PR_LABELS AT: AUTOMERGE;code usability;MPI
PULLREQUESTNUM 2984
SCREAM_SOURCE_REPO https://github.com/E3SM-Project/scream
SCREAM_SOURCE_SHA 70222af
SCREAM_TARGET_BRANCH master
SCREAM_TARGET_REPO https://github.com/E3SM-Project/scream
SCREAM_TARGET_SHA 42ab514
TEST_REPO_ALIAS SCREAM

Using Repos:

Repo: SCREAM (E3SM-Project/scream)
  • Branch: bartgol/eamxx/mpi-abort-on-exception
  • SHA: 70222af
  • Mode: TEST_REPO

Pull Request Author: bartgol

@E3SM-Bot
Copy link
Collaborator

E3SM-Bot commented Sep 6, 2024

Status Flag 'Pull Request AutoTester' - Jenkins Testing: 1 or more Jobs FAILED

Note: Testing will normally be attempted again in approx. 2 Hrs. If a change to the PR source branch occurs, the testing will be attempted again on next available autotester run.

Pull Request Auto Testing has FAILED (click to expand)

Build Information

Test Name: SCREAM_PullRequest_Autotester_Mappy

  • Build Num: 5813
  • Status: FAILED

Jenkins Parameters

Parameter Name Value
PR_LABELS AT: AUTOMERGE;code usability;MPI
PULLREQUESTNUM 2984
SCREAM_SOURCE_REPO https://github.com/E3SM-Project/scream
SCREAM_SOURCE_SHA 70222af
SCREAM_TARGET_BRANCH master
SCREAM_TARGET_REPO https://github.com/E3SM-Project/scream
SCREAM_TARGET_SHA 42ab514
TEST_REPO_ALIAS SCREAM

Build Information

Test Name: SCREAM_PullRequest_Autotester_Weaver

  • Build Num: 6039
  • Status: PASSED

Jenkins Parameters

Parameter Name Value
PR_LABELS AT: AUTOMERGE;code usability;MPI
PULLREQUESTNUM 2984
SCREAM_SOURCE_REPO https://github.com/E3SM-Project/scream
SCREAM_SOURCE_SHA 70222af
SCREAM_TARGET_BRANCH master
SCREAM_TARGET_REPO https://github.com/E3SM-Project/scream
SCREAM_TARGET_SHA 42ab514
TEST_REPO_ALIAS SCREAM
SCREAM_PullRequest_Autotester_Mappy # 5813 FAILED (click to see last 100 lines of console output)

	at java.base/java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2915)
	at java.base/java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3410)
	at java.base/java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:954)
	at java.base/java.io.ObjectInputStream.(ObjectInputStream.java:392)
	at hudson.remoting.ObjectInputStreamEx.(ObjectInputStreamEx.java:50)
	at hudson.remoting.Command.readFrom(Command.java:142)
	at hudson.remoting.Command.readFrom(Command.java:128)
	at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35)
	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:61)
Caused: java.io.IOException: Unexpected termination of the channel
	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:75)
Caused: java.io.IOException: Backing channel 'mappy' is disconnected.
	at hudson.remoting.RemoteInvocationHandler.channelOrFail(RemoteInvocationHandler.java:215)
	at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:285)
	at jdk.proxy2/jdk.proxy2.$Proxy99.isAlive(Unknown Source)
	at hudson.Launcher$RemoteLauncher$ProcImpl.isAlive(Launcher.java:1212)
	at hudson.Launcher$RemoteLauncher$ProcImpl.join(Launcher.java:1204)
	at hudson.tasks.CommandInterpreter.join(CommandInterpreter.java:195)
	at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:145)
	at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:92)
	at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
	at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:818)
	at hudson.model.Build$BuildExecution.build(Build.java:199)
	at hudson.model.Build$BuildExecution.doRun(Build.java:164)
	at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:526)
	at hudson.model.Run.execute(Run.java:1894)
	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:44)
	at hudson.model.ResourceController.execute(ResourceController.java:101)
	at hudson.model.Executor.run(Executor.java:446)
FATAL: Unable to delete script file /tmp/jenkins8876704622228920525.sh
java.io.EOFException
	at java.base/java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2915)
	at java.base/java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3410)
	at java.base/java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:954)
	at java.base/java.io.ObjectInputStream.(ObjectInputStream.java:392)
	at hudson.remoting.ObjectInputStreamEx.(ObjectInputStreamEx.java:50)
	at hudson.remoting.Command.readFrom(Command.java:142)
	at hudson.remoting.Command.readFrom(Command.java:128)
	at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35)
	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:61)
Caused: java.io.IOException: Unexpected termination of the channel
	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:75)
Caused: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@647653a5:mappy": Remote call on mappy failed. The channel is closing down or has closed down
	at hudson.remoting.Channel.call(Channel.java:1035)
	at hudson.FilePath.act(FilePath.java:1229)
	at hudson.FilePath.act(FilePath.java:1218)
	at hudson.FilePath.delete(FilePath.java:1765)
	at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:163)
	at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:92)
	at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
	at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:818)
	at hudson.model.Build$BuildExecution.build(Build.java:199)
	at hudson.model.Build$BuildExecution.doRun(Build.java:164)
	at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:526)
	at hudson.model.Run.execute(Run.java:1894)
	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:44)
	at hudson.model.ResourceController.execute(ResourceController.java:101)
	at hudson.model.Executor.run(Executor.java:446)
Build step 'Execute shell' marked build as failure
ERROR: Unable to tear down: Channel "hudson.remoting.Channel@647653a5:mappy": Remote call on mappy failed. The channel is closing down or has closed down
java.io.EOFException
	at java.base/java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2915)
	at java.base/java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3410)
	at java.base/java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:954)
	at java.base/java.io.ObjectInputStream.(ObjectInputStream.java:392)
	at hudson.remoting.ObjectInputStreamEx.(ObjectInputStreamEx.java:50)
	at hudson.remoting.Command.readFrom(Command.java:142)
	at hudson.remoting.Command.readFrom(Command.java:128)
	at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35)
	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:61)
Caused: java.io.IOException: Unexpected termination of the channel
	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:75)
Caused: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@647653a5:mappy": Remote call on mappy failed. The channel is closing down or has closed down
	at hudson.remoting.Channel.call(Channel.java:1035)
	at hudson.Launcher$RemoteLauncher.launch(Launcher.java:1121)
	at hudson.Launcher$ProcStarter.start(Launcher.java:506)
	at PluginClassLoader for ssh-agent//com.cloudbees.jenkins.plugins.sshagent.exec.ExecRemoteAgent.stop(ExecRemoteAgent.java:116)
	at PluginClassLoader for ssh-agent//com.cloudbees.jenkins.plugins.sshagent.SSHAgentBuildWrapper$SSHAgentEnvironment.tearDown(SSHAgentBuildWrapper.java:343)
	at hudson.model.AbstractBuild$AbstractBuildExecution.tearDownBuildEnvironments(AbstractBuild.java:566)
	at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:530)
	at hudson.model.Run.execute(Run.java:1894)
	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:44)
	at hudson.model.ResourceController.execute(ResourceController.java:101)
	at hudson.model.Executor.run(Executor.java:446)
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script  : #!/bin/bash -le

cd $WORKSPACE/${BUILD_ID}/

./scream/components/eamxx/scripts/jenkins/jenkins_cleanup.sh

We're having issues with some test-launcher job hanging forever. So let's make sure we clean all penting test-launcher jobs

squeue -o"%.7i %u %40j" | grep e3sm-jenkins | grep test-launcher | awk '{ print $1 }' | xargs -r scancel

Exception when executing the batch command : no workspace from node hudson.slaves.DumbSlave[mappy] which is computer hudson.slaves.SlaveComputer@65e5f986 and has channel null
Build step 'Post build task' marked build as failure
Sending e-mails to: lbertag@sandia.gov
Finished: FAILURE

SCREAM_PullRequest_Autotester_Weaver # 6039 PASSED (click to see last 100 lines of console output)

142/157 Test #142: model_initial .........................................................   Passed    6.62 sec
        Start 143: model_restart
143/157 Test #143: model_restart .........................................................   Passed    7.79 sec
        Start 144: restarted_vs_monolithic_check_np1
144/157 Test #144: restarted_vs_monolithic_check_np1 .....................................   Passed    0.11 sec
        Start 145: homme_shoc_cld_spa_p3_rrtmgp_np1
145/157 Test #145: homme_shoc_cld_spa_p3_rrtmgp_np1 ......................................   Passed    6.85 sec
        Start 146: homme_shoc_cld_spa_p3_rrtmgp_baseline_cmp
146/157 Test #146: homme_shoc_cld_spa_p3_rrtmgp_baseline_cmp .............................   Passed    0.18 sec
        Start 147: homme_shoc_cld_spa_p3_rrtmgp_128levels_np1
147/157 Test #147: homme_shoc_cld_spa_p3_rrtmgp_128levels_np1 ............................   Passed    9.88 sec
        Start 148: homme_shoc_cld_spa_p3_rrtmgp_128levels_tend_check_np1
148/157 Test #148: homme_shoc_cld_spa_p3_rrtmgp_128levels_tend_check_np1 .................   Passed    1.69 sec
        Start 149: homme_shoc_cld_spa_p3_rrtmgp_128levels_baseline_cmp
149/157 Test #149: homme_shoc_cld_spa_p3_rrtmgp_128levels_baseline_cmp ...................   Passed    0.69 sec
        Start 150: homme_shoc_cld_spa_p3_rrtmgp_pg2_dp_np1
150/157 Test #150: homme_shoc_cld_spa_p3_rrtmgp_pg2_dp_np1 ...............................   Passed   14.91 sec
        Start 151: homme_shoc_cld_spa_p3_rrtmgp_pg2_dp_baseline_cmp
151/157 Test #151: homme_shoc_cld_spa_p3_rrtmgp_pg2_dp_baseline_cmp ......................   Passed    0.13 sec
        Start 152: homme_shoc_cld_p3_mam_optics_rrtmgp_np1
152/157 Test #152: homme_shoc_cld_p3_mam_optics_rrtmgp_np1 ...............................   Passed   19.35 sec
        Start 153: homme_shoc_cld_p3_mam_optics_rrtmgp_baseline_cmp
153/157 Test #153: homme_shoc_cld_p3_mam_optics_rrtmgp_baseline_cmp ......................   Passed    0.16 sec
        Start 154: homme_shoc_cld_mam_aci_p3_mam_optics_rrtmgp_mam_drydep_np1
154/157 Test #154: homme_shoc_cld_mam_aci_p3_mam_optics_rrtmgp_mam_drydep_np1 ............   Passed   20.36 sec
        Start 155: homme_shoc_cld_mam_aci_p3_mam_optics_rrtmgp_mam_drydep_baseline_cmp
155/157 Test #155: homme_shoc_cld_mam_aci_p3_mam_optics_rrtmgp_mam_drydep_baseline_cmp ...   Passed    0.20 sec
        Start 156: homme_shoc_cld_spa_p3_rrtmgp_mam4_wetscav_np1
156/157 Test #156: homme_shoc_cld_spa_p3_rrtmgp_mam4_wetscav_np1 .........................   Passed   40.91 sec
        Start 157: homme_shoc_cld_spa_p3_rrtmgp_mam4_wetscav_baseline_cmp
157/157 Test #157: homme_shoc_cld_spa_p3_rrtmgp_mam4_wetscav_baseline_cmp ................   Passed    0.18 sec

100% tests passed, 0 tests failed out of 157

Label Time Summary:
baseline_cmp = 146.73 secproc (23 tests)
baseline_gen = 345.73 sec
proc (25 tests)
bfbhash = 1.02 secproc (1 test)
check = 1.02 sec
proc (1 test)
cld = 46.19 secproc (7 tests)
cld_fraction = 1.33 sec
proc (1 test)
cxx baseline_cmp = 7.81 secproc (2 tests)
diagnostics = 48.50 sec
proc (23 tests)
driver = 94.22 secproc (16 tests)
dynamics = 8.11 sec
proc (3 tests)
fail = 34.42 secproc (5 tests)
io = 63.66 sec
proc (14 tests)
mam4_aci = 22.66 secproc (4 tests)
mam4_constituent_fluxes = 8.82 sec
proc (1 test)
mam4_drydep = 3.99 secproc (1 test)
mam4_optics = 4.70 sec
proc (1 test)
mam4_srf_online_emiss = 8.82 secproc (1 test)
mam4_wetscav = 22.53 sec
proc (2 tests)
nudging = 7.59 secproc (2 tests)
p3 = 107.37 sec
proc (12 tests)
p3_sk = 60.21 secproc (2 tests)
physics = 210.90 sec
proc (27 tests)
remap = 5.87 secproc (1 test)
rrtmgp = 45.64 sec
proc (11 tests)
shoc = 59.26 secproc (13 tests)
spa = 9.11 sec
proc (4 tests)
surface_coupling = 5.15 sec*proc (1 test)

Total Test time (real) = 866.74 sec

Testing '''70222af46f2d41d247ce0cda4c327c439b7e6e04''' for test '''full_sp_debug'''

RUN: taskset -c 52-103 sh -c '''SCREAM_BUILD_PARALLEL_LEVEL=52 CTEST_PARALLEL_LEVEL=1 ctest -V --output-on-failure --resource-spec-file /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6039/scream/components/eamxx/ctest-build/full_sp_debug/ctest_resource_file.json -DNO_SUBMIT=True -DBUILD_WORK_DIR=/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6039/scream/components/eamxx/ctest-build/full_sp_debug -DBUILD_NAME_MOD=full_sp_debug -S /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6039/scream/components/eamxx/cmake/ctest_script.cmake -DCTEST_SITE=weaver -DCMAKE_COMMAND="-C /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6039/scream/components/eamxx/cmake/machine-files/weaver.cmake -DNetCDF_Fortran_PATH=/projects/ppc64le-pwr9-rhel8/tpls/netcdf-fortran/4.6.1/gcc/11.3.0/openmpi/4.1.6/5tv5psl -DNetCDF_C_PATH=/projects/ppc64le-pwr9-rhel8/tpls/netcdf-c/4.9.2/gcc/11.3.0/openmpi/4.1.6/pyuuqd3 -DPnetCDF_C_PATH=/projects/ppc64le-pwr9-rhel8/tpls/parallel-netcdf/1.12.3/gcc/11.3.0/openmpi/4.1.6/2s52shy -DCMAKE_BUILD_TYPE=Debug -DEKAT_DEFAULT_BFB=True -DSCREAM_DOUBLE_PRECISION=False -DEKAT_DISABLE_TPL_WARNINGS='''''''''ON''''''''' -DCMAKE_CXX_COMPILER=mpicxx -DCMAKE_C_COMPILER=mpicc -DCMAKE_Fortran_COMPILER=mpifort -DSCREAM_DYNAMICS_DYCORE=HOMME -DSCREAM_TEST_MAX_TOTAL_THREADS=1 -DSCREAM_BASELINES_DIR=/home/projects/e3sm/scream/pr-autotester/master-baselines/weaver/full_sp_debug" '''
FROM: /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6039/scream/components/eamxx/ctest-build/full_sp_debug

Testing '''70222af46f2d41d247ce0cda4c327c439b7e6e04''' for test '''release'''

RUN: taskset -c 104-155 sh -c '''SCREAM_BUILD_PARALLEL_LEVEL=52 CTEST_PARALLEL_LEVEL=1 ctest -V --output-on-failure --resource-spec-file /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6039/scream/components/eamxx/ctest-build/release/ctest_resource_file.json -DNO_SUBMIT=True -DBUILD_WORK_DIR=/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6039/scream/components/eamxx/ctest-build/release -DBUILD_NAME_MOD=release -S /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6039/scream/components/eamxx/cmake/ctest_script.cmake -DCTEST_SITE=weaver -DCMAKE_COMMAND="-C /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6039/scream/components/eamxx/cmake/machine-files/weaver.cmake -DNetCDF_Fortran_PATH=/projects/ppc64le-pwr9-rhel8/tpls/netcdf-fortran/4.6.1/gcc/11.3.0/openmpi/4.1.6/5tv5psl -DNetCDF_C_PATH=/projects/ppc64le-pwr9-rhel8/tpls/netcdf-c/4.9.2/gcc/11.3.0/openmpi/4.1.6/pyuuqd3 -DPnetCDF_C_PATH=/projects/ppc64le-pwr9-rhel8/tpls/parallel-netcdf/1.12.3/gcc/11.3.0/openmpi/4.1.6/2s52shy -DCMAKE_BUILD_TYPE=Release -DEKAT_DISABLE_TPL_WARNINGS='''''''''ON''''''''' -DCMAKE_CXX_COMPILER=mpicxx -DCMAKE_C_COMPILER=mpicc -DCMAKE_Fortran_COMPILER=mpifort -DSCREAM_DYNAMICS_DYCORE=HOMME -DSCREAM_TEST_MAX_TOTAL_THREADS=1 -DSCREAM_BASELINES_DIR=/home/projects/e3sm/scream/pr-autotester/master-baselines/weaver/release" '''
FROM: /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6039/scream/components/eamxx/ctest-build/release

Testing '''70222af46f2d41d247ce0cda4c327c439b7e6e04''' for test '''full_debug'''

RUN: taskset -c 0-51 sh -c '''SCREAM_BUILD_PARALLEL_LEVEL=52 CTEST_PARALLEL_LEVEL=1 ctest -V --output-on-failure --resource-spec-file /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6039/scream/components/eamxx/ctest-build/full_debug/ctest_resource_file.json -DNO_SUBMIT=True -DBUILD_WORK_DIR=/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6039/scream/components/eamxx/ctest-build/full_debug -DBUILD_NAME_MOD=full_debug -S /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6039/scream/components/eamxx/cmake/ctest_script.cmake -DCTEST_SITE=weaver -DCMAKE_COMMAND="-C /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6039/scream/components/eamxx/cmake/machine-files/weaver.cmake -DNetCDF_Fortran_PATH=/projects/ppc64le-pwr9-rhel8/tpls/netcdf-fortran/4.6.1/gcc/11.3.0/openmpi/4.1.6/5tv5psl -DNetCDF_C_PATH=/projects/ppc64le-pwr9-rhel8/tpls/netcdf-c/4.9.2/gcc/11.3.0/openmpi/4.1.6/pyuuqd3 -DPnetCDF_C_PATH=/projects/ppc64le-pwr9-rhel8/tpls/parallel-netcdf/1.12.3/gcc/11.3.0/openmpi/4.1.6/2s52shy -DCMAKE_BUILD_TYPE=Debug -DEKAT_DEFAULT_BFB=True -DKokkos_ENABLE_DEBUG_BOUNDS_CHECK=True -DEKAT_DISABLE_TPL_WARNINGS='''''''''ON''''''''' -DCMAKE_CXX_COMPILER=mpicxx -DCMAKE_C_COMPILER=mpicc -DCMAKE_Fortran_COMPILER=mpifort -DSCREAM_DYNAMICS_DYCORE=HOMME -DSCREAM_TEST_MAX_TOTAL_THREADS=1 -DSCREAM_BASELINES_DIR=/home/projects/e3sm/scream/pr-autotester/master-baselines/weaver/full_debug" '''
FROM: /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6039/scream/components/eamxx/ctest-build/full_debug
OVERALL STATUS: PASS
Starting analysis on weaver with cmd: cd /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6039/scream/components/eamxx && source /etc/profile.d/modules.sh && module purge && module load cmake/3.25.1 git/2.39.1 python/3.10.8 py-netcdf4/1.5.8 gcc/11.3.0 cuda/11.8.0 openmpi netcdf-c netcdf-fortran parallel-netcdf netlib-lapack && export HDF5_USE_FILE_LOCKING=FALSE && true && bsub -I -q rhel8 -n 4 -gpu num=4 ./scripts/test-all-scream --baseline-dir AUTO $compiler -p -c EKAT_DISABLE_TPL_WARNINGS=ON -m weaver
RUN: cd /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6039/scream/components/eamxx && source /etc/profile.d/modules.sh && module purge && module load cmake/3.25.1 git/2.39.1 python/3.10.8 py-netcdf4/1.5.8 gcc/11.3.0 cuda/11.8.0 openmpi netcdf-c netcdf-fortran parallel-netcdf netlib-lapack && export HDF5_USE_FILE_LOCKING=FALSE && true && bsub -I -q rhel8 -n 4 -gpu num=4 ./scripts/test-all-scream --baseline-dir AUTO $compiler -p -c EKAT_DISABLE_TPL_WARNINGS=ON -m weaver
FROM: /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6039/scream/components/eamxx
Completed analysis on weaver'

  • [[ 0 != 0 ]]
  • [[ 1 == 0 ]]
  • [[ weaver == \m\a\p\p\y ]]
  • set +x
    Performing Post build task...
    Match found for : : True
    Logical operation result is TRUE
    Running script : #!/bin/bash -le

cd $WORKSPACE/${BUILD_ID}/

./scream/components/eamxx/scripts/jenkins/jenkins_cleanup.sh
[SCREAM_PullRequest_Autotester_Weaver] $ /bin/bash -le /tmp/jenkins16414706543195495297.sh
POST BUILD TASK : SUCCESS
END OF POST BUILD TASK : 0
Finished: SUCCESS

@bartgol bartgol added AT: RETEST CI: skip eamxx-v1 Skip eamxx CIME testing for this PR labels Sep 6, 2024
@E3SM-Bot
Copy link
Collaborator

E3SM-Bot commented Sep 6, 2024

Status Flag 'Pull Request AutoTester' - User Requested Retest - Label AT: RETEST will be reset after testing.

@E3SM-Bot
Copy link
Collaborator

E3SM-Bot commented Sep 6, 2024

Status Flag 'Pull Request AutoTester' - Testing Jenkins Projects:

Pull Request Auto Testing STARTING (click to expand)

Build Information

Test Name: SCREAM_PullRequest_Autotester_Mappy

  • Build Num: 5815
  • Status: STARTED

Jenkins Parameters

Parameter Name Value
PR_LABELS AT: RETEST;AT: AUTOMERGE;AT: Skip weaver;AT: Skip v1 Testing;code usability;MPI
PULLREQUESTNUM 2984
SCREAM_SOURCE_REPO https://github.com/E3SM-Project/scream
SCREAM_SOURCE_SHA 70222af
SCREAM_TARGET_BRANCH master
SCREAM_TARGET_REPO https://github.com/E3SM-Project/scream
SCREAM_TARGET_SHA 42ab514
TEST_REPO_ALIAS SCREAM

Build Information

Test Name: SCREAM_PullRequest_Autotester_Weaver

  • Build Num: -1
  • Status: SKIPPED

Jenkins Parameters

Parameter Name Value

Using Repos:

Repo: SCREAM (E3SM-Project/scream)
  • Branch: bartgol/eamxx/mpi-abort-on-exception
  • SHA: 70222af
  • Mode: TEST_REPO

Pull Request Author: bartgol

@E3SM-Bot
Copy link
Collaborator

E3SM-Bot commented Sep 6, 2024

Status Flag 'Pull Request AutoTester' - Jenkins Testing: 1 or more Jobs FAILED

Note: Testing will normally be attempted again in approx. 2 Hrs. If a change to the PR source branch occurs, the testing will be attempted again on next available autotester run.

Pull Request Auto Testing has FAILED (click to expand)

Build Information

Test Name: SCREAM_PullRequest_Autotester_Mappy

  • Build Num: 5815
  • Status: FAILED

Jenkins Parameters

Parameter Name Value
PR_LABELS AT: RETEST;AT: AUTOMERGE;AT: Skip weaver;AT: Skip v1 Testing;code usability;MPI
PULLREQUESTNUM 2984
SCREAM_SOURCE_REPO https://github.com/E3SM-Project/scream
SCREAM_SOURCE_SHA 70222af
SCREAM_TARGET_BRANCH master
SCREAM_TARGET_REPO https://github.com/E3SM-Project/scream
SCREAM_TARGET_SHA 42ab514
TEST_REPO_ALIAS SCREAM

Build Information

Test Name: SCREAM_PullRequest_Autotester_Weaver

  • Build Num: -1
  • Status: SKIPPED

Jenkins Parameters

Parameter Name Value
SCREAM_PullRequest_Autotester_Mappy # 5815 FAILED (click to see last 100 lines of console output)

+ tee JENKINS_2024-09-05_203055
+ IFS=';'
+ read -r -a labels
+ skip_testing=0
+ test_scripts=0
+ test_v0=0
+ test_v1=1
+ test_SA=1
+ skip_mappy=0
+ skip_weaver=0
+ skip_blake=0
+ is_at_run=0
+ '[' 6 -gt 0 ']'
+ for label in '"${labels[@]}"'
+ '[' 'AT: RETEST' == 'AT: Integrate Without Testing' ']'
+ '[' 'AT: RETEST' == 'AT: Skip Stand-Alone Testing' ']'
+ '[' 'AT: RETEST' == 'AT: Skip v1 Testing' ']'
+ '[' 'AT: RETEST' == scripts ']'
+ '[' 'AT: RETEST' == SCREAMv0 ']'
+ '[' 'AT: RETEST' == 'AT: Skip mappy' ']'
+ '[' 'AT: RETEST' == 'AT: Skip weaver' ']'
+ '[' 'AT: RETEST' == 'AT: Skip blake' ']'
+ for label in '"${labels[@]}"'
+ '[' 'AT: AUTOMERGE' == 'AT: Integrate Without Testing' ']'
+ '[' 'AT: AUTOMERGE' == 'AT: Skip Stand-Alone Testing' ']'
+ '[' 'AT: AUTOMERGE' == 'AT: Skip v1 Testing' ']'
+ '[' 'AT: AUTOMERGE' == scripts ']'
+ '[' 'AT: AUTOMERGE' == SCREAMv0 ']'
+ '[' 'AT: AUTOMERGE' == 'AT: Skip mappy' ']'
+ '[' 'AT: AUTOMERGE' == 'AT: Skip weaver' ']'
+ '[' 'AT: AUTOMERGE' == 'AT: Skip blake' ']'
+ for label in '"${labels[@]}"'
+ '[' 'AT: Skip weaver' == 'AT: Integrate Without Testing' ']'
+ '[' 'AT: Skip weaver' == 'AT: Skip Stand-Alone Testing' ']'
+ '[' 'AT: Skip weaver' == 'AT: Skip v1 Testing' ']'
+ '[' 'AT: Skip weaver' == scripts ']'
+ '[' 'AT: Skip weaver' == SCREAMv0 ']'
+ '[' 'AT: Skip weaver' == 'AT: Skip mappy' ']'
+ '[' 'AT: Skip weaver' == 'AT: Skip weaver' ']'
+ skip_weaver=1
+ for label in '"${labels[@]}"'
+ '[' 'AT: Skip v1 Testing' == 'AT: Integrate Without Testing' ']'
+ '[' 'AT: Skip v1 Testing' == 'AT: Skip Stand-Alone Testing' ']'
+ '[' 'AT: Skip v1 Testing' == 'AT: Skip v1 Testing' ']'
+ test_v1=0
+ for label in '"${labels[@]}"'
+ '[' 'code usability' == 'AT: Integrate Without Testing' ']'
+ '[' 'code usability' == 'AT: Skip Stand-Alone Testing' ']'
+ '[' 'code usability' == 'AT: Skip v1 Testing' ']'
+ '[' 'code usability' == scripts ']'
+ '[' 'code usability' == SCREAMv0 ']'
+ '[' 'code usability' == 'AT: Skip mappy' ']'
+ '[' 'code usability' == 'AT: Skip weaver' ']'
+ '[' 'code usability' == 'AT: Skip blake' ']'
+ for label in '"${labels[@]}"'
+ '[' MPI == 'AT: Integrate Without Testing' ']'
+ '[' MPI == 'AT: Skip Stand-Alone Testing' ']'
+ '[' MPI == 'AT: Skip v1 Testing' ']'
+ '[' MPI == scripts ']'
+ '[' MPI == SCREAMv0 ']'
+ '[' MPI == 'AT: Skip mappy' ']'
+ '[' MPI == 'AT: Skip weaver' ']'
+ '[' MPI == 'AT: Skip blake' ']'
+ '[' 0 -eq 0 ']'
+ v0_fail=0
+ v1_fail=0
+ sa_fail=0
+ cov_fail=0
+ scripts_fail=0
+ memcheck_fail=0
+ cd /ascldap/users/e3sm-jenkins/jenkins-ws/workspace/SCREAM_PullRequest_Autotester_Mappy/5815/scream/components/eamxx/scripts/jenkins/../..
+ source scripts/jenkins/mappy_setup
++ module purge
+++ /usr/bin/modulecmd bash purge
++ eval
++ source /projects/sems/modulefiles/utils/sems-archive-modules-init.sh
/ascldap/users/e3sm-jenkins/jenkins-ws/workspace/SCREAM_PullRequest_Autotester_Mappy/5815/scream/components/eamxx/scripts/jenkins/jenkins_common_impl.sh: line 2: /projects/sems/modulefiles/utils/sems-archive-modules-init.sh: No such file or directory
Build step 'Execute shell' marked build as failure
$ ssh-agent -k
unset SSH_AUTH_SOCK;
unset SSH_AGENT_PID;
echo Agent pid 36228 killed;
[ssh-agent] Stopped.
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script  : #!/bin/bash -le

cd $WORKSPACE/${BUILD_ID}/

./scream/components/eamxx/scripts/jenkins/jenkins_cleanup.sh

We're having issues with some test-launcher job hanging forever. So let's make sure we clean all penting test-launcher jobs

squeue -o"%.7i %u %40j" | grep e3sm-jenkins | grep test-launcher | awk '{ print $1 }' | xargs -r scancel

[SCREAM_PullRequest_Autotester_Mappy] $ /bin/bash -le /tmp/jenkins14507071293668054559.sh
POST BUILD TASK : FAILURE
END OF POST BUILD TASK : 0
Sending e-mails to: lbertag@sandia.gov
Finished: FAILURE

SCREAM_PullRequest_Autotester_Weaver # -1 SKIPPED

@bartgol
Copy link
Contributor Author

bartgol commented Sep 6, 2024

The AT is failing b/c of FS issues on mappy. I'm not going to add RETEST, since the issue will likely linger for a while. We can retest/merge next week, it's not urgent.

@AaronDonahue
Copy link
Contributor

@bartgol , is this just waiting for the AT to get back up to full functionality? Could we manually test and then manually merge so that the PR doesn't linger?

@E3SM-Bot
Copy link
Collaborator

Status Flag 'Pull Request AutoTester' - User Requested Retest - Label AT: RETEST will be reset after testing.

@E3SM-Bot
Copy link
Collaborator

Status Flag 'Pull Request AutoTester' - Testing Jenkins Projects:

Pull Request Auto Testing STARTING (click to expand)

Build Information

Test Name: SCREAM_PullRequest_Autotester_Weaver

  • Build Num: -1
  • Status: SKIPPED

Jenkins Parameters

Parameter Name Value

Build Information

Test Name: SCREAM_PullRequest_Autotester_Mappy

  • Build Num: 5840
  • Status: STARTED

Jenkins Parameters

Parameter Name Value
PR_LABELS AT: RETEST;AT: AUTOMERGE;AT: Skip weaver;AT: Skip v1 Testing;code usability;MPI
PULLREQUESTNUM 2984
SCREAM_SOURCE_REPO https://github.com/E3SM-Project/scream
SCREAM_SOURCE_SHA 70222af
SCREAM_TARGET_BRANCH master
SCREAM_TARGET_REPO https://github.com/E3SM-Project/scream
SCREAM_TARGET_SHA 798bfa6
TEST_REPO_ALIAS SCREAM

Using Repos:

Repo: SCREAM (E3SM-Project/scream)
  • Branch: bartgol/eamxx/mpi-abort-on-exception
  • SHA: 70222af
  • Mode: TEST_REPO

Pull Request Author: bartgol

@E3SM-Bot
Copy link
Collaborator

Status Flag 'Pull Request AutoTester' - Jenkins Testing: all Jobs PASSED

Pull Request Auto Testing has PASSED (click to expand)

Build Information

Test Name: SCREAM_PullRequest_Autotester_Weaver

  • Build Num: -1
  • Status: SKIPPED

Jenkins Parameters

Parameter Name Value

Build Information

Test Name: SCREAM_PullRequest_Autotester_Mappy

  • Build Num: 5840
  • Status: PASSED

Jenkins Parameters

Parameter Name Value
PR_LABELS AT: RETEST;AT: AUTOMERGE;AT: Skip weaver;AT: Skip v1 Testing;code usability;MPI
PULLREQUESTNUM 2984
SCREAM_SOURCE_REPO https://github.com/E3SM-Project/scream
SCREAM_SOURCE_SHA 70222af
SCREAM_TARGET_BRANCH master
SCREAM_TARGET_REPO https://github.com/E3SM-Project/scream
SCREAM_TARGET_SHA 798bfa6
TEST_REPO_ALIAS SCREAM

@E3SM-Bot E3SM-Bot merged commit b610754 into master Sep 23, 2024
7 checks passed
@E3SM-Bot E3SM-Bot deleted the bartgol/eamxx/mpi-abort-on-exception branch September 23, 2024 22:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI: skip eamxx-v1 Skip eamxx CIME testing for this PR code usability MPI Related to handling of MPI data interfaces/calls.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Call MPI_Abort on exceptions.
4 participants