Skip to content

Conversation

ambrad
Copy link
Member

@ambrad ambrad commented Oct 23, 2024

Fix either a long-latent issue or a compiler-side issue that was recently triggered by an unrelated PR. Edit: Further analysis and testing strongly suggests this is a compiler-side issue for craygnuamdgpu but not crayclang-scream.

Add a test for 128 levels to hold us over until the default for all tests is changed to 128 levels.

Add an option to global state hasher to hash a user-provided array. This let me hash temporary workspace as part of isolating the issue.

Fixes #3053.

@ambrad ambrad added bugfix GPU PRs that make changes specifically for GPUs labels Oct 23, 2024
@ambrad ambrad requested a review from ndkeen October 23, 2024 19:44

if (k+1 == nlev_packs) zi_grid(i,nlevi_v)[nlevi_p] = 0;
});
zi_grid(i,nlevi_v)[nlevi_p] = 0;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Placing a team_barrier before this line would also work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm puzzled. Why was this line buggy? All threads would execute that line. Albeit being unnecessary (so I'm ok with the change), it would not be incorrect, since they all set the same value, and there was a barrier right after.

Copy link
Member Author

@ambrad ambrad Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. If the previous kernel were over nlevi rather than nlev, I could make an argument about memory consistency. But since it's over nlev, I can think of any explanation. Maybe this fix is actually working around a compiler-side issue.

The evidence for this specific line being the issue is that the _sfc quantities a few lines below this code block were all wrong. It's possible there's some other bug I'm not seeing that this fix handles. If so, that bug must affect these _sfc quantities.

I'll weaken my suggestion that other C++ devs study this PR since it may just be working around a compiler-side problem.

Fix either a long-latent issue or a compiler-side issue that was recently
triggered by an unrelated PR.

Add a test for 128 levels to hold us over until the default for all tests is
changed to 128 levels.

Add an option to global state hasher to hash a user-provided array. This let me
hash temporary workspace as part of isolating the issue.
@ambrad ambrad force-pushed the ambrad/eamxx/shoc-0set-bugfix branch from 1445261 to 663216f Compare October 23, 2024 20:52
bartgol
bartgol previously approved these changes Oct 23, 2024
Copy link
Contributor

@bartgol bartgol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this impl appears to deterministically fix a non-determinism, I'm ok with merging, even if it remains a bit mysterious.

@E3SM-Bot
Copy link
Collaborator

Status Flag 'Pull Request AutoTester' - Testing Jenkins Projects:

Pull Request Auto Testing STARTING (click to expand)

Build Information

Test Name: SCREAM_PullRequest_Autotester_Weaver

  • Build Num: 6214
  • Status: STARTED

Jenkins Parameters

Parameter Name Value
PR_LABELS bugfix;AT: PRE-TEST INSPECTED;GPU
PULLREQUESTNUM 3058
SCREAM_SOURCE_REPO https://github.com/ambrad/scream
SCREAM_SOURCE_SHA d57208c
SCREAM_TARGET_BRANCH master
SCREAM_TARGET_REPO https://github.com/E3SM-Project/scream
SCREAM_TARGET_SHA 69c16bb
TEST_REPO_ALIAS SCREAM

Build Information

Test Name: SCREAM_PullRequest_Autotester_Mappy

  • Build Num: 5959
  • Status: STARTED

Jenkins Parameters

Parameter Name Value
PR_LABELS bugfix;AT: PRE-TEST INSPECTED;GPU
PULLREQUESTNUM 3058
SCREAM_SOURCE_REPO https://github.com/ambrad/scream
SCREAM_SOURCE_SHA d57208c
SCREAM_TARGET_BRANCH master
SCREAM_TARGET_REPO https://github.com/E3SM-Project/scream
SCREAM_TARGET_SHA 69c16bb
TEST_REPO_ALIAS SCREAM

Using Repos:

Repo: SCREAM (ambrad/scream)
  • Branch: ambrad/eamxx/shoc-0set-bugfix
  • SHA: d57208c
  • Mode: TEST_REPO

Pull Request Author: ambrad

@E3SM-Bot
Copy link
Collaborator

Status Flag 'Pull Request AutoTester' - Jenkins Testing: 1 or more Jobs FAILED

Note: Testing will normally be attempted again in approx. 2 Hrs. If a change to the PR source branch occurs, the testing will be attempted again on next available autotester run.

Pull Request Auto Testing has FAILED (click to expand)

Build Information

Test Name: SCREAM_PullRequest_Autotester_Weaver

  • Build Num: 6214
  • Status: PASSED

Jenkins Parameters

Parameter Name Value
PR_LABELS bugfix;AT: PRE-TEST INSPECTED;GPU
PULLREQUESTNUM 3058
SCREAM_SOURCE_REPO https://github.com/ambrad/scream
SCREAM_SOURCE_SHA d57208c
SCREAM_TARGET_BRANCH master
SCREAM_TARGET_REPO https://github.com/E3SM-Project/scream
SCREAM_TARGET_SHA 69c16bb
TEST_REPO_ALIAS SCREAM

Build Information

Test Name: SCREAM_PullRequest_Autotester_Mappy

  • Build Num: 5959
  • Status: FAILED

Jenkins Parameters

Parameter Name Value
PR_LABELS bugfix;AT: PRE-TEST INSPECTED;GPU
PULLREQUESTNUM 3058
SCREAM_SOURCE_REPO https://github.com/ambrad/scream
SCREAM_SOURCE_SHA d57208c
SCREAM_TARGET_BRANCH master
SCREAM_TARGET_REPO https://github.com/E3SM-Project/scream
SCREAM_TARGET_SHA 69c16bb
TEST_REPO_ALIAS SCREAM
SCREAM_PullRequest_Autotester_Weaver # 6214 PASSED (click to see last 100 lines of console output)

        Start 143: model_restart
143/157 Test #143: model_restart .........................................................   Passed    6.98 sec
        Start 144: restarted_vs_monolithic_check_np1
144/157 Test #144: restarted_vs_monolithic_check_np1 .....................................   Passed    0.10 sec
        Start 145: homme_shoc_cld_spa_p3_rrtmgp_np1
145/157 Test #145: homme_shoc_cld_spa_p3_rrtmgp_np1 ......................................   Passed   11.49 sec
        Start 146: homme_shoc_cld_spa_p3_rrtmgp_baseline_cmp
146/157 Test #146: homme_shoc_cld_spa_p3_rrtmgp_baseline_cmp .............................   Passed    0.12 sec
        Start 147: homme_shoc_cld_spa_p3_rrtmgp_128levels_np1
147/157 Test #147: homme_shoc_cld_spa_p3_rrtmgp_128levels_np1 ............................   Passed    8.56 sec
        Start 148: homme_shoc_cld_spa_p3_rrtmgp_128levels_tend_check_np1
148/157 Test #148: homme_shoc_cld_spa_p3_rrtmgp_128levels_tend_check_np1 .................   Passed    1.53 sec
        Start 149: homme_shoc_cld_spa_p3_rrtmgp_128levels_baseline_cmp
149/157 Test #149: homme_shoc_cld_spa_p3_rrtmgp_128levels_baseline_cmp ...................   Passed    0.60 sec
        Start 150: homme_shoc_cld_spa_p3_rrtmgp_pg2_dp_np1
150/157 Test #150: homme_shoc_cld_spa_p3_rrtmgp_pg2_dp_np1 ...............................   Passed   17.32 sec
        Start 151: homme_shoc_cld_spa_p3_rrtmgp_pg2_dp_baseline_cmp
151/157 Test #151: homme_shoc_cld_spa_p3_rrtmgp_pg2_dp_baseline_cmp ......................   Passed    0.09 sec
        Start 152: homme_shoc_cld_p3_mam_optics_rrtmgp_np1
152/157 Test #152: homme_shoc_cld_p3_mam_optics_rrtmgp_np1 ...............................   Passed   16.11 sec
        Start 153: homme_shoc_cld_p3_mam_optics_rrtmgp_baseline_cmp
153/157 Test #153: homme_shoc_cld_p3_mam_optics_rrtmgp_baseline_cmp ......................   Passed    0.21 sec
        Start 154: homme_shoc_cld_mam_aci_p3_mam_optics_rrtmgp_mam_drydep_np1
154/157 Test #154: homme_shoc_cld_mam_aci_p3_mam_optics_rrtmgp_mam_drydep_np1 ............   Passed   16.52 sec
        Start 155: homme_shoc_cld_mam_aci_p3_mam_optics_rrtmgp_mam_drydep_baseline_cmp
155/157 Test #155: homme_shoc_cld_mam_aci_p3_mam_optics_rrtmgp_mam_drydep_baseline_cmp ...   Passed    0.17 sec
        Start 156: homme_shoc_cld_spa_p3_rrtmgp_mam4_wetscav_np1
156/157 Test #156: homme_shoc_cld_spa_p3_rrtmgp_mam4_wetscav_np1 .........................   Passed   31.87 sec
        Start 157: homme_shoc_cld_spa_p3_rrtmgp_mam4_wetscav_baseline_cmp
157/157 Test #157: homme_shoc_cld_spa_p3_rrtmgp_mam4_wetscav_baseline_cmp ................   Passed    0.15 sec

100% tests passed, 0 tests failed out of 157

Label Time Summary:
baseline_cmp = 138.24 secproc (23 tests)
baseline_gen = 337.64 sec
proc (25 tests)
bfbhash = 0.92 secproc (1 test)
check = 0.89 sec
proc (1 test)
cld = 46.24 secproc (7 tests)
cld_fraction = 1.17 sec
proc (1 test)
cxx baseline_cmp = 7.63 secproc (2 tests)
diagnostics = 43.21 sec
proc (23 tests)
driver = 91.96 secproc (16 tests)
dynamics = 6.36 sec
proc (3 tests)
fail = 30.92 secproc (5 tests)
io = 50.16 sec
proc (14 tests)
mam4_aci = 28.58 secproc (4 tests)
mam4_constituent_fluxes = 7.96 sec
proc (1 test)
mam4_drydep = 3.48 secproc (1 test)
mam4_optics = 4.15 sec
proc (1 test)
mam4_srf_online_emiss = 7.96 secproc (1 test)
mam4_wetscav = 24.41 sec
proc (2 tests)
nudging = 7.14 secproc (2 tests)
p3 = 109.81 sec
proc (12 tests)
p3_sk = 34.50 secproc (2 tests)
physics = 183.13 sec
proc (27 tests)
remap = 2.84 secproc (1 test)
rrtmgp = 45.89 sec
proc (11 tests)
shoc = 57.65 secproc (13 tests)
spa = 7.90 sec
proc (4 tests)
surface_coupling = 4.25 sec*proc (1 test)

Total Test time (real) = 799.24 sec

Testing '''f37b7193126288c00030acb7e16672b3f1521f00''' for test '''full_sp_debug'''

RUN: taskset -c 52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103 sh -c '''SCREAM_BUILD_PARALLEL_LEVEL=52 CTEST_PARALLEL_LEVEL=1 ctest -V --output-on-failure --resource-spec-file /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6214/scream/components/eamxx/ctest-build/full_sp_debug/ctest_resource_file.json -DNO_SUBMIT=True -DBUILD_WORK_DIR=/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6214/scream/components/eamxx/ctest-build/full_sp_debug -DBUILD_NAME_MOD=full_sp_debug -S /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6214/scream/components/eamxx/cmake/ctest_script.cmake -DCTEST_SITE=weaver -DCMAKE_COMMAND="-C /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6214/scream/components/eamxx/cmake/machine-files/weaver.cmake -DNetCDF_Fortran_PATH=/projects/ppc64le-pwr9-rhel8/tpls/netcdf-fortran/4.6.1/gcc/11.3.0/openmpi/4.1.6/5tv5psl -DNetCDF_C_PATH=/projects/ppc64le-pwr9-rhel8/tpls/netcdf-c/4.9.2/gcc/11.3.0/openmpi/4.1.6/pyuuqd3 -DPnetCDF_C_PATH=/projects/ppc64le-pwr9-rhel8/tpls/parallel-netcdf/1.12.3/gcc/11.3.0/openmpi/4.1.6/2s52shy -DCMAKE_BUILD_TYPE=Debug -DEKAT_DEFAULT_BFB=True -DSCREAM_DOUBLE_PRECISION=False -DEKAT_DISABLE_TPL_WARNINGS='''''''''ON''''''''' -DCMAKE_CXX_COMPILER=mpicxx -DCMAKE_C_COMPILER=mpicc -DCMAKE_Fortran_COMPILER=mpifort -DSCREAM_DYNAMICS_DYCORE=HOMME -DSCREAM_TEST_MAX_TOTAL_THREADS=1 -DSCREAM_BASELINES_DIR=/home/projects/e3sm/scream/pr-autotester/master-baselines/weaver/full_sp_debug" '''
FROM: /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6214/scream/components/eamxx/ctest-build/full_sp_debug

Testing '''f37b7193126288c00030acb7e16672b3f1521f00''' for test '''full_debug'''

RUN: taskset -c 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51 sh -c '''SCREAM_BUILD_PARALLEL_LEVEL=52 CTEST_PARALLEL_LEVEL=1 ctest -V --output-on-failure --resource-spec-file /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6214/scream/components/eamxx/ctest-build/full_debug/ctest_resource_file.json -DNO_SUBMIT=True -DBUILD_WORK_DIR=/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6214/scream/components/eamxx/ctest-build/full_debug -DBUILD_NAME_MOD=full_debug -S /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6214/scream/components/eamxx/cmake/ctest_script.cmake -DCTEST_SITE=weaver -DCMAKE_COMMAND="-C /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6214/scream/components/eamxx/cmake/machine-files/weaver.cmake -DNetCDF_Fortran_PATH=/projects/ppc64le-pwr9-rhel8/tpls/netcdf-fortran/4.6.1/gcc/11.3.0/openmpi/4.1.6/5tv5psl -DNetCDF_C_PATH=/projects/ppc64le-pwr9-rhel8/tpls/netcdf-c/4.9.2/gcc/11.3.0/openmpi/4.1.6/pyuuqd3 -DPnetCDF_C_PATH=/projects/ppc64le-pwr9-rhel8/tpls/parallel-netcdf/1.12.3/gcc/11.3.0/openmpi/4.1.6/2s52shy -DCMAKE_BUILD_TYPE=Debug -DEKAT_DEFAULT_BFB=True -DKokkos_ENABLE_DEBUG_BOUNDS_CHECK=True -DEKAT_DISABLE_TPL_WARNINGS='''''''''ON''''''''' -DCMAKE_CXX_COMPILER=mpicxx -DCMAKE_C_COMPILER=mpicc -DCMAKE_Fortran_COMPILER=mpifort -DSCREAM_DYNAMICS_DYCORE=HOMME -DSCREAM_TEST_MAX_TOTAL_THREADS=1 -DSCREAM_BASELINES_DIR=/home/projects/e3sm/scream/pr-autotester/master-baselines/weaver/full_debug" '''
FROM: /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6214/scream/components/eamxx/ctest-build/full_debug

Testing '''f37b7193126288c00030acb7e16672b3f1521f00''' for test '''release'''

RUN: taskset -c 104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155 sh -c '''SCREAM_BUILD_PARALLEL_LEVEL=52 CTEST_PARALLEL_LEVEL=1 ctest -V --output-on-failure --resource-spec-file /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6214/scream/components/eamxx/ctest-build/release/ctest_resource_file.json -DNO_SUBMIT=True -DBUILD_WORK_DIR=/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6214/scream/components/eamxx/ctest-build/release -DBUILD_NAME_MOD=release -S /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6214/scream/components/eamxx/cmake/ctest_script.cmake -DCTEST_SITE=weaver -DCMAKE_COMMAND="-C /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6214/scream/components/eamxx/cmake/machine-files/weaver.cmake -DNetCDF_Fortran_PATH=/projects/ppc64le-pwr9-rhel8/tpls/netcdf-fortran/4.6.1/gcc/11.3.0/openmpi/4.1.6/5tv5psl -DNetCDF_C_PATH=/projects/ppc64le-pwr9-rhel8/tpls/netcdf-c/4.9.2/gcc/11.3.0/openmpi/4.1.6/pyuuqd3 -DPnetCDF_C_PATH=/projects/ppc64le-pwr9-rhel8/tpls/parallel-netcdf/1.12.3/gcc/11.3.0/openmpi/4.1.6/2s52shy -DCMAKE_BUILD_TYPE=Release -DEKAT_DISABLE_TPL_WARNINGS='''''''''ON''''''''' -DCMAKE_CXX_COMPILER=mpicxx -DCMAKE_C_COMPILER=mpicc -DCMAKE_Fortran_COMPILER=mpifort -DSCREAM_DYNAMICS_DYCORE=HOMME -DSCREAM_TEST_MAX_TOTAL_THREADS=1 -DSCREAM_BASELINES_DIR=/home/projects/e3sm/scream/pr-autotester/master-baselines/weaver/release" '''
FROM: /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6214/scream/components/eamxx/ctest-build/release
OVERALL STATUS: PASS
Starting analysis on weaver with cmd: cd /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6214/scream/components/eamxx && source /etc/profile.d/modules.sh && module purge && module load cmake/3.25.1 git/2.39.1 python/3.10.8 py-netcdf4/1.5.8 gcc/11.3.0 cuda/11.8.0 openmpi netcdf-c netcdf-fortran parallel-netcdf netlib-lapack && export HDF5_USE_FILE_LOCKING=FALSE && true && bsub -I -q rhel8 -n 4 -gpu num=4 ./scripts/test-all-scream --baseline-dir AUTO $compiler -p -c EKAT_DISABLE_TPL_WARNINGS=ON -m weaver
RUN: cd /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6214/scream/components/eamxx && source /etc/profile.d/modules.sh && module purge && module load cmake/3.25.1 git/2.39.1 python/3.10.8 py-netcdf4/1.5.8 gcc/11.3.0 cuda/11.8.0 openmpi netcdf-c netcdf-fortran parallel-netcdf netlib-lapack && export HDF5_USE_FILE_LOCKING=FALSE && true && bsub -I -q rhel8 -n 4 -gpu num=4 ./scripts/test-all-scream --baseline-dir AUTO $compiler -p -c EKAT_DISABLE_TPL_WARNINGS=ON -m weaver
FROM: /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6214/scream/components/eamxx
Completed analysis on weaver'

  • [[ 0 != 0 ]]
  • [[ 1 == 0 ]]
  • [[ weaver == \m\a\p\p\y ]]
  • set +x
    Performing Post build task...
    Match found for : : True
    Logical operation result is TRUE
    Running script : #!/bin/bash -le

cd $WORKSPACE/${BUILD_ID}/

./scream/components/eamxx/scripts/jenkins/jenkins_cleanup.sh
[SCREAM_PullRequest_Autotester_Weaver] $ /bin/bash -le /tmp/jenkins6293765846833960453.sh
POST BUILD TASK : SUCCESS
END OF POST BUILD TASK : 0
Sending e-mails to: lbertag@sandia.gov
Finished: SUCCESS

SCREAM_PullRequest_Autotester_Mappy # 5959 FAILED (click to see last 100 lines of console output)

Running as SYSTEM
Building remotely on mappy in workspace /ascldap/users/e3sm-jenkins/jenkins-ws/workspace/SCREAM_PullRequest_Autotester_Mappy
[ssh-agent] Looking for ssh-agent implementation...
$ ssh-agent
SSH_AUTH_SOCK=/tmp/ssh-XXXXXXKWUbL1/agent.1588474
SSH_AGENT_PID=1588476
[ssh-agent] Started.
$ ssh-agent -k
unset SSH_AUTH_SOCK;
unset SSH_AGENT_PID;
echo Agent pid 1588476 killed;
[ssh-agent] Stopped.
FATAL: Failed to create a temp file on /ascldap/users/e3sm-jenkins/jenkins-ws/workspace/SCREAM_PullRequest_Autotester_Mappy@tmp
java.io.IOException: No space left on device
	at java.base/java.io.UnixFileSystem.createFileExclusively0(Native Method)
	at java.base/java.io.UnixFileSystem.createFileExclusively(UnixFileSystem.java:258)
	at java.base/java.io.File.createTempFile(File.java:2184)
	at Jenkins v2.462.2//hudson.FilePath$CreateTextTempFile.invoke(FilePath.java:1688)
Also:   hudson.remoting.Channel$CallSiteStackTrace: Remote call to mappy
		at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1826)
		at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:356)
		at hudson.remoting.Channel.call(Channel.java:1042)
		at hudson.FilePath.act(FilePath.java:1229)
		at hudson.FilePath.act(FilePath.java:1218)
		at hudson.FilePath.createTextTempFile(FilePath.java:1659)
		at hudson.FilePath.createTextTempFile(FilePath.java:1632)
		at PluginClassLoader for ssh-agent//com.cloudbees.jenkins.plugins.sshagent.exec.ExecRemoteAgent.addIdentity(ExecRemoteAgent.java:74)
		at PluginClassLoader for ssh-agent//com.cloudbees.jenkins.plugins.sshagent.SSHAgentBuildWrapper$SSHAgentEnvironment.add(SSHAgentBuildWrapper.java:324)
		at PluginClassLoader for ssh-agent//com.cloudbees.jenkins.plugins.sshagent.SSHAgentBuildWrapper.preCheckout(SSHAgentBuildWrapper.java:227)
		at jenkins.scm.SCMCheckoutStrategy.preCheckout(SCMCheckoutStrategy.java:75)
		at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:520)
		at hudson.model.Run.execute(Run.java:1894)
		at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:44)
		at hudson.model.ResourceController.execute(ResourceController.java:101)
		at hudson.model.Executor.run(Executor.java:446)
Caused: java.io.IOException: Failed to create a temporary directory in /ascldap/users/e3sm-jenkins/jenkins-ws/workspace/SCREAM_PullRequest_Autotester_Mappy@tmp
	at Jenkins v2.462.2//hudson.FilePath$CreateTextTempFile.invoke(FilePath.java:1690)
	at Jenkins v2.462.2//hudson.FilePath$CreateTextTempFile.invoke(FilePath.java:1665)
	at Jenkins v2.462.2//hudson.FilePath$FileCallableWrapper.call(FilePath.java:3615)
	at hudson.remoting.UserRequest.perform(UserRequest.java:211)
	at hudson.remoting.UserRequest.perform(UserRequest.java:54)
	at hudson.remoting.Request$2.run(Request.java:377)
	at hudson.remoting.InterceptingExecutorService.lambda$wrap$0(InterceptingExecutorService.java:78)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1583)
Caused: java.io.IOException: Failed to create a temp file on /ascldap/users/e3sm-jenkins/jenkins-ws/workspace/SCREAM_PullRequest_Autotester_Mappy@tmp
	at hudson.FilePath.createTextTempFile(FilePath.java:1661)
	at hudson.FilePath.createTextTempFile(FilePath.java:1632)
	at PluginClassLoader for ssh-agent//com.cloudbees.jenkins.plugins.sshagent.exec.ExecRemoteAgent.addIdentity(ExecRemoteAgent.java:74)
	at PluginClassLoader for ssh-agent//com.cloudbees.jenkins.plugins.sshagent.SSHAgentBuildWrapper$SSHAgentEnvironment.add(SSHAgentBuildWrapper.java:324)
	at PluginClassLoader for ssh-agent//com.cloudbees.jenkins.plugins.sshagent.SSHAgentBuildWrapper.preCheckout(SSHAgentBuildWrapper.java:227)
	at jenkins.scm.SCMCheckoutStrategy.preCheckout(SCMCheckoutStrategy.java:75)
	at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:520)
	at hudson.model.Run.execute(Run.java:1894)
	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:44)
	at hudson.model.ResourceController.execute(ResourceController.java:101)
	at hudson.model.Executor.run(Executor.java:446)
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script  : #!/bin/bash -le

cd $WORKSPACE/${BUILD_ID}/

./scream/components/eamxx/scripts/jenkins/jenkins_cleanup.sh

We're having issues with some test-launcher job hanging forever. So let's make sure we clean all penting test-launcher jobs

squeue -o"%.7i %u %40j" | grep e3sm-jenkins | grep test-launcher | awk '{ print $1 }' | xargs -r scancel

[SCREAM_PullRequest_Autotester_Mappy] $ /bin/bash -le /tmp/jenkins3307510291294681704.sh
/tmp/jenkins3307510291294681704.sh: line 3: cd: /ascldap/users/e3sm-jenkins/jenkins-ws/workspace/SCREAM_PullRequest_Autotester_Mappy/5959/: No such file or directory
POST BUILD TASK : FAILURE
END OF POST BUILD TASK : 0
Sending e-mails to: lbertag@sandia.gov
Finished: FAILURE

@E3SM-Bot
Copy link
Collaborator

Status Flag 'Pull Request AutoTester' - User Requested Retest - Label AT: RETEST will be reset after testing.

@E3SM-Bot
Copy link
Collaborator

Status Flag 'Pull Request AutoTester' - Testing Jenkins Projects:

Pull Request Auto Testing STARTING (click to expand)

Build Information

Test Name: SCREAM_PullRequest_Autotester_Weaver

  • Build Num: 6218
  • Status: STARTED

Jenkins Parameters

Parameter Name Value
PR_LABELS AT: RETEST;bugfix;AT: PRE-TEST INSPECTED;GPU
PULLREQUESTNUM 3058
SCREAM_SOURCE_REPO https://github.com/ambrad/scream
SCREAM_SOURCE_SHA d57208c
SCREAM_TARGET_BRANCH master
SCREAM_TARGET_REPO https://github.com/E3SM-Project/scream
SCREAM_TARGET_SHA 69c16bb
TEST_REPO_ALIAS SCREAM

Build Information

Test Name: SCREAM_PullRequest_Autotester_Mappy

  • Build Num: 5963
  • Status: STARTED

Jenkins Parameters

Parameter Name Value
PR_LABELS AT: RETEST;bugfix;AT: PRE-TEST INSPECTED;GPU
PULLREQUESTNUM 3058
SCREAM_SOURCE_REPO https://github.com/ambrad/scream
SCREAM_SOURCE_SHA d57208c
SCREAM_TARGET_BRANCH master
SCREAM_TARGET_REPO https://github.com/E3SM-Project/scream
SCREAM_TARGET_SHA 69c16bb
TEST_REPO_ALIAS SCREAM

Using Repos:

Repo: SCREAM (ambrad/scream)
  • Branch: ambrad/eamxx/shoc-0set-bugfix
  • SHA: d57208c
  • Mode: TEST_REPO

Pull Request Author: ambrad

@ambrad
Copy link
Member Author

ambrad commented Oct 24, 2024

Looks like the issue is triggered by craygnuamdgpu but not crayclang-scream.

@E3SM-Bot
Copy link
Collaborator

Status Flag 'Pull Request AutoTester' - Jenkins Testing: all Jobs PASSED

Pull Request Auto Testing has PASSED (click to expand)

Build Information

Test Name: SCREAM_PullRequest_Autotester_Weaver

  • Build Num: 6218
  • Status: PASSED

Jenkins Parameters

Parameter Name Value
PR_LABELS AT: RETEST;bugfix;AT: PRE-TEST INSPECTED;GPU
PULLREQUESTNUM 3058
SCREAM_SOURCE_REPO https://github.com/ambrad/scream
SCREAM_SOURCE_SHA d57208c
SCREAM_TARGET_BRANCH master
SCREAM_TARGET_REPO https://github.com/E3SM-Project/scream
SCREAM_TARGET_SHA 69c16bb
TEST_REPO_ALIAS SCREAM

Build Information

Test Name: SCREAM_PullRequest_Autotester_Mappy

  • Build Num: 5963
  • Status: PASSED

Jenkins Parameters

Parameter Name Value
PR_LABELS AT: RETEST;bugfix;AT: PRE-TEST INSPECTED;GPU
PULLREQUESTNUM 3058
SCREAM_SOURCE_REPO https://github.com/ambrad/scream
SCREAM_SOURCE_SHA d57208c
SCREAM_TARGET_BRANCH master
SCREAM_TARGET_REPO https://github.com/E3SM-Project/scream
SCREAM_TARGET_SHA 69c16bb
TEST_REPO_ALIAS SCREAM

@E3SM-Bot
Copy link
Collaborator

Status Flag 'Pre-Merge Inspection' - - This Pull Request Requires Inspection... The code must be inspected by a member of the Team before Testing/Merging
THE LAST COMMIT TO THIS PULL REQUEST HAS NOT BEEN REVIEWED YET!

@E3SM-Bot
Copy link
Collaborator

All Jobs Finished; status = PASSED, target_sha=d61d592cb04727b1e227a1d37aaa96cb6c314b99, However Inspection must be performed before merge can occur...

@ambrad
Copy link
Member Author

ambrad commented Oct 24, 2024

@ndkeen is it OK if I merge this?

@ambrad ambrad changed the title EAMxx/SHOC: Fix a subtle GPU bug. EAMxx/SHOC: Work around an GPU issue. Oct 24, 2024
@ambrad ambrad changed the title EAMxx/SHOC: Work around an GPU issue. EAMxx/SHOC: Work around a GPU issue. Oct 24, 2024
@ambrad ambrad merged commit 686c582 into E3SM-Project:master Oct 24, 2024
4 of 5 checks passed
brhillman pushed a commit that referenced this pull request Oct 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix GPU PRs that make changes specifically for GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cases using 128 vertical levels on frontier not BFB
3 participants