-
Notifications
You must be signed in to change notification settings - Fork 54
EAMxx/SHOC: Work around a GPU issue. #3058
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EAMxx/SHOC: Work around a GPU issue. #3058
Conversation
|
||
if (k+1 == nlev_packs) zi_grid(i,nlevi_v)[nlevi_p] = 0; | ||
}); | ||
zi_grid(i,nlevi_v)[nlevi_p] = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Placing a team_barrier before this line would also work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm puzzled. Why was this line buggy? All threads would execute that line. Albeit being unnecessary (so I'm ok with the change), it would not be incorrect, since they all set the same value, and there was a barrier right after.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. If the previous kernel were over nlevi rather than nlev, I could make an argument about memory consistency. But since it's over nlev, I can think of any explanation. Maybe this fix is actually working around a compiler-side issue.
The evidence for this specific line being the issue is that the _sfc quantities a few lines below this code block were all wrong. It's possible there's some other bug I'm not seeing that this fix handles. If so, that bug must affect these _sfc quantities.
I'll weaken my suggestion that other C++ devs study this PR since it may just be working around a compiler-side problem.
Fix either a long-latent issue or a compiler-side issue that was recently triggered by an unrelated PR. Add a test for 128 levels to hold us over until the default for all tests is changed to 128 levels. Add an option to global state hasher to hash a user-provided array. This let me hash temporary workspace as part of isolating the issue.
1445261
to
663216f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this impl appears to deterministically fix a non-determinism, I'm ok with merging, even if it remains a bit mysterious.
Status Flag 'Pull Request AutoTester' - Testing Jenkins Projects: Pull Request Auto Testing STARTING (click to expand)Build InformationTest Name: SCREAM_PullRequest_Autotester_Weaver
Jenkins Parameters
Build InformationTest Name: SCREAM_PullRequest_Autotester_Mappy
Jenkins Parameters
Using Repos:
Pull Request Author: ambrad |
Status Flag 'Pull Request AutoTester' - Jenkins Testing: 1 or more Jobs FAILED Note: Testing will normally be attempted again in approx. 2 Hrs. If a change to the PR source branch occurs, the testing will be attempted again on next available autotester run. Pull Request Auto Testing has FAILED (click to expand)Build InformationTest Name: SCREAM_PullRequest_Autotester_Weaver
Jenkins Parameters
Build InformationTest Name: SCREAM_PullRequest_Autotester_Mappy
Jenkins Parameters
SCREAM_PullRequest_Autotester_Weaver # 6214 PASSED (click to see last 100 lines of console output)
SCREAM_PullRequest_Autotester_Mappy # 5959 FAILED (click to see last 100 lines of console output)
|
Status Flag 'Pull Request AutoTester' - User Requested Retest - Label AT: RETEST will be reset after testing. |
Status Flag 'Pull Request AutoTester' - Testing Jenkins Projects: Pull Request Auto Testing STARTING (click to expand)Build InformationTest Name: SCREAM_PullRequest_Autotester_Weaver
Jenkins Parameters
Build InformationTest Name: SCREAM_PullRequest_Autotester_Mappy
Jenkins Parameters
Using Repos:
Pull Request Author: ambrad |
Looks like the issue is triggered by craygnuamdgpu but not crayclang-scream. |
Status Flag 'Pull Request AutoTester' - Jenkins Testing: all Jobs PASSED Pull Request Auto Testing has PASSED (click to expand)Build InformationTest Name: SCREAM_PullRequest_Autotester_Weaver
Jenkins Parameters
Build InformationTest Name: SCREAM_PullRequest_Autotester_Mappy
Jenkins Parameters
|
Status Flag 'Pre-Merge Inspection' - - This Pull Request Requires Inspection... The code must be inspected by a member of the Team before Testing/Merging |
All Jobs Finished; status = PASSED, target_sha=d61d592cb04727b1e227a1d37aaa96cb6c314b99, However Inspection must be performed before merge can occur... |
@ndkeen is it OK if I merge this? |
EAMxx/SHOC: Fix a subtle GPU bug.
Fix either a long-latent issue or a compiler-side issue that was recently triggered by an unrelated PR. Edit: Further analysis and testing strongly suggests this is a compiler-side issue for craygnuamdgpu but not crayclang-scream.
Add a test for 128 levels to hold us over until the default for all tests is changed to 128 levels.
Add an option to global state hasher to hash a user-provided array. This let me hash temporary workspace as part of isolating the issue.
Fixes #3053.