Skip to content

Commit 24f926c

Browse files
committed
update READMEs
1 parent 57c5c05 commit 24f926c

File tree

2 files changed

+34
-18
lines changed

2 files changed

+34
-18
lines changed

.github/workflows/AT2-README.md

Lines changed: 17 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
# Autotester2 (AT2) Workflow for MAM4xx
2-
This document contains a brief description of how AT2 is used to automate testing on SNL hardware.
2+
3+
This document contains a brief description of how AT2 is used to automate testing on SNL hardware.
34
Additionally, any helpful notes and TODOs may be kept here to assist developers.
45

56
## Overview
@@ -10,9 +11,10 @@ This is done for security/policy reasons and ensures that only those with approv
1011

1112
### Test Hardware and Compiler Configurations
1213

13-
| Test Name | GPU Brand | GPU Type | Micoarchitecture | Compute Capability | Machine | Compilers |
14-
| -------------------- | --------- | -------- | ---------------- | ------------------ | ------- | ---------------------------- |
15-
| gcc_12-3-0_cuda_12-1 | NVIDIA | H100 | Hopper | 9.0 | blake | `gcc` 12.3.0/`nvcc` 12.1.105 |
14+
| Test Name | GPU Brand | GPU Type | Microarchitecture | Compute Capability | Machine | OS | Compilers |
15+
| --------------------------------- | --------- | ------------| ----------------- | ------------------ | ------- | ------ | -------------------------------- |
16+
| GPU AT2 gcc 12.3 cuda 12.1 | NVIDIA | H100 | Hopper | 9.0 | blake | RHEL8 | `gcc` 12.3.0/`nvcc` 12.1.105 |
17+
| GPU AT2 gcc 13.3 hip 6.2 | AMD | MI250/MI210 | AMD_GFX90A | N/A | caraway | RHEL9 | `gcc` 13.3.0/`hipcc` 6.2.41133-0 |
1618

1719
### The Flow of the CI Workflow
1820

@@ -24,7 +26,8 @@ As of now, the image is of a UBI 8 system, with Spack-installed compilers and al
2426

2527
#### Triggering the Testing Workflow
2628

27-
This autotesting workflow is triggered by opening a pull request to `main` and also by a handful of actions on such a PR that is already open, including:
29+
This autotesting workflow is triggered by opening a pull request to `main` and
30+
also by a handful of actions on such a PR that is already open, including:
2831

2932
- `reopened`
3033
- `ready_for_review`
@@ -40,8 +43,8 @@ or
4043

4144
> **Actions** -> `<Previously-run SNL-AT2 Workflow/Job>` -> **Re-run `[all,this]` job(s)**.
4245
43-
The AT2 configuration on `blake` currently attempts to keep 3 runners available
44-
to accept jobs at all times.
46+
The AT2 configuration on `blake` and `caraway` currently attempts to keep 3
47+
runners per machine available to accept jobs at all times.
4548
This workflow is configured to allow concurrent testing, so up to 3 test-matrix
4649
configurations can run at once.
4750
The concurrency setting is also configured to kill any active job if another
@@ -58,13 +61,17 @@ instance of this workflow is started for the same PR ref.
5861

5962
## Development Details
6063

61-
Most of the required configuration is provided by the AT2 docs and instructional Confluence page (on the Sandia network :confused:--reach out if you need access).
64+
Most of the required configuration is provided by the AT2 docs and instructional
65+
Confluence page (on the Sandia network :confused:--reach out if you need access).
6266
However, some non-obvious choices and configurations are listed here.
6367

64-
- To add some info to the testing output, we employ a custom action, cribbed from E3SM/EAMxx, that prints out the workflow's trigger.
68+
- To add some info to the testing output, we employ a custom action, cribbed
69+
from E3SM/EAMxx, that prints out the workflow's trigger.
6570

6671
### Hacks
6772

73+
- [ ] FIXME(@mjs): This should not be necessary any more, after the changes to the haero build. `build-haero.sh` should be functional for this build now.
74+
6875
- For whatever reason, Skywalker does not like building in the `gcc_12-3-0_cuda_12-1` container for the H100 GPU.
6976
- This appears to be an issue of the (Haero?) build not auto-detecting the correct Compute Capability (CC 9.0 => `sm_90`).
7077
- To overcome this, we first obtain the CC flag via `nvidia-smi` within the testing container.
@@ -77,4 +84,4 @@ However, some non-obvious choices and configurations are listed here.
7784
- One token used to fetch and read/write runner information.
7885
- **Expires 11 April 2026**
7986
- One token used fetch and read repository information via the API.
80-
- **Expires 2 May 2025**
87+
- **Expires 6 May 2026**

.github/workflows/README.md

Lines changed: 17 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -12,17 +12,18 @@ To do this, testing is initialized via the top-level workflow, `MAM4xx Autoteste
1212

1313
#### GPU-based Testing
1414

15-
| Test Name | GPU Brand | GPU Type | Microarchitecture | Compute Capability | Machine | Compilers |
16-
| --------------------------------- | --------- | -------- | ---------------- | ------------------ | ------- | ---------------------------- |
17-
| GPU AT2 gcc 12.3 cuda 12.1 | NVIDIA | H100 | Hopper | 9.0 | blake | `gcc` 12.3.0/`nvcc` 12.1.105 |
15+
| Test Name | GPU Brand | GPU Type | Microarchitecture | Compute Capability | Machine | OS | Compilers |
16+
| --------------------------------- | --------- | ------------| ----------------- | ------------------ | ------- | ------ | -------------------------------- |
17+
| GPU AT2 gcc 12.3 cuda 12.1 | NVIDIA | H100 | Hopper | 9.0 | blake | RHEL8 | `gcc` 12.3.0/`nvcc` 12.1.105 |
18+
| GPU AT2 gcc 13.3 hip 6.2 | AMD | MI250/MI210 | AMD_GFX90A | N/A | caraway | RHEL9 | `gcc` 13.3.0/`hipcc` 6.2.41133-0 |
1819

1920
#### CPU-based Testing
2021

21-
**Note:** These are the current specs for GitHub's Ubuntu 22.04 runner and are subject to change.
22+
**Note:** These are the *current* specs for GitHub's Ubuntu 22.04 runner and are subject to change.
2223

23-
| Test Name | OS | Machine | Compiler |
24-
| -------------------------------------------- | -------------------- | -------------- | ---------- |
25-
| GitHub CPU Auto-test Ubuntu 22.04[^gh-ubu2204] | Linux - Ubuntu 22.04 | GitHub Runners | `gcc` 12.3 |
24+
| Test Name | OS | Machine | Compiler |
25+
| --------------------------------------- | -------------------- | -------------- | ---------- |
26+
| CPU GH-runner Ubuntu 22.04[^gh-ubu2204] | Linux - Ubuntu 22.04 | GitHub Runners | `gcc` 12.3 |
2627

2728
### The Flow of the CI Workflow
2829

@@ -48,6 +49,13 @@ Based on the trigger and/or inputs, `MAM4xx Autotester` dispatches sub-workflows
4849
- ***Note:*** AT2 = "Autotester 2," the second generation of a Sandia-developed GitHub-based testing product.
4950
- See the [AT2 README](./AT2-README.md) for details about the implementation of the AT2 product.
5051

52+
#### GPU AT2 `gcc` 13.3 `hip` 6.2
53+
54+
- This is largely identical to the above CUDA-based workflow, the salient difference being that we run on AMD hardware, using the `hipcc` C++ compiler.
55+
- The `caraway` machine has 2 different AMD_GFX90A-architecture MI200-series GPUs available, MI210 and MI250.
56+
- As of the time of writing, autotesting jobs are assigned one or the other based on availability, to speed up matters.
57+
- ***Note:*** This could change based on future needs.
58+
5159
#### GitHub CPU Auto-test Ubuntu 22.04
5260

5361
- The full version of this test runs a "matrix-strategy" test running all combinations of
@@ -86,6 +94,7 @@ The current options when manually triggering a workflow are:
8694
- Test Machine Architecture
8795
- Current Options:
8896
- `GPU-NVIDIA_H100`
97+
- `GPU-AMD_MI200-series`
8998
- `CPU-Ubuntu_22-04`
9099
- `ALL`
91100
- Floating-point Precision
@@ -135,7 +144,7 @@ Refer to the section on [Other Types of Job Control](./AT2-README.md#other-types
135144
- [x] Unify all CI into a single top-level yaml file that calls the sub-cases.
136145
- This should provide finer control over what runs and when.
137146
- @mjschmidt271
138-
- [ ] Add testing for AMD GPUs on `caraway`.
147+
- [x] Add testing for AMD GPUs on `caraway`.
139148
- @jaelynlitz - WIP
140149

141150
### Low-priority

0 commit comments

Comments
 (0)