|
| 1 | +# GPU Health Check & Pre-Check Blueprint |
| 2 | + |
| 3 | +This blueprint provides a **pre-check blueprint** for GPU health validation before running production or research workloads. The focus is on delivering a **diagnostic tool** that can run on both single-node and multi-node environments, ensuring that your infrastructure is ready for demanding experiments. |
| 4 | + |
| 5 | +The workflow includes: |
| 6 | +- **Data types** as input (`fp8`, `fp16`, `fp32`, `fp64`) |
| 7 | +- **Custom Functions** for GPU diagnostics |
| 8 | +- **GPU-Burn** for stress testing |
| 9 | +- **Results** collected in JSON files (and optionally PDF reports) |
| 10 | + |
| 11 | +By following this blueprint, you can identify and localize issues such as thermal throttling, power irregularities, or GPU instability before they impact your main workloads. |
| 12 | + |
| 13 | +--- |
| 14 | + |
| 15 | +## 1. Architecture Overview |
| 16 | + |
| 17 | +Below is a simplified overview: |
| 18 | + |
| 19 | +<img width="893" alt="image" src="https://github.com/user-attachments/assets/e44f7ffe-19cf-48be-a026-e27fddfbed3c" /> |
| 20 | + |
| 21 | + |
| 22 | +### Key Points |
| 23 | + |
| 24 | +- Data Types: You can specify one of several floating-point precisions (`fp8`, fp16, fp32, `fp64`). |
| 25 | +- Custom Functions: Diagnostic functions that measure performance metrics such as throughput, memory bandwidth, etc. |
| 26 | +- Single-Node vs. Multi-Node: Tests can be run on a single machine or scaled to multiple machines. |
| 27 | +- GPU-Burn: A specialized stress-testing tool for pushing GPUs to their maximum performance limits. |
| 28 | +- Results: Output is aggregated into JSON files (and optionally PDFs) for analysis. |
| 29 | + |
| 30 | +--- |
| 31 | + |
| 32 | +## 2. Health Check Blueprint |
| 33 | + |
| 34 | +This blueprint aims to give you confidence that your GPUs are healthy. The key checks include: |
| 35 | + |
| 36 | +1. Compute Throughput |
| 37 | + - Dense matrix multiplications or arithmetic operations stress the GPU cores. |
| 38 | + - Ensures sustained performance without degradation. |
| 39 | + |
| 40 | +2. Memory Bandwidth |
| 41 | + - Reading/writing large chunks of data (e.g., `torch.rand()`) tests memory throughput. |
| 42 | + - Verifies the memory subsystem operates at expected speeds. |
| 43 | + |
| 44 | +3. Temperature & Thermal Stability |
| 45 | + - Uses commands like nvidia-smi to monitor temperature. |
| 46 | + - Checks for throttling under load. |
| 47 | + |
| 48 | +4. Power Consumption |
| 49 | + - Monitors power draw (e.g., `nvidia-smi --query-gpu=power.draw --format=csv`). |
| 50 | + - Identifies irregular or excessive power usage. |
| 51 | + |
| 52 | +5. GPU Utilization |
| 53 | + - Ensures GPU cores (including Tensor Cores) are fully engaged during tests. |
| 54 | + - Confirms no unexpected idle time. |
| 55 | + |
| 56 | +6. Error Detection |
| 57 | + - Checks for hardware errors or CUDA-related issues. |
| 58 | + - Asserts numerical correctness to ensure no silent failures. |
| 59 | + |
| 60 | +7. Multi-GPU Testing |
| 61 | + - Validates multi-GPU or multi-node setups. |
| 62 | + - Ensures the entire environment is consistent and stable. |
| 63 | + |
| 64 | +8. Mixed Precision Testing |
| 65 | + - Uses AMP for fp8 or fp16 operations (e.g., `torch.cuda.amp.autocast()`). |
| 66 | + - Confirms performance and compatibility with mixed-precision ops. |
| 67 | + |
| 68 | +--- |
| 69 | + |
| 70 | +## 3. Data Types and How They Work |
| 71 | + |
| 72 | +- fp8, fp16: Lower precision can offer speedups but requires checks for numerical stability. |
| 73 | +- fp32 (single precision): Standard for most deep learning tasks; tests confirm typical GPU operations. |
| 74 | +- fp64 (double precision): Used in HPC/scientific workloads; verifies performance and accuracy at high precision. |
| 75 | + |
| 76 | +Depending on the dtype you select, the script either runs a set of Custom Functions or launches GPU-Burn to push the hardware to its limits. The results are saved in JSON for analysis. |
| 77 | + |
| 78 | +--- |
| 79 | + |
| 80 | +## 4. Custom Functions |
| 81 | + |
| 82 | +These Python-based diagnostic functions systematically measure: |
| 83 | + |
| 84 | +- Throughput (matrix multiplies, convolution stubs, etc.) |
| 85 | +- Memory bandwidth (large tensor reads/writes) |
| 86 | +- Temperature (via nvidia-smi or other sensors) |
| 87 | +- Power usage |
| 88 | +- GPU utilization |
| 89 | +- Error detection (assert checks, error logs) |
| 90 | +- Multi-GPU orchestration (parallel usage correctness) |
| 91 | +- Mixed precision compatibility (AMP in PyTorch) |
| 92 | + |
| 93 | +They can run on a single node or multiple nodes, with each run producing structured JSON output. |
| 94 | + |
| 95 | +--- |
| 96 | + |
| 97 | +## 5. GPU-Burn |
| 98 | + |
| 99 | +[GPU-Burn](https://github.com/wilicc/gpu-burn) is a stress-testing tool designed to push GPUs to their maximum performance limits. It is typically used to: |
| 100 | + |
| 101 | +- Validate hardware stability |
| 102 | +- Identify potential overheating or faulty components |
| 103 | +- Confirm GPUs can handle extreme workloads without errors or throttling |
| 104 | + |
| 105 | +When you run GPU-Burn in float32 or float64 mode, its output can be captured in a log file, then parsed into JSON or PDF summaries for reporting. |
| 106 | + |
| 107 | +--- |
| 108 | + |
| 109 | + |
| 110 | +## 6. Usage |
| 111 | + |
| 112 | +1. Clone the Blueprint & Install Dependencies |
| 113 | + |
| 114 | + |
| 115 | +Bash |
| 116 | + |
| 117 | + git clone <repo_url> |
| 118 | + cd <repo_name> |
| 119 | + docker build -t gpu-healthcheck . |
| 120 | + |
| 121 | + |
| 122 | +2. Run the Pre-Check |
| 123 | + - Single Node Example (fp16): |
| 124 | + |
| 125 | +Bash |
| 126 | + |
| 127 | + |
| 128 | + docker run --gpus all -it -v $(pwd)/results:/app/testing_results gpu-healthcheck --dtype float16 --expected_gpus A10:2,A100:0,H100:0 |
| 129 | + |
| 130 | + |
| 131 | + - GPU-Burn Stress Test (float32): |
| 132 | + |
| 133 | +Bash |
| 134 | + |
| 135 | + |
| 136 | + docker run --gpus all -it -v $(pwd)/results:/app/testing_results gpu-healthcheck --dtype float32 --expected_gpus A10:2,A100:0,H100:0 |
| 137 | + |
| 138 | + |
| 139 | +3. Examine Results |
| 140 | + - JSON output is located in the results/ directory. |
| 141 | + - PDF summaries will also be generated. |
| 142 | + |
| 143 | +--- |
| 144 | +## 7. Implementing it into OCI AI Blueprints |
| 145 | + |
| 146 | +This is an example of json file which be used to deploy into OCI AI Blueprints: |
| 147 | + |
| 148 | +```json |
| 149 | +{ |
| 150 | + "recipe_id": "healthcheck", |
| 151 | + "recipe_mode": "job", |
| 152 | + "deployment_name": "healthcheck", |
| 153 | + "recipe_image_uri": "iad.ocir.io/iduyx1qnmway/corrino-devops-repository:healthcheck_v0.3", |
| 154 | + "recipe_node_shape": "VM.GPU.A10.2", |
| 155 | + "output_object_storage": [ |
| 156 | + { |
| 157 | + "bucket_name": "healthcheck2", |
| 158 | + "mount_location": "/healthcheck_results", |
| 159 | + "volume_size_in_gbs": 20 |
| 160 | + } |
| 161 | + ], |
| 162 | + "recipe_container_command_args": [ |
| 163 | + "--dtype", "float16", "--output_dir", "/healthcheck_results", "--expected_gpus", "A10:2,A100:0,H100:0" |
| 164 | + ], |
| 165 | + "recipe_replica_count": 1, |
| 166 | + "recipe_nvidia_gpu_count": 2, |
| 167 | + "recipe_node_pool_size": 1, |
| 168 | + "recipe_node_boot_volume_size_in_gbs": 200, |
| 169 | + "recipe_ephemeral_storage_size": 100, |
| 170 | + "recipe_shared_memory_volume_size_limit_in_mb": 1000, |
| 171 | + "recipe_use_shared_node_pool": true |
| 172 | +} |
| 173 | +``` |
| 174 | +--- |
| 175 | + |
| 176 | +## Explanation of Healthcheck Recipe Fields |
| 177 | + |
| 178 | +| Field | Type | Example Value | Description | |
| 179 | +|---------------------------------------|-------------|-------------------------------------------------------------------------------|-------------| |
| 180 | +| `recipe_id` | string | `"healthcheck"` | Identifier for the recipe | |
| 181 | +| `recipe_mode` | string | `"job"` | Whether the recipe runs as a one-time job or a service | |
| 182 | +| `deployment_name` | string | `"healthcheck"` | Name of the deployment/job | |
| 183 | +| `recipe_image_uri` | string | `"iad.ocir.io/.../healthcheck_v0.3"` | URI of the container image stored in OCI Container Registry | |
| 184 | +| `recipe_node_shape` | string | `"VM.GPU.A10.2"` | Compute shape to use for this job | |
| 185 | +| `output_object_storage.bucket_name` | string | `"healthcheck2"` | Name of the Object Storage bucket to write results | |
| 186 | +| `output_object_storage.mount_location`| string | `"/healthcheck_results"` | Directory inside the container where the bucket will be mounted | |
| 187 | +| `output_object_storage.volume_size_in_gbs` | integer | `20` | Storage volume size (GB) for the mounted bucket | |
| 188 | +| `recipe_container_command_args` | list | `[--dtype, float16, --output_dir, /healthcheck_results, --expected_gpus, A10:2,A100:0,H100:0]` | Arguments passed to the container | |
| 189 | +| `--dtype` | string | `"float16"` | Precision type for computations (e.g. float16, float32) | |
| 190 | +| `--output_dir` | string | `"/healthcheck_results"` | Directory for writing output (maps to mounted bucket) | |
| 191 | +| `--expected_gpus` | string | `"A10:2,A100:0,H100:0"` | Expected GPU types and counts | |
| 192 | +| `recipe_replica_count` | integer | `1` | Number of replicas (containers) to run | |
| 193 | +| `recipe_nvidia_gpu_count` | integer | `2` | Number of GPUs to allocate | |
| 194 | +| `recipe_node_pool_size` | integer | `1` | Number of nodes to provision | |
| 195 | +| `recipe_node_boot_volume_size_in_gbs`| integer | `200` | Size of the boot volume (GB) | |
| 196 | +| `recipe_ephemeral_storage_size` | integer | `100` | Ephemeral scratch storage size (GB) | |
| 197 | +| `recipe_shared_memory_volume_size_limit_in_mb` | integer | `1000` | Size of shared memory volume (`/dev/shm`) in MB | |
| 198 | +| `recipe_use_shared_node_pool` | boolean | `true` | Whether to run on a shared node pool | |
| 199 | + |
| 200 | +## 8. Contact |
| 201 | + |
| 202 | +For questions or additional information, open an issue in this blueprint or contact the maintainers directly. |
0 commit comments