Skip to content

Commit c423ceb

Browse files
committed
Merge branch 'Sadra-Recipes' into whisper-readme-polish
2 parents 9731c3a + 1f5f5fa commit c423ceb

22 files changed

+17106
-0
lines changed

docs/cpu_inference/readme.md

Lines changed: 169 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,169 @@
1+
# CPU Inference with Ollama
2+
3+
This blueprint explains how to use CPU inference for running large language models using Ollama. It includes two main deployment strategies:
4+
- Serving pre-saved models directly from Object Storage
5+
- Pulling models from Ollama and saving them to Object Storage
6+
7+
---
8+
9+
## Why CPU Inference?
10+
11+
CPU inference is ideal for:
12+
- Low-throughput or cost-sensitive deployments
13+
- Offline testing and validation
14+
- Prototyping without GPU dependency
15+
16+
---
17+
18+
## Supported Models
19+
20+
Ollama supports several high-quality open-source LLMs. Below is a small set of commonly used models:
21+
22+
| Model Name | Description |
23+
|------------|--------------------------------|
24+
| gemma | Lightweight open LLM by Google |
25+
| llama2 | Meta’s large language model |
26+
| mistral | Open-weight performant LLM |
27+
| phi3 | Microsoft’s compact LLM |
28+
29+
---
30+
31+
## Deploying with OCI AI Blueprint
32+
33+
### Running Ollama Models from Object Storage
34+
35+
If you've already pushed your model to **Object Storage**, use the following service-mode recipe to run it. Ensure your model files are in the **blob + manifest** format used by Ollama.
36+
37+
#### Recipe Configuration
38+
39+
| Field | Description |
40+
|--------------------------------|------------------------------------------------|
41+
| recipe_id | `cpu_inference` – Identifier for the recipe |
42+
| recipe_mode | `service` – Used for long-running inference |
43+
| deployment_name | Custom name for the deployment |
44+
| recipe_image_uri | URI for the container image in OCIR |
45+
| recipe_node_shape | OCI shape, e.g., `BM.Standard.E4.128` |
46+
| input_object_storage | Object Storage bucket mounted as input |
47+
| recipe_container_env | List of environment variables |
48+
| recipe_replica_count | Number of replicas |
49+
| recipe_container_port | Port to expose the container |
50+
| recipe_node_pool_size | Number of nodes in the pool |
51+
| recipe_node_boot_volume_size_in_gbs | Boot volume size in GB |
52+
| recipe_container_command_args | Arguments for the container command |
53+
| recipe_ephemeral_storage_size | Temporary scratch storage |
54+
55+
#### Sample Recipe (Service Mode)
56+
```json
57+
{
58+
"recipe_id": "cpu_inference",
59+
"recipe_mode": "service",
60+
"deployment_name": "gemma and BME4 service",
61+
"recipe_image_uri": "iad.ocir.io/iduyx1qnmway/corrino-devops-repository:cpu_inference_service_v0.2",
62+
"recipe_node_shape": "BM.Standard.E4.128",
63+
"input_object_storage": [
64+
{
65+
"bucket_name": "ollama-models",
66+
"mount_location": "/models",
67+
"volume_size_in_gbs": 20
68+
}
69+
],
70+
"recipe_container_env": [
71+
{ "key": "MODEL_NAME", "value": "gemma" },
72+
{ "key": "PROMPT", "value": "What is the capital of France?" }
73+
],
74+
"recipe_replica_count": 1,
75+
"recipe_container_port": "11434",
76+
"recipe_node_pool_size": 1,
77+
"recipe_node_boot_volume_size_in_gbs": 200,
78+
"recipe_container_command_args": [
79+
"--input_directory", "/models", "--model_name", "gemma"
80+
],
81+
"recipe_ephemeral_storage_size": 100
82+
}
83+
```
84+
85+
---
86+
87+
### Accessing the API
88+
89+
Once deployed, send inference requests to the model via the exposed port:
90+
91+
```bash
92+
curl http://<PUBLIC_IP>:11434/api/generate -d '{
93+
"model": "gemma",
94+
"prompt": "What is the capital of France?",
95+
"stream": false
96+
}'
97+
```
98+
99+
### Example Public Inference Calls
100+
```bash
101+
curl -L POST https://cpu-inference-mismistral.130-162-199-33.nip.io/api/generate \
102+
-d '{ "model": "mistral", "prompt": "What is the capital of Germany?" }' \
103+
| jq -r 'select(.response) | .response' | paste -sd " "
104+
105+
curl -L -k POST https://cpu-inference-mistral-flexe4.130-162-199-33.nip.io/api/generate \
106+
-d '{ "model": "mistral", "prompt": "What is the capital of Germany?" }' \
107+
| jq -r 'select(.response) | .response' | paste -sd " "
108+
```
109+
---
110+
111+
### Pulling from Ollama and Saving to Object Storage
112+
113+
To download a model from Ollama and store it in Object Storage, use the job-mode recipe below.
114+
115+
#### Recipe Configuration
116+
117+
| Field | Description |
118+
|--------------------------------|------------------------------------------------|
119+
| recipe_id | `cpu_inference` – Same recipe base |
120+
| recipe_mode | `job` – One-time job to save a model |
121+
| deployment_name | Custom name for the saving job |
122+
| recipe_image_uri | OCIR URI of the saver image |
123+
| recipe_node_shape | Compute shape used for the job |
124+
| output_object_storage | Where to store pulled models |
125+
| recipe_container_env | Environment variables including model name |
126+
| recipe_replica_count | Set to 1 |
127+
| recipe_container_port | Typically `11434` for Ollama |
128+
| recipe_node_pool_size | Set to 1 |
129+
| recipe_node_boot_volume_size_in_gbs | Size in GB |
130+
| recipe_container_command_args | Set output directory and model name |
131+
| recipe_ephemeral_storage_size | Temporary storage |
132+
133+
#### Sample Recipe (Job Mode)
134+
```json
135+
{
136+
"recipe_id": "cpu_inference",
137+
"recipe_mode": "job",
138+
"deployment_name": "gemma and BME4 saver",
139+
"recipe_image_uri": "iad.ocir.io/iduyx1qnmway/corrino-devops-repository:cpu_inference_saver_v0.2",
140+
"recipe_node_shape": "BM.Standard.E4.128",
141+
"output_object_storage": [
142+
{
143+
"bucket_name": "ollama-models",
144+
"mount_location": "/models",
145+
"volume_size_in_gbs": 20
146+
}
147+
],
148+
"recipe_container_env": [
149+
{ "key": "MODEL_NAME", "value": "gemma" },
150+
{ "key": "PROMPT", "value": "What is the capital of France?" }
151+
],
152+
"recipe_replica_count": 1,
153+
"recipe_container_port": "11434",
154+
"recipe_node_pool_size": 1,
155+
"recipe_node_boot_volume_size_in_gbs": 200,
156+
"recipe_container_command_args": [
157+
"--output_directory", "/models", "--model_name", "gemma"
158+
],
159+
"recipe_ephemeral_storage_size": 100
160+
}
161+
```
162+
163+
---
164+
165+
## Final Notes
166+
167+
- Ensure all OCI IAM permissions are set to allow Object Storage access.
168+
- Confirm that bucket region and deployment region match.
169+
- Use the job-mode recipe once to save a model, and the service-mode recipe repeatedly to serve it.

docs/healthcheck/readme.md

Lines changed: 202 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,202 @@
1+
# GPU Health Check & Pre-Check Blueprint
2+
3+
This blueprint provides a **pre-check blueprint** for GPU health validation before running production or research workloads. The focus is on delivering a **diagnostic tool** that can run on both single-node and multi-node environments, ensuring that your infrastructure is ready for demanding experiments.
4+
5+
The workflow includes:
6+
- **Data types** as input (`fp8`, `fp16`, `fp32`, `fp64`)
7+
- **Custom Functions** for GPU diagnostics
8+
- **GPU-Burn** for stress testing
9+
- **Results** collected in JSON files (and optionally PDF reports)
10+
11+
By following this blueprint, you can identify and localize issues such as thermal throttling, power irregularities, or GPU instability before they impact your main workloads.
12+
13+
---
14+
15+
## 1. Architecture Overview
16+
17+
Below is a simplified overview:
18+
19+
<img width="893" alt="image" src="https://github.com/user-attachments/assets/e44f7ffe-19cf-48be-a026-e27fddfbed3c" />
20+
21+
22+
### Key Points
23+
24+
- Data Types: You can specify one of several floating-point precisions (`fp8`, fp16, fp32, `fp64`).
25+
- Custom Functions: Diagnostic functions that measure performance metrics such as throughput, memory bandwidth, etc.
26+
- Single-Node vs. Multi-Node: Tests can be run on a single machine or scaled to multiple machines.
27+
- GPU-Burn: A specialized stress-testing tool for pushing GPUs to their maximum performance limits.
28+
- Results: Output is aggregated into JSON files (and optionally PDFs) for analysis.
29+
30+
---
31+
32+
## 2. Health Check Blueprint
33+
34+
This blueprint aims to give you confidence that your GPUs are healthy. The key checks include:
35+
36+
1. Compute Throughput
37+
- Dense matrix multiplications or arithmetic operations stress the GPU cores.
38+
- Ensures sustained performance without degradation.
39+
40+
2. Memory Bandwidth
41+
- Reading/writing large chunks of data (e.g., `torch.rand()`) tests memory throughput.
42+
- Verifies the memory subsystem operates at expected speeds.
43+
44+
3. Temperature & Thermal Stability
45+
- Uses commands like nvidia-smi to monitor temperature.
46+
- Checks for throttling under load.
47+
48+
4. Power Consumption
49+
- Monitors power draw (e.g., `nvidia-smi --query-gpu=power.draw --format=csv`).
50+
- Identifies irregular or excessive power usage.
51+
52+
5. GPU Utilization
53+
- Ensures GPU cores (including Tensor Cores) are fully engaged during tests.
54+
- Confirms no unexpected idle time.
55+
56+
6. Error Detection
57+
- Checks for hardware errors or CUDA-related issues.
58+
- Asserts numerical correctness to ensure no silent failures.
59+
60+
7. Multi-GPU Testing
61+
- Validates multi-GPU or multi-node setups.
62+
- Ensures the entire environment is consistent and stable.
63+
64+
8. Mixed Precision Testing
65+
- Uses AMP for fp8 or fp16 operations (e.g., `torch.cuda.amp.autocast()`).
66+
- Confirms performance and compatibility with mixed-precision ops.
67+
68+
---
69+
70+
## 3. Data Types and How They Work
71+
72+
- fp8, fp16: Lower precision can offer speedups but requires checks for numerical stability.
73+
- fp32 (single precision): Standard for most deep learning tasks; tests confirm typical GPU operations.
74+
- fp64 (double precision): Used in HPC/scientific workloads; verifies performance and accuracy at high precision.
75+
76+
Depending on the dtype you select, the script either runs a set of Custom Functions or launches GPU-Burn to push the hardware to its limits. The results are saved in JSON for analysis.
77+
78+
---
79+
80+
## 4. Custom Functions
81+
82+
These Python-based diagnostic functions systematically measure:
83+
84+
- Throughput (matrix multiplies, convolution stubs, etc.)
85+
- Memory bandwidth (large tensor reads/writes)
86+
- Temperature (via nvidia-smi or other sensors)
87+
- Power usage
88+
- GPU utilization
89+
- Error detection (assert checks, error logs)
90+
- Multi-GPU orchestration (parallel usage correctness)
91+
- Mixed precision compatibility (AMP in PyTorch)
92+
93+
They can run on a single node or multiple nodes, with each run producing structured JSON output.
94+
95+
---
96+
97+
## 5. GPU-Burn
98+
99+
[GPU-Burn](https://github.com/wilicc/gpu-burn) is a stress-testing tool designed to push GPUs to their maximum performance limits. It is typically used to:
100+
101+
- Validate hardware stability
102+
- Identify potential overheating or faulty components
103+
- Confirm GPUs can handle extreme workloads without errors or throttling
104+
105+
When you run GPU-Burn in float32 or float64 mode, its output can be captured in a log file, then parsed into JSON or PDF summaries for reporting.
106+
107+
---
108+
109+
110+
## 6. Usage
111+
112+
1. Clone the Blueprint & Install Dependencies
113+
114+
115+
Bash
116+
117+
git clone <repo_url>
118+
cd <repo_name>
119+
docker build -t gpu-healthcheck .
120+
121+
122+
2. Run the Pre-Check
123+
- Single Node Example (fp16):
124+
125+
Bash
126+
127+
128+
docker run --gpus all -it -v $(pwd)/results:/app/testing_results gpu-healthcheck --dtype float16 --expected_gpus A10:2,A100:0,H100:0
129+
130+
131+
- GPU-Burn Stress Test (float32):
132+
133+
Bash
134+
135+
136+
docker run --gpus all -it -v $(pwd)/results:/app/testing_results gpu-healthcheck --dtype float32 --expected_gpus A10:2,A100:0,H100:0
137+
138+
139+
3. Examine Results
140+
- JSON output is located in the results/ directory.
141+
- PDF summaries will also be generated.
142+
143+
---
144+
## 7. Implementing it into OCI AI Blueprints
145+
146+
This is an example of json file which be used to deploy into OCI AI Blueprints:
147+
148+
```json
149+
{
150+
"recipe_id": "healthcheck",
151+
"recipe_mode": "job",
152+
"deployment_name": "healthcheck",
153+
"recipe_image_uri": "iad.ocir.io/iduyx1qnmway/corrino-devops-repository:healthcheck_v0.3",
154+
"recipe_node_shape": "VM.GPU.A10.2",
155+
"output_object_storage": [
156+
{
157+
"bucket_name": "healthcheck2",
158+
"mount_location": "/healthcheck_results",
159+
"volume_size_in_gbs": 20
160+
}
161+
],
162+
"recipe_container_command_args": [
163+
"--dtype", "float16", "--output_dir", "/healthcheck_results", "--expected_gpus", "A10:2,A100:0,H100:0"
164+
],
165+
"recipe_replica_count": 1,
166+
"recipe_nvidia_gpu_count": 2,
167+
"recipe_node_pool_size": 1,
168+
"recipe_node_boot_volume_size_in_gbs": 200,
169+
"recipe_ephemeral_storage_size": 100,
170+
"recipe_shared_memory_volume_size_limit_in_mb": 1000,
171+
"recipe_use_shared_node_pool": true
172+
}
173+
```
174+
---
175+
176+
## Explanation of Healthcheck Recipe Fields
177+
178+
| Field | Type | Example Value | Description |
179+
|---------------------------------------|-------------|-------------------------------------------------------------------------------|-------------|
180+
| `recipe_id` | string | `"healthcheck"` | Identifier for the recipe |
181+
| `recipe_mode` | string | `"job"` | Whether the recipe runs as a one-time job or a service |
182+
| `deployment_name` | string | `"healthcheck"` | Name of the deployment/job |
183+
| `recipe_image_uri` | string | `"iad.ocir.io/.../healthcheck_v0.3"` | URI of the container image stored in OCI Container Registry |
184+
| `recipe_node_shape` | string | `"VM.GPU.A10.2"` | Compute shape to use for this job |
185+
| `output_object_storage.bucket_name` | string | `"healthcheck2"` | Name of the Object Storage bucket to write results |
186+
| `output_object_storage.mount_location`| string | `"/healthcheck_results"` | Directory inside the container where the bucket will be mounted |
187+
| `output_object_storage.volume_size_in_gbs` | integer | `20` | Storage volume size (GB) for the mounted bucket |
188+
| `recipe_container_command_args` | list | `[--dtype, float16, --output_dir, /healthcheck_results, --expected_gpus, A10:2,A100:0,H100:0]` | Arguments passed to the container |
189+
| `--dtype` | string | `"float16"` | Precision type for computations (e.g. float16, float32) |
190+
| `--output_dir` | string | `"/healthcheck_results"` | Directory for writing output (maps to mounted bucket) |
191+
| `--expected_gpus` | string | `"A10:2,A100:0,H100:0"` | Expected GPU types and counts |
192+
| `recipe_replica_count` | integer | `1` | Number of replicas (containers) to run |
193+
| `recipe_nvidia_gpu_count` | integer | `2` | Number of GPUs to allocate |
194+
| `recipe_node_pool_size` | integer | `1` | Number of nodes to provision |
195+
| `recipe_node_boot_volume_size_in_gbs`| integer | `200` | Size of the boot volume (GB) |
196+
| `recipe_ephemeral_storage_size` | integer | `100` | Ephemeral scratch storage size (GB) |
197+
| `recipe_shared_memory_volume_size_limit_in_mb` | integer | `1000` | Size of shared memory volume (`/dev/shm`) in MB |
198+
| `recipe_use_shared_node_pool` | boolean | `true` | Whether to run on a shared node pool |
199+
200+
## 8. Contact
201+
202+
For questions or additional information, open an issue in this blueprint or contact the maintainers directly.

0 commit comments

Comments
 (0)