Skip to content

Sadra recipes #44

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
169 changes: 169 additions & 0 deletions docs/cpu_inference/readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,169 @@
# CPU Inference with Ollama

This blueprint explains how to use CPU inference for running large language models using Ollama. It includes two main deployment strategies:
- Serving pre-saved models directly from Object Storage
- Pulling models from Ollama and saving them to Object Storage

---

## Why CPU Inference?

CPU inference is ideal for:
- Low-throughput or cost-sensitive deployments
- Offline testing and validation
- Prototyping without GPU dependency

---

## Supported Models

Ollama supports several high-quality open-source LLMs. Below is a small set of commonly used models:

| Model Name | Description |
|------------|--------------------------------|
| gemma | Lightweight open LLM by Google |
| llama2 | Meta’s large language model |
| mistral | Open-weight performant LLM |
| phi3 | Microsoft’s compact LLM |

---

## Deploying with OCI AI Blueprint

### Running Ollama Models from Object Storage

If you've already pushed your model to **Object Storage**, use the following service-mode recipe to run it. Ensure your model files are in the **blob + manifest** format used by Ollama.

#### Recipe Configuration

| Field | Description |
|--------------------------------|------------------------------------------------|
| recipe_id | `cpu_inference` – Identifier for the recipe |
| recipe_mode | `service` – Used for long-running inference |
| deployment_name | Custom name for the deployment |
| recipe_image_uri | URI for the container image in OCIR |
| recipe_node_shape | OCI shape, e.g., `BM.Standard.E4.128` |
| input_object_storage | Object Storage bucket mounted as input |
| recipe_container_env | List of environment variables |
| recipe_replica_count | Number of replicas |
| recipe_container_port | Port to expose the container |
| recipe_node_pool_size | Number of nodes in the pool |
| recipe_node_boot_volume_size_in_gbs | Boot volume size in GB |
| recipe_container_command_args | Arguments for the container command |
| recipe_ephemeral_storage_size | Temporary scratch storage |

#### Sample Recipe (Service Mode)
```json
{
"recipe_id": "cpu_inference",
"recipe_mode": "service",
"deployment_name": "gemma and BME4 service",
"recipe_image_uri": "iad.ocir.io/iduyx1qnmway/corrino-devops-repository:cpu_inference_service_v0.2",
"recipe_node_shape": "BM.Standard.E4.128",
"input_object_storage": [
{
"bucket_name": "ollama-models",
"mount_location": "/models",
"volume_size_in_gbs": 20
}
],
"recipe_container_env": [
{ "key": "MODEL_NAME", "value": "gemma" },
{ "key": "PROMPT", "value": "What is the capital of France?" }
],
"recipe_replica_count": 1,
"recipe_container_port": "11434",
"recipe_node_pool_size": 1,
"recipe_node_boot_volume_size_in_gbs": 200,
"recipe_container_command_args": [
"--input_directory", "/models", "--model_name", "gemma"
],
"recipe_ephemeral_storage_size": 100
}
```

---

### Accessing the API

Once deployed, send inference requests to the model via the exposed port:

```bash
curl http://<PUBLIC_IP>:11434/api/generate -d '{
"model": "gemma",
"prompt": "What is the capital of France?",
"stream": false
}'
```

### Example Public Inference Calls
```bash
curl -L POST https://cpu-inference-mismistral.130-162-199-33.nip.io/api/generate \
-d '{ "model": "mistral", "prompt": "What is the capital of Germany?" }' \
| jq -r 'select(.response) | .response' | paste -sd " "

curl -L -k POST https://cpu-inference-mistral-flexe4.130-162-199-33.nip.io/api/generate \
-d '{ "model": "mistral", "prompt": "What is the capital of Germany?" }' \
| jq -r 'select(.response) | .response' | paste -sd " "
```
---

### Pulling from Ollama and Saving to Object Storage

To download a model from Ollama and store it in Object Storage, use the job-mode recipe below.

#### Recipe Configuration

| Field | Description |
|--------------------------------|------------------------------------------------|
| recipe_id | `cpu_inference` – Same recipe base |
| recipe_mode | `job` – One-time job to save a model |
| deployment_name | Custom name for the saving job |
| recipe_image_uri | OCIR URI of the saver image |
| recipe_node_shape | Compute shape used for the job |
| output_object_storage | Where to store pulled models |
| recipe_container_env | Environment variables including model name |
| recipe_replica_count | Set to 1 |
| recipe_container_port | Typically `11434` for Ollama |
| recipe_node_pool_size | Set to 1 |
| recipe_node_boot_volume_size_in_gbs | Size in GB |
| recipe_container_command_args | Set output directory and model name |
| recipe_ephemeral_storage_size | Temporary storage |

#### Sample Recipe (Job Mode)
```json
{
"recipe_id": "cpu_inference",
"recipe_mode": "job",
"deployment_name": "gemma and BME4 saver",
"recipe_image_uri": "iad.ocir.io/iduyx1qnmway/corrino-devops-repository:cpu_inference_saver_v0.2",
"recipe_node_shape": "BM.Standard.E4.128",
"output_object_storage": [
{
"bucket_name": "ollama-models",
"mount_location": "/models",
"volume_size_in_gbs": 20
}
],
"recipe_container_env": [
{ "key": "MODEL_NAME", "value": "gemma" },
{ "key": "PROMPT", "value": "What is the capital of France?" }
],
"recipe_replica_count": 1,
"recipe_container_port": "11434",
"recipe_node_pool_size": 1,
"recipe_node_boot_volume_size_in_gbs": 200,
"recipe_container_command_args": [
"--output_directory", "/models", "--model_name", "gemma"
],
"recipe_ephemeral_storage_size": 100
}
```

---

## Final Notes

- Ensure all OCI IAM permissions are set to allow Object Storage access.
- Confirm that bucket region and deployment region match.
- Use the job-mode recipe once to save a model, and the service-mode recipe repeatedly to serve it.
202 changes: 202 additions & 0 deletions docs/healthcheck/readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,202 @@
# GPU Health Check & Pre-Check Blueprint

This blueprint provides a **pre-check blueprint** for GPU health validation before running production or research workloads. The focus is on delivering a **diagnostic tool** that can run on both single-node and multi-node environments, ensuring that your infrastructure is ready for demanding experiments.

The workflow includes:
- **Data types** as input (`fp8`, `fp16`, `fp32`, `fp64`)
- **Custom Functions** for GPU diagnostics
- **GPU-Burn** for stress testing
- **Results** collected in JSON files (and optionally PDF reports)

By following this blueprint, you can identify and localize issues such as thermal throttling, power irregularities, or GPU instability before they impact your main workloads.

---

## 1. Architecture Overview

Below is a simplified overview:

<img width="893" alt="image" src="https://github.com/user-attachments/assets/e44f7ffe-19cf-48be-a026-e27fddfbed3c" />


### Key Points

- Data Types: You can specify one of several floating-point precisions (`fp8`, fp16, fp32, `fp64`).
- Custom Functions: Diagnostic functions that measure performance metrics such as throughput, memory bandwidth, etc.
- Single-Node vs. Multi-Node: Tests can be run on a single machine or scaled to multiple machines.
- GPU-Burn: A specialized stress-testing tool for pushing GPUs to their maximum performance limits.
- Results: Output is aggregated into JSON files (and optionally PDFs) for analysis.

---

## 2. Health Check Blueprint

This blueprint aims to give you confidence that your GPUs are healthy. The key checks include:

1. Compute Throughput
- Dense matrix multiplications or arithmetic operations stress the GPU cores.
- Ensures sustained performance without degradation.

2. Memory Bandwidth
- Reading/writing large chunks of data (e.g., `torch.rand()`) tests memory throughput.
- Verifies the memory subsystem operates at expected speeds.

3. Temperature & Thermal Stability
- Uses commands like nvidia-smi to monitor temperature.
- Checks for throttling under load.

4. Power Consumption
- Monitors power draw (e.g., `nvidia-smi --query-gpu=power.draw --format=csv`).
- Identifies irregular or excessive power usage.

5. GPU Utilization
- Ensures GPU cores (including Tensor Cores) are fully engaged during tests.
- Confirms no unexpected idle time.

6. Error Detection
- Checks for hardware errors or CUDA-related issues.
- Asserts numerical correctness to ensure no silent failures.

7. Multi-GPU Testing
- Validates multi-GPU or multi-node setups.
- Ensures the entire environment is consistent and stable.

8. Mixed Precision Testing
- Uses AMP for fp8 or fp16 operations (e.g., `torch.cuda.amp.autocast()`).
- Confirms performance and compatibility with mixed-precision ops.

---

## 3. Data Types and How They Work

- fp8, fp16: Lower precision can offer speedups but requires checks for numerical stability.
- fp32 (single precision): Standard for most deep learning tasks; tests confirm typical GPU operations.
- fp64 (double precision): Used in HPC/scientific workloads; verifies performance and accuracy at high precision.

Depending on the dtype you select, the script either runs a set of Custom Functions or launches GPU-Burn to push the hardware to its limits. The results are saved in JSON for analysis.

---

## 4. Custom Functions

These Python-based diagnostic functions systematically measure:

- Throughput (matrix multiplies, convolution stubs, etc.)
- Memory bandwidth (large tensor reads/writes)
- Temperature (via nvidia-smi or other sensors)
- Power usage
- GPU utilization
- Error detection (assert checks, error logs)
- Multi-GPU orchestration (parallel usage correctness)
- Mixed precision compatibility (AMP in PyTorch)

They can run on a single node or multiple nodes, with each run producing structured JSON output.

---

## 5. GPU-Burn

[GPU-Burn](https://github.com/wilicc/gpu-burn) is a stress-testing tool designed to push GPUs to their maximum performance limits. It is typically used to:

- Validate hardware stability
- Identify potential overheating or faulty components
- Confirm GPUs can handle extreme workloads without errors or throttling

When you run GPU-Burn in float32 or float64 mode, its output can be captured in a log file, then parsed into JSON or PDF summaries for reporting.

---


## 6. Usage

1. Clone the Blueprint & Install Dependencies


Bash

git clone <repo_url>
cd <repo_name>
docker build -t gpu-healthcheck .


2. Run the Pre-Check
- Single Node Example (fp16):

Bash


docker run --gpus all -it -v $(pwd)/results:/app/testing_results gpu-healthcheck --dtype float16 --expected_gpus A10:2,A100:0,H100:0


- GPU-Burn Stress Test (float32):

Bash


docker run --gpus all -it -v $(pwd)/results:/app/testing_results gpu-healthcheck --dtype float32 --expected_gpus A10:2,A100:0,H100:0


3. Examine Results
- JSON output is located in the results/ directory.
- PDF summaries will also be generated.

---
## 7. Implementing it into OCI AI Blueprints

This is an example of json file which be used to deploy into OCI AI Blueprints:

```json
{
"recipe_id": "healthcheck",
"recipe_mode": "job",
"deployment_name": "healthcheck",
"recipe_image_uri": "iad.ocir.io/iduyx1qnmway/corrino-devops-repository:healthcheck_v0.3",
"recipe_node_shape": "VM.GPU.A10.2",
"output_object_storage": [
{
"bucket_name": "healthcheck2",
"mount_location": "/healthcheck_results",
"volume_size_in_gbs": 20
}
],
"recipe_container_command_args": [
"--dtype", "float16", "--output_dir", "/healthcheck_results", "--expected_gpus", "A10:2,A100:0,H100:0"
],
"recipe_replica_count": 1,
"recipe_nvidia_gpu_count": 2,
"recipe_node_pool_size": 1,
"recipe_node_boot_volume_size_in_gbs": 200,
"recipe_ephemeral_storage_size": 100,
"recipe_shared_memory_volume_size_limit_in_mb": 1000,
"recipe_use_shared_node_pool": true
}
```
---

## Explanation of Healthcheck Recipe Fields

| Field | Type | Example Value | Description |
|---------------------------------------|-------------|-------------------------------------------------------------------------------|-------------|
| `recipe_id` | string | `"healthcheck"` | Identifier for the recipe |
| `recipe_mode` | string | `"job"` | Whether the recipe runs as a one-time job or a service |
| `deployment_name` | string | `"healthcheck"` | Name of the deployment/job |
| `recipe_image_uri` | string | `"iad.ocir.io/.../healthcheck_v0.3"` | URI of the container image stored in OCI Container Registry |
| `recipe_node_shape` | string | `"VM.GPU.A10.2"` | Compute shape to use for this job |
| `output_object_storage.bucket_name` | string | `"healthcheck2"` | Name of the Object Storage bucket to write results |
| `output_object_storage.mount_location`| string | `"/healthcheck_results"` | Directory inside the container where the bucket will be mounted |
| `output_object_storage.volume_size_in_gbs` | integer | `20` | Storage volume size (GB) for the mounted bucket |
| `recipe_container_command_args` | list | `[--dtype, float16, --output_dir, /healthcheck_results, --expected_gpus, A10:2,A100:0,H100:0]` | Arguments passed to the container |
| `--dtype` | string | `"float16"` | Precision type for computations (e.g. float16, float32) |
| `--output_dir` | string | `"/healthcheck_results"` | Directory for writing output (maps to mounted bucket) |
| `--expected_gpus` | string | `"A10:2,A100:0,H100:0"` | Expected GPU types and counts |
| `recipe_replica_count` | integer | `1` | Number of replicas (containers) to run |
| `recipe_nvidia_gpu_count` | integer | `2` | Number of GPUs to allocate |
| `recipe_node_pool_size` | integer | `1` | Number of nodes to provision |
| `recipe_node_boot_volume_size_in_gbs`| integer | `200` | Size of the boot volume (GB) |
| `recipe_ephemeral_storage_size` | integer | `100` | Ephemeral scratch storage size (GB) |
| `recipe_shared_memory_volume_size_limit_in_mb` | integer | `1000` | Size of shared memory volume (`/dev/shm`) in MB |
| `recipe_use_shared_node_pool` | boolean | `true` | Whether to run on a shared node pool |

## 8. Contact

For questions or additional information, open an issue in this blueprint or contact the maintainers directly.