oracle-quickstart · sadrafh · Mar 31, 2025 · Apr 1, 2025 · Apr 1, 2025
diff --git a/docs/cpu_inference/readme.md b/docs/cpu_inference/readme.md
@@ -0,0 +1,169 @@
+# CPU Inference with Ollama
+
+This blueprint explains how to use CPU inference for running large language models using Ollama. It includes two main deployment strategies:
+- Serving pre-saved models directly from Object Storage
+- Pulling models from Ollama and saving them to Object Storage
+
+---
+
+## Why CPU Inference?
+
+CPU inference is ideal for:
+- Low-throughput or cost-sensitive deployments
+- Offline testing and validation
+- Prototyping without GPU dependency
+
+---
+
+## Supported Models
+
+Ollama supports several high-quality open-source LLMs. Below is a small set of commonly used models:
+
+| Model Name | Description                    |
+|------------|--------------------------------|
+| gemma      | Lightweight open LLM by Google |
+| llama2     | Meta’s large language model     |
+| mistral    | Open-weight performant LLM     |
+| phi3       | Microsoft’s compact LLM        |
+
+---
+
+## Deploying with OCI AI Blueprint
+
+###  Running Ollama Models from Object Storage
+
+If you've already pushed your model to **Object Storage**, use the following service-mode recipe to run it. Ensure your model files are in the **blob + manifest** format used by Ollama.
+
+####  Recipe Configuration
+
+| Field                          | Description                                    |
+|--------------------------------|------------------------------------------------|
+| recipe_id                     | `cpu_inference` – Identifier for the recipe    |
+| recipe_mode                   | `service` – Used for long-running inference    |
+| deployment_name               | Custom name for the deployment                 |
+| recipe_image_uri              | URI for the container image in OCIR            |
+| recipe_node_shape             | OCI shape, e.g., `BM.Standard.E4.128`          |
+| input_object_storage          | Object Storage bucket mounted as input         |
+| recipe_container_env          | List of environment variables                  |
+| recipe_replica_count          | Number of replicas                             |
+| recipe_container_port         | Port to expose the container                   |
+| recipe_node_pool_size         | Number of nodes in the pool                    |
+| recipe_node_boot_volume_size_in_gbs | Boot volume size in GB                  |
+| recipe_container_command_args | Arguments for the container command            |
+| recipe_ephemeral_storage_size | Temporary scratch storage                      |
+
+####  Sample Recipe (Service Mode)
+```json
+{
+  "recipe_id": "cpu_inference",
+  "recipe_mode": "service",
+  "deployment_name": "gemma and BME4 service",
+  "recipe_image_uri": "iad.ocir.io/iduyx1qnmway/corrino-devops-repository:cpu_inference_service_v0.2",
+  "recipe_node_shape": "BM.Standard.E4.128",
+  "input_object_storage": [
+    {
+      "bucket_name": "ollama-models",
+      "mount_location": "/models",
+      "volume_size_in_gbs": 20
+    }
+  ],
+  "recipe_container_env": [
+    { "key": "MODEL_NAME", "value": "gemma" },
+    { "key": "PROMPT", "value": "What is the capital of France?" }
+  ],
+  "recipe_replica_count": 1,
+  "recipe_container_port": "11434",
+  "recipe_node_pool_size": 1,
+  "recipe_node_boot_volume_size_in_gbs": 200,
+  "recipe_container_command_args": [
+    "--input_directory", "/models", "--model_name", "gemma"
+  ],
+  "recipe_ephemeral_storage_size": 100
+}
+```
+
+---
+
+###  Accessing the API
+
+Once deployed, send inference requests to the model via the exposed port:
+
+```bash
+curl http://<PUBLIC_IP>:11434/api/generate -d '{
+  "model": "gemma",
+  "prompt": "What is the capital of France?",
+  "stream": false
+}'
+```
+
+### Example Public Inference Calls
+```bash
+curl -L POST https://cpu-inference-mismistral.130-162-199-33.nip.io/api/generate \
+  -d '{ "model": "mistral", "prompt": "What is the capital of Germany?" }' \
+  | jq -r 'select(.response) | .response' | paste -sd " "
+
+curl -L -k POST https://cpu-inference-mistral-flexe4.130-162-199-33.nip.io/api/generate \
+  -d '{ "model": "mistral", "prompt": "What is the capital of Germany?" }' \
+  | jq -r 'select(.response) | .response' | paste -sd " "
+```
+---
+
+###  Pulling from Ollama and Saving to Object Storage
+
+To download a model from Ollama and store it in Object Storage, use the job-mode recipe below.
+
+####  Recipe Configuration
+
+| Field                          | Description                                    |
+|--------------------------------|------------------------------------------------|
+| recipe_id                     | `cpu_inference` – Same recipe base             |
+| recipe_mode                   | `job` – One-time job to save a model           |
+| deployment_name               | Custom name for the saving job                 |
+| recipe_image_uri              | OCIR URI of the saver image                    |
+| recipe_node_shape             | Compute shape used for the job                 |
+| output_object_storage         | Where to store pulled models                   |
+| recipe_container_env          | Environment variables including model name     |
+| recipe_replica_count          | Set to 1                                       |
+| recipe_container_port         | Typically `11434` for Ollama                   |
+| recipe_node_pool_size         | Set to 1                                       |
+| recipe_node_boot_volume_size_in_gbs | Size in GB                          |
+| recipe_container_command_args | Set output directory and model name            |
+| recipe_ephemeral_storage_size | Temporary storage                              |
+
+####  Sample Recipe (Job Mode)
+```json
+{
+  "recipe_id": "cpu_inference",
+  "recipe_mode": "job",
+  "deployment_name": "gemma and BME4 saver",
+  "recipe_image_uri": "iad.ocir.io/iduyx1qnmway/corrino-devops-repository:cpu_inference_saver_v0.2",
+  "recipe_node_shape": "BM.Standard.E4.128",
+  "output_object_storage": [
+    {
+      "bucket_name": "ollama-models",
+      "mount_location": "/models",
+      "volume_size_in_gbs": 20
+    }
+  ],
+  "recipe_container_env": [
+    { "key": "MODEL_NAME", "value": "gemma" },
+    { "key": "PROMPT", "value": "What is the capital of France?" }
+  ],
+  "recipe_replica_count": 1,
+  "recipe_container_port": "11434",
+  "recipe_node_pool_size": 1,
+  "recipe_node_boot_volume_size_in_gbs": 200,
+  "recipe_container_command_args": [
+    "--output_directory", "/models", "--model_name", "gemma"
+  ],
+  "recipe_ephemeral_storage_size": 100
+}
+```
+
+---
+
+## Final Notes
+
+- Ensure all OCI IAM permissions are set to allow Object Storage access.
+- Confirm that bucket region and deployment region match.
+- Use the job-mode recipe once to save a model, and the service-mode recipe repeatedly to serve it.
diff --git a/docs/healthcheck/readme.md b/docs/healthcheck/readme.md
@@ -0,0 +1,202 @@
+# GPU Health Check & Pre-Check Blueprint
+
+This blueprint provides a **pre-check blueprint** for GPU health validation before running production or research workloads. The focus is on delivering a **diagnostic tool** that can run on both single-node and multi-node environments, ensuring that your infrastructure is ready for demanding experiments.
+
+The workflow includes:
+- **Data types** as input (`fp8`, `fp16`, `fp32`, `fp64`)
+- **Custom Functions** for GPU diagnostics
+- **GPU-Burn** for stress testing
+- **Results** collected in JSON files (and optionally PDF reports)
+
+By following this blueprint, you can identify and localize issues such as thermal throttling, power irregularities, or GPU instability before they impact your main workloads.
+
+---
+
+## 1. Architecture Overview
+
+Below is a simplified overview:
+
+<img width="893" alt="image" src="https://github.com/user-attachments/assets/e44f7ffe-19cf-48be-a026-e27fddfbed3c" />
+
+
+### Key Points
+
+- Data Types: You can specify one of several floating-point precisions (`fp8`, fp16, fp32, `fp64`).
+- Custom Functions: Diagnostic functions that measure performance metrics such as throughput, memory bandwidth, etc.
+- Single-Node vs. Multi-Node: Tests can be run on a single machine or scaled to multiple machines.
+- GPU-Burn: A specialized stress-testing tool for pushing GPUs to their maximum performance limits.
+- Results: Output is aggregated into JSON files (and optionally PDFs) for analysis.
+
+---
+
+## 2. Health Check Blueprint
+
+This blueprint aims to give you confidence that your GPUs are healthy. The key checks include:
+
+1. Compute Throughput  
+   - Dense matrix multiplications or arithmetic operations stress the GPU cores.  
+   - Ensures sustained performance without degradation.
+
+2. Memory Bandwidth  
+   - Reading/writing large chunks of data (e.g., `torch.rand()`) tests memory throughput.  
+   - Verifies the memory subsystem operates at expected speeds.
+
+3. Temperature & Thermal Stability  
+   - Uses commands like nvidia-smi to monitor temperature.  
+   - Checks for throttling under load.
+
+4. Power Consumption  
+   - Monitors power draw (e.g., `nvidia-smi --query-gpu=power.draw --format=csv`).  
+   - Identifies irregular or excessive power usage.
+
+5. GPU Utilization  
+   - Ensures GPU cores (including Tensor Cores) are fully engaged during tests.  
+   - Confirms no unexpected idle time.
+
+6. Error Detection  
+   - Checks for hardware errors or CUDA-related issues.  
+   - Asserts numerical correctness to ensure no silent failures.
+
+7. Multi-GPU Testing  
+   - Validates multi-GPU or multi-node setups.  
+   - Ensures the entire environment is consistent and stable.
+
+8. Mixed Precision Testing  
+   - Uses AMP for fp8 or fp16 operations (e.g., `torch.cuda.amp.autocast()`).  
+   - Confirms performance and compatibility with mixed-precision ops.
+
+---
+
+## 3. Data Types and How They Work
+
+- fp8, fp16: Lower precision can offer speedups but requires checks for numerical stability.  
+- fp32 (single precision): Standard for most deep learning tasks; tests confirm typical GPU operations.  
+- fp64 (double precision): Used in HPC/scientific workloads; verifies performance and accuracy at high precision.
+
+Depending on the dtype you select, the script either runs a set of Custom Functions or launches GPU-Burn to push the hardware to its limits. The results are saved in JSON for analysis.
+
+---
+
+## 4. Custom Functions
+
+These Python-based diagnostic functions systematically measure:
+
+- Throughput (matrix multiplies, convolution stubs, etc.)  
+- Memory bandwidth (large tensor reads/writes)  
+- Temperature (via nvidia-smi or other sensors)  
+- Power usage  
+- GPU utilization  
+- Error detection (assert checks, error logs)  
+- Multi-GPU orchestration (parallel usage correctness)  
+- Mixed precision compatibility (AMP in PyTorch)
+
+They can run on a single node or multiple nodes, with each run producing structured JSON output.
+
+---
+
+## 5. GPU-Burn
+
+[GPU-Burn](https://github.com/wilicc/gpu-burn) is a stress-testing tool designed to push GPUs to their maximum performance limits. It is typically used to:
+
+- Validate hardware stability  
+- Identify potential overheating or faulty components  
+- Confirm GPUs can handle extreme workloads without errors or throttling
+
+When you run GPU-Burn in float32 or float64 mode, its output can be captured in a log file, then parsed into JSON or PDF summaries for reporting.
+
+---
+
+
+## 6. Usage
+
+1. Clone the Blueprint & Install Dependencies  
+
+
+Bash
+
+     git clone <repo_url>
+     cd <repo_name>
+     docker build -t gpu-healthcheck .
+
+
+2. Run the Pre-Check  
+   - Single Node Example (fp16):  
+
+Bash
+
+
+     docker run --gpus all -it -v $(pwd)/results:/app/testing_results gpu-healthcheck --dtype float16 --expected_gpus A10:2,A100:0,H100:0
+
+
+   - GPU-Burn Stress Test (float32):  
+
+Bash
+
+
+     docker run --gpus all -it -v $(pwd)/results:/app/testing_results gpu-healthcheck --dtype float32 --expected_gpus A10:2,A100:0,H100:0
+
+
+3. Examine Results  
+   - JSON output is located in the results/ directory.  
+   - PDF summaries will also be generated.
+
+---
+## 7. Implementing it into OCI AI Blueprints
+
+This is an example of json file which be used to deploy into OCI AI Blueprints:
+
+```json
+{
+  "recipe_id": "healthcheck",
+  "recipe_mode": "job",
+  "deployment_name": "healthcheck",
+  "recipe_image_uri": "iad.ocir.io/iduyx1qnmway/corrino-devops-repository:healthcheck_v0.3",
+  "recipe_node_shape": "VM.GPU.A10.2",
+  "output_object_storage": [
+    {
+      "bucket_name": "healthcheck2",
+      "mount_location": "/healthcheck_results",
+      "volume_size_in_gbs": 20
+    }
+  ],
+  "recipe_container_command_args": [
+    "--dtype", "float16", "--output_dir", "/healthcheck_results", "--expected_gpus", "A10:2,A100:0,H100:0"
+  ],
+  "recipe_replica_count": 1,
+  "recipe_nvidia_gpu_count": 2,
+  "recipe_node_pool_size": 1,
+  "recipe_node_boot_volume_size_in_gbs": 200,
+  "recipe_ephemeral_storage_size": 100,
+  "recipe_shared_memory_volume_size_limit_in_mb": 1000,
+  "recipe_use_shared_node_pool": true
+}
+```
+---
+
+## Explanation of Healthcheck Recipe Fields
+
+| Field                                  | Type        | Example Value                                                                 | Description |
+|---------------------------------------|-------------|-------------------------------------------------------------------------------|-------------|
+| `recipe_id`                           | string      | `"healthcheck"`                                                              | Identifier for the recipe |
+| `recipe_mode`                         | string      | `"job"`                                                                      | Whether the recipe runs as a one-time job or a service |
+| `deployment_name`                     | string      | `"healthcheck"`                                                              | Name of the deployment/job |
+| `recipe_image_uri`                    | string      | `"iad.ocir.io/.../healthcheck_v0.3"`                                         | URI of the container image stored in OCI Container Registry |
+| `recipe_node_shape`                   | string      | `"VM.GPU.A10.2"`                                                              | Compute shape to use for this job |
+| `output_object_storage.bucket_name`   | string      | `"healthcheck2"`                                                              | Name of the Object Storage bucket to write results |
+| `output_object_storage.mount_location`| string      | `"/healthcheck_results"`                                                     | Directory inside the container where the bucket will be mounted |
+| `output_object_storage.volume_size_in_gbs` | integer | `20`                                                                         | Storage volume size (GB) for the mounted bucket |
+| `recipe_container_command_args`       | list        | `[--dtype, float16, --output_dir, /healthcheck_results, --expected_gpus, A10:2,A100:0,H100:0]` | Arguments passed to the container |
+| `--dtype`                             | string      | `"float16"`                                                                   | Precision type for computations (e.g. float16, float32) |
+| `--output_dir`                        | string      | `"/healthcheck_results"`                                                     | Directory for writing output (maps to mounted bucket) |
+| `--expected_gpus`                     | string      | `"A10:2,A100:0,H100:0"`                                                       | Expected GPU types and counts |
+| `recipe_replica_count`                | integer     | `1`                                                                          | Number of replicas (containers) to run |
+| `recipe_nvidia_gpu_count`            | integer     | `2`                                                                          | Number of GPUs to allocate |
+| `recipe_node_pool_size`              | integer     | `1`                                                                          | Number of nodes to provision |
+| `recipe_node_boot_volume_size_in_gbs`| integer     | `200`                                                                         | Size of the boot volume (GB) |
+| `recipe_ephemeral_storage_size`      | integer     | `100`                                                                         | Ephemeral scratch storage size (GB) |
+| `recipe_shared_memory_volume_size_limit_in_mb` | integer | `1000`                                                                   | Size of shared memory volume (`/dev/shm`) in MB |
+| `recipe_use_shared_node_pool`        | boolean     | `true`                                                                       | Whether to run on a shared node pool |
+
+## 8. Contact
+
+For questions or additional information, open an issue in this blueprint or contact the maintainers directly.