Skip to content

πŸš€ Hands-on Ray distributed computing examples: Core, Data, Train, Tune, Serve with PyTorch integration and ML workflows

Notifications You must be signed in to change notification settings

Parry-97/ray_fundamentals

Repository files navigation

Ray Fundamentals Learning Repository πŸš€

Python 3.11+ Ray 2.46.0 uv Learning

πŸ“– Hands-on learning repository following the Anyscale Introduction to Ray Course

This repository contains practical Python implementations and examples designed to complement the Anyscale "Introduction to Ray" course. Each script demonstrates key concepts from distributed computing with Ray, progressing from basic tasks to advanced machine learning workflows.

πŸ“š Table of Contents

πŸ“– Course Overview

This repository implements examples for the following Anyscale course modules:

Course Module Repository Files Key Concepts
πŸ—οΈ Ray Core Fundamentals ray_core.py, ray_actors.py Remote functions, ObjectRefs, Actors
⚑ Ray Core Advanced ray_advanced.py Object Store, Runtime Environments, Resource Management
πŸ“Š Ray Data Processing ray_data.py Distributed data processing, ETL pipelines
πŸ€– Ray AI Libraries ray_ai.py XGBoost integration, Distributed ML workflows
🎯 Ray Tune Optimization ray_tune.py, ray_tune_torch.py Hyperparameter tuning, AutoML, Experiment tracking
πŸ”₯ Ray Train & PyTorch ray_torch.py, ray_torch_ddp.py Distributed training, Model parallelism, DDP
🌐 Ray Serve Deployment ray_serve.py ML model serving, API endpoints, Scalable inference
πŸ§ͺ Testing & Utilities ray_minimal_test.py, shell scripts Memory management, Cleanup utilities

🎯 Learning Objectives

After working through this repository, you'll understand:

  • Ray's distributed computing model and core abstractions
  • How to write scalable remote functions and stateful actors
  • Object store patterns and memory management strategies
  • Integration of Ray with popular ML libraries (XGBoost, PyTorch)
  • Best practices for distributed training and model serving
  • Troubleshooting and resource optimization techniques

πŸ›€οΈ Learning Path

🟒 Beginner Level (Start Here)

Estimated time: 2-3 hours

  1. πŸ“‹ Prerequisites Check

    python --version  # Should be 3.11+
    uv --version      # Package manager
  2. πŸš€ Ray Basics - ray_minimal_test.py

    • Verify Ray installation
    • Understand basic Ray initialization
    • Simple remote functions
  3. πŸ”§ Core Concepts - ray_core.py

    • Remote functions (@ray.remote)
    • Object references (ray.get, ray.put)
    • Common patterns and anti-patterns

🟑 Intermediate Level

Estimated time: 3-4 hours

  1. πŸ‘₯ Stateful Actors - ray_actors.py

    • Actor lifecycle and state management
    • Actor handles and communication
    • Use cases for actors vs tasks
  2. ⚑ Advanced Features - ray_advanced.py

    • Distributed object store
    • Runtime environments
    • Resource allocation and fractional resources
    • Nested tasks and patterns

πŸ”΄ Advanced Level

Estimated time: 4-5 hours

  1. πŸ€– ML Workflows - ray_ai.py

    • Ray integration with XGBoost
    • Distributed data processing
    • Model training and evaluation
  2. πŸ”₯ Distributed Training - ray_torch.py

    • Ray Train with PyTorch
    • Distributed data parallel training
    • Checkpointing and metrics

🏁 Checkpoints

  • βœ… Can create and call remote functions
  • βœ… Understand ObjectRefs and object store
  • βœ… Can implement and use Ray actors
  • βœ… Familiar with runtime environments
  • βœ… Can integrate Ray with ML libraries
  • βœ… Understand distributed training patterns

πŸ“ Project Structure

ray_fundamentals/
β”œβ”€β”€ πŸ“„ README.md                 # This comprehensive guide
β”œβ”€β”€ πŸ“¦ pyproject.toml           # Project dependencies & config
β”œβ”€β”€ 🐍 Python Learning Modules:
β”‚   β”œβ”€β”€ ray_minimal_test.py     # βœ… Installation verification
β”‚   β”œβ”€β”€ ray_core.py             # πŸ—οΈ Remote functions & ObjectRefs
β”‚   β”œβ”€β”€ ray_actors.py           # πŸ‘₯ Stateful actors & communication
β”‚   β”œβ”€β”€ ray_advanced.py         # ⚑ Object store & runtime environments
β”‚   β”œβ”€β”€ ray_ai.py               # πŸ€– XGBoost ML workflow
β”‚   └── ray_torch.py            # πŸ”₯ PyTorch distributed training
β”œβ”€β”€ πŸ› οΈ Utility Scripts:
β”‚   β”œβ”€β”€ cleanup_ray.sh          # 🧹 Clean Ray temp files
β”‚   └── run_ray_safe.sh         # πŸ›‘οΈ Run with memory limits
└── πŸ“š Documentation:
    └── docs/
        β”œβ”€β”€ ray_resources.md    # CPU/GPU resource allocation
        └── ray_runtime_notes.md # Runtime environment deep-dive

πŸ“‹ Detailed File Descriptions

File Purpose Key Concepts Prerequisites
ray_minimal_test.py πŸ§ͺ Verify setup Ray initialization, basic remote functions Python basics
ray_core.py πŸ—οΈ Foundation concepts @ray.remote, ray.get(), ray.put(), anti-patterns None
ray_actors.py πŸ‘₯ Stateful computing Actor classes, state management, handles ray_core.py
ray_advanced.py ⚑ Advanced patterns Object store, runtime envs, resources, nested tasks ray_actors.py
ray_ai.py πŸ€– ML integration XGBoost + Ray, distributed ML workflows ML basics, pandas
ray_torch.py πŸ”₯ Distributed training Ray Train, PyTorch DDP, checkpointing PyTorch knowledge

πŸš€ Quick Start

System Requirements

  • Python: 3.11 or higher
  • Memory: 4GB+ RAM recommended
  • OS: Linux, macOS, or Windows with WSL

1. Setup Environment

# Clone or navigate to this repository
cd ray_fundamentals

# Install dependencies using uv (recommended)
uv sync

# Alternative: using pip
pip install -r requirements.txt

2. Verify Installation

# Test Ray installation with minimal example
uv run ray_minimal_test.py
# or: python ray_minimal_test.py

Expected output:

Ray initialized successfully!
Available resources: {'CPU': 1.0, 'memory': 256000000}
Simple task result: 84
Small matrix test passed: True
Ray shutdown successfully!

3. Start Learning

Begin with the Learning Path above, starting with ray_core.py:

uv run ray_core.py

🧠 Key Ray Concepts

πŸ”§ Remote Functions (Tasks)

@ray.remote
def compute_task(data):
    return process(data)

# Schedule task execution
future = compute_task.remote(my_data)
result = ray.get(future)  # Retrieve result

Files: ray_core.py, ray_advanced.py

πŸ‘₯ Actors (Stateful Workers)

@ray.remote
class StatefulWorker:
    def __init__(self):
        self.state = {}

    def update(self, key, value):
        self.state[key] = value

# Create actor instance
worker = StatefulWorker.remote()
worker.update.remote("key", "value")

Files: ray_actors.py

πŸ—ƒοΈ Object Store

# Store large objects once, reference many times
large_data = ray.put(massive_dataset)
results = [process_data.remote(large_data) for _ in range(10)]

Files: ray_advanced.py

πŸ€– ML Integration

# Distributed training with Ray Train
from ray.train.xgboost import XGBoostTrainer
trainer = XGBoostTrainer(
    datasets={"train": train_dataset},
    params={"objective": "reg:squarederror"}
)
result = trainer.fit()

Files: ray_ai.py, ray_torch.py

▢️ Running Examples

Basic Execution

# Method 1: Using uv (recommended)
uv run <script_name>.py

# Method 2: Direct python execution
python <script_name>.py

# Method 3: With custom memory limits
./run_ray_safe.sh  # Runs ray_advanced.py with memory constraints

Memory-Safe Execution

For systems with limited RAM:

# Use the provided safe execution script
./run_ray_safe.sh

# Or set environment variables manually
export RAY_OBJECT_STORE_ALLOW_SLOW_STORAGE=1
export RAY_memory_usage_threshold=0.6
python ray_advanced.py

Example Execution Sequence

# 1. Verify installation
uv run ray_minimal_test.py

# 2. Learn core concepts
uv run ray_core.py

# 3. Explore actors
uv run ray_actors.py

# 4. Advanced patterns
./run_ray_safe.sh  # runs ray_advanced.py

# 5. ML workflows
uv run ray_ai.py

# 6. Distributed training
uv run ray_torch.py

# 7. Cleanup (if needed)
./cleanup_ray.sh

πŸ“š Documentation

The docs/ directory contains additional learning resources:

πŸ“„ Available Documentation

Document Description Key Topics
ray_resources.md CPU/GPU Resource Management num_cpus, num_gpus, resource allocation
ray_runtime_notes.md Runtime Environments Deep Dive Environment isolation, pip vs uv, Docker containers

πŸ“– Reading Order

  1. Start with code examples
  2. Reference ray_resources.md when working with ray_advanced.py
  3. Review ray_runtime_notes.md for production deployment insights

πŸ› οΈ Utilities & Troubleshooting

🧹 Cleanup Scripts

Script Purpose Usage
cleanup_ray.sh Remove Ray temporary files ./cleanup_ray.sh
run_ray_safe.sh Execute with memory limits ./run_ray_safe.sh

🚨 Common Issues & Solutions

Issue Symptoms Solution
Memory errors Ray crashes, OOM kills Use run_ray_safe.sh or reduce data sizes
Port conflicts "Address already in use" Run ray stop or ./cleanup_ray.sh
Import errors Module not found Ensure uv sync completed successfully
Slow startup Long initialization times Clean temp files with cleanup_ray.sh

πŸ” Debugging Tips

# Check Ray status
ray status

# View Ray dashboard (if available)
# Open browser to: http://localhost:8265

# Monitor system resources
htop  # or top on macOS

# Check disk usage
df -h /tmp  # Ray uses /tmp by default

βš™οΈ Configuration

Environment Variables:

# Memory management
export RAY_OBJECT_STORE_ALLOW_SLOW_STORAGE=1
export RAY_memory_usage_threshold=0.6

# Disable warnings
export RAY_DISABLE_IMPORT_WARNING=1

# Custom temp directory
export RAY_TMPDIR=/path/to/custom/tmp

πŸ”— Additional Resources

πŸ“š Official Documentation

πŸŽ“ Learning Resources

πŸ—οΈ Architecture & Best Practices

πŸš€ Production Deployment


πŸ“ Development Notes

  • Coding Style: Follows PEP 8 with extensive inline documentation
  • Version Control: Uses jj (Jitijiji) as an experimental alternative to Git
  • Package Management: Primary dependency management via uv for faster installs
  • Testing Strategy: Executable examples with assertions and print statements for validation

🀝 Contributing

This is a personal learning repository, but suggestions and improvements are welcome! Feel free to:

  • Report issues or errors in examples
  • Suggest additional Ray concepts to explore
  • Share alternative approaches or optimizations

🎯 Happy Learning! Start your Ray journey with the Learning Path and dive into distributed computing! πŸš€

About

πŸš€ Hands-on Ray distributed computing examples: Core, Data, Train, Tune, Serve with PyTorch integration and ML workflows

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published