A comprehensive platform for building, maintaining, and curating machine learning datasets. It provides an end-to-end solution for data collection, annotation, preprocessing, and quality control, with built-in support for multi-modal data types.
- Docker Engine 24.0.0+
- Docker Compose v2.20.0+
- NVIDIA GPU (recommended)
- 32GB RAM minimum
- 100GB storage minimum
- Clone the repository:
git clone https://github.com/eooo-io/semblance-curation.git
cd semblance-curation
- Copy and configure environment variables:
cp env-example .env
# Edit .env with your preferred settings
- Start the services:
# For production
docker compose up -d
# For development
docker compose -f docker-compose.yml -f docker-compose.override.yml up -d
- Access the services:
- Label Studio: http://localhost:8080
- Jupyter Lab: http://localhost:8888
- MinIO Console: http://localhost:9001
- Grafana: http://localhost:3000
For detailed documentation, visit:
This project is licensed under the MIT License - see the LICENSE file for details.
- Multi-modal data handling (text, audio, images, video)
- Built-in annotation tools
- Local LLM inference capabilities
- Comprehensive monitoring and quality control
- Cloud-ready deployment options
- High availability configuration
- Extensive API support
Semblance Curation is a comprehensive platform designed for organizations and researchers who need to build, maintain, and curate their own machine learning datasets. It provides an end-to-end solution for data collection, annotation, preprocessing, and quality control, with built-in support for multi-modal data types including text, audio, images, and video.
- Build and maintain proprietary ML training datasets
- Curate and clean existing datasets
- Annotate data with custom labels and metadata
- Version and track data lineage
- Perform quality control on training data
- Deploy local LLM inference for data processing
- 32GB RAM
- 8+ CPU cores
- NVIDIA GPU with 8GB+ VRAM (recommended)
- 500GB+ SSD storage
- Ubuntu 20.04+ or similar Linux distribution
For detailed deployment instructions and requirements, see our deployment documentation.
This project is licensed under the MIT License - see the LICENSE file for details.
- Documentation: docs.semblance-curation.io
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Process and store text, voice, and video data
- Scalable storage solutions for various data types
- Efficient data annotation and labeling workflows
- Local LLM inference using Ollama
- GPU-accelerated processing
- Vector-based similarity search with Weaviate
- Comprehensive data annotation tools with Argilla
- Powerful text search with Elasticsearch
- Structured data storage in PostgreSQL
- High-performance caching with Redis
- Scalable object storage using MinIO
- Vector database for semantic search
- Data versioning and lineage tracking
- Automated data quality checks
- Real-time data pipeline monitoring
- Interactive Jupyter Notebook environment
- Comprehensive data science toolkit
- Containerized architecture for consistency
- Full GPU support for ML workloads
- Distributed training support
- Experiment tracking and model versioning
- Automated ML pipeline orchestration
- Docker and Docker Compose
- NVIDIA drivers (for GPU support)
- Git
- Minimum 16GB RAM recommended
- NVIDIA GPU with CUDA support (optional but recommended)
- 50GB+ available storage
# Add to docker-compose.yml
mlflow:
image: ghcr.io/mlflow/mlflow:latest
ports:
- "5000:5000"
environment:
- MLFLOW_TRACKING_URI=postgresql://postgres:${POSTGRES_PASSWORD}@postgres:5432/${POSTGRES_DB}
depends_on:
- postgres
# Add to docker-compose.yml
ray-head:
image: rayproject/ray:latest
ports:
- "8265:8265" # Ray dashboard
- "10001:10001" # Ray client server
command: ray start --head --dashboard-host=0.0.0.0
# Add to your .env file
WANDB_API_KEY=your_key_here
# Add to your .env file
NVIDIA_VISIBLE_DEVICES=all
NVIDIA_DRIVER_CAPABILITIES=compute,utility
CUDA_VISIBLE_DEVICES=0,1 # Specify GPUs to use
# Add to docker-compose.yml for each service
security_opt:
- no-new-privileges:true
ulimits:
nproc: 65535
nofile:
soft: 65535
hard: 65535
# Add to docker-compose.yml for database services
command: >
-c max_connections=200
-c shared_buffers=2GB
-c effective_cache_size=6GB
-c maintenance_work_mem=512MB
-c checkpoint_completion_target=0.9
-c wal_buffers=16MB
-c default_statistics_target=100
-c random_page_cost=1.1
-c effective_io_concurrency=200
-c work_mem=6553kB
-c min_wal_size=1GB
-c max_wal_size=4GB
Create prometheus/prometheus.yml
:
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'semblance-services'
static_configs:
- targets: ['localhost:9090']
- job_name: 'docker'
docker_sd_configs:
- host: unix:///var/run/docker.sock
relabel_configs:
- source_labels: [__meta_docker_container_name]
regex: '/(.*)'
target_label: container_name
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
Create grafana/provisioning/dashboards/ml-metrics.json
:
{
"dashboard": {
"title": "ML Pipeline Metrics",
"panels": [
{
"title": "Model Training Progress",
"type": "graph",
"metrics": ["training_loss", "validation_loss"]
},
{
"title": "GPU Utilization",
"type": "gauge",
"metrics": ["gpu_memory_used", "gpu_utilization"]
},
{
"title": "Data Pipeline Throughput",
"type": "stat",
"metrics": ["records_processed_per_second"]
}
]
}
}
Create security/security-policies.yml
:
# Docker security options
security_opt:
- seccomp:security/seccomp-profile.json
- apparmor:security/apparmor-profile
- no-new-privileges:true
# Network policies
networks:
curation-net:
driver: overlay
attachable: true
driver_opts:
encrypted: "true"
ipam:
driver: default
config:
- subnet: 172.16.0.0/24
# Service-specific security
services:
postgres:
security_opt:
- no-new-privileges:true
environment:
- POSTGRES_PASSWORD_FILE=/run/secrets/db_password
secrets:
- db_password
configs:
- source: postgres_config
target: /etc/postgresql/postgresql.conf
secrets:
db_password:
file: ./secrets/db_password.txt
ssl_cert:
file: ./secrets/ssl_cert.pem
configs:
postgres_config:
file: ./configs/postgresql.conf
Add these services to enable comprehensive monitoring:
# Add to docker-compose.yml
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
loki:
image: grafana/loki:latest
ports:
- "3100:3100"
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "16686:16686"
Create a .devcontainer/devcontainer.json
:
{
"name": "Semblance Dev Environment",
"dockerComposeFile": ["../docker-compose.yml"],
"service": "jupyter",
"workspaceFolder": "/home/jovyan/work",
"extensions": [
"ms-python.python",
"ms-toolsai.jupyter",
"ms-azuretools.vscode-docker"
]
}
-
Clone the repository: ```bash git clone https://github.com/yourusername/semblance-curation.git cd semblance-curation ```
-
Copy the environment file: ```bash cp env-example .env ```
-
Configure your environment variables in `.env`
-
Start the services: ```bash docker compose up -d ```
- Argilla: 6900
- Elasticsearch: 9200
- Weaviate: 8080
- PostgreSQL: 5432
- Redis: 6379
- MinIO: 9000 (API) / 9001 (Console)
- Ollama: 11434
- Jupyter: 8888
The platform consists of several containerized services:
- Argilla: Data annotation and curation
- Elasticsearch: Text search and analytics
- Weaviate: Vector database for ML features
- PostgreSQL: Structured data storage
- Redis: Caching and real-time features
- MinIO: Object storage for large files
- Ollama: Local LLM inference
- Jupyter: Interactive development environment
- All services run in isolated containers
- Configurable authentication for each service
- Secure data storage with volume persistence
- Environment-based configuration
Each component has its own documentation:
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
- Argilla for the data annotation platform
- Weaviate for the vector database
- Ollama for local LLM capabilities
- All other open-source projects that made this possible
- Use data streaming for large datasets
- Implement incremental processing
- Configure appropriate batch sizes
- Use parallel processing where possible
- Implement caching strategies
- Use experiment tracking (MLflow/W&B)
- Implement model versioning
- Set up automated testing
- Use distributed training for large models
- Implement model monitoring
- Use rolling updates
- Implement health checks
- Set up automated backups
- Configure auto-scaling
- Monitor resource usage
Create pipelines/training_pipeline.py
:
import mlflow
from ray import tune
from ray.tune.integration.mlflow import MLflowLoggerCallback
def train_model(config):
mlflow.start_run(nested=True)
# Data loading with versioning
dataset = load_dataset(
path=config["data_path"],
version=config["data_version"]
)
# Model configuration
model = create_model(
architecture=config["model_arch"],
params=config["model_params"]
)
# Training loop with metrics
for epoch in range(config["num_epochs"]):
metrics = train_epoch(model, dataset)
# Log metrics to MLflow
mlflow.log_metrics(metrics)
# Report to Ray Tune
tune.report(
loss=metrics["loss"],
accuracy=metrics["accuracy"]
)
# Configure distributed training
training_config = {
"num_epochs": 100,
"batch_size": 32,
"learning_rate": tune.loguniform(1e-4, 1e-1),
"model_arch": "transformer",
"data_version": "v1.0"
}
# Launch training
analysis = tune.run(
train_model,
config=training_config,
num_samples=10,
callbacks=[MLflowLoggerCallback()]
)
Create backup/backup-config.yml
:
backup_schedule:
postgres:
frequency: "0 2 * * *" # Daily at 2 AM
retention: "30d"
command: |
pg_dump -Fc -d ${POSTGRES_DB} -U ${POSTGRES_USER} > \
/backups/postgres/$(date +%Y%m%d).dump
elasticsearch:
frequency: "0 3 * * *" # Daily at 3 AM
retention: "30d"
command: |
curator_cli snapshot --name $(date +%Y%m%d) \
--repository es_backup_repo
minio:
frequency: "0 4 * * *" # Daily at 4 AM
retention: "30d"
command: |
mc mirror /data /backups/minio/$(date +%Y%m%d)
Create docs/disaster_recovery.md
:
## Disaster Recovery Procedures
### 1. Database Recovery
```bash
# Restore PostgreSQL
pg_restore -d ${POSTGRES_DB} -U ${POSTGRES_USER} /backups/postgres/latest.dump
# Restore Elasticsearch
curl -X POST "localhost:9200/_snapshot/es_backup_repo/latest/_restore"
# Restore MinIO
mc mirror /backups/minio/latest /data
# Stop affected services
docker compose stop affected_service
# Remove corrupted volumes
docker volume rm affected_volume
# Restore from backup
./scripts/restore.sh --service affected_service --backup latest
# Restart services
docker compose up -d
# Verify database integrity
./scripts/verify_db.sh
# Check data consistency
./scripts/verify_data.sh
# Validate service health
./scripts/health_check.sh
Create monitoring/resource-alerts.yml
:
alerts:
high_memory:
threshold: 85%
duration: 5m
action: "scale_up_memory"
high_cpu:
threshold: 90%
duration: 5m
action: "scale_up_cpu"
disk_space:
threshold: 80%
duration: 10m
action: "notify_admin"
metrics:
collection_interval: 30s
retention_period: 30d
exporters:
- prometheus
- grafana