Production-ready tooling and utilities for custom LLM training, fine-tuning, and deployment.
This repository contains Asoba's complete infrastructure for training custom language models. Currently supports Qwen and Mistral model families, with extensible architecture for future model integrations.
- Data Collection & Corpus Building - Automated scrapers and collectors for domain-specific training data
- Training Configurations - Hardware-optimized configs mapping models to tech stacks and instance types
- One-Shot Training Scripts - Streamlined deployment across various AWS instance types
- Monitoring & Validation - Real-time training progress tracking and quality assurance
Model Family | Status | Hardware | Config |
---|---|---|---|
Qwen | ✅ Production | g5.xlarge+ | qwen/ |
Mistral | ✅ Production | g5.2xlarge+ | mistral/ |
# Qwen training (recommended)
./scripts/qwen/deploy_qwen_verbosity_training_to_gpu.sh
# Mistral training with operatives-last processing
./scripts/mistral/deploy_mistral_to_g5.sh
├── scripts/
│ ├── qwen/ # Qwen model training pipeline
│ ├── mistral/ # Mistral model training pipeline
│ ├── corpus-generation/ # Domain-specific data collection
│ └── monitoring/ # Production monitoring with alerts
├── data/
│ ├── corpus/ # Pre-built training datasets
│ ├── collectors/ # Data processing utilities
│ └── validation/ # Quality assurance pipelines
├── infrastructure/ # AWS deployment automation
├── training/ # QLora trainers and frameworks
├── config/ # Hardware-optimized configurations
└── tests/ # Comprehensive test coverage
- IAC/DevOps - Infrastructure as Code, CI/CD, containerization
- Policy Analysis - Government policy, insurance, academic research
- Security/Compliance - Cybersecurity frameworks, compliance standards
- NSFW Content - Adult content classification and moderation
# Collect domain-specific corpus
./scripts/corpus-generation/iac-devops-corpus/corpus-builders/create_final_iac_corpus.py
# Validate corpus quality
./data/validation/universal_validation_pipeline.py
- Golden Config: Optimized for Claude.md methodology compliance
- Hardware: g5.xlarge minimum, g5.2xlarge+ recommended
- Specialization: IAC/DevOps, code generation, system prompts
- Operatives-Last Processing: Handles 3M+ file collections efficiently
- Hardware: g5.2xlarge minimum for stable training
- Specialization: Policy analysis, multi-domain reasoning
# Deploy training instance with automatic setup
./infrastructure/auto-deploy-mistral.sh
# Setup QLora training environment
./infrastructure/setup_qlora_instance.sh
Instance Type | vCPUs | Memory | GPU | Best For |
---|---|---|---|---|
g5.xlarge | 4 | 16GB | 1x A10G | Development, small models |
g5.2xlarge | 8 | 32GB | 1x A10G | Production training |
g5.4xlarge | 16 | 64GB | 1x A10G | Large model fine-tuning |
Production-grade monitoring with failure detection and Slack alerts:
# Monitor with alerts (recommended)
./scripts/monitoring/production_monitor.sh mistral-20250804-171621
# Basic monitoring without alerts
python3 scripts/monitoring/monitor.py --run-id mistral-20250804-171621
# One-time status check
python3 scripts/monitoring/monitor.py --run-id mistral-20250804-171621 --once
Features:
- Silent failure detection with dual heartbeat monitoring
- Actionable Slack alerts with remediation steps
- Automatic error capture via S3 sentinels
- Direct S3 console links for quick debugging
# Run comprehensive test suite
pytest tests/
# Validate training configurations
./scripts/qwen/validate_qwen_styles.py
./scripts/mistral/validate_mistral_golden_config.py
- Follow CLAUDE.md methodology: Explore → Plan → Code → Commit
- All training data must be from authentic, real-world sources
- Maintain comprehensive test coverage
- Hardware configs must be validated across instance types
Built by Asoba for production LLM training at scale.