Skip to content

AsobaCloud/modelTraining

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Asoba Model Training Pipeline

Production-ready tooling and utilities for custom LLM training, fine-tuning, and deployment.

Overview

This repository contains Asoba's complete infrastructure for training custom language models. Currently supports Qwen and Mistral model families, with extensible architecture for future model integrations.

Core Capabilities

  • Data Collection & Corpus Building - Automated scrapers and collectors for domain-specific training data
  • Training Configurations - Hardware-optimized configs mapping models to tech stacks and instance types
  • One-Shot Training Scripts - Streamlined deployment across various AWS instance types
  • Monitoring & Validation - Real-time training progress tracking and quality assurance

Quick Start

Current Model Support

Model Family Status Hardware Config
Qwen ✅ Production g5.xlarge+ qwen/
Mistral ✅ Production g5.2xlarge+ mistral/

Training a Model

# Qwen training (recommended)
./scripts/qwen/deploy_qwen_verbosity_training_to_gpu.sh

# Mistral training with operatives-last processing
./scripts/mistral/deploy_mistral_to_g5.sh

Repository Structure

├── scripts/
│   ├── qwen/                    # Qwen model training pipeline
│   ├── mistral/                 # Mistral model training pipeline  
│   ├── corpus-generation/       # Domain-specific data collection
│   └── monitoring/              # Production monitoring with alerts
├── data/
│   ├── corpus/                  # Pre-built training datasets
│   ├── collectors/              # Data processing utilities
│   └── validation/              # Quality assurance pipelines
├── infrastructure/              # AWS deployment automation
├── training/                    # QLora trainers and frameworks
├── config/                      # Hardware-optimized configurations
└── tests/                       # Comprehensive test coverage

Corpus Collection

Supported Domains

  • IAC/DevOps - Infrastructure as Code, CI/CD, containerization
  • Policy Analysis - Government policy, insurance, academic research
  • Security/Compliance - Cybersecurity frameworks, compliance standards
  • NSFW Content - Adult content classification and moderation

Usage

# Collect domain-specific corpus
./scripts/corpus-generation/iac-devops-corpus/corpus-builders/create_final_iac_corpus.py

# Validate corpus quality
./data/validation/universal_validation_pipeline.py

Training Pipelines

Qwen Pipeline

  • Golden Config: Optimized for Claude.md methodology compliance
  • Hardware: g5.xlarge minimum, g5.2xlarge+ recommended
  • Specialization: IAC/DevOps, code generation, system prompts

→ Qwen Training Guide

Mistral Pipeline

  • Operatives-Last Processing: Handles 3M+ file collections efficiently
  • Hardware: g5.2xlarge minimum for stable training
  • Specialization: Policy analysis, multi-domain reasoning

→ Mistral Training Guide

Infrastructure

One-Shot Deployment

# Deploy training instance with automatic setup
./infrastructure/auto-deploy-mistral.sh

# Setup QLora training environment
./infrastructure/setup_qlora_instance.sh

Hardware Configurations

Instance Type vCPUs Memory GPU Best For
g5.xlarge 4 16GB 1x A10G Development, small models
g5.2xlarge 8 32GB 1x A10G Production training
g5.4xlarge 16 64GB 1x A10G Large model fine-tuning

Monitoring

Production-grade monitoring with failure detection and Slack alerts:

# Monitor with alerts (recommended)
./scripts/monitoring/production_monitor.sh mistral-20250804-171621

# Basic monitoring without alerts
python3 scripts/monitoring/monitor.py --run-id mistral-20250804-171621

# One-time status check
python3 scripts/monitoring/monitor.py --run-id mistral-20250804-171621 --once

Features:

  • Silent failure detection with dual heartbeat monitoring
  • Actionable Slack alerts with remediation steps
  • Automatic error capture via S3 sentinels
  • Direct S3 console links for quick debugging

→ Production Monitoring Guide

Development

Testing

# Run comprehensive test suite
pytest tests/

# Validate training configurations
./scripts/qwen/validate_qwen_styles.py
./scripts/mistral/validate_mistral_golden_config.py

Contributing

  1. Follow CLAUDE.md methodology: Explore → Plan → Code → Commit
  2. All training data must be from authentic, real-world sources
  3. Maintain comprehensive test coverage
  4. Hardware configs must be validated across instance types

Built by Asoba for production LLM training at scale.

About

Clean LLM training pipeline for Asoba custom developed, trained and finetuned models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published