Get BitNet neural network inference running in under 5 minutes.
This guide gets you from zero to running BitNet 1-bit quantized neural network inference immediately. For comprehensive development setup, see development/.
# Check Rust version (1.92.0+ required)
rustc --version
# Clone repository
git clone https://github.com/EffortlessMetrics/BitNet-rs
cd BitNet-rs# CPU inference (fastest setup)
cargo build --release --no-default-features --features cpu
# OR GPU inference (if CUDA available)
cargo build --release --no-default-features --features gpu# Download Microsoft's 1.58-bit quantized model (QK256 GGML I2_S format)
cargo run -p xtask -- download-model --id microsoft/bitnet-b1.58-2B-4T-gguf --file ggml-model-i2_s.ggufWhat is QK256? This model uses GGML-compatible I2_S quantization with 256-element blocks and separate scale tensors. BitNet-rs automatically detects the quantization flavor and routes to the appropriate kernels.
BitNet-rs automatically discovers and loads tokenizers from GGUF files:
# Verify GGUF model with automatic tokenizer discovery
cargo run -p xtask -- verify --model models/microsoft-bitnet-b1.58-2B-4T-gguf/ggml-model-i2_s.gguf
# Or specify tokenizer explicitly if needed
cargo run -p xtask -- verify --model models/microsoft-bitnet-b1.58-2B-4T-gguf/ggml-model-i2_s.gguf --tokenizer models/microsoft-bitnet-b1.58-2B-4T-gguf/tokenizer.jsonWhat Just Happened?
- BitNet-rs extracted tokenizer metadata from GGUF file
- Detected model architecture (BitNet, LLaMA, GPT-2, etc.)
- Resolved vocabulary size (32K, 128K, or custom)
- Applied model-specific tokenizer configuration
# Generate text with automatic tokenizer discovery
cargo run -p xtask -- infer --model models/microsoft-bitnet-b1.58-2B-4T-gguf/ggml-model-i2_s.gguf --prompt "BitNet is a neural network architecture that" --deterministic
# Stream inference (real-time generation) with automatic tokenizer
cargo run -p xtask -- infer --model models/microsoft-bitnet-b1.58-2B-4T-gguf/ggml-model-i2_s.gguf --prompt "Explain 1-bit quantization:" --stream
# Or specify tokenizer explicitly if needed
cargo run -p xtask -- infer --model models/microsoft-bitnet-b1.58-2B-4T-gguf/ggml-model-i2_s.gguf --tokenizer models/microsoft-bitnet-b1.58-2B-4T-gguf/tokenizer.json --prompt "Test" --deterministicFor maximum inference throughput on your hardware:
# Build with native CPU optimizations (recommended for production)
RUSTFLAGS="-C target-cpu=native -C opt-level=3 -C lto=thin" \
cargo build --release --no-default-features --features cpu,full-cli
# Run with full CPU parallelization and reduced log noise
RAYON_NUM_THREADS=$(nproc) RUST_LOG=warn \
cargo run --release -p bitnet-cli --no-default-features --features cpu,full-cli -- run \
--model models/microsoft-bitnet-b1.58-2B-4T-gguf/ggml-model-i2_s.gguf \
--prompt "Explain 1-bit quantization" --max-tokens 128 --temperature 0.7
# Deterministic math sanity check (validates model correctness)
RAYON_NUM_THREADS=1 RUST_LOG=warn \
cargo run --release -p bitnet-cli --no-default-features --features cpu,full-cli -- run \
--model models/microsoft-bitnet-b1.58-2B-4T-gguf/ggml-model-i2_s.gguf \
--prompt "Answer with a single digit: 2+2=" --max-tokens 1 \
--temperature 0.0 --greedyExpected output from math check: 4
Performance Tuning:
RUSTFLAGS="-C target-cpu=native": Enable all CPU instructions (AVX2/AVX-512/NEON)-C opt-level=3: Maximum optimization (aggressive inlining, vectorization)-C lto=thin: Link-time optimization for better performanceRAYON_NUM_THREADS=$(nproc): Use all CPU cores (production inference)RAYON_NUM_THREADS=1: Single-threaded (deterministic results for validation)RUST_LOG=warn: Reduce logging overhead (shows only warnings/errors)
Before you start, understand the performance characteristics of different quantization formats:
| Quantization Format | Status | CPU Performance | Use Case | Time for 128 tokens |
|---|---|---|---|---|
| I2_S BitNet32-F16 | ✅ Production | SIMD-optimised | Recommended | Hardware-dependent |
| I2_S QK256 (GGML) | ~0.1 tok/s | Validation only | ~20 minutes | |
| TL1/TL2 | 🚧 Experimental | SIMD-optimised | Research | Hardware-dependent |
The microsoft/bitnet-b1.58-2B-4T-gguf model uses QK256 format, which is currently MVP-only with scalar kernels.
If you're using QK256 models (like microsoft/bitnet-b1.58-2B-4T-gguf):
# ✅ Quick validation (4-16 tokens) - RECOMMENDED
cargo run -p bitnet-cli --features cpu,full-cli -- run \
--model models/microsoft-bitnet-b1.58-2B-4T-gguf/ggml-model-i2_s.gguf \
--prompt "What is 2+2?" \
--max-tokens 8 # Keep this small for QK256
# ❌ Long generation (128+ tokens) - WILL BE VERY SLOW
# This will take 20+ minutes with QK256 scalar kernelsWhy is QK256 slow?
- Uses scalar (non-SIMD) kernels for correctness validation
- SIMD optimizations planned for v0.2.0 (≥3× improvement target)
- This is expected MVP behavior, not a bug
For production inference, use I2_S BitNet32-F16 models instead.
# Benchmark inference throughput with CPU optimization
RUSTFLAGS="-C target-cpu=native -C opt-level=3 -C lto=thin" \
cargo build --release --no-default-features --features cpu,full-cli
RAYON_NUM_THREADS=$(nproc) RUST_LOG=warn \
cargo run --release -p xtask -- benchmark \
--model models/microsoft-bitnet-b1.58-2B-4T-gguf/ggml-model-i2_s.gguf --tokens 16 # Reduced for QK256Expected Performance:
- I2_S BitNet32-F16: SIMD-optimised; performance varies by hardware
- I2_S QK256: ~0.1 tok/s (MVP scalar kernels, validation only)
- Memory usage: ~2GB for 2B parameter model
For production deployments with QK256 models, use strict loader mode to ensure proper model loading:
# Enable strict loader (fail-fast on model loading errors)
export BITNET_DISABLE_MINIMAL_LOADER=1
# Verify model loads correctly with enhanced GGUF loader
cargo run -p xtask -- verify --model models/microsoft-bitnet-b1.58-2B-4T-gguf/ggml-model-i2_s.gguf
# Run inference with strict validation
cargo run -p bitnet-cli --no-default-features --features cpu,full-cli -- run \
--model models/microsoft-bitnet-b1.58-2B-4T-gguf/ggml-model-i2_s.gguf \
--prompt "What is 2+2?" \
--max-tokens 16Why Strict Mode? The strict loader prevents silent fallback to the minimal loader, which may use incorrect default values (e.g., 32 layers, 0 kv_heads) if the enhanced loader fails. This ensures production inference uses accurate model dimensions.
QK256 is a GGML-compatible I2_S quantization format with 256-element blocks and separate scale tensors. BitNet-rs provides automatic format detection and strict validation modes for production deployments.
The loader automatically detects QK256 format based on tensor size patterns. When a tensor's size matches the QK256 quantization scheme (256-element blocks with separate scales), the loader routes to QK256-specific kernels without requiring explicit configuration.
How it works:
- Loader examines tensor dimensions during GGUF parsing
- Calculates expected size for different quantization formats
- Prioritizes QK256 (GgmlQk256NoScale) for close matches
- Routes to appropriate dequantization kernels automatically
Benefits:
- Zero configuration required for standard QK256 models
- Seamless compatibility with GGML ecosystem
- Automatic fallback to other I2_S flavors if needed
Enforce exact QK256 alignment (reject tensors with >0.1% size deviation) for production validation:
# Enable strict loader with BITNET_DISABLE_MINIMAL_LOADER environment variable
export BITNET_DISABLE_MINIMAL_LOADER=1
# Run inference with strict validation
cargo run -p bitnet-cli --no-default-features --features cpu,full-cli -- run \
--model models/microsoft-bitnet-b1.58-2B-4T-gguf/ggml-model-i2_s.gguf \
--tokenizer models/microsoft-bitnet-b1.58-2B-4T-gguf/tokenizer.json \
--strict-loader \
--prompt "Test" \
--max-tokens 16Use strict mode when:
- Validating model exports for production deployment
- Debugging model loading issues
- Running CI/CD parity tests
What strict mode enforces:
- Exact tensor size alignment (no tolerance for size mismatches)
- Fail-fast on quantization format detection errors
- Prevents silent fallback to minimal loader defaults
Learn more: See howto/use-qk256-models.md for comprehensive QK256 usage guide.
BitNet-rs generates receipts for every inference run, proving real computation with kernel IDs:
# 1. Run parity validation (generates receipt)
scripts/parity_smoke.sh models/microsoft-bitnet-b1.58-2B-4T-gguf/ggml-model-i2_s.gguf
# 2. Check receipt location (automatically created with timestamp)
# Receipt path: docs/baselines/<YYYY-MM-DD>/parity-bitnetcpp.json
# 3. View receipt summary (if jq installed)
jq '{parity, tokenizer, validation}' docs/baselines/$(date +%Y-%m-%d)/parity-bitnetcpp.json
# 4. Verify parity metrics
# - cosine_similarity: ≥0.99 (Rust vs C++ agreement)
# - exact_match_rate: token-level agreement percentage
# - status: "ok" (parity passed) or "rust_only" (C++ unavailable)Receipt Fields:
validation.compute:"rust"(pure Rust kernels) or"cpp"(FFI fallback)parity.status:"ok"(validated),"rust_only"(no C++ ref), or"failed"parity.cpp_available:trueif C++ reference was used for validationtokenizer.source:"rust"(always Rust tokenizer, even with FFI compute)
Verify QK256 implementation against the Microsoft BitNet C++ reference:
# Set up C++ reference path
export BITNET_CPP_DIR=/path/to/bitnet.cpp
# Run comprehensive cross-validation
cargo run -p xtask -- crossval
# Or use quick parity smoke test
./scripts/parity_smoke.sh models/microsoft-bitnet-b1.58-2B-4T-gguf/ggml-model-i2_s.ggufReceipt validation:
# View parity metrics from generated receipt
jq '.parity' docs/baselines/*/parity-bitnetcpp.json
# Expected output:
# {
# "cpp_available": true,
# "cosine_similarity": 0.9923,
# "exact_match_rate": 1.0,
# "status": "ok"
# }Cross-validation ensures:
- Numerical equivalence between Rust and C++ implementations
- Cosine similarity ≥0.99 for output tensors
- Token-level agreement for autoregressive generation
- Receipt-based proof of parity validation
You've successfully:
- Built BitNet-rs with device-aware quantization and complete transformer implementation
- Downloaded a QK256 model (Microsoft's 1.58-bit GGUF in GGML I2_S format) with automatic flavor detection
- Automatic tokenizer discovery extracted tokenizer from GGUF metadata, detected model architecture, and applied optimal configuration
- Verified model compatibility with enhanced GGUF loader, strict mode validation, and comprehensive tensor validation
- Ran production-grade inference with pure-Rust QK256 kernels, real transformer weights, and autoregressive generation
- Benchmarked performance — run
cargo run -p xtask -- benchmark --model <path> --tokens 128to produce a verifiable receipt (typical CPU envelope: 10–25 tok/s for I2_S BitNet32-F16) - Generated validation receipts with parity metrics, kernel IDs, and reproducible baselines in
docs/baselines/
- QK256 Deep Dive: Comprehensive QK256 usage guide in howto/use-qk256-models.md
- I2_S Architecture: Understand dual-flavor quantization in explanation/i2s-dual-flavor.md
- Tokenizer Discovery: Learn about automatic tokenizer discovery in reference/tokenizer-discovery-api.md
- API Integration: See reference/real-model-api-contracts.md for Rust API usage
- Model Formats: Learn about GGUF, I2_S, TL1, TL2 quantization in explanation/
- GPU Setup: Enable CUDA acceleration in development/gpu-setup-guide.md
- Troubleshooting: Common issues in troubleshooting.md
# CPU build and test
cargo build --no-default-features --features cpu
cargo test --workspace --no-default-features --features cpu
# GPU build and test
cargo build --no-default-features --features gpu
cargo test --workspace --no-default-features --features gpu
# Download and verify model (automatic tokenizer discovery)
cargo run -p xtask -- download-model
cargo run -p xtask -- verify --model PATH
# Neural network inference with automatic tokenizer
cargo run -p xtask -- infer --model PATH --prompt "TEXT" --deterministic
cargo run -p xtask -- benchmark --model PATH --tokens 128
# Explicit tokenizer specification (optional)
cargo run -p xtask -- verify --model PATH --tokenizer PATH
cargo run -p xtask -- infer --model PATH --tokenizer PATH --prompt "TEXT"Total time: ~5 minutes to working BitNet neural network inference