Live Demo: https://ml-failure-dashboard.vercel.app (uses mock data)
A dashboard for analyzing ML model failures, with a focus on identifying dangerous high-confidence errors. Built with React + TypeScript (frontend) and FastAPI (backend).
Full backend with real CIFAR-10 evaluation available locally (see Quick Start below).
Accuracy hides dangerous failure modes. This project focuses on identifying systematic confusions and overconfident errors in ML classifiers before deployment.
This dashboard's killer feature: finding high-confidence wrong predictions — the most dangerous errors in production.
- Model achieves ~88% accuracy on CIFAR-10
- 4.7% of predictions are high-confidence but wrong (the red danger zone)
- Confusion matrix shows cat↔dog and automobile↔truck are major confusion pairs
- Click "Only Confident Wrong" checkbox to reveal the killer feature
- These are predictions where the model is ≥80% confident but completely wrong
- Notice the confidence scores: 0.85, 0.91, 0.93 — dangerously overconfident
- Click any row to inspect in detail
- See the full top-3 predictions and confidence breakdown
- Actual image from CIFAR-10 test set shown
- ECE (Expected Calibration Error): 4.23% — shows model is fairly well calibrated
- Click confusion matrix cells to explore specific error slices (e.g., "cat → dog")
- Export filtered predictions as CSV/JSONL for deeper analysis
cd frontend
npm install
npm run devOpen http://localhost:5173 - uses mock data by default.
Terminal 1 - Backend:
cd backend
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
uvicorn app.main:app --reload --port 8000Terminal 2 - Frontend:
cd frontend
npm install
npm run devThen create frontend/.env.local:
VITE_USE_MOCKS=false
VITE_API_BASE=http://localhost:8000The live demo at https://ml-failure-dashboard.vercel.app uses mock data for demonstration purposes.
To deploy with real backend:
- See DEPLOYMENT.md for detailed instructions
- Backend can be deployed to Railway, Render, or Fly.io
- Frontend deployed to Vercel with backend URL as environment variable
- Model Overview - Accuracy, precision, recall, F1, confidence breakdown
- Confusion Matrix - Clickable cells to explore error slices
- Confidence Curve - Accuracy vs confidence buckets
- Errors by Class - Bar chart of error distribution
- Reliability Diagram - Model calibration with ECE (Expected Calibration Error)
- Slice Explorer - Click confusion matrix cells to filter to specific misclassifications
- Export - Download filtered predictions as CSV or JSONL
- Overconfident Errors Toggle - Highlight dangerous high-confidence mistakes
ml-failure-dashboard/
├── frontend/ # React + TypeScript + Vite
│ ├── src/
│ │ ├── api/ # API client & types
│ │ ├── components/ # React components
│ │ └── pages/ # Dashboard page
│ ├── vercel.json # Vercel config
│ └── package.json
│
├── backend/ # FastAPI
│ ├── app/
│ │ ├── api/routes.py # API endpoints
│ │ ├── data/ # JSON artifacts
│ │ ├── models/ # Pydantic schemas
│ │ └── services/ # Data store & evaluator
│ └── requirements.txt
│
└── README.md
| Endpoint | Description |
|---|---|
GET /api/overview |
Model metrics & failure breakdown |
GET /api/confusion-matrix |
Confusion matrix data |
GET /api/confidence-curve |
Calibration curve data |
GET /api/errors-by-class |
Error distribution by class |
GET /api/predictions |
Paginated predictions with filters |
GET /api/predictions/{id} |
Single prediction by ID |
GET /api/calibration |
Reliability diagram data with ECE |
GET /api/export |
Export predictions as CSV or JSONL |
Before deploying, generate real evaluation artifacts:
cd backend
source venv/bin/activate
# Install dependencies (if not done)
pip install -r requirements.txt
# Train model and generate artifacts (~5-10 minutes)
python -m app.services.evaluator --epochs 3 --seed 42This will:
- ✅ Train a SimpleCNN on CIFAR-10 (reaches ~65-75% accuracy)
- ✅ Evaluate on 10,000 test images
- ✅ Generate all JSON artifacts in
app/data/ - ✅ Save test images to
app/static/images/test/ - ✅ Compute calibration (ECE) and reliability diagram data
Note: Use --epochs 5 for better accuracy (~80%) but takes longer.
This dashboard demonstrates:
- Full-stack development (React + FastAPI)
- ML/AI debugging and observability
- Production deployment (Vercel + Railway)
- Clean code architecture and documentation
Tech Stack:
- Frontend: React, TypeScript, Vite, TailwindCSS, Recharts
- Backend: FastAPI, Pydantic, PyTorch, scikit-learn
- Deployment: Vercel (frontend with mock data)
MIT



