This experiment suite tests multiple LLM models with MCP (Model Context Protocol) servers to evaluate their performance across various queries.
- 🤖 Multi-Model Testing: Tests multiple AI models simultaneously
- 🔧 MCP Server Integration: Applies MCP servers during query processing
- 📊 Streaming Output: Real-time colored console output with timestamps
- 💾 JSON Logging: Comprehensive results saved to timestamped JSON files
- 🎨 Colored Console: Beautiful colored output showing experiment progress
-
Install Dependencies:
pip install -r requirements.txt
-
Set API Key:
cp env.example.py env.py
and populate with secret values
Run the complete experiment suite:
python main.py
- Complete workflow:
make run-battery BATTERY=test_pretrain_knowledge_with_no_mcp_servers
make pipeline # Automatically processes latest results → CSV
- Step-by-step:
make run-battery BATTERY=test_pretrain_knowledge_with_no_mcp_servers
make evaluate-latest
make convert-latest-to-csv
- Manual file specification:
make evaluate FILE=output/20250805_123456.json
make convert-to-csv FILE=output/evaluations/evaluated_20250805_123456.json OUTPUT=my_results.csv
Results are saved to output/[timestamp].json
with the following structure:
{
"timestamp": "20240115_143025",
"experiment_start": "2024-01-15T14:30:25.123456",
"total_queries": 12,
"results": [
{
"query": "What is the capital of France?",
"model": "anthropic/claude-sonnet-4",
"experiment": "browsing",
"timestamp": "2024-01-15T14:30:25.123456",
"mcp_servers": ["context7"],
"success": true,
"response": "The capital of France is Paris...",
"error": null,
"duration": 2.34
}
]
}