MCP Overflow Experiment Suite

This experiment suite tests multiple LLM models with MCP (Model Context Protocol) servers to evaluate their performance across various queries.

Features

🤖 Multi-Model Testing: Tests multiple AI models simultaneously
🔧 MCP Server Integration: Applies MCP servers during query processing
📊 Streaming Output: Real-time colored console output with timestamps
💾 JSON Logging: Comprehensive results saved to timestamped JSON files
🎨 Colored Console: Beautiful colored output showing experiment progress

Setup

Install Dependencies:
```
pip install -r requirements.txt
```
Set API Key:
```
cp env.example.py env.py
```
and populate with secret values

Usage

Run the complete experiment suite:

python main.py

Complete workflow:

  make run-battery BATTERY=test_pretrain_knowledge_with_no_mcp_servers
  make pipeline  # Automatically processes latest results → CSV

Step-by-step:

  make run-battery BATTERY=test_pretrain_knowledge_with_no_mcp_servers
  make evaluate-latest
  make convert-latest-to-csv

Manual file specification:

  make evaluate FILE=output/20250805_123456.json
  make convert-to-csv FILE=output/evaluations/evaluated_20250805_123456.json OUTPUT=my_results.csv

Output Files

Results are saved to output/[timestamp].json with the following structure:

{
  "timestamp": "20240115_143025",
  "experiment_start": "2024-01-15T14:30:25.123456",
  "total_queries": 12,
  "results": [
    {
      "query": "What is the capital of France?",
      "model": "anthropic/claude-sonnet-4",
      "experiment": "browsing",
      "timestamp": "2024-01-15T14:30:25.123456",
      "mcp_servers": ["context7"],
      "success": true,
      "response": "The capital of France is Paris...",
      "error": null,
      "duration": 2.34
    }
  ]
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github/workflows		.github/workflows
experiment		experiment
tools		tools
.gitignore		.gitignore
README.md		README.md
convert_results_to_csv.py		convert_results_to_csv.py
env.example.py		env.example.py
evaluate_results.py		evaluate_results.py
main.py		main.py
makefile		makefile
mise.toml		mise.toml
requirements.txt		requirements.txt
writeup.md		writeup.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MCP Overflow Experiment Suite

Features

Setup

Usage

Output Files

About

Uh oh!

Releases

Packages

Languages

tileshq/tool-project

Folders and files

Latest commit

History

Repository files navigation

MCP Overflow Experiment Suite

Features

Setup

Usage

Output Files

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages