This repository contains benchmark results for various OpenHands agents and LLM configurations.
Results are organized in the results/ directory with the following structure:
results/
├── YYYYMMDD_model_name/
│ ├── metadata.json
│ └── scores.json
Each agent directory follows the format: YYYYMMDD_model_name/
YYYYMMDD: Submission date (e.g.,20251124)model_name: Underscored version of the LLM model name (e.g.,gpt_4o_2024_11_20)
Contains agent metadata and configuration:
{
"agent_name": "OpenHands CodeAct v2.0",
"agent_version": "1.0.0",
"model": "gpt-4o-2024-11-20",
"openness": "closed_api_available",
"tool_usage": "standard",
"submission_time": "2025-11-24T19:56:00.092895"
}Fields:
agent_name: Display name of the agentagent_version: Semantic version number (e.g., "1.0.0", "1.0.2")model: LLM model usedopenness: Model availability typeclosed_api_available: Commercial API-based modelsopen_api_available: Open-source models with API accessopen_weights_available: Open-weights models that can be self-hosted
tool_usage: Agent tooling typestandard: Standard tool usagecustom_interface: Custom tool interface
submission_time: ISO 8601 timestamp
Contains benchmark scores and performance metrics:
[
{
"benchmark": "swe-bench",
"score": 45.1,
"metric": "resolve_rate",
"total_cost": 32.55,
"total_runtime": 3600,
"tags": ["bug_fixing"]
},
...
]Fields:
benchmark: Benchmark identifier (e.g., "swe-bench", "commit0")score: Primary metric score (percentage or numeric value)metric: Type of metric (e.g., "resolve_rate", "success_rate")total_cost: Total API cost in USDtotal_runtime: Total runtime in seconds (optional)tags: Category tags for grouping (e.g., ["bug_fixing"], ["app_creation"])
The 1.0.0-dev1/ directory contains the original benchmark-centric JSONL files:
swe-bench.jsonlswe-bench-multimodal.jsonlcommit0.jsonlmulti-swe-bench.jsonlswt-bench.jsonlgaia.jsonl
This format is maintained for backward compatibility.
- SWE-Bench: Resolving GitHub issues from real Python repositories
- SWE-Bench-Multimodal: Similar to SWE-Bench with multimodal inputs
- Commit0: Building applications from scratch based on specifications
- Multi-SWE-Bench: Full-stack web development tasks
- SWT-Bench: Generating comprehensive test suites
- GAIA: General AI assistant tasks requiring web search and reasoning
Results are grouped into 5 main categories on the leaderboard:
- Bug Fixing: SWE-Bench, SWE-Bench-Multimodal
- App Creation: Commit0
- Frontend Development: Multi-SWE-Bench
- Test Generation: SWT-Bench
- Information Gathering: GAIA
To add new benchmark results:
- Create a directory following the naming convention:
results/YYYYMMDD_model_name/ - Add
metadata.jsonwith agent configuration - Add
scores.jsonwith benchmark results - Commit and push to the repository
Example:
# Create directory
mkdir -p results/20251124_gpt_4o_2024_11_20/
# Add metadata
cat > results/20251124_gpt_4o_2024_11_20/metadata.json << 'EOF'
{
"agent_name": "OpenHands CodeAct v2.0",
"agent_version": "1.0.0",
"model": "gpt-4o-2024-11-20",
"openness": "closed_api_available",
"tool_usage": "standard",
"submission_time": "2025-11-24T19:56:00.092895"
}
EOF
# Add scores
cat > results/20251124_gpt_4o_2024_11_20/scores.json << 'EOF'
[
{
"benchmark": "swe-bench",
"score": 45.1,
"metric": "resolve_rate",
"total_cost": 32.55,
"total_runtime": 3600,
"tags": ["bug_fixing"]
},
...
]
EOF
# Commit and push
git add results/20251124_gpt_4o_2024_11_20/
git commit -m "Add results for OpenHands CodeAct v2.0 with GPT-4o"
git push origin mainView the live leaderboard at: https://huggingface.co/spaces/OpenHands/openhands-index
MIT License - See repository for details.