SOP-Bench is a comprehensive benchmark for evaluating LLM-based agents on complex, multi-step Standard Operating Procedures (SOPs) that are fundamental to industrial automation. Built from 2,000+ tasks across 12 industrial domains (healthcare, logistics, finance, content moderation, etc.), SOP-Bench addresses the gap between existing benchmarks and real-world procedural complexity.
🏭 Human Expert-Authored SOPs: Authentic procedures crafted by domain experts reflecting real-world complexity
🤖 Human-AI Collaborative Framework: AI-generated artifacts (tools, APIs, datasets) with human validation
📊 Executable Interfaces: Ground-truth outputs enabling reproducible evaluation
🔧 Two Agent Architectures: Function-Calling (FC) and ReAct agents for systematic comparison
📈 11 Frontier Models Evaluated: Comprehensive benchmarking across Claude, GPT, Llama, and DeepSeek families
- [2026-02] 🎉 SOP-Bench submitted to KDD 2026 Datasets and Benchmarks Track.
Be sure to:
- Change the title in this README
- Edit your repository description on GitHub
- Write in your license below and create a LICENSE file
See CONTRIBUTING for more information.
This library is licensed under the LICENSE NAME HERE License.