SOP-Bench : Complex Industrial SOPs for Evaluating LLM Agents

Overview

SOP-Bench is a comprehensive benchmark for evaluating LLM-based agents on complex, multi-step Standard Operating Procedures (SOPs) that are fundamental to industrial automation. Built from 2,000+ tasks across 12 industrial domains (healthcare, logistics, finance, content moderation, etc.), SOP-Bench addresses the gap between existing benchmarks and real-world procedural complexity.

Key Features

🏭 Human Expert-Authored SOPs: Authentic procedures crafted by domain experts reflecting real-world complexity

🤖 Human-AI Collaborative Framework: AI-generated artifacts (tools, APIs, datasets) with human validation

📊 Executable Interfaces: Ground-truth outputs enabling reproducible evaluation

🔧 Two Agent Architectures: Function-Calling (FC) and ReAct agents for systematic comparison

📈 11 Frontier Models Evaluated: Comprehensive benchmarking across Claude, GPT, Llama, and DeepSeek families

News

[2026-02] 🎉 SOP-Bench submitted to KDD 2026 Datasets and Benchmarks Track.

Be sure to:

Change the title in this README
Edit your repository description on GitHub
Write in your license below and create a LICENSE file

Security

See CONTRIBUTING for more information.

License

This library is licensed under the LICENSE NAME HERE License.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
benchmarks/data		benchmarks/data
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SOP-Bench : Complex Industrial SOPs for Evaluating LLM Agents

Overview

Key Features

News

Security

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

amazon-science/SOP-Bench

Folders and files

Latest commit

History

Repository files navigation

SOP-Bench : Complex Industrial SOPs for Evaluating LLM Agents

Overview

Key Features

News

Security

License

About

Resources

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages