Skip to content

amazon-science/SOP-Bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

SOP-Bench : Complex Industrial SOPs for Evaluating LLM Agents

image

Overview

SOP-Bench is a comprehensive benchmark for evaluating LLM-based agents on complex, multi-step Standard Operating Procedures (SOPs) that are fundamental to industrial automation. Built from 2,000+ tasks across 12 industrial domains (healthcare, logistics, finance, content moderation, etc.), SOP-Bench addresses the gap between existing benchmarks and real-world procedural complexity.

Key Features

🏭 Human Expert-Authored SOPs: Authentic procedures crafted by domain experts reflecting real-world complexity

🤖 Human-AI Collaborative Framework: AI-generated artifacts (tools, APIs, datasets) with human validation

📊 Executable Interfaces: Ground-truth outputs enabling reproducible evaluation

🔧 Two Agent Architectures: Function-Calling (FC) and ReAct agents for systematic comparison

📈 11 Frontier Models Evaluated: Comprehensive benchmarking across Claude, GPT, Llama, and DeepSeek families

News

  • [2026-02] 🎉 SOP-Bench submitted to KDD 2026 Datasets and Benchmarks Track.

Be sure to:

  • Change the title in this README
  • Edit your repository description on GitHub
  • Write in your license below and create a LICENSE file

Security

See CONTRIBUTING for more information.

License

This library is licensed under the LICENSE NAME HERE License.

About

No description, website, or topics provided.

Resources

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages