WebMall: A Multi-Shop Benchmark for Evaluating Web Agents
This release introduces WebMall, a comprehensive benchmark for evaluating web agents' capabilities in e-commerce scenarios. The benchmark features:
• Two task sets: basic (search, compare, cart, checkout) and advanced (vague requirements, product compatibility, substitute finding)
• Local Docker setup for easy deployment of test environments
• Integration with BrowserGym and AgentLab for agent evaluation
• Support for multiple e-shop platforms
Visit our website (https://wbsg-uni-mannheim.github.io/WebMall/) for detailed documentation, task specifications, and initial results.
Requirements:
- Python 3.11/3.12
- Docker and docker-compose
- OpenAI/Anthropic API keys (if using their models)
Full Changelog: https://github.com/wbsg-uni-mannheim/WebMall/commits/v0.7