Skip to content

2024-05-06: Initial release: Version 0.7 of the WebMall benchmark released.

Latest

Choose a tag to compare

@Aaron9812 Aaron9812 released this 06 Jun 09:31
· 45 commits to main since this release

WebMall: A Multi-Shop Benchmark for Evaluating Web Agents

This release introduces WebMall, a comprehensive benchmark for evaluating web agents' capabilities in e-commerce scenarios. The benchmark features:

• Two task sets: basic (search, compare, cart, checkout) and advanced (vague requirements, product compatibility, substitute finding)
• Local Docker setup for easy deployment of test environments
• Integration with BrowserGym and AgentLab for agent evaluation
• Support for multiple e-shop platforms

Visit our website (https://wbsg-uni-mannheim.github.io/WebMall/) for detailed documentation, task specifications, and initial results.

Requirements:

  • Python 3.11/3.12
  • Docker and docker-compose
  • OpenAI/Anthropic API keys (if using their models)

Full Changelog: https://github.com/wbsg-uni-mannheim/WebMall/commits/v0.7