Skip to content

feat: Add Zebra Grid dataset support with ZeroEval alignment#2234

Open
max-yue wants to merge 1 commit intoopen-compass:mainfrom
max-yue:add-zebra-grid-dataset
Open

feat: Add Zebra Grid dataset support with ZeroEval alignment#2234
max-yue wants to merge 1 commit intoopen-compass:mainfrom
max-yue:add-zebra-grid-dataset

Conversation

@max-yue
Copy link

@max-yue max-yue commented Aug 11, 2025

📝 Description

Add support for Zebra Grid logic puzzle dataset from allenai/ZebraLogicBench-private with exact ZeroEval alignment.

🚀 Features

  • ZebraGridDataset class for loading HuggingFace datasets
  • ZebraGridEvaluator with puzzle accuracy and cell accuracy metrics
  • Exact ZeroEval prompt template and JSON extraction logic
  • Complete evaluation configuration with example Qwen3-8B model
  • Achieved 82.4% accuracy on 1000 samples (close to official 84.8%)

🧪 Testing

Tested on 1000 samples with Qwen3-8B thinking model:

  • Puzzle Accuracy: 82.40%
  • Cell Accuracy: 83.74%
  • Performance close to ZeroEval benchmark

📚 Usage

python run.py opencompass/configs/eval_zebra_grid.py --work-dir ./outputs/zebra_grid

🔗 Related

  • Based on allenai/ZebraLogicBench-private dataset
  • Aligned with ZeroEval evaluation framework
  • Compatible with VLLM backend for efficient inference

- Add ZebraGridDataset class for loading allenai/ZebraLogicBench-private
- Implement ZebraGridEvaluator with puzzle accuracy and cell accuracy metrics
- Use exact ZeroEval prompt template and JSON extraction logic
- Include complete evaluation configuration and model setup
- Achieve 82.4% accuracy on 1000 samples, close to official 84.8%

Features:
- Supports HuggingFace datasets integration
- Compatible with VLLM backend for efficient inference
- Provides detailed evaluation metrics (puzzle_accuracy, cell_accuracy)
- Includes example Qwen3-8B thinking model configuration

Usage:
python run.py opencompass/configs/eval_zebra_grid.py --work-dir ./outputs/zebra_grid
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants