Skip to content

[Benchmark] Add support for SciDocBench benchmark#1511

Merged
mzr1996 merged 2 commits into
open-compass:mainfrom
yuhangzang:main
Apr 10, 2026
Merged

[Benchmark] Add support for SciDocBench benchmark#1511
mzr1996 merged 2 commits into
open-compass:mainfrom
yuhangzang:main

Conversation

@yuhangzang
Copy link
Copy Markdown
Collaborator

  • Add SciDocBench dataset class with multi-page image handling
  • Support three evaluation methods: json_match, judge, exec_match
  • Add reasoning verification as secondary scoring pass
  • Register default judge model (gpt-4o-mini) in run.py

- Add SciDocBench dataset class with multi-page image handling
- Support three evaluation methods: json_match, judge, exec_match
- Add reasoning verification as secondary scoring pass
- Register default judge model (gpt-4o-mini) in run.py
…ith checkpoint resume

- Fix import ordering to pass isort pre-commit hook
- Use track_progress_rich for parallel judge/reasoning calls
- Add pkl checkpoint for resumable evaluation on interruption
Copy link
Copy Markdown
Collaborator

@mzr1996 mzr1996 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mzr1996 mzr1996 merged commit 6cd39bb into open-compass:main Apr 10, 2026
7 checks passed
mzr1996 pushed a commit that referenced this pull request Apr 21, 2026
* [Benchmark] Add support for SciDocBench benchmark (#1511)

- Add SciDocBench dataset with VQA-style evaluation
- Skip DATASET_MD5 hash check to accommodate TSV updates
- Decouple reasoning evaluation from answer score

* [Benchmark] Add support for SciDocBench benchmark (#1511)

- Add SciDocBench dataset with VQA-style evaluation
- Skip DATASET_MD5 hash check to accommodate TSV updates
- Decouple reasoning evaluation from answer score

* [Fix] SciDocBench: wrap long line in judge prompt to satisfy flake8 E501
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants