Commit 6cd39bb
authored
[Benchmark] Add support for SciDocBench benchmark (#1511)
* [Benchmark] Add support for SciDocBench benchmark
- Add SciDocBench dataset class with multi-page image handling
- Support three evaluation methods: json_match, judge, exec_match
- Add reasoning verification as secondary scoring pass
- Register default judge model (gpt-4o-mini) in run.py
* [Fix] SciDocBench: fix isort lint and add parallel judge evaluation with checkpoint resume
- Fix import ordering to pass isort pre-commit hook
- Use track_progress_rich for parallel judge/reasoning calls
- Add pkl checkpoint for resumable evaluation on interruption1 parent a9343a1 commit 6cd39bb
3 files changed
Lines changed: 486 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
317 | 317 | | |
318 | 318 | | |
319 | 319 | | |
| 320 | + | |
| 321 | + | |
320 | 322 | | |
321 | 323 | | |
322 | 324 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
107 | 107 | | |
108 | 108 | | |
109 | 109 | | |
| 110 | + | |
110 | 111 | | |
111 | 112 | | |
112 | 113 | | |
| |||
290 | 291 | | |
291 | 292 | | |
292 | 293 | | |
293 | | - | |
| 294 | + | |
| 295 | + | |
294 | 296 | | |
295 | 297 | | |
296 | 298 | | |
| |||
0 commit comments