Bug in the best run selection in calculate_metrics.py

Hi!

In the paper you write:

> To improve evaluation stability, we repeat each task with three independent runs in all experiments. Then we select the best run according to the metrics in the following order: maximum SR, maximum VER, maximum CBS, and minimum Cost. We refer to the next metric in this order to break ties. For example, if two programs generated for a task both have SR = 0, we pick the one with higher VER.

However, this is not how the code works right now. The bug is in the tie-breaking mechanism of `calculate_metrics.py`. When breaking ties, the code incorrectly searches through all trajectories instead of searching only within the filtered candidates that passed all previous criteria. This means it could select a trajectory with the minimum cost that doesn't actually have the best `success_rate`, `valid_program`, and `codebert_score` values from earlier filtering steps. The same issue exists in all tie-breaking sections.

I propose a fix here https://github.com/OSU-NLP-Group/ScienceAgentBench/pull/8.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug in the best run selection in calculate_metrics.py #9

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug in the best run selection in calculate_metrics.py #9

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions