Skip to content

Bug in the best run selection in calculate_metrics.py #9

Open
@piojanu

Description

@piojanu

Hi!

In the paper you write:

To improve evaluation stability, we repeat each task with three independent runs in all experiments. Then we select the best run according to the metrics in the following order: maximum SR, maximum VER, maximum CBS, and minimum Cost. We refer to the next metric in this order to break ties. For example, if two programs generated for a task both have SR = 0, we pick the one with higher VER.

However, this is not how the code works right now. The bug is in the tie-breaking mechanism of calculate_metrics.py. When breaking ties, the code incorrectly searches through all trajectories instead of searching only within the filtered candidates that passed all previous criteria. This means it could select a trajectory with the minimum cost that doesn't actually have the best success_rate, valid_program, and codebert_score values from earlier filtering steps. The same issue exists in all tie-breaking sections.

I propose a fix here #8.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions