Description
Hi!
In the paper you write:
To improve evaluation stability, we repeat each task with three independent runs in all experiments. Then we select the best run according to the metrics in the following order: maximum SR, maximum VER, maximum CBS, and minimum Cost. We refer to the next metric in this order to break ties. For example, if two programs generated for a task both have SR = 0, we pick the one with higher VER.
However, this is not how the code works right now. The bug is in the tie-breaking mechanism of calculate_metrics.py
. When breaking ties, the code incorrectly searches through all trajectories instead of searching only within the filtered candidates that passed all previous criteria. This means it could select a trajectory with the minimum cost that doesn't actually have the best success_rate
, valid_program
, and codebert_score
values from earlier filtering steps. The same issue exists in all tie-breaking sections.
I propose a fix here #8.