feat: Checkpointing evaluation

### Feature description

Currently, evaluation allows specifying a subset of the dataset by defining a range (e.g. `data[:100]`). However, this range is processed fully and without interruption. We’d like to introduce a checkpoint-based evaluation flow, where the process periodically inspects intermediate results and decides whether to continue.

For example, after evaluating a certain number of batches (as an initial, not fully thought-through idea), the system could compute an aggregated metric and compare it against developer-defined criteria. If those criteria are not met, the evaluation would stop early (e.g. after 10 or 50 examples) instead of wasting time on the remaining 1000. Conversely, if the checkpoint condition is satisfied, the evaluation proceeds to the next block.

### Motivation

A checkpoint-based evaluation system would significantly reduce wasted computation time by allowing early termination when results are clearly unsatisfactory, while still enabling full evaluation when performance meets expectations.

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Checkpointing evaluation #850

Feature description

Motivation

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat: Checkpointing evaluation #850

Description

Feature description

Motivation

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions