Skip to content

Initial and final evaluation in finetune scripts do not accumulate over devices #2116

@mseeger

Description

@mseeger

Bug description

I am looking at litgpt/finetune/lora.py and litgpt/finetune/full.py. In the LoRA code, the periodic evaluation L401-418 runs validate on each device, then accumulates parts by all_reduce.

But this does not happen for initial evaluation L310 and final evaluation L260. This seems pretty wrong to me, the loss values would just be the one on the rank 0 device.

Another issue is the val_loss value which is printed in L395, but which seems never updated in L401-418.

I'd be happy to submit a PR fixing all this, but first wanted to check whether I understand something wrong here?

Reproduced in studio

No response

What operating system are you using?

macOS

LitGPT Version

0.5.9

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions