You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am looking at litgpt/finetune/lora.py and litgpt/finetune/full.py. In the LoRA code, the periodic evaluation L401-418 runs validate on each device, then accumulates parts by all_reduce.
But this does not happen for initial evaluation L310 and final evaluation L260. This seems pretty wrong to me, the loss values would just be the one on the rank 0 device.
Another issue is the val_loss value which is printed in L395, but which seems never updated in L401-418.
I'd be happy to submit a PR fixing all this, but first wanted to check whether I understand something wrong here?