-
Notifications
You must be signed in to change notification settings - Fork 50
Open
Labels
documentationImprovements or additions to documentationImprovements or additions to documentationquestionFurther information is requestedFurther information is requested
Description
Thank you for this amazing project! I'm training a voice model but encountered an issue where several loss values suddenly become NaN during training.
During training, I observe the following loss pattern:
Normal training state:
INFO:mi-test:loss_disc=3.322, loss_gen=3.670, loss_fm=10.674, loss_mel=24.860, loss_kl=9.000
After some steps, losses suddenly become NaN:
INFO:mi-test:loss_disc=nan, loss_gen=nan, loss_fm=nan, loss_mel=23.440, loss_kl=9.000
Is this a training collapse? Should I rollback to the previous checkpoint?
Could this be caused by training data quality issues?
What's the recommended solution?Should I:
- Lower the learning rate?
- Reduce batch size?
- Disable FP16 training?
- Clean the training dataset?
Metadata
Metadata
Assignees
Labels
documentationImprovements or additions to documentationImprovements or additions to documentationquestionFurther information is requestedFurther information is requested