Summary
When the total loss is NaN, we can throw an error to stop the training. Otherwise, time is wasted training a model with parameters to be NaN, as seen in the case of deepmodeling/dpgen#1460.
Detailed Description
- NaN can be checked when the total loss is on the CPU (but not the GPU), to avoid extra cost. For example, when writing to
lcurve.out
, the result is on the CPU.
- Check the results before the checkpoint is written, so no checkpoint with NaN is written.
- Implement the feature for TensorFlow, PyTorch, and PaddlePaddle backends.
Further Information, Files, and Links
No response