[Feature Request] throw an error when the total loss is NaN

### Summary

When the total loss is NaN, we can throw an error to stop the training. Otherwise, time is wasted training a model with parameters to be NaN, as seen in the case of https://github.com/deepmodeling/dpgen/issues/1460.

### Detailed Description

1. NaN can be checked when the total loss is on the CPU (but not the GPU), to avoid extra cost. For example, when writing to `lcurve.out`, the result is on the CPU.
2. Check the results before the checkpoint is written, so no checkpoint with NaN is written.
3. Implement the feature for TensorFlow, PyTorch, and PaddlePaddle backends.

### Further Information, Files, and Links

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature Request] throw an error when the total loss is NaN #4985

Summary

Detailed Description

Further Information, Files, and Links

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature Request] throw an error when the total loss is NaN #4985

Description

Summary

Detailed Description

Further Information, Files, and Links

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions