Training Progress stuck at end of Epoch #7182
OlfwayAdbayIgbay
started this conversation in
General
Replies: 1 comment 1 reply
-
Regarding your pause for around 20 seconds: That's totally possible since this is the time we store checkpoints, do all the logging aggregation etc. This may also include reloading your dataloader and recreating workers if those aren't persistent. Regarding the loss: That's quite strange. Unfortunately we cannot help you, before we don't have some code to look at. Could you post an exemplary version? |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I am training a 3D Autoencoder, which is built like this:
Conv3d
Linear Layer
Linear Layer
ConvTranspose3D
I am training to overfit a really small sample of 54 samples in batch sizes of 3.
Every epoch is done in around 3 seconds. But when the epoch hits 100%, the training pauses for around 20 seconds, sometimes longer.
Funnily enough I do not get that problem when I get rid of the linear layers.
Also the loss shoots up at the last step of training, even when shuffle = True is set in the dataloader.
Any ideas what I'm doing wrong?
Beta Was this translation helpful? Give feedback.
All reactions