Training Progress stuck at end of Epoch #7182

OlfwayAdbayIgbay · 2021-04-23T08:13:15Z

OlfwayAdbayIgbay
Apr 23, 2021

I am training a 3D Autoencoder, which is built like this:

Conv3d
Linear Layer
Linear Layer
ConvTranspose3D

I am training to overfit a really small sample of 54 samples in batch sizes of 3.

Every epoch is done in around 3 seconds. But when the epoch hits 100%, the training pauses for around 20 seconds, sometimes longer.

Funnily enough I do not get that problem when I get rid of the linear layers.

Also the loss shoots up at the last step of training, even when shuffle = True is set in the dataloader.

Any ideas what I'm doing wrong?

justusschock · 2021-04-23T09:41:18Z

justusschock
Apr 23, 2021
Maintainer

Regarding your pause for around 20 seconds: That's totally possible since this is the time we store checkpoints, do all the logging aggregation etc. This may also include reloading your dataloader and recreating workers if those aren't persistent.

Regarding the loss: That's quite strange. Unfortunately we cannot help you, before we don't have some code to look at. Could you post an exemplary version?

1 reply

OlfwayAdbayIgbay Apr 23, 2021
Author

Thanks, that already helps a lot. Is there an option that deactivates all of that overhead for quick prototyping?

I'll try to write a reproducible example for the loss, thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Training Progress stuck at end of Epoch #7182

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Training Progress stuck at end of Epoch #7182

Uh oh!

OlfwayAdbayIgbay Apr 23, 2021

Replies: 1 comment · 1 reply

Uh oh!

justusschock Apr 23, 2021 Maintainer

Uh oh!

OlfwayAdbayIgbay Apr 23, 2021 Author

OlfwayAdbayIgbay
Apr 23, 2021

Replies: 1 comment 1 reply

justusschock
Apr 23, 2021
Maintainer

OlfwayAdbayIgbay Apr 23, 2021
Author