-
Notifications
You must be signed in to change notification settings - Fork 160
Potential dataloader memory leak and problems with multi-gpu training. #21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for your details response. About first issue, it is a typing error. Change to And second issue, I haven't faced an error like that. But I also have a problem with dataloader's num_workers occasionally. My fix is change to manual num_workers like 4, 8, 16, 32, etc. If you use your own dataloader (dataset), check the code again. If you have found a solution, please leave a comment. |
Thanks a lot for the reply. I have tried resizing the num_workers by power of two, by simply dropping num_workers to the next power of two of cpu counts:
But even if I do above, when the number of batch size is above 64, dataloader just freezes after a few epochs. The only way I can bypass is num_worker 0, or batch size of 32. For DDP, I do not think the logic is fully implemented. I found a way to properly get local rank information, but now my training never starts.
I do not have much problem with the single GPU results, as the result is already quite decent. But if you have plans to keep on maintaining this repo, above two issues can be quite critical. If you only have single GPU, then it might not be possible to fix the DDP issue though. But I am a little surprised that I am the only one facing issues with the dataloader. It could be because I am testing on my own custom dataset, but the size of it is only 1/5 of COCO stuff dataset. |
Very sorry for the late reply. I do wish to maintain this repo but due to the time and resource constraints, it is quite difficult. |
Hi, first of all thanks a lot for the great repo. All the models provided in the repo is very easy to use.
I have noticed a few problems with training progress, and I wanted to bring some to your attention.
First issue, is regarding multi-gpu training. I have two GPUs with 24GB of VRAM each. I have tried this:
But setup_ddp() fails, suggesting int(os.environ(['LOCAL_RANK'])) has below issue:
When I try training using a single GPU command, things do run fine, but the dataloader crahses after a few epochs.
Above issue can only be avoided when I do following:
When dataloader crahses, it freezes my entire computer and I wondering if you have any idea how to fix above issue.
The text was updated successfully, but these errors were encountered: