Open
Description
Hi, first of all thanks a lot for the great repo. All the models provided in the repo is very easy to use.
I have noticed a few problems with training progress, and I wanted to bring some to your attention.
First issue, is regarding multi-gpu training. I have two GPUs with 24GB of VRAM each. I have tried this:
$ python -m torch.distributed.launch --nproc_per_node=2 --use_env tools/train.py --cfg configs/<CONFIG_FILE_NAME>.yaml
But setup_ddp() fails, suggesting int(os.environ(['LOCAL_RANK'])) has below issue:
TypeError: '_Environ' object is not callable
When I try training using a single GPU command, things do run fine, but the dataloader crahses after a few epochs.
Epoch: [1/200] Iter: [4/299] LR: 0.00010241 Loss: 10.58329177: 1%|▊ | 4/299 [00:18<14:23, 2.93s/it]Killed
(detection) philip@philip-Z390-UD: seg_library/tools$ /home/philip/anaconda3/envs/detection/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 6 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
Above issue can only be avoided when I do following:
- Force num_workers to 0, without using mp.cpu_count (which is super slow)
- Or make batch size very small, which also slows down training progress.
When dataloader crahses, it freezes my entire computer and I wondering if you have any idea how to fix above issue.
Metadata
Metadata
Assignees
Labels
No labels