Skip to content

Potential dataloader memory leak and problems with multi-gpu training.  #21

Open
@chophilip21

Description

@chophilip21

Hi, first of all thanks a lot for the great repo. All the models provided in the repo is very easy to use.

I have noticed a few problems with training progress, and I wanted to bring some to your attention.

First issue, is regarding multi-gpu training. I have two GPUs with 24GB of VRAM each. I have tried this:

$ python -m torch.distributed.launch --nproc_per_node=2 --use_env tools/train.py --cfg configs/<CONFIG_FILE_NAME>.yaml

But setup_ddp() fails, suggesting int(os.environ(['LOCAL_RANK'])) has below issue:

TypeError: '_Environ' object is not callable

When I try training using a single GPU command, things do run fine, but the dataloader crahses after a few epochs.

Epoch: [1/200] Iter: [4/299] LR: 0.00010241 Loss: 10.58329177:   1%|▊                                                                | 4/299 [00:18<14:23,  2.93s/it]Killed
(detection) philip@philip-Z390-UD: seg_library/tools$ /home/philip/anaconda3/envs/detection/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 6 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Above issue can only be avoided when I do following:

  1. Force num_workers to 0, without using mp.cpu_count (which is super slow)
  2. Or make batch size very small, which also slows down training progress.

When dataloader crahses, it freezes my entire computer and I wondering if you have any idea how to fix above issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions