Skip to content

RuntimeError: Can't start new thread #280

@cgebbe

Description

@cgebbe

🐛 Bug

I got the following error after training for 2h or 11h:

Traceback (most recent call last): 
  File "/home/ubuntu/deeplearning/semantic-segmentation/downstream/scripts/lightning/main.py", line 55, in <module> 
    main() 
  File "/home/ubuntu/deeplearning/semantic-segmentation/downstream/scripts/lightning/main.py", line 35, in main 
    _cli = LightningCLI( 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/cli.py", line 394, in __init__ 
    self._run_subcommand(self.subcommand) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/cli.py", line 701, in _run_subcommand 
    fn(**fn_kwargs) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 543, in fit 
    call._call_and_handle_interrupt( 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt 
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch 
    return function(*args, **kwargs) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 579, in _fit_impl 
    self._run(model, ckpt_path=ckpt_path) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 986, in _run 
    results = self._run_stage() 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1030, in _run_stage 
    self.fit_loop.run() 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 205, in run 
    self.advance() 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 363, in advance 
    self.epoch_loop.run(self._data_fetcher) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 140, in run 
    self.advance(data_fetcher) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 212, in advance 
    batch, _, __ = next(data_fetcher) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/loops/fetchers.py", line 133, in __next__ 
    batch = super().__next__() 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/loops/fetchers.py", line 60, in __next__ 
    batch = next(self.iterator) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/utilities/combined_loader.py", line 341, in __next__ 
    out = next(self._iterator) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/utilities/combined_loader.py", line 78, in __next__ 
    out[i] = next(self.iterators[i]) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 633, in __next__ 
    data = self._next_data() 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1345, in _next_data 
    return self._process_data(data) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1371, in _process_data 
    data.reraise() 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/torch/_utils.py", line 644, in reraise 
    raise exception 
RuntimeError: Caught RuntimeError in DataLoader worker process 15. 

Original Traceback (most recent call last): 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop 
    data = fetcher.fetch(index) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 32, in fetch 
    data.append(next(self.dataset_iter)) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/litdata/streaming/combined.py", line 231, in __next__ 
    return self._get_sample(dataset_index) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/litdata/streaming/combined.py", line 255, in _get_sample 
    sample = next(self._dataset_iters[dataset_index]) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/litdata/streaming/dataset.py", line 365, in __next__ 
    data = self.__getitem__( 
  File "/home/ubuntu/deeplearning/semantic-segmentation/downstream/datasets/litdata/lit_dataset.py", line 52, in __getitem__ 
    dct = super().__getitem__(index) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/litdata/streaming/dataset.py", line 335, in __getitem__ 
    return self.cache[index] 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/litdata/streaming/cache.py", line 140, in __getitem__ 
    return self._reader.read(index) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/litdata/streaming/reader.py", line 269, in read 
    self._prepare_thread.start() 
  File "/usr/lib/python3.10/threading.py", line 935, in start 
    _start_new_thread(self._bootstrap, ()) 
RuntimeError: can't start new thread 

Code sample

Unfortunately I can't provide a minimal code sample, but the main points are:

  • each dataset item is a dictionary containing numpy arrays
  • we use CombinedStreamingDataset with around ~7000 small StreamingDataset. The reason is that we need to specify several subsets of the 7000 datasets and do it this way (happy to learn about alternatives). While I know this is not optimal, it seemed to work fine at first (and also maxxed out GPU utilization)
  • the dataset is wrapped in a simple torch.utils.data.DataLoader
  • the training loop is triggered using lightning.pytorch.cli.LightningCLI

Environment

  • litdata version: 0.2.18
  • PyTorch Version (e.g., 1.0): [2.3.1](torch: 2.4.0+cu121)
  • OS (e.g., Linux): ubuntu 22.04
  • How you installed PyTorch (conda, pip, source): uv pip
  • Build command you used (if compiling from source):
  • Python version: 3.10
  • CUDA/cuDNN version: 12.2
  • GPU models and configuration: A10G
  • Any other relevant information:

Additional Info

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghelp wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions