RuntimeError: Can't start new thread

## 🐛 Bug

I got the following error after training for 2h or 11h:

```bash
Traceback (most recent call last): 
  File "/home/ubuntu/deeplearning/semantic-segmentation/downstream/scripts/lightning/main.py", line 55, in <module> 
    main() 
  File "/home/ubuntu/deeplearning/semantic-segmentation/downstream/scripts/lightning/main.py", line 35, in main 
    _cli = LightningCLI( 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/cli.py", line 394, in __init__ 
    self._run_subcommand(self.subcommand) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/cli.py", line 701, in _run_subcommand 
    fn(**fn_kwargs) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 543, in fit 
    call._call_and_handle_interrupt( 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt 
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch 
    return function(*args, **kwargs) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 579, in _fit_impl 
    self._run(model, ckpt_path=ckpt_path) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 986, in _run 
    results = self._run_stage() 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1030, in _run_stage 
    self.fit_loop.run() 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 205, in run 
    self.advance() 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 363, in advance 
    self.epoch_loop.run(self._data_fetcher) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 140, in run 
    self.advance(data_fetcher) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 212, in advance 
    batch, _, __ = next(data_fetcher) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/loops/fetchers.py", line 133, in __next__ 
    batch = super().__next__() 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/loops/fetchers.py", line 60, in __next__ 
    batch = next(self.iterator) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/utilities/combined_loader.py", line 341, in __next__ 
    out = next(self._iterator) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/utilities/combined_loader.py", line 78, in __next__ 
    out[i] = next(self.iterators[i]) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 633, in __next__ 
    data = self._next_data() 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1345, in _next_data 
    return self._process_data(data) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1371, in _process_data 
    data.reraise() 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/torch/_utils.py", line 644, in reraise 
    raise exception 
RuntimeError: Caught RuntimeError in DataLoader worker process 15. 

Original Traceback (most recent call last): 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop 
    data = fetcher.fetch(index) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 32, in fetch 
    data.append(next(self.dataset_iter)) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/litdata/streaming/combined.py", line 231, in __next__ 
    return self._get_sample(dataset_index) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/litdata/streaming/combined.py", line 255, in _get_sample 
    sample = next(self._dataset_iters[dataset_index]) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/litdata/streaming/dataset.py", line 365, in __next__ 
    data = self.__getitem__( 
  File "/home/ubuntu/deeplearning/semantic-segmentation/downstream/datasets/litdata/lit_dataset.py", line 52, in __getitem__ 
    dct = super().__getitem__(index) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/litdata/streaming/dataset.py", line 335, in __getitem__ 
    return self.cache[index] 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/litdata/streaming/cache.py", line 140, in __getitem__ 
    return self._reader.read(index) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/litdata/streaming/reader.py", line 269, in read 
    self._prepare_thread.start() 
  File "/usr/lib/python3.10/threading.py", line 935, in start 
    _start_new_thread(self._bootstrap, ()) 
RuntimeError: can't start new thread 
```

#### Code sample

Unfortunately I can't provide a minimal code sample, but the main points are:

- each dataset item is a dictionary containing numpy arrays
- we use `CombinedStreamingDataset` with around ~7000 small `StreamingDataset`. The reason is that we need to specify several subsets of the 7000 datasets and do it this way (happy to learn about alternatives). While I know this is not optimal, it seemed to work fine at first (and also maxxed out GPU utilization)
- the dataset is wrapped in a simple `torch.utils.data.DataLoader`
- the training loop is triggered using `lightning.pytorch.cli.LightningCLI`

### Environment

- litdata version: 0.2.18
- PyTorch Version (e.g., 1.0): [2.3.1](torch: 2.4.0+cu121)
- OS (e.g., Linux): ubuntu 22.04
- How you installed PyTorch (`conda`, `pip`, source): uv pip
- Build command you used (if compiling from source):
- Python version:  3.10
- CUDA/cuDNN version: 12.2
- GPU models and configuration: A10G
- Any other relevant information:

### Additional Info

- The following dask issue looks very similar: https://github.com/dask/dask/issues/1780 
- Is it possible that this issue can be circumvented by defining a threadpool with a max size?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RuntimeError: Can't start new thread #280

🐛 Bug

Code sample

Environment

Additional Info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RuntimeError: Can't start new thread #280

Description

🐛 Bug

Code sample

Environment

Additional Info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions