-
Notifications
You must be signed in to change notification settings - Fork 69
Open
Labels
bugSomething isn't workingSomething isn't workinghelp wantedExtra attention is neededExtra attention is needed
Description
🐛 Bug
I got the following error after training for 2h or 11h:
Traceback (most recent call last):
File "/home/ubuntu/deeplearning/semantic-segmentation/downstream/scripts/lightning/main.py", line 55, in <module>
main()
File "/home/ubuntu/deeplearning/semantic-segmentation/downstream/scripts/lightning/main.py", line 35, in main
_cli = LightningCLI(
File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/cli.py", line 394, in __init__
self._run_subcommand(self.subcommand)
File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/cli.py", line 701, in _run_subcommand
fn(**fn_kwargs)
File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 543, in fit
call._call_and_handle_interrupt(
File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
return function(*args, **kwargs)
File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 579, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 986, in _run
results = self._run_stage()
File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1030, in _run_stage
self.fit_loop.run()
File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 205, in run
self.advance()
File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 363, in advance
self.epoch_loop.run(self._data_fetcher)
File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 140, in run
self.advance(data_fetcher)
File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 212, in advance
batch, _, __ = next(data_fetcher)
File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/loops/fetchers.py", line 133, in __next__
batch = super().__next__()
File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/loops/fetchers.py", line 60, in __next__
batch = next(self.iterator)
File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/utilities/combined_loader.py", line 341, in __next__
out = next(self._iterator)
File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/utilities/combined_loader.py", line 78, in __next__
out[i] = next(self.iterators[i])
File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 633, in __next__
data = self._next_data()
File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1345, in _next_data
return self._process_data(data)
File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1371, in _process_data
data.reraise()
File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/torch/_utils.py", line 644, in reraise
raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 15.
Original Traceback (most recent call last):
File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
data = fetcher.fetch(index)
File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 32, in fetch
data.append(next(self.dataset_iter))
File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/litdata/streaming/combined.py", line 231, in __next__
return self._get_sample(dataset_index)
File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/litdata/streaming/combined.py", line 255, in _get_sample
sample = next(self._dataset_iters[dataset_index])
File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/litdata/streaming/dataset.py", line 365, in __next__
data = self.__getitem__(
File "/home/ubuntu/deeplearning/semantic-segmentation/downstream/datasets/litdata/lit_dataset.py", line 52, in __getitem__
dct = super().__getitem__(index)
File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/litdata/streaming/dataset.py", line 335, in __getitem__
return self.cache[index]
File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/litdata/streaming/cache.py", line 140, in __getitem__
return self._reader.read(index)
File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/litdata/streaming/reader.py", line 269, in read
self._prepare_thread.start()
File "/usr/lib/python3.10/threading.py", line 935, in start
_start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread
Code sample
Unfortunately I can't provide a minimal code sample, but the main points are:
- each dataset item is a dictionary containing numpy arrays
- we use
CombinedStreamingDataset
with around ~7000 smallStreamingDataset
. The reason is that we need to specify several subsets of the 7000 datasets and do it this way (happy to learn about alternatives). While I know this is not optimal, it seemed to work fine at first (and also maxxed out GPU utilization) - the dataset is wrapped in a simple
torch.utils.data.DataLoader
- the training loop is triggered using
lightning.pytorch.cli.LightningCLI
Environment
- litdata version: 0.2.18
- PyTorch Version (e.g., 1.0): [2.3.1](torch: 2.4.0+cu121)
- OS (e.g., Linux): ubuntu 22.04
- How you installed PyTorch (
conda
,pip
, source): uv pip - Build command you used (if compiling from source):
- Python version: 3.10
- CUDA/cuDNN version: 12.2
- GPU models and configuration: A10G
- Any other relevant information:
Additional Info
- The following dask issue looks very similar: RuntimeError: can't start new thread dask/dask#1780
- Is it possible that this issue can be circumvented by defining a threadpool with a max size?
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workinghelp wantedExtra attention is neededExtra attention is needed