Skip to content

Impossibility to use num_workers and prefetch_factor when using StatefulDataLoader (use_stateful_dataloader=True) #3110

Open
@hkproj

Description

@hkproj

System Info

- `Accelerate` version: 0.34.2
- Platform: Linux-5.15.0-1057-aws-x86_64-with-glibc2.31
- `accelerate` bash location: /fsx/umar/miniconda3/envs/memory-efficient-transformers/bin/accelerate
- Python version: 3.10.14
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.3.1+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch MUSA available: False
- System RAM: 1999.99 GB
- GPU type: NVIDIA H100 80GB HBM3
- `Accelerate` default config:
        Not found

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction


dataset_streaming = True
ds_train = ... # Dataset loaded with streaming=True
train_batch_size = 12
collator = DataCollatorForLanguageModeling(...)
dataloader_num_workers = 4
dataloader_prefetch_factor = 10

dl_trainer = DataLoader(
        ds_train,
        batch_size=train_batch_size,
        collate_fn=collator,
        shuffle=not dataset_streaming,
        drop_last=True,
        num_workers=dataloader_num_workers,
        prefetch_factor=dataloader_prefetch_factor,
        pin_memory=True,
    )

model, optimizer, scheduler, dl_eval, dl_trainer = accelerator.prepare(
        model, optimizer, scheduler, dl_eval, dl_trainer
    )

for _, batch in enumerate(dl_trainer):
     training_loop()

A DataLoader initialized with num_workers results in the following errors when iterating through the wrapper DataLoader:

[rank0]:     for _, batch in batch_enumerator:
[rank0]:   File "/fsx/umar/miniconda3/envs/memory-efficient-transformers/lib/python3.10/site-packages/tqdm/std.py", line 1181, in __iter__
[rank0]:     for obj in iterable:
[rank0]:   File "/fsx/umar/miniconda3/envs/memory-efficient-transformers/lib/python3.10/site-packages/accelerate/data_loader.py", line 798, in __iter__
[rank0]:     next_batch, next_batch_info = self._fetch_batches(main_iterator)
[rank0]:   File "/fsx/umar/miniconda3/envs/memory-efficient-transformers/lib/python3.10/site-packages/accelerate/data_loader.py", line 751, in _fetch_batches
[rank0]:     self._update_state_dict()
[rank0]:   File "/fsx/umar/miniconda3/envs/memory-efficient-transformers/lib/python3.10/site-packages/accelerate/data_loader.py", line 479, in _update_state_dict
[rank0]:     self.adjust_state_dict_for_prefetch()
[rank0]:   File "/fsx/umar/miniconda3/envs/memory-efficient-transformers/lib/python3.10/site-packages/accelerate/data_loader.py", line 459, in adjust_state_dict_for_prefetch
[rank0]:     if self.dl_state_dict["_sampler_iter_yielded"] > 0:
[rank0]: KeyError: '_sampler_iter_yielded'

I also tried with the latest development version of accelerate (https://github.com/huggingface/accelerate@9f9951325c69f0a6c7c8ab00df2ab8af23b3c1fa) but I still get the same error.

@muellerzr is aware of this issue.

Expected behavior

I'd like the possibility to prefetch multiple samples and that is only possible by specifying num_workers to a number greater than 0.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestfeature requestRequest for a new feature to be added to Accelerate

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions