`datasets` Iterable Dataset sharding support

When I'm using huggingface `datasets`, I think a good faeture is the IterableDataset create by `load_dataset(..., streaming=True)`. It has `shard`, `shuffle` and `state_dict`, which is useful in multi node training. I can just call `shard` to make different node with different data, easy to randomize and recover.

However, it failed when working with `accelerate`. when passing dataloader with IterableDatset `accelerate` assume that the dataset instance is all same in different rank. and the `IterableDatasetShard` class is doing a overhead sampling by sampling n batch in all rank and yield out the sub batches belongs to the coresspoding rank [src ref](https://github.com/huggingface/accelerate/blob/v1.6.0/src/accelerate/data_loader.py#L265).

Not prepare the dataloader is a temproary solution in this case I think. But I think is nice to have the auto shard and shuffle in accelerate prepare. So I'm writing the issue for this. And If this feature is permmitted, but leaf no one to do, I think I can do the feature pr by myself

I have search the relative issue and find https://github.com/huggingface/accelerate/issues/2859
but this issue is aiming to add `load_state_dict` for dataloader

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`datasets` Iterable Dataset sharding support #3547

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

datasets Iterable Dataset sharding support #3547

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`datasets` Iterable Dataset sharding support #3547