Skip to content

datasets Iterable Dataset sharding support #3547

Open
@ValMystletainn

Description

@ValMystletainn

When I'm using huggingface datasets, I think a good faeture is the IterableDataset create by load_dataset(..., streaming=True). It has shard, shuffle and state_dict, which is useful in multi node training. I can just call shard to make different node with different data, easy to randomize and recover.

However, it failed when working with accelerate. when passing dataloader with IterableDatset accelerate assume that the dataset instance is all same in different rank. and the IterableDatasetShard class is doing a overhead sampling by sampling n batch in all rank and yield out the sub batches belongs to the coresspoding rank src ref.

Not prepare the dataloader is a temproary solution in this case I think. But I think is nice to have the auto shard and shuffle in accelerate prepare. So I'm writing the issue for this. And If this feature is permmitted, but leaf no one to do, I think I can do the feature pr by myself

I have search the relative issue and find #2859
but this issue is aiming to add load_state_dict for dataloader

Metadata

Metadata

Assignees

No one assigned

    Labels

    feature requestRequest for a new feature to be added to Accelerate

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions