Description
When I'm using huggingface datasets
, I think a good faeture is the IterableDataset create by load_dataset(..., streaming=True)
. It has shard
, shuffle
and state_dict
, which is useful in multi node training. I can just call shard
to make different node with different data, easy to randomize and recover.
However, it failed when working with accelerate
. when passing dataloader with IterableDatset accelerate
assume that the dataset instance is all same in different rank. and the IterableDatasetShard
class is doing a overhead sampling by sampling n batch in all rank and yield out the sub batches belongs to the coresspoding rank src ref.
Not prepare the dataloader is a temproary solution in this case I think. But I think is nice to have the auto shard and shuffle in accelerate prepare. So I'm writing the issue for this. And If this feature is permmitted, but leaf no one to do, I think I can do the feature pr by myself
I have search the relative issue and find #2859
but this issue is aiming to add load_state_dict
for dataloader