Skip to content

Add IterableDataset.reshard()#7992

Merged
lhoestq merged 8 commits intomainfrom
iterable-dataset-reshard
Feb 4, 2026
Merged

Add IterableDataset.reshard()#7992
lhoestq merged 8 commits intomainfrom
iterable-dataset-reshard

Conversation

@lhoestq
Copy link
Member

@lhoestq lhoestq commented Feb 4, 2026

To increase the number of shards of a dataset, you can use [IterableDataset.reshard]:

>>> dataset
IterableDataset({
    features: ['label', 'title', 'content'],
    num_shards: 4
})
>>> dataset.reshard()
IterableDataset({
    features: ['label', 'title', 'content'],
    num_shards: 3600
})

The resharding mechanism depends on the dataset file format.
For example for Parquet, it reshards using row groups instead of having one file per shard.

We can implement other formats later (e.g. JSON Lines, CSV can be split by recovering line boundaries from arbitrary locations)

Other details:

related to #7917

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@lhoestq lhoestq merged commit 025593f into main Feb 4, 2026
13 of 15 checks passed
@lhoestq lhoestq deleted the iterable-dataset-reshard branch February 4, 2026 18:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Data duplication with split_dataset_by_node and interleaved_dataset concatenate_datasets does not preserve shuffling state

2 participants