Add `IterableDataset.reshard()` by lhoestq · Pull Request #7992 · huggingface/datasets

lhoestq · 2026-02-04T18:24:41Z

To increase the number of shards of a dataset, you can use [IterableDataset.reshard]:

>>> dataset
IterableDataset({
    features: ['label', 'title', 'content'],
    num_shards: 4
})
>>> dataset.reshard()
IterableDataset({
    features: ['label', 'title', 'content'],
    num_shards: 3600
})

The resharding mechanism depends on the dataset file format.
For example for Parquet, it reshards using row groups instead of having one file per shard.

We can implement other formats later (e.g. JSON Lines, CSV can be split by recovering line boundaries from arbitrary locations)

Other details:

fixed concatenate after shuffling, now it correctly shuffles the shards: close concatenate_datasets does not preserve shuffling state #7196
fixed interleave after split_by_node: close Data duplication with split_dataset_by_node and interleaved_dataset #7868
changed a bit how the effective seed works in multi node/worker/epoch situations

related to #7917

HuggingFaceDocBuilderDev · 2026-02-04T18:27:49Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

add IterableDataset.reshard()

e6eb59b

lhoestq added 7 commits February 4, 2026 19:37

fix test and interleave_dataset after split_by_node

d11779e

update test

1008eed

add warning on shuffling seed in distributed setups

d172566

better docstring

b89cd5d

typo

9089ff4

dot

aae8236

better torch warning if too few shards

14c9923

lhoestq merged commit 025593f into main Feb 4, 2026
13 of 15 checks passed

lhoestq deleted the iterable-dataset-reshard branch February 4, 2026 18:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `IterableDataset.reshard()`#7992

Add `IterableDataset.reshard()`#7992
lhoestq merged 8 commits intomainfrom
iterable-dataset-reshard

lhoestq commented Feb 4, 2026 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Feb 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lhoestq commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Feb 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lhoestq commented Feb 4, 2026 •

edited

Loading