Skip to content

Fix batching for table-formatted datasets#8126

Merged
lhoestq merged 1 commit intohuggingface:mainfrom
bluehyena:fix-8075-batch-table-formats
Apr 10, 2026
Merged

Fix batching for table-formatted datasets#8126
lhoestq merged 1 commit intohuggingface:mainfrom
bluehyena:fix-8075-batch-table-formats

Conversation

@bluehyena
Copy link
Copy Markdown
Contributor

Summary

Fix #8075

Fix .batch() on table-formatted Dataset and IterableDataset.

Before this change, calling .batch() on datasets formatted as pyarrow, pandas, or polars could fail because the batching path assumed dict-like inputs. This updates batching to use an Arrow-based path for table-style formats, so batching works regardless of whether the table format is applied before or after .batch(). These two forms now behave equivalently:

dataset.with_format(format_type).batch(n)

and

dataset.batch(n).with_format(format_type)

for pyarrow, pandas, and polars.

What changed

  • added a shared _batch_arrow_table helper to build batched Arrow tables
  • updated Dataset.batch() to route table-formatted batching through an Arrow .map(...) path
  • updated IterableDataset.batch() to do the same and then restore the original table format
  • added regression tests for pyarrow, pandas, and polars on both Dataset and IterableDataset

Tests

  • python -m pytest tests/test_arrow_dataset.py -k "test_dataset_batch or test_dataset_batch_with_table_format or test_dataset_batch_with_polars_format" -q
  • python -m pytest tests/test_iterable_dataset.py -k "test_iterable_dataset_batch or test_iterable_dataset_batch_with_table_format or test_iterable_dataset_batch_with_polars_format" -q
  • python -m ruff check src/datasets/arrow_dataset.py src/datasets/iterable_dataset.py src/datasets/table.py tests/test_arrow_dataset.py tests/test_iterable_dataset.py

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Copy Markdown
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm !

@lhoestq lhoestq merged commit 4775eeb into huggingface:main Apr 10, 2026
2 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

.batch() error on formatted datasets

3 participants