feat: use content defined chunking #7589

kszucs · 2025-05-29T18:19:41Z

Use content defined chunking by default when writing parquet files.

set the parameters in io.parquet.ParquetDatasetReader
set the parameters in arrow_writer.ParquetWriter

It requires a new pyarrow pin ">=21.0.0" which is released now.

HuggingFaceDocBuilderDev · 2025-05-29T18:40:28Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

kszucs · 2025-06-16T10:08:55Z

Need to set DEFAULT_MAX_BATCH_SIZE = 1024 * 1024

kszucs · 2025-07-24T09:21:54Z

We should consider enabling page indexes by default when writing parquet files to enable page pruning readers like the next dataset viewer huggingface/dataset-viewer#3199

…c builder workflow

kszucs · 2025-07-25T11:33:15Z

src/datasets/config.py

@@ -183,7 +183,9 @@

 # Batch size constants. For more info, see:
 # https://github.com/apache/arrow/blob/master/docs/source/cpp/arrays.rst#size-limitations-and-recommendations)
-DEFAULT_MAX_BATCH_SIZE = 1000
+DEFAULT_MAX_BATCH_SIZE = 1024 * 1024


This is the default arrow row group size. If we choose a too small row group size then we cannot profit from CDC chunking that much.

lhoestq · 2025-08-13T14:31:33Z

Need to set DEFAULT_MAX_BATCH_SIZE = 1024 * 1024

maybe we'll need to auto-tweak the row group size to aim for a [30MB-300MB] interval, or we can end up with multiple GBs row groups

severo · 2025-08-13T15:37:39Z

maybe we'll need to auto-tweak the row group size to aim for a [30MB-300MB] interval, or we can end up with multiple GBs row groups

We should consider enabling page indexes by default when writing parquet files to enable page pruning readers like the next dataset viewer huggingface/dataset-viewer#3199

would it make sense to use the default row group size, and expect the readers will rely on the pages index to fetch only the required bits? Not sure if it exists in duckdb.

lhoestq · 2025-08-13T15:43:38Z

would it make sense to use the default row group size, and expect the readers will rely on the pages index to fetch only the required bits? Not sure if it exists in duckdb.

most frameworks read row group by row group, that's why we need them to be of reasonable size anyways

severo · 2025-08-13T17:06:47Z

We should consider enabling page indexes by default when writing parquet files to enable page pruning readers like the next dataset viewer huggingface/dataset-viewer#3199

~~where would the page indexes be stored? in the custom section in the Parquet file metadata? Is it standardized or ad hoc?~~

OK, I just RTFM:

write_page_index: bool, default False

Whether to write a page index in general for all columns. Writing statistics to the page index disables the old method of writing statistics to each data page header. The page index makes statistics-based filtering more efficient than the page header, as it gathers all the statistics for a Parquet file in a single place, avoiding scattered I/O. Note that the page index is not yet used on the read size by PyArrow.

kszucs changed the title ~~feat: use content defined chunking in io.parquet.ParquetDatasetReader~~ feat: use content defined chunking May 29, 2025

kszucs force-pushed the cdc branch 2 times, most recently from 960db25 to ef901ea Compare June 8, 2025 15:06

kszucs force-pushed the cdc branch from 735fbca to 9330cac Compare June 17, 2025 15:04

kszucs added 14 commits July 25, 2025 12:58

feat: use content defined chunking in io.parquet.ParquetDatasetReader

fe67bb7

ci: use nightly pyarrow wheels

66d77d9

feat: use content defined chunking in arrow_writer.ParquetWriter

9f866a4

ci: try to pass the pyarrow nightly wheel as additional arg to the do…

51c3135

…c builder workflow

ci: trigger builds on push no matter the branch

a7fcd4a

ci: pass the extra index url in the check_code_quality job

ff733a3

ci: specify pyarrow constraint as pyarrow>=21.0.0.dev

ac148b4

ci: restore branch filters in the doc build workflow

a9fda12

ci: install pyarrow from nightly channel in a separate command

7c66ba1

ci: missing --system for uv pip install

6de8d5d

chore: initialize features variable

ae35e1e

chore: set the default max batch size to pyarrow\'s default

9e9939c

chore: rename cdc_options argument to use_content_defined_chunking

565d260

chore: always store the cdc parameters as metadata

ee1e73b

kszucs force-pushed the cdc branch from 9330cac to ee1e73b Compare July 25, 2025 10:58

kszucs marked this pull request as ready for review July 25, 2025 10:59

kszucs added 3 commits July 25, 2025 13:19

test: cover more input parameter values for ParquetWriter

b306a37

test: cover more input parameter values for ParquetDatasetWriter

00a8c54

ci: pin arrow=21.0.0 since it is released now; restore CI workarounds

e0facc0

kszucs force-pushed the cdc branch from 5de2191 to e0facc0 Compare July 25, 2025 11:32

kszucs commented Jul 25, 2025

View reviewed changes

chore: use Union typehint

2b3acd5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: use content defined chunking #7589

feat: use content defined chunking #7589

Uh oh!

kszucs commented May 29, 2025 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented May 29, 2025

Uh oh!

kszucs commented Jun 16, 2025

Uh oh!

kszucs commented Jul 24, 2025

Uh oh!

kszucs Jul 25, 2025 •

edited

Loading

Uh oh!

lhoestq commented Aug 13, 2025

Uh oh!

severo commented Aug 13, 2025 •

edited

Loading

Uh oh!

lhoestq commented Aug 13, 2025

Uh oh!

severo commented Aug 13, 2025 •

edited

Loading

Uh oh!

Uh oh!

feat: use content defined chunking #7589

Are you sure you want to change the base?

feat: use content defined chunking #7589

Uh oh!

Conversation

kszucs commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented May 29, 2025

Uh oh!

kszucs commented Jun 16, 2025

Uh oh!

kszucs commented Jul 24, 2025

Uh oh!

kszucs Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lhoestq commented Aug 13, 2025

Uh oh!

severo commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhoestq commented Aug 13, 2025

Uh oh!

severo commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

kszucs commented May 29, 2025 •

edited

Loading

kszucs Jul 25, 2025 •

edited

Loading

severo commented Aug 13, 2025 •

edited

Loading

severo commented Aug 13, 2025 •

edited

Loading