Skip to content

feat: use content defined chunking #7589

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 18 commits into
base: main
Choose a base branch
from
Open

Conversation

kszucs
Copy link
Member

@kszucs kszucs commented May 29, 2025

Use content defined chunking by default when writing parquet files.

  • set the parameters in io.parquet.ParquetDatasetReader
  • set the parameters in arrow_writer.ParquetWriter

It requires a new pyarrow pin ">=21.0.0" which is released now.

@kszucs kszucs changed the title feat: use content defined chunking in io.parquet.ParquetDatasetReader feat: use content defined chunking May 29, 2025
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@kszucs kszucs force-pushed the cdc branch 2 times, most recently from 960db25 to ef901ea Compare June 8, 2025 15:06
@kszucs
Copy link
Member Author

kszucs commented Jun 16, 2025

Need to set DEFAULT_MAX_BATCH_SIZE = 1024 * 1024

@kszucs
Copy link
Member Author

kszucs commented Jul 24, 2025

We should consider enabling page indexes by default when writing parquet files to enable page pruning readers like the next dataset viewer huggingface/dataset-viewer#3199

@kszucs kszucs marked this pull request as ready for review July 25, 2025 10:59
@@ -183,7 +183,9 @@

# Batch size constants. For more info, see:
# https://github.com/apache/arrow/blob/master/docs/source/cpp/arrays.rst#size-limitations-and-recommendations)
DEFAULT_MAX_BATCH_SIZE = 1000
DEFAULT_MAX_BATCH_SIZE = 1024 * 1024
Copy link
Member Author

@kszucs kszucs Jul 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the default arrow row group size. If we choose a too small row group size then we cannot profit from CDC chunking that much.

@lhoestq
Copy link
Member

lhoestq commented Aug 13, 2025

Need to set DEFAULT_MAX_BATCH_SIZE = 1024 * 1024

maybe we'll need to auto-tweak the row group size to aim for a [30MB-300MB] interval, or we can end up with multiple GBs row groups

@severo
Copy link
Collaborator

severo commented Aug 13, 2025

maybe we'll need to auto-tweak the row group size to aim for a [30MB-300MB] interval, or we can end up with multiple GBs row groups

We should consider enabling page indexes by default when writing parquet files to enable page pruning readers like the next dataset viewer huggingface/dataset-viewer#3199

would it make sense to use the default row group size, and expect the readers will rely on the pages index to fetch only the required bits? Not sure if it exists in duckdb.

@lhoestq
Copy link
Member

lhoestq commented Aug 13, 2025

would it make sense to use the default row group size, and expect the readers will rely on the pages index to fetch only the required bits? Not sure if it exists in duckdb.

most frameworks read row group by row group, that's why we need them to be of reasonable size anyways

@severo
Copy link
Collaborator

severo commented Aug 13, 2025

We should consider enabling page indexes by default when writing parquet files to enable page pruning readers like the next dataset viewer huggingface/dataset-viewer#3199

where would the page indexes be stored? in the custom section in the Parquet file metadata? Is it standardized or ad hoc?

OK, I just RTFM:

write_page_index: bool, default False

Whether to write a page index in general for all columns. Writing statistics to the page index disables the old method of writing statistics to each data page header. The page index makes statistics-based filtering more efficient than the page header, as it gathers all the statistics for a Parquet file in a single place, avoiding scattered I/O. Note that the page index is not yet used on the read size by PyArrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants