Skip to content

Conversation

mhaseeb123
Copy link
Member

@mhaseeb123 mhaseeb123 commented Oct 6, 2025

Description

Follow up of #20086 and #19986.

This PR enables skipping decompression of parquet data pages marked as pruned in the new experimental parquet reader. This PR also zeros out nesting size information (used to allocate output buffers) for pruned pages right when it's being computed instead of resetting it later-on just before buffer allocation in (#20086).

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Copy link

copy-pr-bot bot commented Oct 6, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Oct 6, 2025
@mhaseeb123 mhaseeb123 added feature request New feature or request 2 - In Progress Currently a work in progress Performance Performance related issue non-breaking Non-breaking change cuIO cuIO issue labels Oct 6, 2025
@mhaseeb123 mhaseeb123 requested a review from nvdbaranec October 7, 2025 20:54
@mhaseeb123 mhaseeb123 added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Oct 7, 2025
@mhaseeb123 mhaseeb123 marked this pull request as ready for review October 8, 2025 21:37
@mhaseeb123 mhaseeb123 requested a review from a team as a code owner October 8, 2025 21:37
@mhaseeb123 mhaseeb123 requested a review from lamarrr October 8, 2025 21:37
return 0;
}

// If this page is pruned and has a list parent, set the batch size for this depth to 0 to
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this what we added in (#20086) since we are taking care of page sizes (and batch sizes) in compute_page_sizes_kernel instead.

thrust::make_counting_iterator<size_t>(key_start),
thrust::make_counting_iterator<size_t>(key_start + num_keys_this_iter),
size_input.begin(),
get_page_nesting_size{
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

page_mask.data() not needed anymore

Copy link
Member Author

@mhaseeb123 mhaseeb123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only two if blocks need special attention while reviewing. The rest is just trivial stuff

return;
}

if (page_mask.size() and not page_mask[page_idx]) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nvdbaranec @pmattione-nvidia please review this if block

@mhaseeb123 mhaseeb123 added 4 - Needs Review Waiting for reviewer to review or respond and removed 3 - Ready for Review Ready for review by team labels Oct 9, 2025
Copy link
Contributor

@nvdbaranec nvdbaranec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one minor comment.

@mhaseeb123 mhaseeb123 requested a review from nvdbaranec October 10, 2025 21:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

4 - Needs Review Waiting for reviewer to review or respond cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants