Skip to content

feat: new settings fuse_parquet_read_batch_size #17682

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

dantengsky
Copy link
Member

@dantengsky dantengsky commented Apr 1, 2025

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

Introduce new setting fuse_parquet_read_batch_size which controls the batch size during deserializing of parquet fuse table data block.

The default value is set to 8192. In preliminary TPCH tests, this setting performed good.

TPCH SF 300 q1:

select  l_returnflag,     l_linestatus,     
    sum(l_quantity) as sum_qty,     sum(l_extendedprice) as sum_base_price,    
    sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,     
    sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,     
    avg(l_quantity) as avg_qty,     avg(l_extendedprice) as avg_price,     
    avg(l_discount) as avg_disc,     count(*) as count_order 
  from     lineitem  
  group by     l_returnflag,     l_linestatus order by     l_returnflag,     l_linestatus;
  • Single Query Node
  • Disk cache enabled
  • Table lineitem fully cached (hot)
round 1 round 2 round 3
v1.2.711 37.965 s 38.161 s 37.933 s
this PR 23.938 s 23.656 s 23.845 s

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - use existing tests

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

@github-actions github-actions bot added the pr-feature this PR introduces a new feature to the codebase label Apr 1, 2025
@dantengsky dantengsky added ci-benchmark Benchmark: run all test and removed pr-feature this PR introduces a new feature to the codebase labels Apr 1, 2025
@github-actions github-actions bot added the pr-feature this PR introduces a new feature to the codebase label Apr 1, 2025
Copy link
Contributor

github-actions bot commented Apr 1, 2025

Docker Image for PR

  • tag: pr-17682-a3bc292-1743495043

note: this image tag is only available for internal use.

@dantengsky dantengsky added ci-benchmark Benchmark: run all test and removed ci-benchmark Benchmark: run all test labels Apr 1, 2025
Copy link
Contributor

github-actions bot commented Apr 1, 2025

Docker Image for PR

  • tag: pr-17682-57178b6-1743500122

note: this image tag is only available for internal use.

@dantengsky dantengsky force-pushed the feat-fuse-parquet-batch-size branch from ea35831 to 4af7cf5 Compare April 1, 2025 10:20
@dantengsky dantengsky added ci-benchmark Benchmark: run all test and removed ci-benchmark Benchmark: run all test labels Apr 1, 2025
Copy link
Contributor

github-actions bot commented Apr 1, 2025

Docker Image for PR

  • tag: pr-17682-14bbf11-1743511848

note: this image tag is only available for internal use.

@dantengsky dantengsky added the ci-benchmark-cloud Benchmark: run only cloud tests for tpch/hits label Apr 2, 2025
Copy link
Contributor

github-actions bot commented Apr 2, 2025

Docker Image for PR

  • tag: pr-17682-f9c6476-1743569154

note: this image tag is only available for internal use.

@dantengsky dantengsky force-pushed the feat-fuse-parquet-batch-size branch from f3157fc to 920124e Compare April 2, 2025 05:07
@dantengsky dantengsky added ci-benchmark-cloud Benchmark: run only cloud tests for tpch/hits and removed ci-benchmark-cloud Benchmark: run only cloud tests for tpch/hits ci-benchmark Benchmark: run all test labels Apr 3, 2025
Copy link
Contributor

github-actions bot commented Apr 3, 2025

Docker Image for PR

  • tag: pr-17682-d7e16fa-1743657627

note: this image tag is only available for internal use.

@dantengsky dantengsky force-pushed the feat-fuse-parquet-batch-size branch from 920124e to 0d3a27c Compare April 21, 2025 11:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-benchmark-cloud Benchmark: run only cloud tests for tpch/hits pr-feature this PR introduces a new feature to the codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant