Skip to content

Perf: fast CursorValues compare for StringViewArray using inline_key_… #16630

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jul 4, 2025

Conversation

zhuqi-lucas
Copy link
Contributor

@zhuqi-lucas zhuqi-lucas commented Jun 30, 2025

…fast

Which issue does this PR close?

The arrow-rs compare improvement has been merged, we can apply it to CursorValues compare for StringViewArray in datafusion.

Related arrow-rs changes:
apache/arrow-rs#7748

Rationale for this change

What changes are included in this PR?

Optimizing the CursorValues compare for StringViewArray using fast inlined fast key, and compare performance.

Are these changes tested?

Yes

--------------------
Benchmark sort_tpch.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      main ┃ fast_sort_with_inlined_fast_key ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ Q1108.19 ms │                       106.83 ms │     no change │
│ Q2101.67 ms │                       102.40 ms │     no change │
│ Q3665.25 ms │                       657.33 ms │     no change │
│ Q4129.60 ms │                       128.60 ms │     no change │
│ Q5260.81 ms │                       259.19 ms │     no change │
│ Q6277.54 ms │                       276.67 ms │     no change │
│ Q7457.42 ms │                       445.38 ms │     no change │
│ Q8306.32 ms │                       304.63 ms │     no change │
│ Q9319.58 ms │                       315.52 ms │     no change │
│ Q10466.82 ms │                       469.02 ms │     no change │
│ Q11253.35 ms │                       205.67 ms │ +1.23x faster │
└──────────────┴───────────┴─────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                              ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (main)3346.55ms │
│ Total Time (fast_sort_with_inlined_fast_key)3271.23ms │
│ Average Time (main)304.23ms │
│ Average Time (fast_sort_with_inlined_fast_key)297.38ms │
│ Queries Faster1 │
│ Queries Slower0 │
│ Queries with No Change10 │
│ Queries with Failure0 │
└────────────────────────────────────────────────┴───────────┘

Are there any user-facing changes?

No

@github-actions github-actions bot added the physical-plan Changes to the physical-plan crate label Jun 30, 2025
@zhuqi-lucas zhuqi-lucas force-pushed the fast_sort_with_inlined_fast_key branch from 2551c37 to 95b8d73 Compare July 1, 2025 06:50
@@ -493,13 +493,12 @@ impl<C: CursorValues> SortPreservingMergeStream<C> {
if self.enable_round_robin_tie_breaker && cmp_node == 1 {
match (&self.cursors[winner], &self.cursors[challenger]) {
(Some(ac), Some(bc)) => {
let ord = ac.cmp(bc);
if ord.is_eq() {
if ac == bc {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also change this to using eq instead of compare first, this should also optimize some cases.

@zhuqi-lucas
Copy link
Contributor Author

The latest PR shows almost 1.4 fast in my local MAC.

@alamb
Copy link
Contributor

alamb commented Jul 1, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubuntu SMP Wed May 28 02:40:52 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing fast_sort_with_inlined_fast_key (82ab9df) to 9bb309c diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zhuqi-lucas -- this looks wonderful

I kicked off some benchmark runs. I think we'll be ready to merge once the benchmarks are done and we add some tests.

Thank you again for pushing this along. BTW I have been using your work as a case study in how to do performance optimization https://x.com/andrewlamb1111/status/1940145869833085246 ❤️

@@ -288,6 +288,64 @@ impl CursorArray for StringViewArray {
}
}

/// Todo use arrow-rs side api after: <https://github.com/apache/arrow-rs/pull/7748> released
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

/// ⇒ key("bar") < key("bar\0")
/// ```
#[inline(always)]
pub fn inline_key_fast(raw: u128) -> u128 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I know this is only temporary until we use the code in apache/arrow-rs#7748 , I think we should add some unit tests to make sure we don't accidentally break it

Perhaps we can just copy/paste the test_inline_key_fast_various_lengths_and_lexical test from https://github.com/apache/arrow-rs/pull/7748/files#diff-160ecd8082d5d28081f01cdb08a898cb8f49b17149c7118bf96746ddaae24b4fR1168 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @alamb for good point, addressed in latest PR!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @alamb,

I will port the apache/arrow-rs#7875 fix, after it merged!

@alamb
Copy link
Contributor

alamb commented Jul 1, 2025

🤖: Benchmark completed

Details

Comparing HEAD and fast_sort_with_inlined_fast_key
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ fast_sort_with_inlined_fast_key ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 0     │  1959.45 ms │                      1874.85 ms │ no change │
│ QQuery 1     │   716.85 ms │                       682.99 ms │ no change │
│ QQuery 2     │  1379.59 ms │                      1358.63 ms │ no change │
│ QQuery 3     │   669.13 ms │                       662.60 ms │ no change │
│ QQuery 4     │  1384.88 ms │                      1359.05 ms │ no change │
│ QQuery 5     │ 15154.52 ms │                     15103.16 ms │ no change │
│ QQuery 6     │  2053.03 ms │                      2065.88 ms │ no change │
│ QQuery 7     │  1944.19 ms │                      1907.83 ms │ no change │
└──────────────┴─────────────┴─────────────────────────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                              ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                              │ 25261.64ms │
│ Total Time (fast_sort_with_inlined_fast_key)   │ 25014.99ms │
│ Average Time (HEAD)                            │  3157.70ms │
│ Average Time (fast_sort_with_inlined_fast_key) │  3126.87ms │
│ Queries Faster                                 │          0 │
│ Queries Slower                                 │          0 │
│ Queries with No Change                         │          8 │
│ Queries with Failure                           │          0 │
└────────────────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ fast_sort_with_inlined_fast_key ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.22 ms │                         2.26 ms │     no change │
│ QQuery 1     │    33.74 ms │                        33.58 ms │     no change │
│ QQuery 2     │    81.83 ms │                        81.25 ms │     no change │
│ QQuery 3     │    99.56 ms │                        98.29 ms │     no change │
│ QQuery 4     │   595.65 ms │                       591.31 ms │     no change │
│ QQuery 5     │   857.94 ms │                       862.35 ms │     no change │
│ QQuery 6     │     2.20 ms │                         2.22 ms │     no change │
│ QQuery 7     │    38.99 ms │                        39.03 ms │     no change │
│ QQuery 8     │   870.83 ms │                       865.10 ms │     no change │
│ QQuery 9     │  1191.06 ms │                      1189.45 ms │     no change │
│ QQuery 10    │   264.17 ms │                       257.69 ms │     no change │
│ QQuery 11    │   295.33 ms │                       283.15 ms │     no change │
│ QQuery 12    │   866.71 ms │                       882.61 ms │     no change │
│ QQuery 13    │  1251.38 ms │                      1157.31 ms │ +1.08x faster │
│ QQuery 14    │   806.11 ms │                       815.12 ms │     no change │
│ QQuery 15    │   789.04 ms │                       773.35 ms │     no change │
│ QQuery 16    │  1588.58 ms │                      1619.14 ms │     no change │
│ QQuery 17    │  1597.89 ms │                      1611.10 ms │     no change │
│ QQuery 18    │  2850.79 ms │                      2893.84 ms │     no change │
│ QQuery 19    │    86.69 ms │                        87.67 ms │     no change │
│ QQuery 20    │  1151.87 ms │                      1184.58 ms │     no change │
│ QQuery 21    │  1302.52 ms │                      1340.23 ms │     no change │
│ QQuery 22    │  2128.11 ms │                      2212.12 ms │     no change │
│ QQuery 23    │  7475.95 ms │                      7440.17 ms │     no change │
│ QQuery 24    │   446.12 ms │                       438.83 ms │     no change │
│ QQuery 25    │   300.09 ms │                       311.06 ms │     no change │
│ QQuery 26    │   444.03 ms │                       456.63 ms │     no change │
│ QQuery 27    │  1537.19 ms │                      1565.20 ms │     no change │
│ QQuery 28    │ 12926.19 ms │                     11838.29 ms │ +1.09x faster │
│ QQuery 29    │   539.78 ms │                       531.80 ms │     no change │
│ QQuery 30    │   777.35 ms │                       796.65 ms │     no change │
│ QQuery 31    │   810.08 ms │                       816.43 ms │     no change │
│ QQuery 32    │  2410.45 ms │                      2424.99 ms │     no change │
│ QQuery 33    │  3161.23 ms │                      3192.31 ms │     no change │
│ QQuery 34    │  3257.99 ms │                      3264.91 ms │     no change │
│ QQuery 35    │  1277.55 ms │                      1269.33 ms │     no change │
│ QQuery 36    │   123.77 ms │                       123.61 ms │     no change │
│ QQuery 37    │    51.71 ms │                        51.52 ms │     no change │
│ QQuery 38    │   118.14 ms │                       122.98 ms │     no change │
│ QQuery 39    │   196.57 ms │                       198.79 ms │     no change │
│ QQuery 40    │    40.52 ms │                        41.97 ms │     no change │
│ QQuery 41    │    38.77 ms │                        37.49 ms │     no change │
│ QQuery 42    │    32.74 ms │                        32.90 ms │     no change │
└──────────────┴─────────────┴─────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                              ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                              │ 54719.47ms │
│ Total Time (fast_sort_with_inlined_fast_key)   │ 53838.60ms │
│ Average Time (HEAD)                            │  1272.55ms │
│ Average Time (fast_sort_with_inlined_fast_key) │  1252.06ms │
│ Queries Faster                                 │          2 │
│ Queries Slower                                 │          0 │
│ Queries with No Change                         │         41 │
│ Queries with Failure                           │          0 │
└────────────────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ fast_sort_with_inlined_fast_key ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 1     │ 102.19 ms │                       101.42 ms │ no change │
│ QQuery 2     │  20.99 ms │                        21.17 ms │ no change │
│ QQuery 3     │  32.26 ms │                        32.88 ms │ no change │
│ QQuery 4     │  18.62 ms │                        18.60 ms │ no change │
│ QQuery 5     │  49.97 ms │                        48.76 ms │ no change │
│ QQuery 6     │  11.80 ms │                        11.95 ms │ no change │
│ QQuery 7     │  90.10 ms │                        90.28 ms │ no change │
│ QQuery 8     │  25.44 ms │                        25.14 ms │ no change │
│ QQuery 9     │  54.88 ms │                        54.05 ms │ no change │
│ QQuery 10    │  42.71 ms │                        43.23 ms │ no change │
│ QQuery 11    │  11.43 ms │                        11.46 ms │ no change │
│ QQuery 12    │  34.32 ms │                        34.39 ms │ no change │
│ QQuery 13    │  26.03 ms │                        26.22 ms │ no change │
│ QQuery 14    │   9.79 ms │                         9.82 ms │ no change │
│ QQuery 15    │  19.53 ms │                        19.32 ms │ no change │
│ QQuery 16    │  18.03 ms │                        18.51 ms │ no change │
│ QQuery 17    │  94.72 ms │                        96.35 ms │ no change │
│ QQuery 18    │ 195.27 ms │                       193.54 ms │ no change │
│ QQuery 19    │  24.74 ms │                        24.71 ms │ no change │
│ QQuery 20    │  32.22 ms │                        31.67 ms │ no change │
│ QQuery 21    │ 144.80 ms │                       144.06 ms │ no change │
│ QQuery 22    │  14.50 ms │                        14.74 ms │ no change │
└──────────────┴───────────┴─────────────────────────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                              ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                              │ 1074.36ms │
│ Total Time (fast_sort_with_inlined_fast_key)   │ 1072.26ms │
│ Average Time (HEAD)                            │   48.83ms │
│ Average Time (fast_sort_with_inlined_fast_key) │   48.74ms │
│ Queries Faster                                 │         0 │
│ Queries Slower                                 │         0 │
│ Queries with No Change                         │        22 │
│ Queries with Failure                           │         0 │
└────────────────────────────────────────────────┴───────────┘

@alamb
Copy link
Contributor

alamb commented Jul 1, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubuntu SMP Wed May 28 02:40:52 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing fast_sort_with_inlined_fast_key (82ab9df) to 9bb309c diff using: sort_tpch
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Jul 1, 2025

🤖: Benchmark completed

Details

Comparing HEAD and fast_sort_with_inlined_fast_key
--------------------
Benchmark sort_tpch.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       HEAD ┃ fast_sort_with_inlined_fast_key ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ Q1           │  325.73 ms │                       343.28 ms │  1.05x slower │
│ Q2           │  323.40 ms │                       296.73 ms │ +1.09x faster │
│ Q3           │ 1094.10 ms │                      1061.63 ms │     no change │
│ Q4           │  413.50 ms │                       422.10 ms │     no change │
│ Q5           │  396.88 ms │                       403.57 ms │     no change │
│ Q6           │  433.31 ms │                       437.50 ms │     no change │
│ Q7           │  792.53 ms │                       791.69 ms │     no change │
│ Q8           │  665.29 ms │                       684.66 ms │     no change │
│ Q9           │  700.98 ms │                       727.78 ms │     no change │
│ Q10          │ 1027.52 ms │                      1042.44 ms │     no change │
│ Q11          │  579.11 ms │                       549.83 ms │ +1.05x faster │
└──────────────┴────────────┴─────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                              ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                              │ 6752.35ms │
│ Total Time (fast_sort_with_inlined_fast_key)   │ 6761.22ms │
│ Average Time (HEAD)                            │  613.85ms │
│ Average Time (fast_sort_with_inlined_fast_key) │  614.66ms │
│ Queries Faster                                 │         2 │
│ Queries Slower                                 │         1 │
│ Queries with No Change                         │         8 │
│ Queries with Failure                           │         0 │
└────────────────────────────────────────────────┴───────────┘

…cas/arrow-datafusion into fast_sort_with_inlined_fast_key
@zhuqi-lucas
Copy link
Contributor Author

zhuqi-lucas commented Jul 2, 2025

Thanks @zhuqi-lucas -- this looks wonderful

I kicked off some benchmark runs. I think we'll be ready to merge once the benchmarks are done and we add some tests.

Thank you again for pushing this along. BTW I have been using your work as a case study in how to do performance optimization https://x.com/andrewlamb1111/status/1940145869833085246 ❤️

Thank you very much @alamb! I really appreciate your kind words and support.

I’ve also shared this work on my LinkedIn, it's been a fantastic experience joining the Apache DataFusion / Arrow Rust community and contributing to performance improvements.

Looking forward to doing more!

@zhuqi-lucas
Copy link
Contributor Author

zhuqi-lucas commented Jul 2, 2025

🤖: Benchmark completed

Details

Comparing HEAD and fast_sort_with_inlined_fast_key
--------------------
Benchmark sort_tpch.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       HEAD ┃ fast_sort_with_inlined_fast_key ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ Q1           │  325.73 ms │                       343.28 ms │  1.05x slower │
│ Q2           │  323.40 ms │                       296.73 ms │ +1.09x faster │
│ Q3           │ 1094.10 ms │                      1061.63 ms │     no change │
│ Q4           │  413.50 ms │                       422.10 ms │     no change │
│ Q5           │  396.88 ms │                       403.57 ms │     no change │
│ Q6           │  433.31 ms │                       437.50 ms │     no change │
│ Q7           │  792.53 ms │                       791.69 ms │     no change │
│ Q8           │  665.29 ms │                       684.66 ms │     no change │
│ Q9           │  700.98 ms │                       727.78 ms │     no change │
│ Q10          │ 1027.52 ms │                      1042.44 ms │     no change │
│ Q11          │  579.11 ms │                       549.83 ms │ +1.05x faster │
└──────────────┴────────────┴─────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                              ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                              │ 6752.35ms │
│ Total Time (fast_sort_with_inlined_fast_key)   │ 6761.22ms │
│ Average Time (HEAD)                            │  613.85ms │
│ Average Time (fast_sort_with_inlined_fast_key) │  614.66ms │
│ Queries Faster                                 │         2 │
│ Queries Slower                                 │         1 │
│ Queries with No Change                         │         8 │
│ Queries with Failure                           │         0 │
└────────────────────────────────────────────────┴───────────┘

The Q1 regression seems noise, because it does not have utf8view sort key, the l_linenumber is Integer:

        SELECT l_linenumber, l_partkey
        FROM lineitem
        ORDER BY l_linenumber
      

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @zhuqi-lucas

@alamb alamb merged commit 0185da6 into apache:main Jul 4, 2025
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
physical-plan Changes to the physical-plan crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Continue optimizing the CursorValues compare for StringViewArray
3 participants