Skip to content

Fast parquet order inversion #17172

@crepererum

Description

@crepererum

Is your feature request related to a problem or challenge?

Often parquet files are written in time-ascending order, i.e. new values are appended at the end. However many users would like to know the last value(s) of a given context. Currently this means that DataFusion would scan the entire parquet file and just keep the relevant bit at the end. However that seems wasteful and I think we can do better.

Describe the solution you'd like

The solution here overlaps with things that we would need to implement in the arrow-rs repository, but I am using an umbrella ticket to illustrate the rough idea.

Part 1 -- Order Inversion

If a parquet file is order col1 ASC, col2 DESC, col3 ASC, then reading it "back to front" should result in data ordered col1 DESC, col2 ASC, col3 DESC, i.e. the sequence of expressions is the same, but the polarity (ascending or descending) of each expression is flipped. So the DataFusion planner should (after all of this is implemented) that it is better to inverse the order and potentially apply a top-k on that than reading an entire file and perform a simple Sort operation.

Part 2 -- Decode Parquet Pages in Reverse Order

Ideally we would be able to decode the parquet files backwards, i.e. "last value first". However this only works for a subset of encodings, while others are clearly order-dependent. But if we look at a parquet file, this order only matters on a page level and technically pages (of a column) can be decoded in reverse order. That is true for ALL encodings.

Part 3 -- Flip a Page

Reading the pages in reverse order however is still not enough. Imagine the following layout:

pages:
 - values:
   - A
   - B
   - C
 - values:
   - D
   - E

Decoding pages in reverse order would yield D, E, A, B C. That's not right and we need to reverse the order within a page. Now my hypothesis is that "flipping" a page would only require inverting the order of an arrow buffer and in contrast to the generic take kernel, this can be implemented in-place and without random access. This highly simplified code for an integer array shows the principle:

fn flip_u64_array(array: &mut [u64]) {
    // if you do this iterator instead of index-based,
    // you can even get rid of the bounds-checking
    for idx in 0..(array.len() / 2) {
        array.swap(idx, array.len() - idx);
    }
}

One can imagine how this even works for more complex types:

  • dictionary encoded data: flip the array that contains the per-row dict keys, but NOT the actual dict
  • nested data (like list): flip top-level array, but not the children (i.e. the lists themselves are NOT flipped)
  • run-offset encoded data: flip both the offset and the value array
  • string views: flip the offset array but NOT the values

A flip kernel would even be useful for DataFusion in general (outside the parquet context), i.e. when we know that we can use this optimization instead of a full sort.

Describe alternatives you've considered

-

Additional context

-

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions