Add metadata index for Parquet files

**Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
Related to #5855 and #5853.

One of the pain points of reading Parquet files is the all-or-nothing nature of the file metadata, which is stored as a Thrift encoded blob in the file's footer. A traditional parser built from Thrift generated code will decode the entire `FileMetaData` structure, which can be very costly with extremely large schemas. The new parsing code introduced recently (#5854) can reduce this cost some by skipping unwanted structures, but as currently implemented it still needs to process the Thrift framing even if not fully decoding everything.

**Describe the solution you'd like**
One solution to the above is to provide an index into the serialized metadata so that only the structures requested are parsed. A full implementation of this would be used along with either row group selections or column projections, and would also be of use for predicate processing (only read column chunk statistics for columns present in the predicate, for instance). This will also need the options object detailed in #8643.

Such an index could be embedded in the `FileMetaData` in the manner described in the [Binary Protocol Extensions](https://github.com/apache/parquet-format/blob/master/BinaryProtocolExtensions.md) section of the Parquet specification.

**Describe alternatives you've considered**


**Additional context**

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add metadata index for Parquet files #8713

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add metadata index for Parquet files #8713

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions