Skip to content

"Loose" Schema Mode #4

@michaelmoss

Description

@michaelmoss

Hi, this is a great, useful library.

I tested this libraries compatibility with various 'schema evolution' scenarios where I changed the protobuf (added fields, renamed fields, changed optional->repeated, etc) pass it to the ProtoParquetWriter and try to read it back using ProtoParquetRDD. I found that many 'legal' protobuf evolution rules, like field renames or type changes were not compatible with this library.

Now, I don't want to conflate parquet's own schema evolution rules, or spark 'schema merging' capabilities (http://spark.apache.org/docs/latest/sql-programming-guide.html#schema-merging), but overall it seems technically feasible to have a 'loose' mode where fields that exist in parquet and which have a compatible protobuf type will hydrate the protobuf, regardless of any schema mismatches which makes the process fail hard and fast.

Any thoughts on this?

Examples:
Adding new Enum:
Caused by: org.apache.parquet.io.InvalidRecordException: Illegal enum value

Adding new field:
org.apache.parquet.schema.IncompatibleSchemaModificationException: Cant find "timeout" Scheme mismatch

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions