-
Notifications
You must be signed in to change notification settings - Fork 38
Initial E2E Implementation of DeltaCAT Native Catalog APIs #570
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…rted dataset types.
…idation on write.
…ractive transaction.
…ine which duplicate records to keep.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for putting together initial set of DeltaCAT Catalog APIs, very exciting to see how many new features are introduced, especially all the extended table management features like schema evolution, nested transactions, inline COW compactions which are comprehensive to cover so many use cases.
Just did a high-level first pass, focusing on the new features added, will look into all the exhaustive test cases in a more detailed review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Read over all the features added/changes made!
Summary
This is a large PR. It may be easiest to review by working backwards from key tests like
test_default_catalog_impl.py
andtest_deltacat_api.py
.It provides the first E2E working implementation of core DeltaCAT Catalog table creation, alteration, and data IO APIs with proper transactions wrapping all operations. Among other things, it provides:
make type-mappings
makefile target) reader/writer support matrix inreader_compatibility_mapping.py
across all Arrow data types, supported dataset types (PyArrow, Pandas, Polars, NumPy, Daft, Ray Data), and supported content types with inline schema (Parquet, Avro, Orc, Feather). This allows us to quickly detect and short-circuit any write that would break a declared supported reader before persisting data or doing any computationally expensive work.Testing
test_default_catalog_impl.py
validates correct behavior of DeltaCAT Catalog APIs end-to-end, andtest_deltacat_api.py
runs more exhaustive storage-layer verifications.test_default_catalog_impl.py
has grown very large, and should be broken out across multiple test scripts in the future.Regression Risk
Known backwards incompatibility with prior DeltaCAT locator metadata. Mitigation for this is provided via a helper script (
backfill_locator_to_id_mappings.py
) that will read all metadata files from a source catalog using the old locators, and rewrite the metadata files using the new locators (no data file copies required, but they will also still be referenced in the old catalog location, and thus shouldn't be deleted).Checklist
Unit tests covering the changes have been added
E2E testing has been performed