Skip to content

Conversation

pdames
Copy link
Member

@pdames pdames commented Aug 26, 2025

Summary

This is a large PR. It may be easiest to review by working backwards from key tests like test_default_catalog_impl.py and test_deltacat_api.py.

It provides the first E2E working implementation of core DeltaCAT Catalog table creation, alteration, and data IO APIs with proper transactions wrapping all operations. Among other things, it provides:

  1. Inline copy-on-write table compaction and table properties to control automated compaction.
  2. Automatic/manual schema evolution support, and table properties to control table schema evolution behavior.
  3. Support for writing/reading both schemaless tables and tables with schemas.
  4. Full cross-catalog, recursive metadata copy and backfill support (e.g., to support easily backfilling major revisions to catalog metadata storage specification).
  5. Frontpage "overview"/"quickstart" documentation and more detailed Storage, Table, and Schema README doc pages.
  6. Multi-table/namespace/etc. transaction support (i.e., transactions that can operate over any number of objects within the bounds of a single catalog).
  7. Comprehensive, auto-generated (via new make type-mappings makefile target) reader/writer support matrix in reader_compatibility_mapping.py across all Arrow data types, supported dataset types (PyArrow, Pandas, Polars, NumPy, Daft, Ray Data), and supported content types with inline schema (Parquet, Avro, Orc, Feather). This allows us to quickly detect and short-circuit any write that would break a declared supported reader before persisting data or doing any computationally expensive work.

Testing

test_default_catalog_impl.py validates correct behavior of DeltaCAT Catalog APIs end-to-end, and test_deltacat_api.py runs more exhaustive storage-layer verifications. test_default_catalog_impl.py has grown very large, and should be broken out across multiple test scripts in the future.

Regression Risk

Known backwards incompatibility with prior DeltaCAT locator metadata. Mitigation for this is provided via a helper script (backfill_locator_to_id_mappings.py) that will read all metadata files from a source catalog using the old locators, and rewrite the metadata files using the new locators (no data file copies required, but they will also still be referenced in the old catalog location, and thus shouldn't be deleted).

Checklist

  • Unit tests covering the changes have been added

    • If this is a bugfix, regression tests have been added
  • E2E testing has been performed

pdames added 30 commits July 29, 2025 10:59
Copy link
Member

@Zyiqin-Miranda Zyiqin-Miranda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting together initial set of DeltaCAT Catalog APIs, very exciting to see how many new features are introduced, especially all the extended table management features like schema evolution, nested transactions, inline COW compactions which are comprehensive to cover so many use cases.
Just did a high-level first pass, focusing on the new features added, will look into all the exhaustive test cases in a more detailed review.

Copy link
Collaborator

@rnapark rnapark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read over all the features added/changes made!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants