Before publishing a pull request, ensure that the following validations pass:
make test
make test-integration
make lint
We use make
to automate common development tasks. If you feel that any recurring development routine is
missing from the current set of Makefile targets, please propose a new one!
make venv
Creates a virtual environment in the venv
directory with all required development dependencies installed. You usually
don't need to run this command directly, since it will be invoked automatically by any other target that needs it.
make clean-venv
Removes all artifacts created by the venv
Makefile target.
make build
Builds a redistributable wheel to the build
directory. Stores intermediate artifacts in the dist
directory.
make clean-build
Removes all non-virtual-environment artifacts created by the build
Makefile target.
make rebuild
Runs clean-build
followed by build
.
make clean
Removes all artifacts created by the build
and venv
Makefile targets.
make deploy-s3
Builds and uploads a wheel to S3. See Build and Deploy an S3 Wheel.
make install
Installs all developer requirements from dev-requirements.txt
in your virtual environment.
make lint
Runs the linter to ensure that code in your local workspace conforms to code-style guidelines.
make test
Runs all unit tests.
Note
To run an individual unit test where my_deltacat_test
exists in either the test file, class,
or function/method name, run a command of the form:
python -m pytest -k "my_deltacat_test" -s -vv
Note that the -s
flag disables output capturing so that you can see stdout from print
and other statements in real time, and -vv
let's you see each test version and its input
parameters to ease debugging.
make test-integration
Runs all integration tests.
make test-integration-rebuild
Rebuild the integration test environment.
make benchmark-aws
Run AWS benchmarks.
You can deploy and test your local DeltaCAT changes on any AWS environment that can run Ray applications (e.g. EC2, Glue for Ray, EKS, etc.).
Caution
Iceberg script execution on Glue for Ray is currently broken. DeltaCAT and PyIceberg v0.5+ depend on a version of pydantic that is incompatible with ray v2.4 used by Glue. See Ray Issue #37372 for additional details.
Use the Glue Runner at dev/deploy/aws/scripts/runner.py aws glue
to configure your AWS account to run any Python script
using AWS Glue for Ray. The Glue Runner can also be used to build and upload changes in your workspace to an S3 wheel
used during execution instead of the default PyPi DeltaCAT wheel.
- Install and configure the latest version of the AWS CLI:
- Create an AWS Glue IAM Role that can create and run jobs:
- Install and configure boto3:
- Print Usage Instructions and Exit
python runner.py aws glue -h
- Create a new Glue job and run your first script in us-east-1 (the default region)
python runner.py aws glue deltacat/examples/hello_world.py --glue-iam-role "AWSGlueServiceRole"
- Run an example in us-east-1 using the last job config and DeltaCAT deploy
python runner.py aws glue deltacat/examples/hello_world.py
- Run an example in us-east-1 using your local workspace copy of DeltaCAT
python runner.py aws glue deltacat/examples/hello_world.py --deploy-local-deltacat
Note
The deployed package is referenced by an S3 URL that expires in 7 days. After 7 days, you must deploy a new DeltaCAT package to avoid receiving a 403 error!
- Create a new job and run an example in us-west-2:
python runner.py aws glue deltacat/examples/hello_world.py \
--region us-west-2
--glue-iam-role "AWSGlueServiceRole" \
- Pass arguments into an example script as environment variables:
python runner.py aws glue deltacat/examples/basic_logging.py \
--script-args '{"--var1":"Try that", "--var2":"DeltaCAT"}' \
- Creates an S3 bucket at
s3://deltacat-packages-{stage}
if it doesn't already exist. - [Optional] Builds a wheel containing your local workspace changes and uploads it to
s3://deltacat-packages-{stage}/
if the--deploy-local-deltacat
flag is set.
Important
{stage} is replaced with os.environ["USER"]
unless you set the $DELTACAT_STAGE
environment variable.
- Creates an S3 bucket at
s3://deltacat-glue-scripts-{stage}
if it doesn't already exist. - Uploads the script to run to
s3://deltacat-glue-scripts-$USER
. - Creates or updates the Glue Job
deltacat-runner-{stage}
to run this example. - Run the
deltacat-runner-{stage}
Glue Job with either the newly built DeltaCAT wheel or the last used wheel.
If you'd like to run integration tests in any other custom environment, you can run a single command to package your local changes in a wheel, upload it to S3, then install it on your Ray cluster from a signed S3 URL.
Simply run make deploy-s3
to upload your local workspace to a wheel at
s3://deltacat-packages-{stage}/deltacat-{version}-{timestamp}-{python}-{abi}-{platform}.whl
.
If the deploy succeeds, you should see some text printed telling you how to install this wheel from a signed S3 URL:
to install run:
pip install deltacat @ `s3://deltacat-packages-{stage}/deltacat-{version}-{timestamp}-{python}-{abi}-{platform}.whl`
The variables in the above S3 URL will be replaced as follows:
stage: The runtime value of the
$DELTACAT_STAGE
environment variable if defined or the$USER
environment variable if not.
version: The current DeltaCAT distribution version. See https://peps.python.org/pep-0491/.
timestamp: Second-precision epoch timestamp build tag. See https://peps.python.org/pep-0491/.
python: Language implementation and version tag (e.g. ‘py27’, ‘py2’, ‘py3’). See https://peps.python.org/pep-0491/.
abi: ABI tag (e.g. ‘cp33m’, ‘abi3’, ‘none’). See https://peps.python.org/pep-0491/.
platform: Platform tag (e.g. ‘linux_x86_64’, ‘any’). See https://peps.python.org/pep-0491/.
Use the $DELTACAT_STAGE
environment variable to change the S3 bucket that your workspace wheel is uploaded to:
export DELTACAT_STAGE=dev
make deploy-s3
This uploads a wheel to
s3://deltacat-packages-dev/deltacat-{version}-{timestamp}-{python}-{abi}-{platform}.whl
.
You can benchmark your DeltaCAT changes on AWS by running:
make benchmark-aws
Note
We recommend running benchmarks in an environment configured for high bandwidth access to cloud storage. For example, on an EC2 instance with enhanced networking support: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/enhanced-networking.html.
Parquet Reads: Modify the SINGLE_COLUMN_BENCHMARKS
and ALL_COLUMN_BENCHMARKS
fixtures in deltacat/benchmarking/benchmark_parquet_reads.py
to add more files and benchmark test cases.
DeltaCAT defines the signature for its internal storage
and catalog
APIs in their respective interface.py
files.
All functions in interface.py
raise a NotImplementedError
if invoked directly. Implementations of each are expected
to conform to the function signatures defined in interface.py
, with conformance validated through unit tests.
Some reasons we made these "classless" interface signatures are:
- SERDE LIMITATIONS: In early DeltaCAT test cases, distributed Ray applications using equivalent
storage
andcatalog
classes produced oversized serialized payloads via Ray cloudpickle. This resulted in application stability issues and/or severe runtime performance penalties. - STATELESSNESS: All
catalog
andstorage
implementations are meant to be stateless (e.g., to support wrapping them in stateless web services), but classes encourage attaching and tracking ephemeral state in class properties.
DeltaCAT's base metadata storage
model (Metafile
) and all child classes inherit from a standard Python
Dict
. Other storage
models like SortKey
inherit from other standard Python collections like Tuple
.
There are a few reasons for this:
- SERDE:
Dict
and other Python collections support standardized serialization/deserialization viajson
,msgpack
,pickle
, and Raycloudpickle
dumps
/loads
functions. They also support standardized output to a wide variety of human-readable and/or pretty-printed string formats (e.g., viapprint
) to simplify log message evaluation and debugging. - EXTENSIBLE:
Dict
and other Python collections make it easy to add new properties to models over time, can store/retrieve any valid Python object, and make it easy to delineate between when model validation is required (e.g., before serializing to disk) and not (e.g., when a model is instantiated in-memory before knowing the final value of all its properties). - PERFORMANT: We prioritize model performance over pure object-oriented design principals, and Python's base
collections (e.g.,
Dict
,List
,Set
,Tuple
) avoid many performance penalties otherwise incurred by Python OOD extensions like abstract bases classes (ABC
).
Here are a few guiding principles to keep in mind when creating or modifying DeltaCAT's internal storage models:
- CORRECTNESS: The general best practice is "don't make it easy to create invalid models" not "try to make it
impossible to create invalid model". To that end, use properties & setters to validate reads/writes of in-memory model
state, and override
Metafile
to_serializable
/from_serializable
methods to validate model correctness during reads/writes to/from disk.- How hard should it be to create an invalid model? A developer set on modifying a model's internal state
will do so, regardless of guardrails put in place. However, the act of creating an invalid model should look obvious,
not accidental (e.g., directly modifying the bytes of a persisted metadata file on disk, directly modifying
an in-memory model's key/value pairs in its underlying
Dict
, etc.). - Should I interact with a model's base collection directly? If you're directly accessing key/value pairs
of a model's underlying
Dict
, then it's assumed that you know what you're doing, have intentionally bypassed all property-based guardrails (e.g., for performance reasons), and will assume responsibility for leaving the model in a valid state. This isn't implicitly a bad thing, provided that you understand the trade-offs being made. - Should I make my model immutable to prevent accidental modification? DeltaCAT model performance, flexibility, and SerDe compatibility take priority over trying to create immutable models. Don't worry about trying to freeze your model via NamedTuple, frozendict, frozen pydantic ConfigDict, etc. Users that want to mutate the model will do so anyway. An immutable type can be copied into new immutable types with the desired changes applied, and forcing all nested objects to be immutable creates unnecessary limitations on the types of properties that can be modeled.
- How hard should it be to create an invalid model? A developer set on modifying a model's internal state
will do so, regardless of guardrails put in place. However, the act of creating an invalid model should look obvious,
not accidental (e.g., directly modifying the bytes of a persisted metadata file on disk, directly modifying
an in-memory model's key/value pairs in its underlying
- DECORATORS: Models should follow the decorator design pattern. In other words, they should only extend their base collections and wrap their underlying methods, but shouldn't override base Python collection methods with different/unexpected behaviors.
- PROPERTIES: Telegraph read-only model properties by just creating a
@property
decorator with no corresponding setter. Telegraph mutable model properties by creating a corresponding@property-name.setter
decorator. - PERSISTENCE: Models should be validated before being written to durable storage by their
to_serializable
method. They should be validated again when read from durable storage by theirfrom_serializable
method. Only model properties that are added to their model's base Python collection will be persisted post-serialization. In other words, durable (written-to-disk) model properties should have a corresponding key added to their underlyingDict
, while ephemeral (in-memory-only) model properties should not. - INTERFACES: Interface API declarations and abstract methods should simply raise a
NotImplementedError
in the base class where they're defined, but not implemented.
Some DeltaCAT compute functions interact with Ray cloudpickle
differently than the typical Ray application. This
allows us to improve compute stability and efficiency at the cost of managing our own distributed object garbage
collection instead of relying on Ray's automatic distributed object reference counting and garbage collection. For
example, see the comment at deltacat/compute/compactor/utils/primary_key_index.py
for an explanation of our custom
cloudpickle.dumps
usage.