Lakehouse Testing

This personal project explores the core concepts of modern data lakehouse architectures, with a focus on Apache Iceberg and Ducklake.

We experiment with Iceberg using both Apache Spark and Polars, and, as you might expect, we use DuckDB to create our Ducklake.

Our main goal when creating this project was to compare different approaches to metadata management. It’s well known that Iceberg uses manifests, manifest-lists, and metadata files directly in object storage, whereas Ducklake relies on a dedicated database to manage all table metadata.

Requirements

Docker
make

Project Overview

There are three scripts in the scripts folder: ducklake.py, iceberg-polars.py, and iceberg-spark.py. Each script performs the same steps, using its respective technology:

Reads the CSV file google_play_music_reviews.csv, a dataset downloaded from Kaggle containing user reviews from the Google Play Store for seven popular music streaming apps: Spotify, Apple Music, SoundCloud, TIDAL, Deezer, Shazam, and Google Play Music.
Writes the full dataset in either Ducklake or Iceberg format.
Reads the data back from Ducklake or Iceberg.
Aggregates by app to COUNT the number of occurrences (see the group_by_app function in lib/sqls.py).
Deletes all Spotify records and counts again to confirm the records were deleted.
Inserts all Spotify records back and performs a final count to ensure the records were successfully reinserted.

Note: To use polars with Iceberg, we use the boringcatalog, a lightweight Iceberg catalog that uses a single JSON file.

How to Run

First, run make run to start all the necessary Docker containers for the project: a Postgres database for Ducklake, Spark Master and Worker for iceberg-spark, and a Python container to run the scripts.

Once the containers are up, run make dev to connect to the Python container. Inside the container, you can run:

make ducklake
make polars
make spark

to run the project with your chosen technology. You can add -demo at the end (e.g., make polars-demo) to run a paused version that will prompt you to press Enter to continue, allowing you to observe the data as it changes.

All files are written to the storage folder. After running all three commands, your storage folder should look like this:

Note:

ducklake uses the postgres folder and creates the ducklake folder.
polars creates the catalog and reviews.db folders.
spark creates the iceberg folder.

You can use a SQL workbench with the connection string postgresql://postgres:postgres@127.0.0.1/ducklake to view the ducklake metadata database and explore it.

Lastly, make stop will stop all containers.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
imgs		imgs
lib		lib
scripts		scripts
storage		storage
.dockerignore		.dockerignore
.gitignore		.gitignore
.python-version		.python-version
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Lakehouse Testing

Requirements

Project Overview

How to Run

References

About

Uh oh!

Releases

Packages

Languages

andreqaugusto/lakehouse-testing

Folders and files

Latest commit

History

Repository files navigation

Lakehouse Testing

Requirements

Project Overview

How to Run

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages