This personal project explores the core concepts of modern data lakehouse architectures, with a focus on Apache Iceberg and Ducklake.
We experiment with Iceberg using both Apache Spark and Polars, and, as you might expect, we use DuckDB to create our Ducklake.
Our main goal when creating this project was to compare different approaches to metadata management. It’s well known that Iceberg uses manifests
, manifest-lists
, and metadata
files directly in object storage, whereas Ducklake relies on a dedicated database to manage all table metadata.
- Docker
make
There are three scripts in the scripts
folder: ducklake.py
, iceberg-polars.py
, and iceberg-spark.py
. Each script performs the same steps, using its respective technology:
- Reads the CSV file
google_play_music_reviews.csv
, a dataset downloaded from Kaggle containing user reviews from the Google Play Store for seven popular music streaming apps: Spotify, Apple Music, SoundCloud, TIDAL, Deezer, Shazam, and Google Play Music. - Writes the full dataset in either Ducklake or Iceberg format.
- Reads the data back from Ducklake or Iceberg.
- Aggregates by app to
COUNT
the number of occurrences (see thegroup_by_app
function inlib/sqls.py
). - Deletes all
Spotify
records and counts again to confirm the records were deleted. - Inserts all
Spotify
records back and performs a final count to ensure the records were successfully reinserted.
Note: To use polars
with Iceberg, we use the boringcatalog, a lightweight Iceberg catalog that uses a single JSON file.
First, run make run
to start all the necessary Docker containers for the project: a Postgres database for Ducklake, Spark Master and Worker for iceberg-spark
, and a Python container to run the scripts.
Once the containers are up, run make dev
to connect to the Python container. Inside the container, you can run:
make ducklake
make polars
make spark
to run the project with your chosen technology. You can add -demo
at the end (e.g., make polars-demo
) to run a paused version that will prompt you to press Enter
to continue, allowing you to observe the data as it changes.
All files are written to the storage
folder. After running all three commands, your storage
folder should look like this:
Note:
ducklake
uses thepostgres
folder and creates theducklake
folder.polars
creates thecatalog
andreviews.db
folders.spark
creates theiceberg
folder.
You can use a SQL workbench with the connection string postgresql://postgres:postgres@127.0.0.1/ducklake
to view the ducklake
metadata database and explore it.
Lastly, make stop
will stop all containers.