Skip to content

andreqaugusto/lakehouse-testing

Repository files navigation

Lakehouse Testing

This personal project explores the core concepts of modern data lakehouse architectures, with a focus on Apache Iceberg and Ducklake.

We experiment with Iceberg using both Apache Spark and Polars, and, as you might expect, we use DuckDB to create our Ducklake.

Our main goal when creating this project was to compare different approaches to metadata management. It’s well known that Iceberg uses manifests, manifest-lists, and metadata files directly in object storage, whereas Ducklake relies on a dedicated database to manage all table metadata.

Requirements

  • Docker
  • make

Project Overview

There are three scripts in the scripts folder: ducklake.py, iceberg-polars.py, and iceberg-spark.py. Each script performs the same steps, using its respective technology:

  • Reads the CSV file google_play_music_reviews.csv, a dataset downloaded from Kaggle containing user reviews from the Google Play Store for seven popular music streaming apps: Spotify, Apple Music, SoundCloud, TIDAL, Deezer, Shazam, and Google Play Music.
  • Writes the full dataset in either Ducklake or Iceberg format.
  • Reads the data back from Ducklake or Iceberg.
  • Aggregates by app to COUNT the number of occurrences (see the group_by_app function in lib/sqls.py).
  • Deletes all Spotify records and counts again to confirm the records were deleted.
  • Inserts all Spotify records back and performs a final count to ensure the records were successfully reinserted.

Note: To use polars with Iceberg, we use the boringcatalog, a lightweight Iceberg catalog that uses a single JSON file.

How to Run

First, run make run to start all the necessary Docker containers for the project: a Postgres database for Ducklake, Spark Master and Worker for iceberg-spark, and a Python container to run the scripts.

Once the containers are up, run make dev to connect to the Python container. Inside the container, you can run:

  • make ducklake
  • make polars
  • make spark

to run the project with your chosen technology. You can add -demo at the end (e.g., make polars-demo) to run a paused version that will prompt you to press Enter to continue, allowing you to observe the data as it changes.

All files are written to the storage folder. After running all three commands, your storage folder should look like this:

Note:

  • ducklake uses the postgres folder and creates the ducklake folder.
  • polars creates the catalog and reviews.db folders.
  • spark creates the iceberg folder.

You can use a SQL workbench with the connection string postgresql://postgres:postgres@127.0.0.1/ducklake to view the ducklake metadata database and explore it.

Lastly, make stop will stop all containers.

References

About

Repository for a personal test that uses Lakehouse technologies

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published