Skip to content

BauplanLabs/git_for_data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Git for Data

A collection of "Git-for-Data" snippets, models, resources

Overview

This repository contains a collection of code snippets, models, and resources related to "Git-for-Data," (roughly understood as) the idea that we should treat data assets similarly to how we treat code assets in software development.

In particular, it explores the idea of applying Git-like version control concepts to data management, in the context of modern cloud lakehouses (e.g. the Bauplan paper on reproducible cloud pipelines) and open formats (e.g. Apache Iceberg).

Of course, "Git-for-Data" per se is not a new phrase, and a quick Google search highlights a few existing projects. Aside from high-level similarities, however, they all differ in scope, implementation, and intended usage: one of the main motivations for this project (and therefore the content in this repository) is to provide a precise definition of fuzzy concepts, and promote a more formal, shared understanding of the core primitives of a data management system built around "version control."

Setup

Depending on the project, you may need a few tools to run the code yourself if you wish to do so. Basic dependencies include Bauplan and Alloy.

Install Alloy

Alloy ships as a self-contained executable: you can download it here. The code in this repo has been written and tested with Alloy 6.2.0.

To learn more about Alloy, you can check out the official book.

Get Bauplan

Get a Bauplan free sandbox account here: follow the instructions to create an API key, install the CLI / SDK in a Python virtual environment and run the tutorial to get acquainted with the platform.

Content

Blog Series

Git for Data: Part 1

A very simple Alloy model to demonstrate the basic interplay between table versions ("snapshots"), lakehouse "commits", and how people can collaborate through "feature branches" using Git-style merges at the end.

The companion blog post (LINK TBC) discusses the difference in the commit history between a three-way merge and a fast-forward merge. You can reproduce the visual instances in the blog post by commenting / uncommenting standardMerge (you'll get this) and ffMerge (you'll get this) at the end of the gsd.als file.

The commit_api.py script in the bpln folder showcases how Bauplan currently works, i.e. demonstrates Bauplan’s Python-based APIs for lakehouse management, data branching, and auditability APIs to programmatically investigate the commit history through typed Python objects. If you have uv installed, you can run the script with uv run commit_api.py --table_name my_alloy_table (make sure my_alloy_table does not already exist in your account).

Can you spot the differences between the implementation and the formal specification?

Git for Data: Part 2

TBC

Paper

TBC

Acknowledgements

Our interest in "Git-for-Data" started at the very beginning of Bauplan, given our focus on reproducible data pipelines. However, we would not have been able to reach such a maturity without our 2025 summer interns: Manuel Barros (CMU), Jinlang Wang (University of Wisconsin–Madison), Weiming Sheng (Columbia), who did fantastic work in exploring both the formal semantics and the Alloy implementation of these concepts.

About

A collection of "git for data" snippets, models, resources

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published