Git for Data

A collection of "Git-for-Data" snippets, models, resources

Overview

This repository contains a collection of code snippets, models, and resources related to "Git-for-Data," (roughly understood as) the idea that we should treat data assets similarly to how we treat code assets in software development.

In particular, it explores the idea of applying Git-like version control concepts to data management, in the context of modern cloud lakehouses (e.g. the Bauplan paper on reproducible cloud pipelines) and open formats (e.g. Apache Iceberg).

Of course, "Git-for-Data" per se is not a new phrase, and a quick Google search highlights a few existing projects. Aside from high-level similarities, however, they all differ in scope, implementation, and intended usage: one of the main motivations for this project (and therefore the content in this repository) is to provide a precise definition of fuzzy concepts, and promote a more formal, shared understanding of the core primitives of a data management system built around "version control."

Setup

Depending on the project, you may need a few tools to run the code yourself if you wish to do so. Basic dependencies include Bauplan and Alloy.

Install Alloy

Alloy ships as a self-contained executable: you can download it here. The code in this repo has been written and tested with Alloy 6.2.0.

To learn more about Alloy, you can check out the official book.

Get Bauplan

Get a Bauplan free sandbox account here: follow the instructions to create an API key, install the CLI / SDK in a Python virtual environment and run the tutorial to get acquainted with the platform.

Content

Blog Series

Git for Data: Part 1

A very simple Alloy model to demonstrate the basic interplay between table versions ("snapshots"), lakehouse "commits", and how people can collaborate through "feature branches" using Git-style merges at the end.

The companion blog post (LINK TBC) discusses the difference in the commit history between a three-way merge and a fast-forward merge. You can reproduce the visual instances in the blog post by commenting / uncommenting standardMerge (you'll get this) and ffMerge (you'll get this) at the end of the gsd.als file.

The commit_api.py script in the bpln folder showcases how Bauplan currently works, i.e. demonstrates Bauplan’s Python-based APIs for lakehouse management, data branching, and auditability APIs to programmatically investigate the commit history through typed Python objects. If you have uv installed, you can run the script with uv run commit_api.py --table_name my_alloy_table (make sure my_alloy_table does not already exist in your account).

Can you spot the differences between the implementation and the formal specification?

Git for Data: Part 2

TBC

Paper

TBC

Acknowledgements

Our interest in "Git-for-Data" started at the very beginning of Bauplan, given our focus on reproducible data pipelines. However, we would not have been able to reach such a maturity without our 2025 summer interns: Manuel Barros (CMU), Jinlang Wang (University of Wisconsin–Madison), Weiming Sheng (Columbia), who did fantastic work in exploring both the formal semantics and the Alloy implementation of these concepts.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
src/blog_series		src/blog_series
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Git for Data

Overview

Setup

Install Alloy

Get Bauplan

Content

Blog Series

Git for Data: Part 1

Git for Data: Part 2

Paper

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

License

BauplanLabs/git_for_data

Folders and files

Latest commit

History

Repository files navigation

Git for Data

Overview

Setup

Install Alloy

Get Bauplan

Content

Blog Series

Git for Data: Part 1

Git for Data: Part 2

Paper

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages