Skip to content

Latest commit

 

History

History
161 lines (111 loc) · 5.84 KB

File metadata and controls

161 lines (111 loc) · 5.84 KB

Databricks Bundle Example

This project is an example implementation of a Databricks Asset Bundle using a Databricks Free Edition workspace.

The project is configured using pyproject.toml (Python specifics) and databricks.yaml (Databricks Bundle specifics) and uses uv to manage the Python project and dependencies.

Repository Structure

Directory Description
.github/workflows CI/CD jobs to test and deploy bundle
src/dab_project Python project (Used in Databricks Workflow as Python-Wheel-Task)
dbt dbt project
* Used in Databricks Workflow as dbt-Task
* dbt-Models used from https://github.com/dbt-labs/jaffle_shop_duckdb
resources Resources such as Databricks Workflows or Databricks Volumes/Schemas
* Python-based workflow: https://docs.databricks.com/aws/en/dev-tools/bundles/python
* YAML-based Workflow: https://docs.databricks.com/aws/en/dev-tools/bundles/resources#job
scripts Python script to setup groups, service principals and catalogs used in a Databricks (Free Edition) workspace
tests Unit-tests running on Databricks (via Connect) or locally
* Used in ci.yml jobs

Databricks Workspace

For this example we use a Databricks Free Edition workspace https://www.databricks.com/learn/free-edition with all resources and identities managed in the Workspace (no external connections or Cloud Identity Management).

Setup

Groups and Service Principals are not necessary, but are used in this project to showcase handling permissions on resources such as catalogs or workflows.

  • Serverless environment: Version 4 which is similar to Databricks Runtime ~17.*
  • Catalogs: lake_dev, lake_test and lake_prod
  • Service principals (for CI/CD and Workflow runners)
    • sp_etl_dev (for dev and test) and sp_etl_prod (for prod)
    • Make sure the User used to deploy Workflows has Service principal: User on the used service principals
  • Groups
    • group_etl group with ALL PRIVILEGES and group_reader with limited permissions on catalogs
    • These are mostly to test applying grants using Asset Bundle resources

A script exists set up the (Free) Workspace as described in scripts/setup_workspace.py, more on that in the Development section.

Development

Requirements

Setup environment

Sync entire uv environment with all optional dependency groups:

uv sync --all-extras

Note: we install Databricks Connect in a follow-up step

(Optional) Activate virtual environment

Bash:

source .venv/bin/activate

Windows:

.venv\Scripts\activate

Databricks Connect

Install databricks-connect in active environment. This requires authentication being set up via Databricks CLI.

uv pip uninstall pyspark
uv pip install databricks-connect==17.2.*

Option 2: Run with temporary dependency

uv run --with databricks-connect==17.2.* pytest

Note: For Databricks Runtime Serverless v4

See https://docs.databricks.com/aws/en/dev-tools/vscode-ext/ for using Databricks Connect extension in VS Code.

Unit-Tests

# in case databricks-connect is installed, --no-sync prevents reinstalling pyspark
uv run --no-sync pytest -v

Based on whether Databricks Connect is enabled or not the Unit-Tests try to use a Databricks Cluster or start a local Spark session with Delta support.

  • On Databricks the unit-tests currently assume the catalog lake_dev exists.

Note: For local Spark Java is required. On Windows Spark/Delta requires HADOOP libraries and generally does not run well, opt for wsl instead.

Checks

# Linting
uv run ruff check --fix
# Formatting
uv run ruff format

Setup Databricks Workspace

The following script sets up a Databricks (Free Edition) Workspace for this project with additional catalogs, groups and service principals. It uses both Databricks-SDK and Databricks Connect (Serverless).

# Authenticate to your Databricks workspace, if you have not done so already:
# databricks configure

uv run ./scripts/setup_workspace.py

Databricks CLI

  1. Authenticate to your Databricks workspace, if you have not done so already:

    $ databricks configure
    
  2. To deploy a development copy of this project, type:

    $ databricks bundle deploy --target dev
    
  3. Similarly, to deploy a production copy, type:

    $ databricks bundle deploy --target prod
    
  4. Deploy with custom variables

    $ databricks bundle deploy --target dev --var "catalog_name=workspace"
    

FAQ

  • Service Principals

    For this example, the targets test and prod use a group and service principals.

    The group group_etl can manage the workflow, ideally your user and the service principal are part of it. This group should also have sufficient permissions on the used Catalogs.

    Make sure the User used to deploy has Service principal: User permissions. Service principal: Manager is not enough.

  • dbt project

    The dbt project is based on https://github.com/dbt-labs/jaffle_shop_duckdb with following changes:

    • Schema bronze, silver, gold
    • documented materialization use_materialization_v2
    • Primary, Foreign Key Constraints

TODO:

  • Streaming example
  • Logging
    • Logging to volume