Skip to content

Commit 6d5f2a8

Browse files
authored
Add Lakeflow template (#2959)
## Changes Add Lakeflow template based on the new Pipeline folder structure and leveraging the new `glob` and `root_path` properties. ## Why <!-- Why are these changes needed? Provide the context that the reviewer might be missing. For example, were there any decisions behind the change that are not reflected in the code itself? --> ## Tests <!-- How have you tested the changes? --> <!-- If your PR needs to be included in the release notes for next release, add a separate entry in NEXT_CHANGELOG.md as part of your PR. -->
1 parent f2396bb commit 6d5f2a8

File tree

53 files changed

+1158
-3
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

53 files changed

+1158
-3
lines changed
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
{
2+
"project_name": "my_lakeflow_pipelines",
3+
"default_catalog": "main",
4+
"personal_schemas": "yes",
5+
"language": "python"
6+
}
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
2+
>>> [CLI] bundle init lakeflow-pipelines --config-file ./input.json --output-dir output
3+
4+
Welcome to the template for Lakeflow Declarative Pipelines!
5+
6+
7+
Your new project has been created in the 'my_lakeflow_pipelines' directory!
8+
9+
Refer to the README.md file for "getting started" instructions!
10+
11+
>>> [CLI] bundle validate -t dev
12+
Name: my_lakeflow_pipelines
13+
Target: dev
14+
Workspace:
15+
Host: [DATABRICKS_URL]
16+
User: [USERNAME]
17+
Path: /Workspace/Users/[USERNAME]/.bundle/my_lakeflow_pipelines/dev
18+
19+
Validation OK!
20+
21+
>>> [CLI] bundle validate -t prod
22+
Name: my_lakeflow_pipelines
23+
Target: prod
24+
Workspace:
25+
Host: [DATABRICKS_URL]
26+
User: [USERNAME]
27+
Path: /Workspace/Users/[USERNAME]/.bundle/my_lakeflow_pipelines/prod
28+
29+
Validation OK!
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# Typings for Pylance in Visual Studio Code
2+
# see https://github.com/microsoft/pyright/blob/main/docs/builtins.md
3+
from databricks.sdk.runtime import *
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
{
2+
"recommendations": [
3+
"databricks.databricks",
4+
"ms-python.vscode-pylance",
5+
"redhat.vscode-yaml"
6+
]
7+
}
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
{
2+
"python.analysis.stubPath": ".vscode",
3+
"databricks.python.envFile": "${workspaceFolder}/.env",
4+
"jupyter.interactiveWindow.cellMarker.codeRegex": "^# COMMAND ----------|^# Databricks notebook source|^(#\\s*%%|#\\s*\\<codecell\\>|#\\s*In\\[\\d*?\\]|#\\s*In\\[ \\])",
5+
"jupyter.interactiveWindow.cellMarker.default": "# COMMAND ----------",
6+
"python.testing.pytestArgs": [
7+
"."
8+
],
9+
"python.testing.unittestEnabled": false,
10+
"python.testing.pytestEnabled": true,
11+
"python.analysis.extraPaths": ["resources/my_lakeflow_pipelines_pipeline"],
12+
"files.exclude": {
13+
"**/*.egg-info": true,
14+
"**/__pycache__": true,
15+
".pytest_cache": true,
16+
},
17+
"[python]": {
18+
"editor.defaultFormatter": "ms-python.black-formatter",
19+
"editor.formatOnSave": true,
20+
},
21+
}
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
# my_lakeflow_pipelines
2+
3+
The 'my_lakeflow_pipelines' project was generated by using the Lakeflow Pipelines template.
4+
5+
## Setup
6+
7+
1. Install the Databricks CLI from https://docs.databricks.com/dev-tools/cli/databricks-cli.html
8+
9+
2. Authenticate to your Databricks workspace, if you have not done so already:
10+
```
11+
$ databricks auth login
12+
```
13+
14+
3. Optionally, install developer tools such as the Databricks extension for Visual Studio Code from
15+
https://docs.databricks.com/dev-tools/vscode-ext.html. Or the PyCharm plugin from
16+
https://www.databricks.com/blog/announcing-pycharm-integration-databricks.
17+
18+
19+
## Deploying resources
20+
21+
1. To deploy a development copy of this project, type:
22+
```
23+
$ databricks bundle deploy --target dev
24+
```
25+
(Note that "dev" is the default target, so the `--target` parameter
26+
is optional here.)
27+
28+
2. Similarly, to deploy a production copy, type:
29+
```
30+
$ databricks bundle deploy --target prod
31+
```
32+
33+
3. Use the "summary" comand to review everything that was deployed:
34+
```
35+
$ databricks bundle summary
36+
```
37+
38+
4. To run a job or pipeline, use the "run" command:
39+
```
40+
$ databricks bundle run
41+
```
Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
# This is a Databricks asset bundle definition for my_lakeflow_pipelines.
2+
# See https://docs.databricks.com/dev-tools/bundles/index.html for documentation.
3+
bundle:
4+
name: my_lakeflow_pipelines
5+
uuid: [UUID]
6+
7+
include:
8+
- resources/*.yml
9+
- resources/*/*.yml
10+
11+
# Variable declarations. These variables are assigned in the dev/prod targets below.
12+
variables:
13+
catalog:
14+
description: The catalog to use
15+
schema:
16+
description: The schema to use
17+
notifications:
18+
description: The email addresses to use for failure notifications
19+
20+
targets:
21+
dev:
22+
# The default target uses 'mode: development' to create a development copy.
23+
# - Deployed resources get prefixed with '[dev my_user_name]'
24+
# - Any job schedules and triggers are paused by default.
25+
# See also https://docs.databricks.com/dev-tools/bundles/deployment-modes.html.
26+
mode: development
27+
default: true
28+
workspace:
29+
host: [DATABRICKS_URL]
30+
variables:
31+
catalog: main
32+
schema: ${workspace.current_user.short_name}
33+
notifications: []
34+
35+
prod:
36+
mode: production
37+
workspace:
38+
host: [DATABRICKS_URL]
39+
# We explicitly deploy to /Workspace/Users/[USERNAME] to make sure we only have a single copy.
40+
root_path: /Workspace/Users/[USERNAME]/.bundle/${bundle.name}/${bundle.target}
41+
permissions:
42+
- user_name: [USERNAME]
43+
level: CAN_MANAGE
44+
variables:
45+
catalog: main
46+
schema: default
47+
notifications: [[USERNAME]]
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
.databricks/
2+
build/
3+
dist/
4+
__pycache__/
5+
*.egg-info
6+
.venv/
7+
**/explorations/**
8+
**/!explorations/README.md
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
# my_lakeflow_pipelines_pipeline
2+
3+
This folder defines all source code for the my_lakeflow_pipelines_pipeline pipeline:
4+
5+
- `explorations`: Ad-hoc notebooks used to explore the data processed by this pipeline.
6+
- `transformations`: All dataset definitions and transformations.
7+
- `utilities` (optional): Utility functions and Python modules used in this pipeline.
8+
- `data_sources` (optional): View definitions describing the source data for this pipeline.
9+
10+
## Getting Started
11+
12+
To get started, go to the `transformations` folder -- most of the relevant source code lives there:
13+
14+
* By convention, every dataset under `transformations` is in a separate file.
15+
* Take a look at the sample under "sample_trips_my_lakeflow_pipelines.py" to get familiar with the syntax.
16+
Read more about the syntax at https://docs.databricks.com/dlt/python-ref.html.
17+
* Use `Run file` to run and preview a single transformation.
18+
* Use `Run pipeline` to run _all_ transformations in the entire pipeline.
19+
* Use `+ Add` in the file browser to add a new data set definition.
20+
* Use `Schedule` to run the pipeline on a schedule!
21+
22+
For more tutorials and reference material, see https://docs.databricks.com/dlt.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {
6+
"application/vnd.databricks.v1+cell": {
7+
"cellMetadata": {},
8+
"inputWidgets": {},
9+
"nuid": "[UUID]",
10+
"showTitle": false,
11+
"tableResultSettingsMap": {},
12+
"title": ""
13+
}
14+
},
15+
"source": [
16+
"### Example Exploratory Notebook\n",
17+
"\n",
18+
"Use this notebook to explore the data generated by the pipeline in your preferred programming language.\n",
19+
"\n",
20+
"**Note**: This notebook is not executed as part of the pipeline."
21+
]
22+
},
23+
{
24+
"cell_type": "code",
25+
"execution_count": 0,
26+
"metadata": {
27+
"application/vnd.databricks.v1+cell": {
28+
"cellMetadata": {},
29+
"inputWidgets": {},
30+
"nuid": "[UUID]",
31+
"showTitle": false,
32+
"tableResultSettingsMap": {},
33+
"title": ""
34+
}
35+
},
36+
"outputs": [],
37+
"source": [
38+
"# !!! Before performing any data analysis, make sure to run the pipeline to materialize the sample datasets. The tables referenced in this notebook depend on that step.\n",
39+
"\n",
40+
"display(spark.sql(\"SELECT * FROM main.[USERNAME].my_lakeflow_pipelines\"))"
41+
]
42+
}
43+
],
44+
"metadata": {
45+
"application/vnd.databricks.v1+notebook": {
46+
"computePreferences": null,
47+
"dashboards": [],
48+
"environmentMetadata": null,
49+
"inputWidgetPreferences": null,
50+
"language": "python",
51+
"notebookMetadata": {
52+
"pythonIndentUnit": 2
53+
},
54+
"notebookName": "sample_exploration",
55+
"widgets": {}
56+
},
57+
"language_info": {
58+
"name": "python"
59+
}
60+
},
61+
"nbformat": 4,
62+
"nbformat_minor": 0
63+
}

0 commit comments

Comments
 (0)