Skip to content

Commit 1ee4ede

Browse files
committed
Add acceptance tests
1 parent 2bcd816 commit 1ee4ede

File tree

28 files changed

+600
-0
lines changed

28 files changed

+600
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# Typings for Pylance in Visual Studio Code
2+
# see https://github.com/microsoft/pyright/blob/main/docs/builtins.md
3+
from databricks.sdk.runtime import *
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
{
2+
"recommendations": [
3+
"databricks.databricks",
4+
"ms-python.vscode-pylance",
5+
"redhat.vscode-yaml"
6+
]
7+
}
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
{
2+
"python.analysis.stubPath": ".vscode",
3+
"databricks.python.envFile": "${workspaceFolder}/.env",
4+
"jupyter.interactiveWindow.cellMarker.codeRegex": "^# COMMAND ----------|^# Databricks notebook source|^(#\\s*%%|#\\s*\\<codecell\\>|#\\s*In\\[\\d*?\\]|#\\s*In\\[ \\])",
5+
"jupyter.interactiveWindow.cellMarker.default": "# COMMAND ----------",
6+
"python.testing.pytestArgs": [
7+
"."
8+
],
9+
"python.testing.unittestEnabled": false,
10+
"python.testing.pytestEnabled": true,
11+
"python.analysis.extraPaths": ["assets/etl_pipeline"],
12+
"files.exclude": {
13+
"**/*.egg-info": true,
14+
"**/__pycache__": true,
15+
".pytest_cache": true,
16+
},
17+
"[python]": {
18+
"editor.defaultFormatter": "ms-python.black-formatter",
19+
"editor.formatOnSave": true,
20+
},
21+
}
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
# my_lakeflow_pipelines
2+
3+
The 'my_lakeflow_pipelines' project was generated by using the Lakeflow template.
4+
5+
## Setup
6+
7+
1. Install the Databricks CLI from https://docs.databricks.com/dev-tools/cli/databricks-cli.html
8+
9+
2. Authenticate to your Databricks workspace, if you have not done so already:
10+
```
11+
$ databricks auth login
12+
```
13+
14+
3. Optionally, install developer tools such as the Databricks extension for Visual Studio Code from
15+
https://docs.databricks.com/dev-tools/vscode-ext.html. Or the PyCharm plugin from
16+
https://www.databricks.com/blog/announcing-pycharm-integration-databricks.
17+
18+
19+
## Deploying resources
20+
21+
1. To deploy a development copy of this project, type:
22+
```
23+
$ databricks bundle deploy --target dev
24+
```
25+
(Note that "dev" is the default target, so the `--target` parameter
26+
is optional here.)
27+
28+
2. Similarly, to deploy a production copy, type:
29+
```
30+
$ databricks bundle deploy --target prod
31+
```
32+
33+
3. Use the "summary" comand to review everything that was deployed:
34+
```
35+
$ databricks bundle summary
36+
```
37+
38+
4. To run a job or pipeline, use the "run" command:
39+
```
40+
$ databricks bundle run
41+
```
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
# This is a Databricks asset bundle definition for my_lakeflow_pipelines.
2+
# See https://docs.databricks.com/dev-tools/bundles/index.html for documentation.
3+
bundle:
4+
name: my_lakeflow_pipelines
5+
uuid: [UUID]
6+
7+
include:
8+
- resources/*.yml
9+
- resources/*/*.yml
10+
11+
# Variable declarations. These variables are assigned in the dev/prod targets below.
12+
variables:
13+
catalog:
14+
description: The catalog to use
15+
schema:
16+
description: The schema to use
17+
notifications:
18+
description: The email addresses to use for failure notifications
19+
20+
targets:
21+
dev:
22+
# The default target uses 'mode: development' to create a development copy.
23+
# - Deployed resources get prefixed with '[dev my_user_name]'
24+
# - Any job schedules and triggers are paused by default.
25+
# See also https://docs.databricks.com/dev-tools/bundles/deployment-modes.html.
26+
mode: development
27+
default: true
28+
workspace:
29+
host: [DATABRICKS_URL]
30+
variables:
31+
catalog: main
32+
schema: ${workspace.current_user.short_name}
33+
notifications: []
34+
35+
prod:
36+
mode: production
37+
workspace:
38+
host: [DATABRICKS_URL]
39+
# We explicitly specify /Workspace/Users/[USERNAME] to make sure we only have a single copy.
40+
root_path: /Workspace/Users/[USERNAME]/.bundle/${bundle.name}/${bundle.target}
41+
permissions:
42+
- user_name: [USERNAME]
43+
level: CAN_MANAGE
44+
run_as:
45+
user_name: [USERNAME]
46+
variables:
47+
catalog: main
48+
schema: default
49+
notifications: [[USERNAME]]
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
.databricks/
2+
build/
3+
dist/
4+
__pycache__/
5+
*.egg-info
6+
.venv/
7+
**/explorations/**
8+
**/!explorations/README.md
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
# my_lakeflow_pipelines_pipeline
2+
3+
This folder defines all source code for the my_lakeflow_pipelines_pipeline pipeline:
4+
5+
- `explorations`: Ad-hoc notebooks used to explore the data processed by this pipeline.
6+
- `transformations`: All dataset definitions and transformations.
7+
- `utilities`: Utility functions and Python modules used in this pipeline.
8+
- `data_sources` (optional): View definitions describing the source data for this pipeline.
9+
10+
## Getting Started
11+
12+
To get started, go to the `transformations` folder -- most of the relevant source code lives there:
13+
14+
* By convention, every dataset under `transformations` is in a separate file.
15+
* Take a look at the sample under "sample_trips_my_lakeflow_pipelines.py" to get familiar with the syntax.
16+
Read more about the syntax at https://docs.databricks.com/dlt/python-ref.html.
17+
* Use `Run file` to run and preview a single transformation.
18+
* Use `Run pipeline` to run _all_ transformations in the entire pipeline.
19+
* Use `+ Add` in the file browser to add a new data set definition.
20+
* Use `Schedule` to run the pipeline on a schedule!
21+
22+
For more tutorials and reference material, see https://docs.databricks.com/dlt.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {
6+
"application/vnd.databricks.v1+cell": {
7+
"cellMetadata": {},
8+
"inputWidgets": {},
9+
"nuid": "[UUID]",
10+
"showTitle": false,
11+
"tableResultSettingsMap": {},
12+
"title": ""
13+
}
14+
},
15+
"source": [
16+
"### Example Exploratory Notebook\n",
17+
"\n",
18+
"Use this notebook to explore the data generated by the pipeline in your preferred programming language.\n",
19+
"\n",
20+
"**Note**: This notebook is not executed as part of the pipeline."
21+
]
22+
},
23+
{
24+
"cell_type": "code",
25+
"execution_count": 0,
26+
"metadata": {
27+
"application/vnd.databricks.v1+cell": {
28+
"cellMetadata": {},
29+
"inputWidgets": {},
30+
"nuid": "[UUID]",
31+
"showTitle": false,
32+
"tableResultSettingsMap": {},
33+
"title": ""
34+
}
35+
},
36+
"outputs": [],
37+
"source": [
38+
"# !!! Before performing any data analysis, make sure to run the pipeline to materialize the sample datasets. The tables referenced in this notebook depend on that step.\n",
39+
"\n",
40+
"display(spark.sql(\"SELECT * FROM main.[USERNAME].my_lakeflow_pipelines\"))"
41+
]
42+
}
43+
],
44+
"metadata": {
45+
"application/vnd.databricks.v1+notebook": {
46+
"computePreferences": null,
47+
"dashboards": [],
48+
"environmentMetadata": null,
49+
"inputWidgetPreferences": null,
50+
"language": "python",
51+
"notebookMetadata": {
52+
"pythonIndentUnit": 2
53+
},
54+
"notebookName": "sample_exploration",
55+
"widgets": {}
56+
},
57+
"language_info": {
58+
"name": "python"
59+
}
60+
},
61+
"nbformat": 4,
62+
"nbformat_minor": 0
63+
}
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
# The job that triggers my_lakeflow_pipelines_pipeline.
2+
resources:
3+
jobs:
4+
my_lakeflow_pipelines_job:
5+
name: my_lakeflow_pipelines_job
6+
7+
trigger:
8+
# Run this job every day, exactly one day from the last run; see https://docs.databricks.com/api/workspace/jobs/create#trigger
9+
periodic:
10+
interval: 1
11+
unit: DAYS
12+
13+
email_notifications:
14+
on_failure: ${var.notifications}
15+
16+
tasks:
17+
- task_key: refresh_pipeline
18+
pipeline_task:
19+
pipeline_id: ${resources.pipelines.my_lakeflow_pipelines_pipeline.id}
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
resources:
2+
pipelines:
3+
my_lakeflow_pipelines_pipeline:
4+
name: my_lakeflow_pipelines_pipeline
5+
serverless: true
6+
continuous: false
7+
channel: "PREVIEW"
8+
photon: true
9+
catalog: ${var.catalog}
10+
schema: ${var.schema}
11+
root_path: "."
12+
libraries:
13+
- glob:
14+
include: transformations/**
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
import dlt
2+
from pyspark.sql.functions import col
3+
from utilities import utils
4+
5+
6+
# This file defines a sample transformation.
7+
# Edit the sample below or add new transformations
8+
# using "+ Add" in the file browser.
9+
10+
11+
@dlt.table
12+
def sample_trips_my_lakeflow_pipelines():
13+
return (
14+
spark.read.table("samples.nyctaxi.trips")
15+
.withColumn("trip_distance_km", utils.distance_km(col("trip_distance")))
16+
)
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
import dlt
2+
from pyspark.sql.functions import col, sum
3+
4+
5+
# This file defines a sample transformation.
6+
# Edit the sample below or add new transformations
7+
# using "+ Add" in the file browser.
8+
9+
10+
@dlt.table
11+
def sample_zones_my_lakeflow_pipelines():
12+
# Read from the "sample_trips" table, then sum all the fares
13+
return (
14+
spark.read.table("sample_trips_my_lakeflow_pipelines")
15+
.groupBy(col("pickup_zip"))
16+
.agg(
17+
sum("fare_amount").alias("total_fare")
18+
)
19+
)
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
from pyspark.sql.functions import udf
2+
from pyspark.sql.types import FloatType
3+
4+
5+
@udf(returnType=FloatType())
6+
def distance_km(distance_miles):
7+
"""Convert distance from miles to kilometers (1 mile = 1.60934 km)."""
8+
return distance_miles * 1.60934
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
{
2+
"project_name": "my_lakeflow_pipelines",
3+
"default_catalog": "main",
4+
"personal_schemas": "yes",
5+
"language": "sql"
6+
}
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
2+
>>> [CLI] bundle init lakeflow-pipelines --config-file ./input.json --output-dir output
3+
4+
Welcome to the template for Lakeflow Declarative Pipelines!
5+
6+
7+
Your new project has been created in the 'my_lakeflow_pipelines' directory!
8+
9+
Refer to the README.md file for "getting started" instructions!
10+
11+
>>> [CLI] bundle validate -t dev
12+
Name: my_lakeflow_pipelines
13+
Target: dev
14+
Workspace:
15+
Host: [DATABRICKS_URL]
16+
User: [USERNAME]
17+
Path: /Workspace/Users/[USERNAME]/.bundle/my_lakeflow_pipelines/dev
18+
19+
Validation OK!
20+
21+
>>> [CLI] bundle validate -t prod
22+
Name: my_lakeflow_pipelines
23+
Target: prod
24+
Workspace:
25+
Host: [DATABRICKS_URL]
26+
User: [USERNAME]
27+
Path: /Workspace/Users/[USERNAME]/.bundle/my_lakeflow_pipelines/prod
28+
29+
Validation OK!
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# Typings for Pylance in Visual Studio Code
2+
# see https://github.com/microsoft/pyright/blob/main/docs/builtins.md
3+
from databricks.sdk.runtime import *
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
{
2+
"recommendations": [
3+
"databricks.databricks",
4+
"ms-python.vscode-pylance",
5+
"redhat.vscode-yaml"
6+
]
7+
}
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
{
2+
"python.analysis.stubPath": ".vscode",
3+
"databricks.python.envFile": "${workspaceFolder}/.env",
4+
"jupyter.interactiveWindow.cellMarker.codeRegex": "^# COMMAND ----------|^# Databricks notebook source|^(#\\s*%%|#\\s*\\<codecell\\>|#\\s*In\\[\\d*?\\]|#\\s*In\\[ \\])",
5+
"jupyter.interactiveWindow.cellMarker.default": "# COMMAND ----------",
6+
"python.testing.pytestArgs": [
7+
"."
8+
],
9+
"python.testing.unittestEnabled": false,
10+
"python.testing.pytestEnabled": true,
11+
"python.analysis.extraPaths": ["assets/etl_pipeline"],
12+
"files.exclude": {
13+
"**/*.egg-info": true,
14+
"**/__pycache__": true,
15+
".pytest_cache": true,
16+
},
17+
"[python]": {
18+
"editor.defaultFormatter": "ms-python.black-formatter",
19+
"editor.formatOnSave": true,
20+
},
21+
}

0 commit comments

Comments
 (0)