Skip to content

Commit 3e37836

Browse files
authored
Merge pull request #11 from frizzleqq/pytester
improve pytest usage
2 parents c9ad0fb + d51a7a2 commit 3e37836

8 files changed

Lines changed: 78 additions & 157 deletions

File tree

.github/workflows/ci.yml

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ jobs:
3232
java-version: 17
3333
distribution: "zulu"
3434
- name: Install the project
35-
run: uv sync --locked --extra dev_local
35+
run: uv sync --locked --extra dev
3636
- name: Run code checks
3737
run: uv run ruff check
3838
- name: Check code formatting
@@ -65,6 +65,10 @@ jobs:
6565
version: 0.260.0
6666
- name: Install the project
6767
run: uv sync --locked --extra dev
68+
- name: Install Databricks Connect
69+
run: |
70+
uv pip uninstall pyspark
71+
uv pip install databricks-connect==16.3.5
6872
- name: Check Databricks CLI
6973
run: databricks current-user me
7074
- name: Run tests

README.md

Lines changed: 28 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -2,20 +2,18 @@
22

33
This project is an example implementation of a [Databricks Asset Bundle](https://docs.databricks.com/aws/en/dev-tools/bundles/) using a [Databricks Free Edition](https://www.databricks.com/learn/free-edition) workspace.
44

5-
The project ist configured using `pyproject.toml` (Python specifics) and `databricks.yaml` (Databricks Bundle specifics) and uses [uv](https://docs.astral.sh/uv/) to manage the Python project and dependencies.
5+
The project is configured using `pyproject.toml` (Python specifics) and `databricks.yaml` (Databricks Bundle specifics) and uses [uv](https://docs.astral.sh/uv/) to manage the Python project and dependencies.
66

7-
## Repo Overview
7+
## Repository Structure
88

9-
* `.github/workflows`: CI/CD jobs to test and dpeloy bundle
10-
* `dab_project`: Python project (Used in Databricks Workflow as Python-Wheel-Task)
11-
* `dbt`: [dbt](https://github.com/dbt-labs/dbt-core) project (Used in Databricks Workflow as dbt-Task)
12-
* dbt-Models used from https://github.com/dbt-labs/jaffle_shop_duckdb
13-
* `resources`: Resources such as Databricks Workflows or Databricks Volumes/Schemas
14-
* Python-based workflow: https://docs.databricks.com/aws/en/dev-tools/bundles/python
15-
* YAML-based Workflow: https://docs.databricks.com/aws/en/dev-tools/bundles/resources#job
16-
* `scripts`: Python script to setup groups, service principals and catalogs used in a Databricks (Free Edition) workspace
17-
* `tests`: Unit-tests running on Databricks (via Connect) or locally
18-
* Used in [ci.yml](.github/workflows/ci.yml) jobs
9+
| Directory | Description |
10+
|-----------|-------------|
11+
| `.github/workflows` | CI/CD jobs to test and deploy bundle |
12+
| `dab_project` | Python project (Used in Databricks Workflow as Python-Wheel-Task) |
13+
| `dbt` | [dbt](https://github.com/dbt-labs/dbt-core) project<br/>* Used in Databricks Workflow as dbt-Task<br/>* dbt-Models used from https://github.com/dbt-labs/jaffle_shop_duckdb |
14+
| `resources` | Resources such as Databricks Workflows or Databricks Volumes/Schemas<br/>* Python-based workflow: https://docs.databricks.com/aws/en/dev-tools/bundles/python<br/>* YAML-based Workflow: https://docs.databricks.com/aws/en/dev-tools/bundles/resources#job |
15+
| `scripts` | Python script to setup groups, service principals and catalogs used in a Databricks (Free Edition) workspace |
16+
| `tests` | Unit-tests running on Databricks (via Connect) or locally<br/>* Used in [ci.yml](.github/workflows/ci.yml) jobs |
1917

2018
## Databricks Workspace
2119

@@ -52,7 +50,7 @@ Sync entire `uv` environment with dev dependencies:
5250
uv sync --extra dev
5351
```
5452

55-
> **Note:** `dev` uses Databricks Connect, while `dev_local` uses local Spark
53+
> **Note:** we install Databricks Connect in a follow-up step
5654
5755
#### (Optional) Activate virtual environment
5856

@@ -66,30 +64,38 @@ Windows:
6664
.venv\Scripts\activate
6765
```
6866

67+
### Databricks Connect
68+
69+
Install `databricks-connect` in active environment. This requires authentication being set up via Databricks CLI.
70+
71+
```bash
72+
uv pip uninstall pyspark
73+
uv pip install databricks-connect==16.3.5
74+
```
75+
> **Note:** For Databricks Runtime 16.3
76+
77+
See https://docs.databricks.com/aws/en/dev-tools/vscode-ext/ for using Databricks Connect extension in VS Code.
78+
6979
### Unit-Tests
7080

7181
```bash
7282
uv run pytest -v
7383
```
7484

75-
Based on whether Databricks Connect (the `dev` default) is enabled or not the Unit-Tests try to use a Databricks Cluster or start a local Spark session with Delta support.
76-
* On Databricks the unit-tests currently assume the catalog `unit_tests` exists (not ideal).
85+
Based on whether Databricks Connect is enabled or not the Unit-Tests try to use a Databricks Cluster or start a local Spark session with Delta support.
86+
* On Databricks the unit-tests currently assume the catalog `lake_dev` exists.
7787

7888
> **Note:** For local Spark Java is required. On Windows Spark/Delta requires HADOOP libraries and generally does not run well, opt for `wsl` instead.
7989
8090
### Checks
8191

8292
```bash
8393
# Linting
84-
ruff check --fix
94+
uv run ruff check --fix
8595
# Formatting
86-
ruff format
96+
uv run ruff format
8797
```
8898

89-
### Databricks Connect
90-
91-
See https://docs.databricks.com/aws/en/dev-tools/vscode-ext/ for using Databricks Connect extension in VS Code.
92-
9399
### Setup Databricks Workspace
94100

95101
The following script sets up a Databricks (Free Edition) Workspace for this project with additional catalogs, groups and service principals. It uses both Databricks-SDK and Databricks Connect (Serverless).
@@ -150,7 +156,7 @@ uv run ./scripts/setup_workspace.py
150156
The `dbt` project is based on https://github.com/dbt-labs/jaffle_shop_duckdb with following changes:
151157

152158
* Schema bronze, silver, gold
153-
* document materialization `use_materialization_v2`
159+
* documented materialization `use_materialization_v2`
154160
* Primary, Foreign Key Constraints
155161

156162
## TODO:

pyproject.toml

Lines changed: 1 addition & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -21,21 +21,7 @@ dependencies = [
2121

2222
[project.optional-dependencies]
2323
dev = [
24-
# Databricks Runtime (connect includes delta/pyspark)
25-
"databricks-connect~=16.3.0",
26-
"pydantic==2.8.2",
27-
# dbt
28-
"dbt-databricks~=1.10.0",
29-
# Tooling
30-
"databricks-bundles~=0.260.0", # For Python-based Workflows
31-
"mypy", # Type hints
32-
"pip", # Databricks extension needs it
33-
"pytest", # Unit testing
34-
"ruff", # Linting/Formatting
35-
]
36-
# Is this really needed?
37-
dev_local = [
38-
# Databricks Runtime (connect includes delta/pyspark)
24+
# Runtime
3925
"delta-spark>=3.3.0, <4.0.0",
4026
"pydantic==2.8.2",
4127
"pyspark>=3.5.0, <4.0.0",

resources/constants.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,8 @@ class Variables:
2121
DEFAULT_ENVIRONMENT = JobEnvironment(
2222
environment_key="default",
2323
spec=Environment(
24-
environment_version=Variables.serverless_environment_version, dependencies=["./dist/*.whl"]
24+
environment_version=Variables.serverless_environment_version,
25+
dependencies=["./dist/dab_project*.whl"],
2526
),
2627
)
2728

tests/conftest.py

Lines changed: 31 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
import shutil
22
import tempfile
3+
import uuid
34
from pathlib import Path
45
from typing import Generator, Optional
56

@@ -23,7 +24,7 @@ def spark() -> Generator[SparkSession, None, None]:
2324
yield spark
2425
else:
2526
# If databricks-connect is not installed, we use use local Spark session
26-
warehouse_dir = tempfile.TemporaryDirectory().name
27+
warehouse_dir = tempfile.mkdtemp()
2728
_builder = (
2829
SparkSession.builder.master("local[*]")
2930
.config("spark.hive.metastore.warehouse.dir", Path(warehouse_dir).as_uri())
@@ -46,13 +47,38 @@ def spark() -> Generator[SparkSession, None, None]:
4647

4748

4849
@pytest.fixture(scope="session")
49-
def catalog_name() -> Generator[Optional[str], None, None]:
50+
def catalog_name() -> Optional[str]:
5051
"""Fixture to provide the catalog name for tests.
5152
52-
In Databricks, we use the "unit_tests" catalog.
53+
In Databricks, we use the "lake_dev" catalog.
5354
Locally we run without a catalog, so we return None.
5455
"""
5556
if DATABRICKS_CONNECT_AVAILABLE:
56-
yield "unit_tests"
57+
return "lake_dev"
5758
else:
58-
yield None
59+
return None
60+
61+
62+
@pytest.fixture(scope="module")
63+
def create_schema(spark, catalog_name, request) -> Generator[str, None, None]:
64+
"""Fixture to provide a schema for tests.
65+
66+
Creates a schema with a random name prefixed with the test module name and cleans it up after tests.
67+
"""
68+
module_name = request.module.__name__.split(".")[-1] # Get just the module name without path
69+
schema_name = f"pytest_{module_name}_{uuid.uuid4().hex[:8]}"
70+
71+
if catalog_name is not None:
72+
full_schema_name = f"{catalog_name}.{schema_name}"
73+
else:
74+
full_schema_name = schema_name
75+
76+
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {full_schema_name}")
77+
yield schema_name
78+
spark.sql(f"DROP SCHEMA IF EXISTS {full_schema_name} CASCADE")
79+
80+
81+
@pytest.fixture(scope="function")
82+
def table_name(request) -> str:
83+
"""Fixture to provide a table name based on the test function name."""
84+
return request.node.name

tests/test_base_task.py

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -21,18 +21,18 @@ def _perform_task(self, catalog_name: str) -> None:
2121
return Task.create_task_factory("TestTask")
2222

2323

24-
def test_etl_task_run(spark, catalog_name, request):
24+
def test_etl_task_run(spark, catalog_name, create_schema, table_name):
2525
task = generate_test_task(
26-
schema_name=__name__,
27-
table_name=f"table_{request.node.name}",
26+
schema_name=create_schema,
27+
table_name=table_name,
2828
)
2929
task.run(catalog_name)
3030

3131
# Verify that the data was written to the Delta table
3232
delta_table = DeltaWorker(
3333
catalog_name=catalog_name,
34-
schema_name=__name__,
35-
table_name=f"table_{request.node.name}",
34+
schema_name=create_schema,
35+
table_name=table_name,
3636
)
3737

3838
assert task.get_class_name() == "TestTask"

tests/test_delta.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@
1313
# spark.sql(f"DROP SCHEMA IF EXISTS {schema_name} CASCADE")
1414

1515

16-
def test_deltawriter_create_table_if_not_exists(spark, catalog_name, request):
16+
def test_deltawriter_create_table_if_not_exists(spark, catalog_name, create_schema, table_name):
1717
schema = T.StructType(
1818
[
1919
T.StructField("key", T.IntegerType()),
@@ -22,8 +22,8 @@ def test_deltawriter_create_table_if_not_exists(spark, catalog_name, request):
2222
)
2323
delta_writer = DeltaWorker(
2424
catalog_name=catalog_name,
25-
schema_name=__name__,
26-
table_name=f"table_{request.node.name}",
25+
schema_name=create_schema,
26+
table_name=table_name,
2727
)
2828

2929
delta_writer.drop_table_if_exists()

0 commit comments

Comments
 (0)