Skip to content

Commit 847f094

Browse files
committed
SQLFrame: Add basic and all-types examples
SQLFrame [1] implements the PySpark [2] DataFrame API in order to enable running transformation pipelines directly on database engines - no Spark clusters or dependencies required. [1] https://pypi.org/project/sqlframe/ [2] https://spark.apache.org/docs/latest/api/python/
1 parent 4efcd1c commit 847f094

11 files changed

+543
-0
lines changed

.github/dependabot.yml

+5
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,11 @@ updates:
3333
schedule:
3434
interval: "daily"
3535

36+
- directory: "/by-dataframe/sqlframe"
37+
package-ecosystem: "pip"
38+
schedule:
39+
interval: "daily"
40+
3641
- directory: "/by-language/csharp-npgsql"
3742
package-ecosystem: "nuget"
3843
schedule:
+74
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
name: SQLFrame
2+
3+
on:
4+
pull_request:
5+
branches: ~
6+
paths:
7+
- 'dataframe-sqlframe.yml'
8+
- 'by-dataframe/sqlframe/**'
9+
- '/requirements.txt'
10+
push:
11+
branches: [ main ]
12+
paths:
13+
- 'dataframe-sqlframe.yml'
14+
- 'by-dataframe/sqlframe/**'
15+
- '/requirements.txt'
16+
17+
# Allow job to be triggered manually.
18+
workflow_dispatch:
19+
20+
# Run job each night after CrateDB nightly has been published.
21+
schedule:
22+
- cron: '0 3 * * *'
23+
24+
# Cancel in-progress jobs when pushing to the same branch.
25+
concurrency:
26+
cancel-in-progress: true
27+
group: ${{ github.workflow }}-${{ github.ref }}
28+
29+
jobs:
30+
test:
31+
name: "
32+
Python: ${{ matrix.python-version }}
33+
CrateDB: ${{ matrix.cratedb-version }}
34+
on ${{ matrix.os }}"
35+
runs-on: ${{ matrix.os }}
36+
strategy:
37+
fail-fast: false
38+
matrix:
39+
os: [ 'ubuntu-latest' ]
40+
python-version: [ '3.9', '3.13' ]
41+
cratedb-version: [ 'nightly' ]
42+
43+
services:
44+
cratedb:
45+
image: crate/crate:${{ matrix.cratedb-version }}
46+
ports:
47+
- 4200:4200
48+
- 5432:5432
49+
env:
50+
CRATE_HEAP_SIZE: 4g
51+
52+
steps:
53+
54+
- name: Acquire sources
55+
uses: actions/checkout@v4
56+
57+
- name: Set up Python
58+
uses: actions/setup-python@v5
59+
with:
60+
python-version: ${{ matrix.python-version }}
61+
architecture: x64
62+
cache: 'pip'
63+
cache-dependency-path: |
64+
requirements.txt
65+
by-dataframe/sqlframe/requirements.txt
66+
by-dataframe/sqlframe/requirements-test.txt
67+
68+
- name: Install utilities
69+
run: |
70+
pip install -r requirements.txt
71+
72+
- name: Validate by-dataframe/sqlframe
73+
run: |
74+
ngr test --accept-no-venv by-dataframe/sqlframe

by-dataframe/sqlframe/.gitignore

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
*.csv

by-dataframe/sqlframe/README.md

+77
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
# Verify the `sqlframe` library with CrateDB
2+
3+
Turning PySpark Into a Universal DataFrame API
4+
5+
## About
6+
7+
This folder includes software integration tests for verifying
8+
that the [SQLFrame] Python library works well together with [CrateDB].
9+
10+
SQLFrame implements the [PySpark] DataFrame API in order to enable running
11+
transformation pipelines directly on database engines - no Spark clusters
12+
or dependencies required.
13+
14+
## What's Inside
15+
16+
- `example_basic.py`: A few examples that read CrateDB's `sys.summits` table.
17+
An example inquiring existing tables.
18+
19+
- `example_types.py`: An example that exercises all data types supported by
20+
CrateDB.
21+
22+
## Synopsis
23+
24+
```shell
25+
pip install --upgrade sqlframe
26+
```
27+
```python
28+
from psycopg2 import connect
29+
from sqlframe import activate
30+
from sqlframe.base.functions import col
31+
32+
# Define database connection parameters, suitable for CrateDB on localhost.
33+
# For CrateDB Cloud, use `crate://<username>:<password>@<host>`.
34+
conn = connect(
35+
dbname="crate",
36+
user="crate",
37+
password="",
38+
host="localhost",
39+
port="5432",
40+
)
41+
# Activate SQLFrame to run directly on CrateDB.
42+
activate("postgres", conn=conn)
43+
44+
from pyspark.sql import SparkSession
45+
46+
spark = SparkSession.builder.getOrCreate()
47+
48+
# Invoke query.
49+
df = spark.sql(
50+
spark.table("sys.summits")
51+
.where(col("region").ilike("ortler%"))
52+
.sort(col("height").desc())
53+
.limit(3)
54+
)
55+
print(df.sql())
56+
df.show()
57+
```
58+
59+
## Tests
60+
61+
Set up sandbox and install packages.
62+
```bash
63+
pip install uv
64+
uv venv .venv
65+
source .venv/bin/activate
66+
uv pip install -r requirements.txt -r requirements-test.txt
67+
```
68+
69+
Run integration tests.
70+
```bash
71+
pytest
72+
```
73+
74+
75+
[CrateDB]: https://cratedb.com/database
76+
[PySpark]: https://spark.apache.org/docs/latest/api/python/
77+
[SQLFrame]: https://pypi.org/project/sqlframe/
+98
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
"""
2+
Using `sqlframe` with CrateDB: Basic usage.
3+
4+
pip install --upgrade sqlframe
5+
6+
A few basic operations using the `sqlframe` library with CrateDB.
7+
8+
- https://pypi.org/project/sqlframe/
9+
"""
10+
11+
from psycopg2 import connect
12+
from sqlframe import activate
13+
from sqlframe.base.functions import col
14+
15+
from patch import monkeypatch
16+
17+
18+
def connect_spark():
19+
# Connect to database.
20+
conn = connect(
21+
dbname="crate",
22+
user="crate",
23+
password="",
24+
host="localhost",
25+
port="5432",
26+
)
27+
# Activate SQLFrame to run directly on CrateDB.
28+
activate("postgres", conn=conn)
29+
30+
from pyspark.sql import SparkSession
31+
32+
spark = SparkSession.builder.getOrCreate()
33+
return spark
34+
35+
36+
def sqlframe_select_sys_summits():
37+
"""
38+
Query CrateDB's built-in `sys.summits` table.
39+
:return:
40+
"""
41+
spark = connect_spark()
42+
df = spark.sql(
43+
spark.table("sys.summits")
44+
.where(col("region").ilike("ortler%"))
45+
.sort(col("height").desc())
46+
.limit(3)
47+
)
48+
print(df.sql())
49+
df.show()
50+
return df
51+
52+
53+
def sqlframe_export_sys_summits_pandas():
54+
"""
55+
Query CrateDB's built-in `sys.summits` table, returning a pandas dataframe.
56+
"""
57+
spark = connect_spark()
58+
df = spark.sql(
59+
spark.table("sys.summits")
60+
.where(col("region").ilike("ortler%"))
61+
.sort(col("height").desc())
62+
.limit(3)
63+
).toPandas()
64+
return df
65+
66+
67+
def sqlframe_export_sys_summits_csv():
68+
"""
69+
Query CrateDB's built-in `sys.summits` table, saving the output to CSV.
70+
"""
71+
spark = connect_spark()
72+
df = spark.sql(
73+
spark.table("sys.summits")
74+
.where(col("region").ilike("ortler%"))
75+
.sort(col("height").desc())
76+
.limit(3)
77+
)
78+
df.write.csv("summits.csv", mode="overwrite")
79+
return df
80+
81+
82+
def sqlframe_get_table_names():
83+
"""
84+
Inquire table names of the system schema `sys`.
85+
"""
86+
spark = connect_spark()
87+
tables = spark.catalog.listTables(dbName="sys")
88+
return tables
89+
90+
91+
monkeypatch()
92+
93+
94+
if __name__ == "__main__":
95+
print(sqlframe_select_sys_summits())
96+
print(sqlframe_export_sys_summits_pandas())
97+
print(sqlframe_export_sys_summits_csv())
98+
print(sqlframe_get_table_names())

0 commit comments

Comments
 (0)