Skip to content

Commit 4e456de

Browse files
committed
Add SQLAlchemy document loader
1 parent 4369777 commit 4e456de

File tree

11 files changed

+958
-0
lines changed

11 files changed

+958
-0
lines changed
Lines changed: 237 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,237 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# SQLAlchemy\n",
8+
"\n",
9+
"This notebook demonstrates how to load documents from an [SQLite] database,\n",
10+
"using the [SQLAlchemy] document loader.\n",
11+
"\n",
12+
"It loads the result of a database query with one document per row.\n",
13+
"\n",
14+
"[SQLAlchemy]: https://www.sqlalchemy.org/\n",
15+
"[SQLite]: https://sqlite.org/"
16+
]
17+
},
18+
{
19+
"cell_type": "markdown",
20+
"source": [
21+
"## Prerequisites"
22+
],
23+
"metadata": {
24+
"collapsed": false
25+
}
26+
},
27+
{
28+
"cell_type": "code",
29+
"execution_count": 33,
30+
"metadata": {
31+
"tags": []
32+
},
33+
"outputs": [],
34+
"source": [
35+
"#!pip install langchain termsql"
36+
]
37+
},
38+
{
39+
"cell_type": "markdown",
40+
"source": [
41+
"Provide input data as SQLite database."
42+
],
43+
"metadata": {
44+
"collapsed": false
45+
}
46+
},
47+
{
48+
"cell_type": "code",
49+
"execution_count": 34,
50+
"metadata": {
51+
"tags": []
52+
},
53+
"outputs": [
54+
{
55+
"name": "stdout",
56+
"output_type": "stream",
57+
"text": [
58+
"Overwriting example.csv\n"
59+
]
60+
}
61+
],
62+
"source": [
63+
"%%file example.csv\n",
64+
"Team,Payroll\n",
65+
"Nationals,81.34\n",
66+
"Reds,82.20"
67+
]
68+
},
69+
{
70+
"cell_type": "code",
71+
"execution_count": 35,
72+
"outputs": [
73+
{
74+
"name": "stdout",
75+
"output_type": "stream",
76+
"text": [
77+
"Nationals|81.34\r\n",
78+
"Reds|82.2\r\n"
79+
]
80+
}
81+
],
82+
"source": [
83+
"!termsql --infile=example.csv --head --delimiter=\",\" --outfile=example.sqlite --table=payroll"
84+
],
85+
"metadata": {
86+
"collapsed": false
87+
}
88+
},
89+
{
90+
"cell_type": "markdown",
91+
"source": [
92+
"## Usage"
93+
],
94+
"metadata": {
95+
"collapsed": false
96+
}
97+
},
98+
{
99+
"cell_type": "code",
100+
"execution_count": 36,
101+
"metadata": {
102+
"tags": []
103+
},
104+
"outputs": [],
105+
"source": [
106+
"from langchain.document_loaders.sqlalchemy import SQLAlchemyLoader\n",
107+
"from pprint import pprint\n",
108+
"\n",
109+
"loader = SQLAlchemyLoader(\n",
110+
" \"SELECT * FROM payroll\",\n",
111+
" url=\"sqlite:///example.sqlite\",\n",
112+
")\n",
113+
"documents = loader.load()"
114+
]
115+
},
116+
{
117+
"cell_type": "code",
118+
"execution_count": 37,
119+
"metadata": {
120+
"tags": []
121+
},
122+
"outputs": [
123+
{
124+
"name": "stdout",
125+
"output_type": "stream",
126+
"text": [
127+
"[Document(page_content='Team: Nationals\\nPayroll: 81.34', metadata={}),\n",
128+
" Document(page_content='Team: Reds\\nPayroll: 82.2', metadata={})]\n"
129+
]
130+
}
131+
],
132+
"source": [
133+
"pprint(documents)"
134+
]
135+
},
136+
{
137+
"cell_type": "markdown",
138+
"metadata": {},
139+
"source": [
140+
"## Specifying Which Columns are Content vs Metadata"
141+
]
142+
},
143+
{
144+
"cell_type": "code",
145+
"execution_count": 38,
146+
"metadata": {},
147+
"outputs": [],
148+
"source": [
149+
"loader = SQLAlchemyLoader(\n",
150+
" \"SELECT * FROM payroll\",\n",
151+
" url=\"sqlite:///example.sqlite\",\n",
152+
" page_content_columns=[\"Team\"],\n",
153+
" metadata_columns=[\"Payroll\"],\n",
154+
")\n",
155+
"documents = loader.load()"
156+
]
157+
},
158+
{
159+
"cell_type": "code",
160+
"execution_count": 39,
161+
"metadata": {},
162+
"outputs": [
163+
{
164+
"name": "stdout",
165+
"output_type": "stream",
166+
"text": [
167+
"[Document(page_content='Team: Nationals', metadata={'Payroll': 81.34}),\n",
168+
" Document(page_content='Team: Reds', metadata={'Payroll': 82.2})]\n"
169+
]
170+
}
171+
],
172+
"source": [
173+
"pprint(documents)"
174+
]
175+
},
176+
{
177+
"cell_type": "markdown",
178+
"metadata": {},
179+
"source": [
180+
"## Adding Source to Metadata"
181+
]
182+
},
183+
{
184+
"cell_type": "code",
185+
"execution_count": 40,
186+
"metadata": {},
187+
"outputs": [],
188+
"source": [
189+
"loader = SQLAlchemyLoader(\n",
190+
" \"SELECT * FROM payroll\",\n",
191+
" url=\"sqlite:///example.sqlite\",\n",
192+
" source_columns=[\"Team\"],\n",
193+
")\n",
194+
"documents = loader.load()"
195+
]
196+
},
197+
{
198+
"cell_type": "code",
199+
"execution_count": 41,
200+
"metadata": {},
201+
"outputs": [
202+
{
203+
"name": "stdout",
204+
"output_type": "stream",
205+
"text": [
206+
"[Document(page_content='Team: Nationals\\nPayroll: 81.34', metadata={'source': 'Nationals'}),\n",
207+
" Document(page_content='Team: Reds\\nPayroll: 82.2', metadata={'source': 'Reds'})]\n"
208+
]
209+
}
210+
],
211+
"source": [
212+
"pprint(documents)"
213+
]
214+
}
215+
],
216+
"metadata": {
217+
"kernelspec": {
218+
"display_name": "Python 3 (ipykernel)",
219+
"language": "python",
220+
"name": "python3"
221+
},
222+
"language_info": {
223+
"codemirror_mode": {
224+
"name": "ipython",
225+
"version": 3
226+
},
227+
"file_extension": ".py",
228+
"mimetype": "text/x-python",
229+
"name": "python",
230+
"nbconvert_exporter": "python",
231+
"pygments_lexer": "ipython3",
232+
"version": "3.10.6"
233+
}
234+
},
235+
"nbformat": 4,
236+
"nbformat_minor": 4
237+
}
Lines changed: 155 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,155 @@
1+
# SQLAlchemy
2+
3+
4+
## About
5+
6+
The [SQLAlchemy] document loader loads records from any supported database,
7+
see [SQLAlchemy dialects] for all supported SQL databases and dialects.
8+
9+
You can either use plain SQL for querying, or use an SQLAlchemy `Select`
10+
statement object, if you are using SQLAlchemy-Core or -ORM.
11+
12+
You can select which columns to place into the document, which columns
13+
to place into its metadata, which columns to use as a `source` attribute
14+
in metadata, and whether to include the result row number and/or the SQL
15+
query expression into the metadata.
16+
17+
18+
## Example
19+
20+
This example uses PostgreSQL, and the `psycopg2` driver.
21+
22+
23+
### Prerequisites
24+
25+
```shell
26+
psql postgresql://postgres@localhost/ --command "CREATE DATABASE testdrive;"
27+
psql postgresql://postgres@localhost/testdrive < ./libs/langchain/tests/integration_tests/examples/mlb_teams_2012.sql
28+
```
29+
30+
31+
### Basic loading
32+
33+
```python
34+
from langchain.document_loaders.sqlalchemy import SQLAlchemyLoader
35+
from pprint import pprint
36+
37+
38+
loader = SQLAlchemyLoader(
39+
query="SELECT * FROM mlb_teams_2012 LIMIT 3;",
40+
url="postgresql+psycopg2://postgres@localhost:5432/testdrive",
41+
)
42+
docs = loader.load()
43+
```
44+
45+
```python
46+
pprint(docs)
47+
```
48+
49+
<CodeOutputBlock lang="python">
50+
51+
```
52+
[Document(page_content='Team: Nationals\nPayroll (millions): 81.34\nWins: 98', metadata={}),
53+
Document(page_content='Team: Reds\nPayroll (millions): 82.2\nWins: 97', metadata={}),
54+
Document(page_content='Team: Yankees\nPayroll (millions): 197.96\nWins: 95', metadata={})]
55+
```
56+
57+
</CodeOutputBlock>
58+
59+
60+
## Enriching metadata
61+
62+
Use the `include_rownum_into_metadata` and `include_query_into_metadata` options to
63+
optionally populate the `metadata` dictionary with corresponding information.
64+
65+
Having the `query` within metadata is useful when using documents loaded from
66+
database tables for chains that answer questions using their origin queries.
67+
68+
```python
69+
loader = SQLAlchemyLoader(
70+
query="SELECT * FROM mlb_teams_2012 LIMIT 3;",
71+
url="postgresql+psycopg2://postgres@localhost:5432/testdrive",
72+
include_rownum_into_metadata=True,
73+
include_query_into_metadata=True,
74+
)
75+
docs = loader.load()
76+
```
77+
78+
```python
79+
pprint(docs)
80+
```
81+
82+
<CodeOutputBlock lang="python">
83+
84+
```
85+
[Document(page_content='Team: Nationals\nPayroll (millions): 81.34\nWins: 98', metadata={'row': 0, 'query': 'SELECT * FROM mlb_teams_2012 LIMIT 3;'}),
86+
Document(page_content='Team: Reds\nPayroll (millions): 82.2\nWins: 97', metadata={'row': 1, 'query': 'SELECT * FROM mlb_teams_2012 LIMIT 3;'}),
87+
Document(page_content='Team: Yankees\nPayroll (millions): 197.96\nWins: 95', metadata={'row': 2, 'query': 'SELECT * FROM mlb_teams_2012 LIMIT 3;'})]
88+
```
89+
90+
</CodeOutputBlock>
91+
92+
93+
## Customizing metadata
94+
95+
Use the `page_content_columns`, and `metadata_columns` options to optionally populate
96+
the `metadata` dictionary with corresponding information. When `page_content_columns`
97+
is empty, all columns will be used.
98+
99+
```python
100+
loader = SQLAlchemyLoader(
101+
query="SELECT * FROM mlb_teams_2012 LIMIT 3;",
102+
url="postgresql+psycopg2://postgres@localhost:5432/testdrive",
103+
page_content_columns=["Payroll (millions)", "Wins"],
104+
metadata_columns=["Team"],
105+
)
106+
docs = loader.load()
107+
```
108+
109+
```python
110+
pprint(docs)
111+
```
112+
113+
<CodeOutputBlock lang="python">
114+
115+
```
116+
[Document(page_content='Payroll (millions): 81.34\nWins: 98', metadata={'Team': 'Nationals'}),
117+
Document(page_content='Payroll (millions): 82.2\nWins: 97', metadata={'Team': 'Reds'}),
118+
Document(page_content='Payroll (millions): 197.96\nWins: 95', metadata={'Team': 'Yankees'})]
119+
```
120+
121+
</CodeOutputBlock>
122+
123+
124+
## Specify column(s) to identify the document source
125+
126+
Use the `source_columns` option to specify the columns to use as a "source" for the
127+
document created from each row. This is useful for identifying documents through
128+
their metadata. Typically, you may use the primary key column(s) for that purpose.
129+
130+
```python
131+
loader = SQLAlchemyLoader(
132+
query="SELECT * FROM mlb_teams_2012 LIMIT 3;",
133+
url="postgresql+psycopg2://postgres@localhost:5432/testdrive",
134+
source_columns="Team",
135+
)
136+
docs = loader.load()
137+
```
138+
139+
```python
140+
pprint(docs)
141+
```
142+
143+
<CodeOutputBlock lang="python">
144+
145+
```
146+
[Document(page_content='Team: Nationals\nPayroll (millions): 81.34\nWins: 98', metadata={'source': 'Nationals'}),
147+
Document(page_content='Team: Reds\nPayroll (millions): 82.2\nWins: 97', metadata={'source': 'Reds'}),
148+
Document(page_content='Team: Yankees\nPayroll (millions): 197.96\nWins: 95', metadata={'source': 'Yankees'})]
149+
```
150+
151+
</CodeOutputBlock>
152+
153+
154+
[SQLAlchemy]: https://www.sqlalchemy.org/
155+
[SQLAlchemy dialects]: https://docs.sqlalchemy.org/en/20/dialects/

0 commit comments

Comments
 (0)