Skip to content

Commit ef21cde

Browse files
committed
CrateDB loader: Add document loader support
The implementation is based on the generic SQLAlchemy document loader.
1 parent 4e456de commit ef21cde

File tree

8 files changed

+488
-4
lines changed

8 files changed

+488
-4
lines changed

docs/docs_skeleton/vercel.json

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1448,6 +1448,10 @@
14481448
"source": "/docs/modules/data_connection/document_loaders/integrations/copypaste",
14491449
"destination": "/docs/integrations/document_loaders/copypaste"
14501450
},
1451+
{
1452+
"source": "/docs/modules/data_connection/document_loaders/integrations/cratedb",
1453+
"destination": "/docs/integrations/document_loaders/cratedb"
1454+
},
14511455
{
14521456
"source": "/en/latest/modules/indexes/document_loaders/examples/csv.html",
14531457
"destination": "/docs/integrations/document_loaders/csv"
Lines changed: 232 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,232 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# CrateDB\n",
8+
"\n",
9+
"This notebook demonstrates how to load documents from a [CrateDB] database,\n",
10+
"using the [SQLAlchemy] document loader.\n",
11+
"\n",
12+
"It loads the result of a database query with one document per row.\n",
13+
"\n",
14+
"[CrateDB]: https://github.com/crate/crate\n",
15+
"[SQLAlchemy]: https://www.sqlalchemy.org/"
16+
]
17+
},
18+
{
19+
"cell_type": "markdown",
20+
"source": [
21+
"## Prerequisites"
22+
],
23+
"metadata": {
24+
"collapsed": false
25+
}
26+
},
27+
{
28+
"cell_type": "code",
29+
"execution_count": 40,
30+
"metadata": {
31+
"tags": []
32+
},
33+
"outputs": [],
34+
"source": [
35+
"#!pip install langchain crash 'crate[sqlalchemy]'"
36+
]
37+
},
38+
{
39+
"cell_type": "markdown",
40+
"source": [
41+
"Populate database."
42+
],
43+
"metadata": {
44+
"collapsed": false
45+
}
46+
},
47+
{
48+
"cell_type": "code",
49+
"execution_count": 41,
50+
"metadata": {
51+
"tags": []
52+
},
53+
"outputs": [
54+
{
55+
"name": "stdout",
56+
"output_type": "stream",
57+
"text": [
58+
"\u001B[32mCONNECT OK\r\n",
59+
"\u001B[0m\u001B[32mPSQL OK, 1 row affected (0.001 sec)\r\n",
60+
"\u001B[0m\u001B[32mDELETE OK, 30 rows affected (0.008 sec)\r\n",
61+
"\u001B[0m\u001B[32mINSERT OK, 30 rows affected (0.011 sec)\r\n",
62+
"\u001B[0m\u001B[0m\u001B[32mCONNECT OK\r\n",
63+
"\u001B[0m\u001B[32mREFRESH OK, 1 row affected (0.001 sec)\r\n",
64+
"\u001B[0m\u001B[0m"
65+
]
66+
}
67+
],
68+
"source": [
69+
"!crash < ./example_data/mlb_teams_2012.sql\n",
70+
"!crash --command \"REFRESH TABLE mlb_teams_2012;\""
71+
]
72+
},
73+
{
74+
"cell_type": "markdown",
75+
"source": [
76+
"## Usage"
77+
],
78+
"metadata": {
79+
"collapsed": false
80+
}
81+
},
82+
{
83+
"cell_type": "code",
84+
"execution_count": 42,
85+
"metadata": {
86+
"tags": []
87+
},
88+
"outputs": [],
89+
"source": [
90+
"from langchain.document_loaders.cratedb import CrateDBLoader\n",
91+
"from pprint import pprint\n",
92+
"\n",
93+
"CONNECTION_STRING = \"crate://crate@localhost/\"\n",
94+
"\n",
95+
"loader = CrateDBLoader(\n",
96+
" 'SELECT * FROM mlb_teams_2012 ORDER BY \"Team\" LIMIT 5;',\n",
97+
" url=CONNECTION_STRING,\n",
98+
")\n",
99+
"documents = loader.load()"
100+
]
101+
},
102+
{
103+
"cell_type": "code",
104+
"execution_count": 43,
105+
"metadata": {
106+
"tags": []
107+
},
108+
"outputs": [
109+
{
110+
"name": "stdout",
111+
"output_type": "stream",
112+
"text": [
113+
"[Document(page_content='Team: Angels\\nPayroll (millions): 154.49\\nWins: 89', metadata={}),\n",
114+
" Document(page_content='Team: Astros\\nPayroll (millions): 60.65\\nWins: 55', metadata={}),\n",
115+
" Document(page_content='Team: Athletics\\nPayroll (millions): 55.37\\nWins: 94', metadata={}),\n",
116+
" Document(page_content='Team: Blue Jays\\nPayroll (millions): 75.48\\nWins: 73', metadata={}),\n",
117+
" Document(page_content='Team: Braves\\nPayroll (millions): 83.31\\nWins: 94', metadata={})]\n"
118+
]
119+
}
120+
],
121+
"source": [
122+
"pprint(documents)"
123+
]
124+
},
125+
{
126+
"cell_type": "markdown",
127+
"metadata": {},
128+
"source": [
129+
"## Specifying Which Columns are Content vs Metadata"
130+
]
131+
},
132+
{
133+
"cell_type": "code",
134+
"execution_count": 44,
135+
"metadata": {},
136+
"outputs": [],
137+
"source": [
138+
"loader = CrateDBLoader(\n",
139+
" 'SELECT * FROM mlb_teams_2012 ORDER BY \"Team\" LIMIT 5;',\n",
140+
" url=CONNECTION_STRING,\n",
141+
" page_content_columns=[\"Team\"],\n",
142+
" metadata_columns=[\"Payroll (millions)\"],\n",
143+
")\n",
144+
"documents = loader.load()"
145+
]
146+
},
147+
{
148+
"cell_type": "code",
149+
"execution_count": 45,
150+
"metadata": {},
151+
"outputs": [
152+
{
153+
"name": "stdout",
154+
"output_type": "stream",
155+
"text": [
156+
"[Document(page_content='Team: Angels', metadata={'Payroll (millions)': 154.49}),\n",
157+
" Document(page_content='Team: Astros', metadata={'Payroll (millions)': 60.65}),\n",
158+
" Document(page_content='Team: Athletics', metadata={'Payroll (millions)': 55.37}),\n",
159+
" Document(page_content='Team: Blue Jays', metadata={'Payroll (millions)': 75.48}),\n",
160+
" Document(page_content='Team: Braves', metadata={'Payroll (millions)': 83.31})]\n"
161+
]
162+
}
163+
],
164+
"source": [
165+
"pprint(documents)"
166+
]
167+
},
168+
{
169+
"cell_type": "markdown",
170+
"metadata": {},
171+
"source": [
172+
"## Adding Source to Metadata"
173+
]
174+
},
175+
{
176+
"cell_type": "code",
177+
"execution_count": 46,
178+
"metadata": {},
179+
"outputs": [],
180+
"source": [
181+
"loader = CrateDBLoader(\n",
182+
" 'SELECT * FROM mlb_teams_2012 ORDER BY \"Team\" LIMIT 5;',\n",
183+
" url=CONNECTION_STRING,\n",
184+
" source_columns=[\"Team\"],\n",
185+
")\n",
186+
"documents = loader.load()"
187+
]
188+
},
189+
{
190+
"cell_type": "code",
191+
"execution_count": 47,
192+
"metadata": {},
193+
"outputs": [
194+
{
195+
"name": "stdout",
196+
"output_type": "stream",
197+
"text": [
198+
"[Document(page_content='Team: Angels\\nPayroll (millions): 154.49\\nWins: 89', metadata={'source': 'Angels'}),\n",
199+
" Document(page_content='Team: Astros\\nPayroll (millions): 60.65\\nWins: 55', metadata={'source': 'Astros'}),\n",
200+
" Document(page_content='Team: Athletics\\nPayroll (millions): 55.37\\nWins: 94', metadata={'source': 'Athletics'}),\n",
201+
" Document(page_content='Team: Blue Jays\\nPayroll (millions): 75.48\\nWins: 73', metadata={'source': 'Blue Jays'}),\n",
202+
" Document(page_content='Team: Braves\\nPayroll (millions): 83.31\\nWins: 94', metadata={'source': 'Braves'})]\n"
203+
]
204+
}
205+
],
206+
"source": [
207+
"pprint(documents)"
208+
]
209+
}
210+
],
211+
"metadata": {
212+
"kernelspec": {
213+
"display_name": "Python 3 (ipykernel)",
214+
"language": "python",
215+
"name": "python3"
216+
},
217+
"language_info": {
218+
"codemirror_mode": {
219+
"name": "ipython",
220+
"version": 3
221+
},
222+
"file_extension": ".py",
223+
"mimetype": "text/x-python",
224+
"name": "python",
225+
"nbconvert_exporter": "python",
226+
"pygments_lexer": "ipython3",
227+
"version": "3.10.6"
228+
}
229+
},
230+
"nbformat": 4,
231+
"nbformat_minor": 4
232+
}
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
-- psql postgresql://postgres@localhost < ./libs/langchain/tests/integration_tests/examples/mlb_teams_2012.sql
2+
-- crash < ./libs/langchain/tests/integration_tests/examples/mlb_teams_2012.sql
3+
4+
CREATE TABLE IF NOT EXISTS mlb_teams_2012 ("Team" VARCHAR, "Payroll (millions)" FLOAT, "Wins" BIGINT);
5+
DELETE FROM mlb_teams_2012;
6+
INSERT INTO mlb_teams_2012
7+
("Team", "Payroll (millions)", "Wins")
8+
VALUES
9+
('Nationals', 81.34, 98),
10+
('Reds', 82.20, 97),
11+
('Yankees', 197.96, 95),
12+
('Giants', 117.62, 94),
13+
('Braves', 83.31, 94),
14+
('Athletics', 55.37, 94),
15+
('Rangers', 120.51, 93),
16+
('Orioles', 81.43, 93),
17+
('Rays', 64.17, 90),
18+
('Angels', 154.49, 89),
19+
('Tigers', 132.30, 88),
20+
('Cardinals', 110.30, 88),
21+
('Dodgers', 95.14, 86),
22+
('White Sox', 96.92, 85),
23+
('Brewers', 97.65, 83),
24+
('Phillies', 174.54, 81),
25+
('Diamondbacks', 74.28, 81),
26+
('Pirates', 63.43, 79),
27+
('Padres', 55.24, 76),
28+
('Mariners', 81.97, 75),
29+
('Mets', 93.35, 74),
30+
('Blue Jays', 75.48, 73),
31+
('Royals', 60.91, 72),
32+
('Marlins', 118.07, 69),
33+
('Red Sox', 173.18, 69),
34+
('Indians', 78.43, 68),
35+
('Twins', 94.08, 66),
36+
('Rockies', 78.06, 64),
37+
('Cubs', 88.19, 61),
38+
('Astros', 60.65, 55)
39+
;

docs/extras/integrations/providers/cratedb.mdx

Lines changed: 41 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,8 @@ data, and query it using SQL.
3838

3939
## Features
4040

41-
The CrateDB adapter supports the Vector Store subsystem of LangChain.
41+
The CrateDB adapter supports the _Vector Store_ and _Document Loader_
42+
subsystems of LangChain.
4243

4344
### Vector Store
4445

@@ -51,6 +52,11 @@ Supports:
5152
- Approximate nearest neighbor search.
5253
- Euclidean distance.
5354

55+
### Document Loader
56+
57+
`CrateDBLoader` provides loading documents from a database table by an SQL
58+
query expression or an SQLAlchemy selectable instance.
59+
5460

5561
## Installation and Setup
5662

@@ -74,16 +80,16 @@ docker run --rm -it --name=cratedb --publish=4200:4200 --publish=5432:5432 \
7480
### Install Client
7581

7682
```bash
77-
pip install 'crate[sqlalchemy]' 'langchain[openai]'
83+
pip install 'crate[sqlalchemy]' 'langchain[openai]' 'crash'
7884
```
7985

8086

81-
## Usage
87+
## Usage » Vector Store
8288

8389
For a more detailed walkthrough of the `CrateDBVectorSearch` wrapper, there is also
8490
a corresponding [Jupyter notebook](/docs/extras/integrations/vectorstores/cratedb.html).
8591

86-
### Acquire text file
92+
### Provide input data
8793
The example uses the canonical `state_of_the_union.txt`.
8894
```shell
8995
wget https://raw.githubusercontent.com/langchain-ai/langchain/v0.0.291/docs/extras/modules/state_of_the_union.txt
@@ -120,6 +126,37 @@ if __name__ == "__main__":
120126
```
121127

122128

129+
## Usage » Document Loader
130+
131+
For a more detailed walkthrough of the `CrateDBLoader`, there is also a corresponding
132+
[Jupyter notebook](/docs/extras/integrations/document_loaders/cratedb.html).
133+
134+
135+
### Provide input data
136+
```shell
137+
wget https://raw.githubusercontent.com/langchain-ai/langchain/main/docs/extras/integrations/document_loaders/example_data/mlb_teams_2012.sql
138+
crash < ./example_data/mlb_teams_2012.sql
139+
crash --command "REFRESH TABLE mlb_teams_2012;"
140+
```
141+
142+
### Load documents by SQL query
143+
```python
144+
from langchain.document_loaders.cratedb import CrateDBLoader
145+
from pprint import pprint
146+
147+
def main():
148+
loader = CrateDBLoader(
149+
'SELECT * FROM mlb_teams_2012 ORDER BY "Team" LIMIT 5;',
150+
url="crate://crate@localhost/",
151+
)
152+
documents = loader.load()
153+
pprint(documents)
154+
155+
if __name__ == "__main__":
156+
main()
157+
```
158+
159+
123160
[CrateDB]: https://github.com/crate/crate
124161
[CrateDB Cloud]: https://crate.io/product
125162
[CrateDB Cloud Console]: https://console.cratedb.cloud/
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
from langchain.document_loaders.sqlalchemy import SQLAlchemyLoader
2+
3+
4+
class CrateDBLoader(SQLAlchemyLoader):
5+
pass
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
version: "3"
2+
3+
services:
4+
postgresql:
5+
image: crate/crate:nightly
6+
environment:
7+
- CRATE_HEAP_SIZE=4g
8+
ports:
9+
- "4200:4200"
10+
- "5432:5432"
11+
command: |
12+
crate -Cdiscovery.type=single-node
13+
healthcheck:
14+
test:
15+
[
16+
"CMD-SHELL",
17+
"curl --silent --fail http://localhost:4200/ || exit 1",
18+
]
19+
interval: 5s
20+
retries: 60

0 commit comments

Comments
 (0)