Skip to content

Commit 9ebc97e

Browse files
shyam-cbbilgeyucel
andauthored
Add Couchbase integration (#259)
* Add Couchbase integration * examples updated * Apply suggestions from code review Co-authored-by: Bilge Yücel <bilge.yucel@deepset.ai> * docs update --------- Co-authored-by: Bilge Yücel <bilge.yucel@deepset.ai>
1 parent 6720ca1 commit 9ebc97e

File tree

2 files changed

+327
-0
lines changed

2 files changed

+327
-0
lines changed
Lines changed: 326 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,326 @@
1+
---
2+
layout: integration
3+
name: Couchbase
4+
description: Use the Couchbase database with Haystack
5+
authors:
6+
- name: Couchbase
7+
socials:
8+
github: Couchbase-Ecosystem
9+
pypi: https://pypi.org/project/couchbase-haystack/
10+
repo: https://github.com/Couchbase-Ecosystem/couchbase-haystack
11+
type: Document Store
12+
report_issue: https://github.com/Couchbase-Ecosystem/couchbase-haystack/issues
13+
logo: /logos/couchbase.svg
14+
version: Haystack 2.0
15+
toc: true
16+
---
17+
18+
**Table of Contents**
19+
20+
- [Overview](#overview)
21+
- [Installation](#installation)
22+
- [Usage](#usage)
23+
- [License](#license)
24+
25+
## Overview
26+
27+
An integration of [Couchbase](https://www.couchbase.com) NoSQL database with [Haystack v2.0](https://docs.haystack.deepset.ai/docs/intro)
28+
by [deepset](https://www.deepset.ai). In Couchbase [Vector search index](https://docs.couchbase.com/server/current/vector-search/vector-search.html)
29+
is being used for indexing document embeddings and dense retrievals.
30+
31+
The library allows using Couchbase as a [DocumentStore](https://docs.haystack.deepset.ai/docs/document-store), and implements the required [Protocol](https://docs.haystack.deepset.ai/docs/document-store#documentstore-protocol) methods. You can start working with the implementation by importing it from `couchbase_haystack` package:
32+
33+
```python
34+
from couchbase_haystack import CouchbaseDocumentStore
35+
```
36+
37+
In addition to the `CouchbaseDocumentStore` the library includes the following haystack components which can be used in a pipeline:
38+
39+
- `CouchbaseEmbeddingRetriever` - is a typical [retriever component](https://docs.haystack.deepset.ai/docs/retrievers) that can be used to query vector store index and find related Documents. The component uses `CouchbaseDocumentStore` to query embeddings.
40+
41+
The `couchbase-haystack` library uses [Python Driver](https://docs.couchbase.com/python-sdk/current/hello-world/start-using-sdk.html).
42+
43+
`CouchbaseDocumentStore` will store Documents as JSON documents in Couchbase. Embeddings are stored as part of the document, with indexing and querying of vector embeddings managed by Couchbase's dedicated [Vector Search Index](https://docs.couchbase.com/server/current/vector-search/vector-search.html).
44+
45+
```text
46+
+-----------------------------+
47+
| Couchbase Database |
48+
+-----------------------------+
49+
| |
50+
| +----------------+ |
51+
| | Data service | |
52+
write_documents | +----------------+ |
53+
+------------------------+----->| properties | |
54+
| | | | |
55+
+---------+--------------+ | | embedding | |
56+
| | | +--------+-------+ |
57+
| CouchbaseDocumentStore | | | |
58+
| | | |index |
59+
+---------+--------------+ | | |
60+
| | +--------+--------+ |
61+
| | | Search service | |
62+
| | +-----------------+ |
63+
+----------------------->| | FTS | |
64+
query_embeddings | | Vector Index | |
65+
| | (for embedding) | |
66+
| +-----------------+ |
67+
| |
68+
+-----------------------------+
69+
```
70+
71+
In the above diagram:
72+
73+
- `Data service` Supports the storing, setting, and retrieving of documents, specified by key. Basically where the documents are stored in key value.
74+
- `properties` are Document [attributes](https://docs.haystack.deepset.ai/docs/data-classes#document) stored as part of the Document.
75+
- `embedding` is also a property of the Document (just shown separately in the diagram for clarity) which is a vector of type `LIST[FLOAT]`.
76+
- `Search service` Where indexes specially purposed for Full Text Search and Vector search are created. The Search Service allows for efficient querying
77+
and retrieval based on both text content and vector embeddings.
78+
79+
`CouchbaseDocumentStore` requires the vector index to be created manually either by SDK or UI. Before writing documents, you should make sure Documents are embedded by one of the provided [embedders](https://docs.haystack.deepset.ai/docs/embedders). For example [SentenceTransformersDocumentEmbedder](https://docs.haystack.deepset.ai/docs/sentencetransformersdocumentembedder) can be used in indexing pipeline to calculate document embeddings before writing those to Couchbase.
80+
81+
## Installation
82+
83+
`couchbase-haystack` can be installed as any other Python library, using pip:
84+
85+
```bash
86+
pip install --upgrade pip # optional
87+
pip install sentence-transformers # required in order to run pipeline examples given below
88+
pip install couchbase-haystack
89+
```
90+
91+
## Usage
92+
93+
### Running Couchbase
94+
95+
You will need a running instance of Couchbase to use the components from this package. There are several options available:
96+
97+
- [Docker](https://docs.couchbase.com/server/current/getting-started/do-a-quick-install.html)
98+
- [Couchbase Cloud](https://www.couchbase.com/products/capella) - a fully managed cloud service
99+
- [Couchbase Server](https://www.couchbase.com/downloads) - installable on various operating systems
100+
101+
The simplest way to start the database locally is with a Docker container:
102+
103+
```bash
104+
docker run \
105+
--restart always \
106+
--publish=8091-8096:8091-8096 --publish=11210:11210 \
107+
--env COUCHBASE_ADMINISTRATOR_USERNAME=admin \
108+
--env COUCHBASE_ADMINISTRATOR_PASSWORD=passw0rd \
109+
couchbase:enterprise-7.6.2
110+
```
111+
112+
In this example, the container is started using Couchbase Server version `7.6.2`. The `COUCHBASE_ADMINISTRATOR_USERNAME` and `COUCHBASE_ADMINISTRATOR_PASSWORD` environment variables set the default credentials for authentication.
113+
114+
> **Note:**
115+
> Assuming you have a Docker container running, navigate to <http://localhost:8091> to open the Couchbase Web Console and explore your data.
116+
117+
### Document Store
118+
119+
Once you have the package installed and the database running, you can start using `CouchbaseDocumentStore` as any other document stores that support embeddings.
120+
121+
```python
122+
from haystack.utils.auth import Secret
123+
from couchbase_haystack import CouchbaseDocumentStore, CouchbasePasswordAuthenticator
124+
125+
document_store = CouchbaseDocumentStore(
126+
cluster_connection_string=Secret.from_env_var("CB_CONNECTION_STRING"),
127+
authenticator=CouchbasePasswordAuthenticator(
128+
username=Secret.from_env_var("CB_USERNAME"),
129+
password=Secret.from_env_var("CB_PASSWORD")
130+
),
131+
bucket = "haystack_bucket_name",
132+
scope="haystack_scope_name",
133+
collection="haystack_collection_name",
134+
vector_search_index = "vector_search_index"
135+
)
136+
```
137+
138+
Assuming there is a list of documents available and a running couchbase database you can write/index those in Couchbase, e.g.:
139+
140+
```python
141+
from haystack import Document
142+
143+
documents = [Document(content="Alice has been living in New York City for the past 5 years.")]
144+
145+
document_store.write_documents(documents)
146+
```
147+
148+
If you intend to obtain embeddings before writing documents use the following code:
149+
150+
```python
151+
from haystack import Document
152+
153+
# import one of the available document embedders
154+
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
155+
156+
documents = [Document(content="Alice has been living in New York City for the past 5 years.")]
157+
158+
document_embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
159+
document_embedder.warm_up() # will download the model during first run
160+
documents_with_embeddings = document_embedder.run(documents)
161+
162+
document_store.write_documents(documents_with_embeddings.get("documents"))
163+
```
164+
165+
Make sure embedding model produces vectors of same size as it has been set on `Couchbase Vector Index`, e.g. setting `embedding_dim=384` would comply with the "sentence-transformers/all-MiniLM-L6-v2" model.
166+
167+
> **Note**
168+
> Most of the time you will be using [Haystack Pipelines](https://docs.haystack.deepset.ai/docs/pipelines) to build both indexing and querying RAG scenarios.
169+
170+
It is important to understand how haystack Documents are stored in Couchbase after you call `write_documents`.
171+
172+
```python
173+
from random import random
174+
175+
sample_embedding = [random() for _ in range(384)] # using fake/random embedding for brevity here to simplify example
176+
document = Document(
177+
content="Alice has been living in New York City for the past 5 years.", embedding=sample_embedding, meta={"num_of_years": 5}
178+
)
179+
document.to_dict()
180+
```
181+
182+
The above code converts a Document to a dictionary and will render the following output:
183+
184+
```bash
185+
>>> output:
186+
{
187+
"id": "11c255ad10bff4286781f596a5afd9ab093ed056d41bca4120c849058e52f24d",
188+
"content": "Alice has been living in New York City for the past 5 years.",
189+
"dataframe": None,
190+
"blob": None,
191+
"score": None,
192+
"embedding": [0.025010755222666936, 0.27502931836911926, 0.22321073814882275, ...], # vector of size 384
193+
"num_of_years": 5,
194+
}
195+
```
196+
197+
The data from the dictionary will be used to create a document in Couchbase after you write the document with `document_store.write_documents([document])`. You could query it with Cypher, e.g. `MATCH (doc:Document) RETURN doc`. Below is a json document Couchbase:
198+
199+
```js
200+
{
201+
"id": "11c255ad10bff4286781f596a5afd9ab093ed056d41bca4120c849058e52f24d",
202+
"embedding": [0.6394268274307251, 0.02501075528562069,0.27502933144569397, ...], // vector of size 384
203+
"content": "Alice has been living in New York City for the past 5 years.",
204+
"meta": {
205+
"num_of_years": 5
206+
}
207+
}
208+
```
209+
210+
The full list of parameters accepted by `CouchbaseDocumentStore` can be found in
211+
[API documentation](https://couchbase-ecosystem.github.io/couchbase-haystack/reference/couchbase_document_store).
212+
213+
### Indexing documents
214+
215+
With Haystack you can use [DocumentWriter](https://docs.haystack.deepset.ai/docs/documentwriter) component to write Documents into a Document Store. In the example below we construct pipeline to write documents to Couchbase using `CouchbaseDocumentStore`:
216+
217+
```python
218+
from haystack import Document
219+
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
220+
from haystack.components.writers import DocumentWriter
221+
from haystack.pipeline import Pipeline
222+
from haystack.utils.auth import Secret
223+
from couchbase_haystack import CouchbaseDocumentStore, CouchbasePasswordAuthenticator
224+
225+
documents = [Document(content="This is document 1"), Document(content="This is document 2")]
226+
227+
document_store = CouchbaseDocumentStore(
228+
cluster_connection_string=Secret.from_env_var("CB_CONNECTION_STRING"),
229+
authenticator=CouchbasePasswordAuthenticator(
230+
username=Secret.from_env_var("CB_USERNAME"),
231+
password=Secret.from_env_var("CB_PASSWORD")
232+
),
233+
bucket = "haystack_bucket_name",
234+
scope="haystack_scope_name",
235+
collection="haystack_collection_name",
236+
vector_search_index = "vector_search_index"
237+
)
238+
embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
239+
document_writer = DocumentWriter(document_store=document_store)
240+
241+
indexing_pipeline = Pipeline()
242+
indexing_pipeline.add_component(instance=embedder, name="embedder")
243+
indexing_pipeline.add_component(instance=document_writer, name="writer")
244+
245+
indexing_pipeline.connect("embedder", "writer")
246+
indexing_pipeline.run({"embedder": {"documents": documents}})
247+
```
248+
249+
```bash
250+
>>> output:
251+
`{'writer': {'documents_written': 2}}`
252+
```
253+
254+
### Retrieving documents
255+
256+
`CouchbaseEmbeddingRetriever` component can be used to retrieve documents from Couchbase by querying vector index using an embedded query. Below is a pipeline which finds documents using query embedding:
257+
258+
```python
259+
from typing import List
260+
261+
from haystack import Document, Pipeline
262+
from haystack.components.embedders import SentenceTransformersDocumentEmbedder, SentenceTransformersTextEmbedder
263+
from haystack.utils.auth import Secret
264+
from couchbase_haystack.document_store import CouchbaseDocumentStore, CouchbasePasswordAuthenticator
265+
from couchbase_haystack.component.retriever import CouchbaseEmbeddingRetriever
266+
267+
document_store = CouchbaseDocumentStore(
268+
cluster_connection_string=Secret.from_env_var("CB_CONNECTION_STRING"),
269+
authenticator=CouchbasePasswordAuthenticator(
270+
username=Secret.from_env_var("CB_USERNAME"),
271+
password=Secret.from_env_var("CB_PASSWORD")
272+
),
273+
bucket = "haystack_bucket_name",
274+
scope="haystack_scope_name",
275+
collection="haystack_collection_name",
276+
vector_search_index = "vector_search_index"
277+
)
278+
279+
documents = [
280+
Document(content="Alice has been living in New York City for the past 5 years.", meta={"num_of_years": 5, "city": "New York"}),
281+
Document(content="John moved to Los Angeles 2 years ago and loves the sunny weather.", meta={"num_of_years": 2, "city": "Los Angeles"}),
282+
]
283+
284+
# Same model is used for both query and Document embeddings
285+
model_name = "sentence-transformers/all-MiniLM-L6-v2"
286+
287+
document_embedder = SentenceTransformersDocumentEmbedder(model=model_name)
288+
document_embedder.warm_up()
289+
documents_with_embeddings = document_embedder.run(documents)
290+
291+
document_store.write_documents(documents_with_embeddings.get("documents"))
292+
293+
print("Number of documents written: ", document_store.count_documents())
294+
295+
pipeline = Pipeline()
296+
pipeline.add_component("text_embedder", SentenceTransformersTextEmbedder(model=model_name))
297+
pipeline.add_component("retriever", CouchbaseEmbeddingRetriever(document_store=document_store))
298+
pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
299+
300+
result = pipeline.run(
301+
data={
302+
"text_embedder": {"text": "What cities do people live in?"},
303+
"retriever": {
304+
"top_k": 5
305+
},
306+
}
307+
)
308+
309+
documents: List[Document] = result["retriever"]["documents"]
310+
```
311+
312+
```bash
313+
>>> output:
314+
[Document(id=3e35fa03aff6e3c45e6560f58adc4fde3c436c111a8809c30133b5cb492e8694, content: 'Alice has been living in New York City for the past 5 years.', meta: {'num_of_years': 5, 'city': 'New York'}, score: 0.36796408891677856, embedding: "embedding": vector of size 384), Document(id=ca4d7d7d7ff6c13b950a88580ab134b2dc15b48a47b8f571a46b354b5344e5fa, content: 'John moved to Los Angeles 2 years ago and loves the sunny weather.', meta: {'num_of_years': 2, 'city': 'Los Angeles'}, score: 0.3126790523529053, embedding: vector of size 384)]
315+
```
316+
317+
### More examples
318+
319+
You can find more examples in the implementation [repository](https://github.com/Couchbase-Ecosystem/couchbase-haystack/tree/main/examples):
320+
321+
- [indexing_pipeline.py](https://github.com/Couchbase-Ecosystem/couchbase-haystack/tree/main/examples/indexing_pipeline.py) - Indexing text files (documents) from a remote http location.
322+
- [rag_pipeline.py](https://github.com/Couchbase-Ecosystem/couchbase-haystack/tree/main/examples/rag_pipeline.py) - Generative question answering RAG pipeline using `CouchbaseEmbeddingRetriever` to fetch documents from Couchbase document store and answer question using [HuggingFaceAPIGenerator](https://docs.haystack.deepset.ai/docs/huggingfacetgigenerator).
323+
324+
## License
325+
326+
`couchbase-haystack` is distributed under the terms of the [MIT](https://spdx.org/licenses/MIT.html) license.

logos/couchbase.svg

Lines changed: 1 addition & 0 deletions
Loading

0 commit comments

Comments
 (0)