|
| 1 | +--- |
| 2 | +layout: integration |
| 3 | +name: Couchbase |
| 4 | +description: Use the Couchbase database with Haystack |
| 5 | +authors: |
| 6 | + - name: Couchbase |
| 7 | + socials: |
| 8 | + github: Couchbase-Ecosystem |
| 9 | +pypi: https://pypi.org/project/couchbase-haystack/ |
| 10 | +repo: https://github.com/Couchbase-Ecosystem/couchbase-haystack |
| 11 | +type: Document Store |
| 12 | +report_issue: https://github.com/Couchbase-Ecosystem/couchbase-haystack/issues |
| 13 | +logo: /logos/couchbase.svg |
| 14 | +version: Haystack 2.0 |
| 15 | +toc: true |
| 16 | +--- |
| 17 | + |
| 18 | +**Table of Contents** |
| 19 | + |
| 20 | +- [Overview](#overview) |
| 21 | +- [Installation](#installation) |
| 22 | +- [Usage](#usage) |
| 23 | +- [License](#license) |
| 24 | + |
| 25 | +## Overview |
| 26 | + |
| 27 | +An integration of [Couchbase](https://www.couchbase.com) NoSQL database with [Haystack v2.0](https://docs.haystack.deepset.ai/docs/intro) |
| 28 | +by [deepset](https://www.deepset.ai). In Couchbase [Vector search index](https://docs.couchbase.com/server/current/vector-search/vector-search.html) |
| 29 | +is being used for indexing document embeddings and dense retrievals. |
| 30 | + |
| 31 | +The library allows using Couchbase as a [DocumentStore](https://docs.haystack.deepset.ai/docs/document-store), and implements the required [Protocol](https://docs.haystack.deepset.ai/docs/document-store#documentstore-protocol) methods. You can start working with the implementation by importing it from `couchbase_haystack` package: |
| 32 | + |
| 33 | +```python |
| 34 | +from couchbase_haystack import CouchbaseDocumentStore |
| 35 | +``` |
| 36 | + |
| 37 | +In addition to the `CouchbaseDocumentStore` the library includes the following haystack components which can be used in a pipeline: |
| 38 | + |
| 39 | +- `CouchbaseEmbeddingRetriever` - is a typical [retriever component](https://docs.haystack.deepset.ai/docs/retrievers) that can be used to query vector store index and find related Documents. The component uses `CouchbaseDocumentStore` to query embeddings. |
| 40 | + |
| 41 | +The `couchbase-haystack` library uses [Python Driver](https://docs.couchbase.com/python-sdk/current/hello-world/start-using-sdk.html). |
| 42 | + |
| 43 | +`CouchbaseDocumentStore` will store Documents as JSON documents in Couchbase. Embeddings are stored as part of the document, with indexing and querying of vector embeddings managed by Couchbase's dedicated [Vector Search Index](https://docs.couchbase.com/server/current/vector-search/vector-search.html). |
| 44 | + |
| 45 | +```text |
| 46 | + +-----------------------------+ |
| 47 | + | Couchbase Database | |
| 48 | + +-----------------------------+ |
| 49 | + | | |
| 50 | + | +----------------+ | |
| 51 | + | | Data service | | |
| 52 | + write_documents | +----------------+ | |
| 53 | + +------------------------+----->| properties | | |
| 54 | + | | | | | |
| 55 | ++---------+--------------+ | | embedding | | |
| 56 | +| | | +--------+-------+ | |
| 57 | +| CouchbaseDocumentStore | | | | |
| 58 | +| | | |index | |
| 59 | ++---------+--------------+ | | | |
| 60 | + | | +--------+--------+ | |
| 61 | + | | | Search service | | |
| 62 | + | | +-----------------+ | |
| 63 | + +----------------------->| | FTS | | |
| 64 | + query_embeddings | | Vector Index | | |
| 65 | + | | (for embedding) | | |
| 66 | + | +-----------------+ | |
| 67 | + | | |
| 68 | + +-----------------------------+ |
| 69 | +``` |
| 70 | + |
| 71 | +In the above diagram: |
| 72 | + |
| 73 | +- `Data service` Supports the storing, setting, and retrieving of documents, specified by key. Basically where the documents are stored in key value. |
| 74 | +- `properties` are Document [attributes](https://docs.haystack.deepset.ai/docs/data-classes#document) stored as part of the Document. |
| 75 | +- `embedding` is also a property of the Document (just shown separately in the diagram for clarity) which is a vector of type `LIST[FLOAT]`. |
| 76 | +- `Search service` Where indexes specially purposed for Full Text Search and Vector search are created. The Search Service allows for efficient querying |
| 77 | +and retrieval based on both text content and vector embeddings. |
| 78 | + |
| 79 | +`CouchbaseDocumentStore` requires the vector index to be created manually either by SDK or UI. Before writing documents, you should make sure Documents are embedded by one of the provided [embedders](https://docs.haystack.deepset.ai/docs/embedders). For example [SentenceTransformersDocumentEmbedder](https://docs.haystack.deepset.ai/docs/sentencetransformersdocumentembedder) can be used in indexing pipeline to calculate document embeddings before writing those to Couchbase. |
| 80 | + |
| 81 | +## Installation |
| 82 | + |
| 83 | +`couchbase-haystack` can be installed as any other Python library, using pip: |
| 84 | + |
| 85 | +```bash |
| 86 | +pip install --upgrade pip # optional |
| 87 | +pip install sentence-transformers # required in order to run pipeline examples given below |
| 88 | +pip install couchbase-haystack |
| 89 | +``` |
| 90 | + |
| 91 | +## Usage |
| 92 | + |
| 93 | +### Running Couchbase |
| 94 | + |
| 95 | +You will need a running instance of Couchbase to use the components from this package. There are several options available: |
| 96 | + |
| 97 | +- [Docker](https://docs.couchbase.com/server/current/getting-started/do-a-quick-install.html) |
| 98 | +- [Couchbase Cloud](https://www.couchbase.com/products/capella) - a fully managed cloud service |
| 99 | +- [Couchbase Server](https://www.couchbase.com/downloads) - installable on various operating systems |
| 100 | + |
| 101 | +The simplest way to start the database locally is with a Docker container: |
| 102 | + |
| 103 | +```bash |
| 104 | +docker run \ |
| 105 | + --restart always \ |
| 106 | + --publish=8091-8096:8091-8096 --publish=11210:11210 \ |
| 107 | + --env COUCHBASE_ADMINISTRATOR_USERNAME=admin \ |
| 108 | + --env COUCHBASE_ADMINISTRATOR_PASSWORD=passw0rd \ |
| 109 | + couchbase:enterprise-7.6.2 |
| 110 | +``` |
| 111 | + |
| 112 | +In this example, the container is started using Couchbase Server version `7.6.2`. The `COUCHBASE_ADMINISTRATOR_USERNAME` and `COUCHBASE_ADMINISTRATOR_PASSWORD` environment variables set the default credentials for authentication. |
| 113 | + |
| 114 | +> **Note:** |
| 115 | +> Assuming you have a Docker container running, navigate to <http://localhost:8091> to open the Couchbase Web Console and explore your data. |
| 116 | +
|
| 117 | +### Document Store |
| 118 | + |
| 119 | +Once you have the package installed and the database running, you can start using `CouchbaseDocumentStore` as any other document stores that support embeddings. |
| 120 | + |
| 121 | +```python |
| 122 | +from haystack.utils.auth import Secret |
| 123 | +from couchbase_haystack import CouchbaseDocumentStore, CouchbasePasswordAuthenticator |
| 124 | + |
| 125 | +document_store = CouchbaseDocumentStore( |
| 126 | + cluster_connection_string=Secret.from_env_var("CB_CONNECTION_STRING"), |
| 127 | + authenticator=CouchbasePasswordAuthenticator( |
| 128 | + username=Secret.from_env_var("CB_USERNAME"), |
| 129 | + password=Secret.from_env_var("CB_PASSWORD") |
| 130 | + ), |
| 131 | + bucket = "haystack_bucket_name", |
| 132 | + scope="haystack_scope_name", |
| 133 | + collection="haystack_collection_name", |
| 134 | + vector_search_index = "vector_search_index" |
| 135 | +) |
| 136 | +``` |
| 137 | + |
| 138 | +Assuming there is a list of documents available and a running couchbase database you can write/index those in Couchbase, e.g.: |
| 139 | + |
| 140 | +```python |
| 141 | +from haystack import Document |
| 142 | + |
| 143 | +documents = [Document(content="Alice has been living in New York City for the past 5 years.")] |
| 144 | + |
| 145 | +document_store.write_documents(documents) |
| 146 | +``` |
| 147 | + |
| 148 | +If you intend to obtain embeddings before writing documents use the following code: |
| 149 | + |
| 150 | +```python |
| 151 | +from haystack import Document |
| 152 | + |
| 153 | +# import one of the available document embedders |
| 154 | +from haystack.components.embedders import SentenceTransformersDocumentEmbedder |
| 155 | + |
| 156 | +documents = [Document(content="Alice has been living in New York City for the past 5 years.")] |
| 157 | + |
| 158 | +document_embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2") |
| 159 | +document_embedder.warm_up() # will download the model during first run |
| 160 | +documents_with_embeddings = document_embedder.run(documents) |
| 161 | + |
| 162 | +document_store.write_documents(documents_with_embeddings.get("documents")) |
| 163 | +``` |
| 164 | + |
| 165 | +Make sure embedding model produces vectors of same size as it has been set on `Couchbase Vector Index`, e.g. setting `embedding_dim=384` would comply with the "sentence-transformers/all-MiniLM-L6-v2" model. |
| 166 | + |
| 167 | +> **Note** |
| 168 | +> Most of the time you will be using [Haystack Pipelines](https://docs.haystack.deepset.ai/docs/pipelines) to build both indexing and querying RAG scenarios. |
| 169 | +
|
| 170 | +It is important to understand how haystack Documents are stored in Couchbase after you call `write_documents`. |
| 171 | + |
| 172 | +```python |
| 173 | +from random import random |
| 174 | + |
| 175 | +sample_embedding = [random() for _ in range(384)] # using fake/random embedding for brevity here to simplify example |
| 176 | +document = Document( |
| 177 | + content="Alice has been living in New York City for the past 5 years.", embedding=sample_embedding, meta={"num_of_years": 5} |
| 178 | +) |
| 179 | +document.to_dict() |
| 180 | +``` |
| 181 | + |
| 182 | +The above code converts a Document to a dictionary and will render the following output: |
| 183 | + |
| 184 | +```bash |
| 185 | +>>> output: |
| 186 | +{ |
| 187 | + "id": "11c255ad10bff4286781f596a5afd9ab093ed056d41bca4120c849058e52f24d", |
| 188 | + "content": "Alice has been living in New York City for the past 5 years.", |
| 189 | + "dataframe": None, |
| 190 | + "blob": None, |
| 191 | + "score": None, |
| 192 | + "embedding": [0.025010755222666936, 0.27502931836911926, 0.22321073814882275, ...], # vector of size 384 |
| 193 | + "num_of_years": 5, |
| 194 | +} |
| 195 | +``` |
| 196 | + |
| 197 | +The data from the dictionary will be used to create a document in Couchbase after you write the document with `document_store.write_documents([document])`. You could query it with Cypher, e.g. `MATCH (doc:Document) RETURN doc`. Below is a json document Couchbase: |
| 198 | + |
| 199 | +```js |
| 200 | +{ |
| 201 | + "id": "11c255ad10bff4286781f596a5afd9ab093ed056d41bca4120c849058e52f24d", |
| 202 | + "embedding": [0.6394268274307251, 0.02501075528562069,0.27502933144569397, ...], // vector of size 384 |
| 203 | + "content": "Alice has been living in New York City for the past 5 years.", |
| 204 | + "meta": { |
| 205 | + "num_of_years": 5 |
| 206 | + } |
| 207 | +} |
| 208 | +``` |
| 209 | + |
| 210 | +The full list of parameters accepted by `CouchbaseDocumentStore` can be found in |
| 211 | +[API documentation](https://couchbase-ecosystem.github.io/couchbase-haystack/reference/couchbase_document_store). |
| 212 | + |
| 213 | +### Indexing documents |
| 214 | + |
| 215 | +With Haystack you can use [DocumentWriter](https://docs.haystack.deepset.ai/docs/documentwriter) component to write Documents into a Document Store. In the example below we construct pipeline to write documents to Couchbase using `CouchbaseDocumentStore`: |
| 216 | + |
| 217 | +```python |
| 218 | +from haystack import Document |
| 219 | +from haystack.components.embedders import SentenceTransformersDocumentEmbedder |
| 220 | +from haystack.components.writers import DocumentWriter |
| 221 | +from haystack.pipeline import Pipeline |
| 222 | +from haystack.utils.auth import Secret |
| 223 | +from couchbase_haystack import CouchbaseDocumentStore, CouchbasePasswordAuthenticator |
| 224 | + |
| 225 | +documents = [Document(content="This is document 1"), Document(content="This is document 2")] |
| 226 | + |
| 227 | +document_store = CouchbaseDocumentStore( |
| 228 | + cluster_connection_string=Secret.from_env_var("CB_CONNECTION_STRING"), |
| 229 | + authenticator=CouchbasePasswordAuthenticator( |
| 230 | + username=Secret.from_env_var("CB_USERNAME"), |
| 231 | + password=Secret.from_env_var("CB_PASSWORD") |
| 232 | + ), |
| 233 | + bucket = "haystack_bucket_name", |
| 234 | + scope="haystack_scope_name", |
| 235 | + collection="haystack_collection_name", |
| 236 | + vector_search_index = "vector_search_index" |
| 237 | +) |
| 238 | +embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2") |
| 239 | +document_writer = DocumentWriter(document_store=document_store) |
| 240 | + |
| 241 | +indexing_pipeline = Pipeline() |
| 242 | +indexing_pipeline.add_component(instance=embedder, name="embedder") |
| 243 | +indexing_pipeline.add_component(instance=document_writer, name="writer") |
| 244 | + |
| 245 | +indexing_pipeline.connect("embedder", "writer") |
| 246 | +indexing_pipeline.run({"embedder": {"documents": documents}}) |
| 247 | +``` |
| 248 | + |
| 249 | +```bash |
| 250 | +>>> output: |
| 251 | +`{'writer': {'documents_written': 2}}` |
| 252 | +``` |
| 253 | + |
| 254 | +### Retrieving documents |
| 255 | + |
| 256 | +`CouchbaseEmbeddingRetriever` component can be used to retrieve documents from Couchbase by querying vector index using an embedded query. Below is a pipeline which finds documents using query embedding: |
| 257 | + |
| 258 | +```python |
| 259 | +from typing import List |
| 260 | + |
| 261 | +from haystack import Document, Pipeline |
| 262 | +from haystack.components.embedders import SentenceTransformersDocumentEmbedder, SentenceTransformersTextEmbedder |
| 263 | +from haystack.utils.auth import Secret |
| 264 | +from couchbase_haystack.document_store import CouchbaseDocumentStore, CouchbasePasswordAuthenticator |
| 265 | +from couchbase_haystack.component.retriever import CouchbaseEmbeddingRetriever |
| 266 | + |
| 267 | +document_store = CouchbaseDocumentStore( |
| 268 | + cluster_connection_string=Secret.from_env_var("CB_CONNECTION_STRING"), |
| 269 | + authenticator=CouchbasePasswordAuthenticator( |
| 270 | + username=Secret.from_env_var("CB_USERNAME"), |
| 271 | + password=Secret.from_env_var("CB_PASSWORD") |
| 272 | + ), |
| 273 | + bucket = "haystack_bucket_name", |
| 274 | + scope="haystack_scope_name", |
| 275 | + collection="haystack_collection_name", |
| 276 | + vector_search_index = "vector_search_index" |
| 277 | +) |
| 278 | + |
| 279 | +documents = [ |
| 280 | + Document(content="Alice has been living in New York City for the past 5 years.", meta={"num_of_years": 5, "city": "New York"}), |
| 281 | + Document(content="John moved to Los Angeles 2 years ago and loves the sunny weather.", meta={"num_of_years": 2, "city": "Los Angeles"}), |
| 282 | +] |
| 283 | + |
| 284 | +# Same model is used for both query and Document embeddings |
| 285 | +model_name = "sentence-transformers/all-MiniLM-L6-v2" |
| 286 | + |
| 287 | +document_embedder = SentenceTransformersDocumentEmbedder(model=model_name) |
| 288 | +document_embedder.warm_up() |
| 289 | +documents_with_embeddings = document_embedder.run(documents) |
| 290 | + |
| 291 | +document_store.write_documents(documents_with_embeddings.get("documents")) |
| 292 | + |
| 293 | +print("Number of documents written: ", document_store.count_documents()) |
| 294 | + |
| 295 | +pipeline = Pipeline() |
| 296 | +pipeline.add_component("text_embedder", SentenceTransformersTextEmbedder(model=model_name)) |
| 297 | +pipeline.add_component("retriever", CouchbaseEmbeddingRetriever(document_store=document_store)) |
| 298 | +pipeline.connect("text_embedder.embedding", "retriever.query_embedding") |
| 299 | + |
| 300 | +result = pipeline.run( |
| 301 | + data={ |
| 302 | + "text_embedder": {"text": "What cities do people live in?"}, |
| 303 | + "retriever": { |
| 304 | + "top_k": 5 |
| 305 | + }, |
| 306 | + } |
| 307 | +) |
| 308 | + |
| 309 | +documents: List[Document] = result["retriever"]["documents"] |
| 310 | +``` |
| 311 | + |
| 312 | +```bash |
| 313 | +>>> output: |
| 314 | +[Document(id=3e35fa03aff6e3c45e6560f58adc4fde3c436c111a8809c30133b5cb492e8694, content: 'Alice has been living in New York City for the past 5 years.', meta: {'num_of_years': 5, 'city': 'New York'}, score: 0.36796408891677856, embedding: "embedding": vector of size 384), Document(id=ca4d7d7d7ff6c13b950a88580ab134b2dc15b48a47b8f571a46b354b5344e5fa, content: 'John moved to Los Angeles 2 years ago and loves the sunny weather.', meta: {'num_of_years': 2, 'city': 'Los Angeles'}, score: 0.3126790523529053, embedding: vector of size 384)] |
| 315 | +``` |
| 316 | + |
| 317 | +### More examples |
| 318 | + |
| 319 | +You can find more examples in the implementation [repository](https://github.com/Couchbase-Ecosystem/couchbase-haystack/tree/main/examples): |
| 320 | + |
| 321 | +- [indexing_pipeline.py](https://github.com/Couchbase-Ecosystem/couchbase-haystack/tree/main/examples/indexing_pipeline.py) - Indexing text files (documents) from a remote http location. |
| 322 | +- [rag_pipeline.py](https://github.com/Couchbase-Ecosystem/couchbase-haystack/tree/main/examples/rag_pipeline.py) - Generative question answering RAG pipeline using `CouchbaseEmbeddingRetriever` to fetch documents from Couchbase document store and answer question using [HuggingFaceAPIGenerator](https://docs.haystack.deepset.ai/docs/huggingfacetgigenerator). |
| 323 | + |
| 324 | +## License |
| 325 | + |
| 326 | +`couchbase-haystack` is distributed under the terms of the [MIT](https://spdx.org/licenses/MIT.html) license. |
0 commit comments