|
| 1 | +{ |
| 2 | + "cells": [ |
| 3 | + { |
| 4 | + "cell_type": "markdown", |
| 5 | + "id": "8e19358e-22e8-406c-ae17-d916db889313", |
| 6 | + "metadata": {}, |
| 7 | + "source": [ |
| 8 | + "<div id=\"singlestore-header\" style=\"display: flex; background-color: rgba(210, 255, 153, 0.25); padding: 5px;\">\n", |
| 9 | + " <div id=\"icon-image\" style=\"width: 90px; height: 90px;\">\n", |
| 10 | + " <img width=\"100%\" height=\"100%\" src=\"https://raw.githubusercontent.com/singlestore-labs/spaces-notebooks/master/common/images/header-icons/chart-network.png\" />\n", |
| 11 | + " </div>\n", |
| 12 | + " <div id=\"text\" style=\"padding: 5px; margin-left: 10px;\">\n", |
| 13 | + " <div id=\"badge\" style=\"display: inline-block; background-color: rgba(0, 0, 0, 0.15); border-radius: 4px; padding: 4px 8px; align-items: center; margin-top: 6px; margin-bottom: -2px; font-size: 80%\">SingleStore Notebooks</div>\n", |
| 14 | + " <h1 style=\"font-weight: 500; margin: 8px 0 0 4px;\">Semantic Search with Hugging Face Models and Datasets</h1>\n", |
| 15 | + " </div>\n", |
| 16 | + "</div>" |
| 17 | + ] |
| 18 | + }, |
| 19 | + { |
| 20 | + "cell_type": "markdown", |
| 21 | + "id": "9bebf253-7913-4d7a-8ebc-f10463803baa", |
| 22 | + "metadata": {}, |
| 23 | + "source": [ |
| 24 | + "In this notebook, we will demonstrate an example of conducting semantic search on SingleStoreDB with SQL! Unlike traditional keyword-based search methods, semantic search algorithms take into account the relationships between words and their meanings, enabling them to deliver more accurate and relevant results \u2013 even when search terms are vague or ambiguous. \n", |
| 25 | + "\n", |
| 26 | + "SingleStoreDB\u2019s built-in parallelization and Intel SIMD-based vector processing takes care of the heavy lifting involved in processing vector data. This allows your to run your ML algorithms right in your database extremely efficiently with just 2 lines of SQL!\n", |
| 27 | + "\n", |
| 28 | + "\n", |
| 29 | + "In this example, we use Hugging Face to create embeddings for our dataset and run semantic_search using dot_product vector matching function!" |
| 30 | + ] |
| 31 | + }, |
| 32 | + { |
| 33 | + "cell_type": "markdown", |
| 34 | + "id": "358d1eb0-a0dd-423d-86ea-0d131abe4169", |
| 35 | + "metadata": {}, |
| 36 | + "source": [ |
| 37 | + "## 1. Create a workspace in your workspace group\n", |
| 38 | + "\n", |
| 39 | + "S-00 is sufficient.\n", |
| 40 | + "\n", |
| 41 | + "## 2. Create a database named `semantic_search`" |
| 42 | + ] |
| 43 | + }, |
| 44 | + { |
| 45 | + "cell_type": "code", |
| 46 | + "execution_count": null, |
| 47 | + "id": "af5e02fb-e15b-4c85-ac69-a40dd974cd88", |
| 48 | + "metadata": {}, |
| 49 | + "outputs": [], |
| 50 | + "source": [ |
| 51 | + "%%sql\n", |
| 52 | + "DROP DATABASE IF EXISTS semantic_search;\n", |
| 53 | + "\n", |
| 54 | + "CREATE DATABASE semantic_search;" |
| 55 | + ] |
| 56 | + }, |
| 57 | + { |
| 58 | + "cell_type": "markdown", |
| 59 | + "id": "284f2bdc-a428-4a55-9f1f-fce623914b34", |
| 60 | + "metadata": {}, |
| 61 | + "source": [ |
| 62 | + "<div class=\"alert alert-block alert-warning\">\n", |
| 63 | + " <b class=\"fa fa-solid fa-exclamation-circle\"></b>\n", |
| 64 | + " <div>\n", |
| 65 | + " <p><b>Action Required</b></p>\n", |
| 66 | + " <p>Make sure to select the <tt>semantic_search</tt> database from the drop-down menu at the top of this notebook.\n", |
| 67 | + " It updates the <tt>connection_url</tt> which is used by the <tt>%%sql</tt> magic command and SQLAlchemy to make connections to the selected database.</p>\n", |
| 68 | + " </div>\n", |
| 69 | + "</div>" |
| 70 | + ] |
| 71 | + }, |
| 72 | + { |
| 73 | + "cell_type": "markdown", |
| 74 | + "id": "8124ab1c-7f17-47bc-9f8a-c7bd5a33a426", |
| 75 | + "metadata": {}, |
| 76 | + "source": [ |
| 77 | + "## 3. Install and import required libraries\n", |
| 78 | + "\n", |
| 79 | + "We will use an embedding model on Hugging Face with Sentence Transfomers library. We will be analysing the sentiment of reviewers of Disneyland. This dataset is available on Hugging Face and to use it, we will need the datasets library. " |
| 80 | + ] |
| 81 | + }, |
| 82 | + { |
| 83 | + "cell_type": "code", |
| 84 | + "execution_count": null, |
| 85 | + "id": "af6146b2-a044-4dd8-b020-e3d8c1f91aba", |
| 86 | + "metadata": {}, |
| 87 | + "outputs": [], |
| 88 | + "source": [ |
| 89 | + "!pip3 install -U sentence-transformers torch tensorflow datasets --quiet\n", |
| 90 | + "\n", |
| 91 | + "import json\n", |
| 92 | + "\n", |
| 93 | + "import ibis\n", |
| 94 | + "import numpy as np\n", |
| 95 | + "import pandas as pd\n", |
| 96 | + "import singlestoredb as s2\n", |
| 97 | + "import torch\n", |
| 98 | + "\n", |
| 99 | + "from datasets import load_dataset\n", |
| 100 | + "from transformers import AutoTokenizer, AutoModel" |
| 101 | + ] |
| 102 | + }, |
| 103 | + { |
| 104 | + "cell_type": "markdown", |
| 105 | + "id": "f80d23bc-7e98-4ac8-b2a0-7a737e4010e5", |
| 106 | + "metadata": {}, |
| 107 | + "source": [ |
| 108 | + "## 4. Load Sentence Transformer library and create a function called `get_embedding()`\n", |
| 109 | + "\n", |
| 110 | + "To vectorize and embed the reviews that customers of Disneyland left, we will be using the sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 model. We will create a function called get_embeddings() that will call this model and return the vectorized version of the sentence." |
| 111 | + ] |
| 112 | + }, |
| 113 | + { |
| 114 | + "cell_type": "code", |
| 115 | + "execution_count": null, |
| 116 | + "id": "a463c0fd-c747-4605-a728-c22a2fa17cfb", |
| 117 | + "metadata": {}, |
| 118 | + "outputs": [], |
| 119 | + "source": [ |
| 120 | + "# Load Sentence Transformers model\n", |
| 121 | + "model_name = \"sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2\"\n", |
| 122 | + "\n", |
| 123 | + "model = AutoModel.from_pretrained(model_name)\n", |
| 124 | + "tokenizer = AutoTokenizer.from_pretrained(model_name)" |
| 125 | + ] |
| 126 | + }, |
| 127 | + { |
| 128 | + "cell_type": "code", |
| 129 | + "execution_count": null, |
| 130 | + "id": "f2e31300-1e6a-425c-bcf7-3708ce9e40d0", |
| 131 | + "metadata": {}, |
| 132 | + "outputs": [], |
| 133 | + "source": [ |
| 134 | + "# Define a function to get embeddings\n", |
| 135 | + "def get_embedding(sentence):\n", |
| 136 | + " inputs = tokenizer(sentence, padding=True, truncation=True, return_tensors=\"pt\")\n", |
| 137 | + " with torch.no_grad():\n", |
| 138 | + " embedding = model(**inputs).last_hidden_state.mean(dim=1).squeeze().tolist()\n", |
| 139 | + " return embedding" |
| 140 | + ] |
| 141 | + }, |
| 142 | + { |
| 143 | + "cell_type": "markdown", |
| 144 | + "id": "17fb3aad-e3a8-4a2a-985c-64f0c94431b8", |
| 145 | + "metadata": {}, |
| 146 | + "source": [ |
| 147 | + "## 5. Load the dataset on Disneyland reviews from Hugging Face into a `DataFrame`\n", |
| 148 | + "\n", |
| 149 | + "We will be doing some operations and only sampling 100 random reviews from the \"test\" dataset of disneyland_reviews." |
| 150 | + ] |
| 151 | + }, |
| 152 | + { |
| 153 | + "cell_type": "code", |
| 154 | + "execution_count": null, |
| 155 | + "id": "e3af3810-0ce5-432b-a879-4eaa16524d38", |
| 156 | + "metadata": {}, |
| 157 | + "outputs": [], |
| 158 | + "source": [ |
| 159 | + "# Load the dataset into a pandas DataFrame\n", |
| 160 | + "dataset = load_dataset(\"dariadaria/disneyland_reviews\")\n", |
| 161 | + "dataframe = pd.DataFrame(dataset[\"train\"].to_pandas())\n", |
| 162 | + "\n", |
| 163 | + "sample_size = 100 # Adjust the desired sample size\n", |
| 164 | + "random_sample = dataframe.sample(n=sample_size)\n", |
| 165 | + "random_sample['Review_Text'] = random_sample['Review_Text'].astype(str)\n", |
| 166 | + "random_sample['Review_Text'] = random_sample['Review_Text'].str.replace(\"'\", \"\").str.replace('\"', '')" |
| 167 | + ] |
| 168 | + }, |
| 169 | + { |
| 170 | + "cell_type": "markdown", |
| 171 | + "id": "8188ccb2-d5cf-48b5-8c9f-8b3858c18ae7", |
| 172 | + "metadata": {}, |
| 173 | + "source": [ |
| 174 | + "## 6. Insert data into SingleStoreDB\n", |
| 175 | + "\n", |
| 176 | + "You can seamlessly bring this data to your SingleStoreDB table directly from your from `DataFrame`. This is the magic of the Ibis library. This process is extremely performant and happens in the engine. SingleStore \u2665\ufe0f Python.\n", |
| 177 | + "\n", |
| 178 | + "We will bring this data into a table called reviews. Notice how you don't have to write any SQL for this\u00a0\u2013\u00a0we will infer the schema from your dataframe and underneath the hood configure how to bring this `DataFrame` into our database. " |
| 179 | + ] |
| 180 | + }, |
| 181 | + { |
| 182 | + "cell_type": "code", |
| 183 | + "execution_count": null, |
| 184 | + "id": "419a690a-810c-4c80-b7ea-fd25cf1d5e80", |
| 185 | + "metadata": {}, |
| 186 | + "outputs": [], |
| 187 | + "source": [ |
| 188 | + "conn = ibis.singlestoredb.connect()\n", |
| 189 | + "reviews_tbl = conn.create_table('reviews', random_sample, force=True)\n", |
| 190 | + "conn.show.create_table('reviews')" |
| 191 | + ] |
| 192 | + }, |
| 193 | + { |
| 194 | + "cell_type": "markdown", |
| 195 | + "id": "db124797-a11c-4a97-9f58-b337c50014e3", |
| 196 | + "metadata": {}, |
| 197 | + "source": [ |
| 198 | + "## 7. Generate embeddings of the reviews left by customers and add them to your SingleStoreDB table\n", |
| 199 | + "\n", |
| 200 | + "We want to embed the entries in the Review_Text column and add the embeddings to the database. We will do this with SQL. Embeddings are stored as a Blob type in SingleStoreDB. " |
| 201 | + ] |
| 202 | + }, |
| 203 | + { |
| 204 | + "cell_type": "code", |
| 205 | + "execution_count": null, |
| 206 | + "id": "b6c511b8-173d-4975-a8a5-ea693f5fc3bc", |
| 207 | + "metadata": {}, |
| 208 | + "outputs": [], |
| 209 | + "source": [ |
| 210 | + "# Add a new column called embeddings to your reviews table\n", |
| 211 | + "%sql ALTER TABLE reviews ADD embeddings BLOB;" |
| 212 | + ] |
| 213 | + }, |
| 214 | + { |
| 215 | + "cell_type": "code", |
| 216 | + "execution_count": null, |
| 217 | + "id": "bce5a7cb-ad4f-4293-8bc3-9d09f76ae5e8", |
| 218 | + "metadata": {}, |
| 219 | + "outputs": [], |
| 220 | + "source": [ |
| 221 | + "reviews = %sql SELECT Review_Text FROM reviews;\n", |
| 222 | + "\n", |
| 223 | + "for i in reviews:\n", |
| 224 | + " review_embedding = json.dumps(get_embedding(i[0]))\n", |
| 225 | + " %sql UPDATE reviews SET embeddings = JSON_ARRAY_PACK('{{review_embedding}}') WHERE Review_Text='{{i[0]}}';" |
| 226 | + ] |
| 227 | + }, |
| 228 | + { |
| 229 | + "cell_type": "markdown", |
| 230 | + "id": "e34e62fb-7690-4a31-a874-ff7856d16cc7", |
| 231 | + "metadata": {}, |
| 232 | + "source": [ |
| 233 | + "## 8. Run the semantic search algorithm with just one line of SQL\n", |
| 234 | + "\n", |
| 235 | + "We will utilize SingleStoreDB's distributed architecture to efficiently compute the dot product of the input string (stored in searchstring) with each entry in the database and return the top 5 reviews with the highest dot product score. Each vector is normalized to length 1, hence the dot product function essentially computes the cosine similarity between two vectors \u2013 an appropriate nearness metric. SingleStoreDB makes this extremely fast because it compiles queries to machine code and runs dot_product using SIMD instructions." |
| 236 | + ] |
| 237 | + }, |
| 238 | + { |
| 239 | + "cell_type": "code", |
| 240 | + "execution_count": null, |
| 241 | + "id": "08bd6b1c-9731-4062-9b9a-a5e1a1d8efa3", |
| 242 | + "metadata": {}, |
| 243 | + "outputs": [], |
| 244 | + "source": [ |
| 245 | + "searchstring = input(\"Please enter a search string: \")\n", |
| 246 | + "\n", |
| 247 | + "search_embedding = json.dumps(get_embedding(searchstring)) \n", |
| 248 | + "\n", |
| 249 | + "results = %sql SELECT Review_Text, DOT_PRODUCT(embeddings, JSON_ARRAY_PACK('{{search_embedding}}')) AS Score FROM reviews ORDER BY Score DESC LIMIT 5;\n", |
| 250 | + "\n", |
| 251 | + "for i, res in enumerate(results):\n", |
| 252 | + " print(f'{i + 1}: {res[0]} Score: {res[1]}')" |
| 253 | + ] |
| 254 | + }, |
| 255 | + { |
| 256 | + "cell_type": "markdown", |
| 257 | + "id": "0383939d-7fd3-434d-a27b-952eeed40e5f", |
| 258 | + "metadata": {}, |
| 259 | + "source": [ |
| 260 | + "## 9. Clean up" |
| 261 | + ] |
| 262 | + }, |
| 263 | + { |
| 264 | + "cell_type": "code", |
| 265 | + "execution_count": null, |
| 266 | + "id": "0e91592f-4856-4cab-b15e-23585f551ab3", |
| 267 | + "metadata": {}, |
| 268 | + "outputs": [], |
| 269 | + "source": [ |
| 270 | + "%%sql\n", |
| 271 | + "DROP DATABASE semantic_search;" |
| 272 | + ] |
| 273 | + }, |
| 274 | + { |
| 275 | + "cell_type": "markdown", |
| 276 | + "id": "a6829f66-b37e-493d-9631-6da519140485", |
| 277 | + "metadata": {}, |
| 278 | + "source": [ |
| 279 | + "<div id=\"singlestore-footer\" style=\"background-color: rgba(194, 193, 199, 0.25); height:2px; margin-bottom:10px\"></div>\n", |
| 280 | + "<div><img src=\"https://raw.githubusercontent.com/singlestore-labs/spaces-notebooks/master/common/images/singlestore-logo-grey.png\" style=\"padding: 0px; margin: 0px; height: 24px\"/></div>" |
| 281 | + ] |
| 282 | + } |
| 283 | + ], |
| 284 | + "metadata": { |
| 285 | + "kernelspec": { |
| 286 | + "display_name": "Python 3 (ipykernel)", |
| 287 | + "language": "python", |
| 288 | + "name": "python3" |
| 289 | + }, |
| 290 | + "language_info": { |
| 291 | + "codemirror_mode": { |
| 292 | + "name": "ipython", |
| 293 | + "version": 3 |
| 294 | + }, |
| 295 | + "file_extension": ".py", |
| 296 | + "mimetype": "text/x-python", |
| 297 | + "name": "python", |
| 298 | + "nbconvert_exporter": "python", |
| 299 | + "pygments_lexer": "ipython3", |
| 300 | + "version": "3.11.3" |
| 301 | + } |
| 302 | + }, |
| 303 | + "nbformat": 4, |
| 304 | + "nbformat_minor": 5 |
| 305 | +} |
0 commit comments