MSUSAzureAccelerators
diff --git a/‎01-Load-Data-ACogSearch.ipynb‎
Lines changed: 31 additions & 56 deletions b/‎01-Load-Data-ACogSearch.ipynb‎
Lines changed: 31 additions & 56 deletions
diff --git a/‎02-LoadCSVOneToMany-ACogSearch.ipynb‎
Lines changed: 15 additions & 17 deletions b/‎02-LoadCSVOneToMany-ACogSearch.ipynb‎
Lines changed: 15 additions & 17 deletions
@@ -19,14 +19,13 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Load and Enrich multiple file types Azure AI Search\n",
+    "# Load and Enrich multiple file types with Azure AI Search\n",
     "\n",
-    "In this Jupyter Notebook, we create and run enrichment steps to unlock searchable content in the specified Azure blob. It performs operations over mixed content in Azure Storage, such as images and application files, using a skillset that analyzes and extracts text information that becomes searchable in Azure Cognitive Search. \n",
-    "The reference sample can be found at [Tutorial: Use Python and AI to generate searchable content from Azure blobs](https://docs.microsoft.com/azure/search/cognitive-search-tutorial-blob-python).\n",
+    "In this Jupyter Notebook, we create and run enrichment steps to unlock searchable content in the specified Azure blob. It performs operations over mixed content in Azure Storage, such as images and application files, using a skillset that analyzes and extracts text information that becomes searchable in Azure AI Search. \n",
     "\n",
     "In this demo we are going to be using a private (so we can mimic a private data lake scenario) Blob Storage container that has all the dialogues of each episode of the TV Series show: FRIENDS. (3.1k text files).\n",
     "\n",
-    "Although only  TXT files are used here, this can be done at a much larger scale and Azure Cognitive Search supports a range of other file formats including: PDF, Microsoft Office (DOCX/DOC, XSLX/XLS, PPTX/PPT, MSG), HTML, XML, ZIP, and plain text files (including JSON).\n",
+    "Although only  TXT files are used here, this can be done at a much larger scale and Azure AI Search supports a range of other file formats including: PDF, Microsoft Office (DOCX/DOC, XSLX/XLS, PPTX/PPT, MSG), HTML, XML, ZIP, and plain text files (including JSON).\n",
     "Azure Search support the following sources: [Data Sources Gallery](https://learn.microsoft.com/EN-US/AZURE/search/search-data-sources-gallery)\n",
     "\n",
     "This notebook creates the following objects on your search service:\n",
@@ -52,7 +51,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 1,
+   "execution_count": 6,
    "metadata": {
     "tags": []
    },
@@ -70,7 +69,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 2,
+   "execution_count": 7,
    "metadata": {
     "tags": []
    },
@@ -85,7 +84,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 3,
+   "execution_count": 8,
    "metadata": {
     "tags": []
    },
@@ -105,7 +104,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 4,
+   "execution_count": 9,
    "metadata": {
     "tags": []
    },
@@ -127,36 +126,11 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 5,
+   "execution_count": null,
    "metadata": {
     "tags": []
    },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Extracting ./data/friends_transcripts.zip ... \n",
-      "Extracted ./data/friends_transcripts.zip to ./data/temp_extract\n"
-     ]
-    },
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "Uploading Files: 100%|██████████████████████████████████████████| 3107/3107 [08:53<00:00,  5.83it/s]\n"
-     ]
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Temp Folder: ./data/temp_extract removed\n",
-      "CPU times: user 31.6 s, sys: 5.19 s, total: 36.8 s\n",
-      "Wall time: 11min\n"
-     ]
-    }
-   ],
+   "outputs": [],
    "source": [
     "%%time\n",
     "\n",
@@ -175,12 +149,12 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Create Data Source (Blob container with the Arxiv CS pdfs)"
+    "## Create Data Source (Blob container with the Friends txt files)"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 6,
+   "execution_count": 10,
    "metadata": {
     "tags": []
    },
@@ -189,7 +163,7 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "201\n",
+      "204\n",
       "True\n"
      ]
     }
@@ -205,7 +179,9 @@
     "        \"connectionString\": os.environ['BLOB_CONNECTION_STRING']\n",
     "    },\n",
     "    \"dataDeletionDetectionPolicy\" : {\n",
-    "        \"@odata.type\" :\"#Microsoft.Azure.Search.NativeBlobSoftDeleteDeletionDetectionPolicy\" # this makes sure that if the item is deleted from the source, it will be deleted from the index\n",
+    "        \"@odata.type\" :\"#Microsoft.Azure.Search.SoftDeleteColumnDeletionDetectionPolicy\", # this makes sure that if the item is deleted from the source, it will be marked deleted in the index\n",
+    "        \"softDeleteColumnName\": \"isDeleted\",\n",
+    "        \"softDeleteMarkerValue\": \"yes\"\n",
     "    },\n",
     "    \"container\": {\n",
     "        \"name\": BLOB_CONTAINER_NAME\n",
@@ -233,7 +209,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 8,
+   "execution_count": 11,
    "metadata": {
     "tags": []
    },
@@ -254,7 +230,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "In Azure AI Search, a search index is your searchable content, available to the search engine for indexing, full text search, vector search, hybrid search, and filtered queries. An index is defined by a schema and saved to the search service, with data import following as a second step. This content exists within your search service, apart from your primary data stores, which is necessary for the millisecond response times expected in modern search applications. Except for indexer-driven indexing scenarios, the search service never connects to or queries your source data.\n",
+    "In Azure AI Search, a search index is your searchable content, available to the search engine for indexing, full text search, agentic search, vector search, hybrid search, and filtered queries. An index is defined by a schema and saved to the search service, with data import following as a second step. This content exists within your search service, apart from your primary data stores, which is necessary for the millisecond response times expected in modern search applications. Except for indexer-driven indexing scenarios, the search service never connects to or queries your source data.\n",
     "\n",
     "Reference:\n",
     "\n",
@@ -285,7 +261,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 9,
+   "execution_count": 12,
    "metadata": {
     "tags": []
    },
@@ -294,7 +270,7 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "201\n",
+      "204\n",
       "True\n"
      ]
     }
@@ -389,6 +365,7 @@
     "        {\"name\": \"title\", \"type\": \"Edm.String\", \"searchable\": \"true\", \"retrievable\": \"true\", \"facetable\": \"false\", \"filterable\": \"true\", \"sortable\": \"false\"},\n",
     "        {\"name\": \"name\", \"type\": \"Edm.String\", \"searchable\": \"true\", \"retrievable\": \"true\", \"sortable\": \"false\", \"filterable\": \"false\", \"facetable\": \"false\"},\n",
     "        {\"name\": \"location\", \"type\": \"Edm.String\", \"searchable\": \"true\", \"retrievable\": \"true\", \"sortable\": \"false\", \"filterable\": \"false\", \"facetable\": \"false\"},   \n",
+    "        {\"name\": \"isDeleted\", \"type\": \"Edm.String\", \"searchable\": \"true\", \"retrievable\": \"true\", \"sortable\": \"false\", \"filterable\": \"false\", \"facetable\": \"false\"},   \n",
     "        {\"name\": \"chunk\",\"type\": \"Edm.String\", \"searchable\": \"true\", \"retrievable\": \"true\", \"sortable\": \"false\", \"filterable\": \"false\", \"facetable\": \"false\"},\n",
     "        {\n",
     "            \"name\": \"chunkVector\",\n",
@@ -468,7 +445,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 11,
+   "execution_count": 13,
    "metadata": {
     "tags": []
    },
@@ -477,7 +454,7 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "201\n",
+      "204\n",
       "True\n"
      ]
     }
@@ -623,7 +600,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 12,
+   "execution_count": 14,
    "metadata": {
     "tags": []
    },
@@ -643,12 +620,12 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The three components you have created thus far (data source, skillset, index) are inputs to an indexer. Creating the indexer on Azure Cognitive Search is the event that puts the entire pipeline into motion."
+    "The three components you have created thus far (data source, skillset, index) are inputs to an indexer. Creating the indexer on Azure AI Search is the event that puts the entire pipeline into motion."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 13,
+   "execution_count": 10,
    "metadata": {
     "tags": []
    },
@@ -705,7 +682,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 14,
+   "execution_count": 11,
    "metadata": {
     "tags": []
    },
@@ -724,7 +701,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 17,
+   "execution_count": 13,
    "metadata": {
     "tags": []
    },
@@ -759,9 +736,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "**When the indexer finishes running we will have all 994 documents indexed in your Search Engine!.**\n",
-    "\n",
-    "**Note:** Noticed that it only index 1 document (the zip file) but the AI Search service did the work of uncompressing it and indexing each individual doc**"
+    "**When the indexer finishes running we will have all documents indexed in your Search Engine!.**"
    ]
   },
   {
@@ -793,9 +768,9 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "GPTSearch3 (Python 3.12)",
+   "display_name": "RAGAgents (Python 3.12)",
    "language": "python",
-   "name": "gptsearch3"
+   "name": "ragagents"
   },
   "language_info": {
    "codemirror_mode": {
@@ -807,7 +782,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.12.8"
+   "version": "3.12.11"
   },
   "vscode": {
    "interpreter": {
 
@@ -98,23 +98,21 @@
      "name": "stderr",
      "output_type": "stream",
      "text": [
-      "Uploading Files: 100%|████████████████████████████████████████████████| 1/1 [00:06<00:00,  6.23s/it]"
+      "Uploading Files: 100%|████████████████████████████████████████████| 743/743 [06:04<00:00,  2.04it/s]\n"
      ]
     },
     {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Temp Folder: ./data/temp_extract removed\n",
-      "CPU times: user 767 ms, sys: 305 ms, total: 1.07 s\n",
-      "Wall time: 7.74 s\n"
-     ]
-    },
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "\n"
+     "ename": "OSError",
+     "evalue": "[Errno 39] Directory not empty: './data/temp_extract/s01/e11'",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[31m---------------------------------------------------------------------------\u001b[39m",
+      "\u001b[31mOSError\u001b[39m                                   Traceback (most recent call last)",
+      "\u001b[36mFile \u001b[39m\u001b[32m<timed exec>:14\u001b[39m\n",
+      "\u001b[36mFile \u001b[39m\u001b[32m/anaconda/envs/RAGAgents/lib/python3.12/shutil.py:759\u001b[39m, in \u001b[36mrmtree\u001b[39m\u001b[34m(path, ignore_errors, onerror, onexc, dir_fd)\u001b[39m\n\u001b[32m    757\u001b[39m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[32m    758\u001b[39m     \u001b[38;5;28;01mwhile\u001b[39;00m stack:\n\u001b[32m--> \u001b[39m\u001b[32m759\u001b[39m         _rmtree_safe_fd(stack, onexc)\n\u001b[32m    760\u001b[39m \u001b[38;5;28;01mfinally\u001b[39;00m:\n\u001b[32m    761\u001b[39m     \u001b[38;5;66;03m# Close any file descriptors still on the stack.\u001b[39;00m\n\u001b[32m    762\u001b[39m     \u001b[38;5;28;01mwhile\u001b[39;00m stack:\n",
+      "\u001b[36mFile \u001b[39m\u001b[32m/anaconda/envs/RAGAgents/lib/python3.12/shutil.py:703\u001b[39m, in \u001b[36m_rmtree_safe_fd\u001b[39m\u001b[34m(stack, onexc)\u001b[39m\n\u001b[32m    701\u001b[39m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mOSError\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m err:\n\u001b[32m    702\u001b[39m     err.filename = path\n\u001b[32m--> \u001b[39m\u001b[32m703\u001b[39m     onexc(func, path, err)\n",
+      "\u001b[36mFile \u001b[39m\u001b[32m/anaconda/envs/RAGAgents/lib/python3.12/shutil.py:662\u001b[39m, in \u001b[36m_rmtree_safe_fd\u001b[39m\u001b[34m(stack, onexc)\u001b[39m\n\u001b[32m    660\u001b[39m     \u001b[38;5;28;01mreturn\u001b[39;00m\n\u001b[32m    661\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m func \u001b[38;5;129;01mis\u001b[39;00m os.rmdir:\n\u001b[32m--> \u001b[39m\u001b[32m662\u001b[39m     os.rmdir(name, dir_fd=dirfd)\n\u001b[32m    663\u001b[39m     \u001b[38;5;28;01mreturn\u001b[39;00m\n\u001b[32m    665\u001b[39m \u001b[38;5;66;03m# Note: To guard against symlink races, we use the standard\u001b[39;00m\n\u001b[32m    666\u001b[39m \u001b[38;5;66;03m# lstat()/open()/fstat() trick.\u001b[39;00m\n",
+      "\u001b[31mOSError\u001b[39m: [Errno 39] Directory not empty: './data/temp_extract/s01/e11'"
      ]
     }
    ],
@@ -698,9 +696,9 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "GPTSearch3 (Python 3.12)",
+   "display_name": "RAGAgents (Python 3.12)",
    "language": "python",
-   "name": "gptsearch3"
+   "name": "ragagents"
   },
   "language_info": {
    "codemirror_mode": {
@@ -712,7 +710,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.12.8"
+   "version": "3.12.11"
   }
  },
  "nbformat": 4,