Skip to content

Commit 5c53f51

Browse files
authored
Merge pull request #81 from pablomarin/main
Merge from source
2 parents 5eca1b0 + ddacbc1 commit 5c53f51

19 files changed

+23101
-1097
lines changed

01-Load-Data-ACogSearch.ipynb

Lines changed: 31 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -19,14 +19,13 @@
1919
"cell_type": "markdown",
2020
"metadata": {},
2121
"source": [
22-
"# Load and Enrich multiple file types Azure AI Search\n",
22+
"# Load and Enrich multiple file types with Azure AI Search\n",
2323
"\n",
24-
"In this Jupyter Notebook, we create and run enrichment steps to unlock searchable content in the specified Azure blob. It performs operations over mixed content in Azure Storage, such as images and application files, using a skillset that analyzes and extracts text information that becomes searchable in Azure Cognitive Search. \n",
25-
"The reference sample can be found at [Tutorial: Use Python and AI to generate searchable content from Azure blobs](https://docs.microsoft.com/azure/search/cognitive-search-tutorial-blob-python).\n",
24+
"In this Jupyter Notebook, we create and run enrichment steps to unlock searchable content in the specified Azure blob. It performs operations over mixed content in Azure Storage, such as images and application files, using a skillset that analyzes and extracts text information that becomes searchable in Azure AI Search. \n",
2625
"\n",
2726
"In this demo we are going to be using a private (so we can mimic a private data lake scenario) Blob Storage container that has all the dialogues of each episode of the TV Series show: FRIENDS. (3.1k text files).\n",
2827
"\n",
29-
"Although only TXT files are used here, this can be done at a much larger scale and Azure Cognitive Search supports a range of other file formats including: PDF, Microsoft Office (DOCX/DOC, XSLX/XLS, PPTX/PPT, MSG), HTML, XML, ZIP, and plain text files (including JSON).\n",
28+
"Although only TXT files are used here, this can be done at a much larger scale and Azure AI Search supports a range of other file formats including: PDF, Microsoft Office (DOCX/DOC, XSLX/XLS, PPTX/PPT, MSG), HTML, XML, ZIP, and plain text files (including JSON).\n",
3029
"Azure Search support the following sources: [Data Sources Gallery](https://learn.microsoft.com/EN-US/AZURE/search/search-data-sources-gallery)\n",
3130
"\n",
3231
"This notebook creates the following objects on your search service:\n",
@@ -52,7 +51,7 @@
5251
},
5352
{
5453
"cell_type": "code",
55-
"execution_count": 1,
54+
"execution_count": 6,
5655
"metadata": {
5756
"tags": []
5857
},
@@ -70,7 +69,7 @@
7069
},
7170
{
7271
"cell_type": "code",
73-
"execution_count": 2,
72+
"execution_count": 7,
7473
"metadata": {
7574
"tags": []
7675
},
@@ -85,7 +84,7 @@
8584
},
8685
{
8786
"cell_type": "code",
88-
"execution_count": 3,
87+
"execution_count": 8,
8988
"metadata": {
9089
"tags": []
9190
},
@@ -105,7 +104,7 @@
105104
},
106105
{
107106
"cell_type": "code",
108-
"execution_count": 4,
107+
"execution_count": 9,
109108
"metadata": {
110109
"tags": []
111110
},
@@ -127,36 +126,11 @@
127126
},
128127
{
129128
"cell_type": "code",
130-
"execution_count": 5,
129+
"execution_count": null,
131130
"metadata": {
132131
"tags": []
133132
},
134-
"outputs": [
135-
{
136-
"name": "stdout",
137-
"output_type": "stream",
138-
"text": [
139-
"Extracting ./data/friends_transcripts.zip ... \n",
140-
"Extracted ./data/friends_transcripts.zip to ./data/temp_extract\n"
141-
]
142-
},
143-
{
144-
"name": "stderr",
145-
"output_type": "stream",
146-
"text": [
147-
"Uploading Files: 100%|██████████████████████████████████████████| 3107/3107 [08:53<00:00, 5.83it/s]\n"
148-
]
149-
},
150-
{
151-
"name": "stdout",
152-
"output_type": "stream",
153-
"text": [
154-
"Temp Folder: ./data/temp_extract removed\n",
155-
"CPU times: user 31.6 s, sys: 5.19 s, total: 36.8 s\n",
156-
"Wall time: 11min\n"
157-
]
158-
}
159-
],
133+
"outputs": [],
160134
"source": [
161135
"%%time\n",
162136
"\n",
@@ -175,12 +149,12 @@
175149
"cell_type": "markdown",
176150
"metadata": {},
177151
"source": [
178-
"## Create Data Source (Blob container with the Arxiv CS pdfs)"
152+
"## Create Data Source (Blob container with the Friends txt files)"
179153
]
180154
},
181155
{
182156
"cell_type": "code",
183-
"execution_count": 6,
157+
"execution_count": 10,
184158
"metadata": {
185159
"tags": []
186160
},
@@ -189,7 +163,7 @@
189163
"name": "stdout",
190164
"output_type": "stream",
191165
"text": [
192-
"201\n",
166+
"204\n",
193167
"True\n"
194168
]
195169
}
@@ -205,7 +179,9 @@
205179
" \"connectionString\": os.environ['BLOB_CONNECTION_STRING']\n",
206180
" },\n",
207181
" \"dataDeletionDetectionPolicy\" : {\n",
208-
" \"@odata.type\" :\"#Microsoft.Azure.Search.NativeBlobSoftDeleteDeletionDetectionPolicy\" # this makes sure that if the item is deleted from the source, it will be deleted from the index\n",
182+
" \"@odata.type\" :\"#Microsoft.Azure.Search.SoftDeleteColumnDeletionDetectionPolicy\", # this makes sure that if the item is deleted from the source, it will be marked deleted in the index\n",
183+
" \"softDeleteColumnName\": \"isDeleted\",\n",
184+
" \"softDeleteMarkerValue\": \"yes\"\n",
209185
" },\n",
210186
" \"container\": {\n",
211187
" \"name\": BLOB_CONTAINER_NAME\n",
@@ -233,7 +209,7 @@
233209
},
234210
{
235211
"cell_type": "code",
236-
"execution_count": 8,
212+
"execution_count": 11,
237213
"metadata": {
238214
"tags": []
239215
},
@@ -254,7 +230,7 @@
254230
"cell_type": "markdown",
255231
"metadata": {},
256232
"source": [
257-
"In Azure AI Search, a search index is your searchable content, available to the search engine for indexing, full text search, vector search, hybrid search, and filtered queries. An index is defined by a schema and saved to the search service, with data import following as a second step. This content exists within your search service, apart from your primary data stores, which is necessary for the millisecond response times expected in modern search applications. Except for indexer-driven indexing scenarios, the search service never connects to or queries your source data.\n",
233+
"In Azure AI Search, a search index is your searchable content, available to the search engine for indexing, full text search, agentic search, vector search, hybrid search, and filtered queries. An index is defined by a schema and saved to the search service, with data import following as a second step. This content exists within your search service, apart from your primary data stores, which is necessary for the millisecond response times expected in modern search applications. Except for indexer-driven indexing scenarios, the search service never connects to or queries your source data.\n",
258234
"\n",
259235
"Reference:\n",
260236
"\n",
@@ -285,7 +261,7 @@
285261
},
286262
{
287263
"cell_type": "code",
288-
"execution_count": 9,
264+
"execution_count": 12,
289265
"metadata": {
290266
"tags": []
291267
},
@@ -294,7 +270,7 @@
294270
"name": "stdout",
295271
"output_type": "stream",
296272
"text": [
297-
"201\n",
273+
"204\n",
298274
"True\n"
299275
]
300276
}
@@ -389,6 +365,7 @@
389365
" {\"name\": \"title\", \"type\": \"Edm.String\", \"searchable\": \"true\", \"retrievable\": \"true\", \"facetable\": \"false\", \"filterable\": \"true\", \"sortable\": \"false\"},\n",
390366
" {\"name\": \"name\", \"type\": \"Edm.String\", \"searchable\": \"true\", \"retrievable\": \"true\", \"sortable\": \"false\", \"filterable\": \"false\", \"facetable\": \"false\"},\n",
391367
" {\"name\": \"location\", \"type\": \"Edm.String\", \"searchable\": \"true\", \"retrievable\": \"true\", \"sortable\": \"false\", \"filterable\": \"false\", \"facetable\": \"false\"}, \n",
368+
" {\"name\": \"isDeleted\", \"type\": \"Edm.String\", \"searchable\": \"true\", \"retrievable\": \"true\", \"sortable\": \"false\", \"filterable\": \"false\", \"facetable\": \"false\"}, \n",
392369
" {\"name\": \"chunk\",\"type\": \"Edm.String\", \"searchable\": \"true\", \"retrievable\": \"true\", \"sortable\": \"false\", \"filterable\": \"false\", \"facetable\": \"false\"},\n",
393370
" {\n",
394371
" \"name\": \"chunkVector\",\n",
@@ -468,7 +445,7 @@
468445
},
469446
{
470447
"cell_type": "code",
471-
"execution_count": 11,
448+
"execution_count": 13,
472449
"metadata": {
473450
"tags": []
474451
},
@@ -477,7 +454,7 @@
477454
"name": "stdout",
478455
"output_type": "stream",
479456
"text": [
480-
"201\n",
457+
"204\n",
481458
"True\n"
482459
]
483460
}
@@ -623,7 +600,7 @@
623600
},
624601
{
625602
"cell_type": "code",
626-
"execution_count": 12,
603+
"execution_count": 14,
627604
"metadata": {
628605
"tags": []
629606
},
@@ -643,12 +620,12 @@
643620
"cell_type": "markdown",
644621
"metadata": {},
645622
"source": [
646-
"The three components you have created thus far (data source, skillset, index) are inputs to an indexer. Creating the indexer on Azure Cognitive Search is the event that puts the entire pipeline into motion."
623+
"The three components you have created thus far (data source, skillset, index) are inputs to an indexer. Creating the indexer on Azure AI Search is the event that puts the entire pipeline into motion."
647624
]
648625
},
649626
{
650627
"cell_type": "code",
651-
"execution_count": 13,
628+
"execution_count": 10,
652629
"metadata": {
653630
"tags": []
654631
},
@@ -705,7 +682,7 @@
705682
},
706683
{
707684
"cell_type": "code",
708-
"execution_count": 14,
685+
"execution_count": 11,
709686
"metadata": {
710687
"tags": []
711688
},
@@ -724,7 +701,7 @@
724701
},
725702
{
726703
"cell_type": "code",
727-
"execution_count": 17,
704+
"execution_count": 13,
728705
"metadata": {
729706
"tags": []
730707
},
@@ -759,9 +736,7 @@
759736
"cell_type": "markdown",
760737
"metadata": {},
761738
"source": [
762-
"**When the indexer finishes running we will have all 994 documents indexed in your Search Engine!.**\n",
763-
"\n",
764-
"**Note:** Noticed that it only index 1 document (the zip file) but the AI Search service did the work of uncompressing it and indexing each individual doc**"
739+
"**When the indexer finishes running we will have all documents indexed in your Search Engine!.**"
765740
]
766741
},
767742
{
@@ -793,9 +768,9 @@
793768
],
794769
"metadata": {
795770
"kernelspec": {
796-
"display_name": "GPTSearch3 (Python 3.12)",
771+
"display_name": "RAGAgents (Python 3.12)",
797772
"language": "python",
798-
"name": "gptsearch3"
773+
"name": "ragagents"
799774
},
800775
"language_info": {
801776
"codemirror_mode": {
@@ -807,7 +782,7 @@
807782
"name": "python",
808783
"nbconvert_exporter": "python",
809784
"pygments_lexer": "ipython3",
810-
"version": "3.12.8"
785+
"version": "3.12.11"
811786
},
812787
"vscode": {
813788
"interpreter": {

02-LoadCSVOneToMany-ACogSearch.ipynb

Lines changed: 15 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -98,23 +98,21 @@
9898
"name": "stderr",
9999
"output_type": "stream",
100100
"text": [
101-
"Uploading Files: 100%|████████████████████████████████████████████████| 1/1 [00:06<00:00, 6.23s/it]"
101+
"Uploading Files: 100%|████████████████████████████████████████████| 743/743 [06:04<00:00, 2.04it/s]\n"
102102
]
103103
},
104104
{
105-
"name": "stdout",
106-
"output_type": "stream",
107-
"text": [
108-
"Temp Folder: ./data/temp_extract removed\n",
109-
"CPU times: user 767 ms, sys: 305 ms, total: 1.07 s\n",
110-
"Wall time: 7.74 s\n"
111-
]
112-
},
113-
{
114-
"name": "stderr",
115-
"output_type": "stream",
116-
"text": [
117-
"\n"
105+
"ename": "OSError",
106+
"evalue": "[Errno 39] Directory not empty: './data/temp_extract/s01/e11'",
107+
"output_type": "error",
108+
"traceback": [
109+
"\u001b[31m---------------------------------------------------------------------------\u001b[39m",
110+
"\u001b[31mOSError\u001b[39m Traceback (most recent call last)",
111+
"\u001b[36mFile \u001b[39m\u001b[32m<timed exec>:14\u001b[39m\n",
112+
"\u001b[36mFile \u001b[39m\u001b[32m/anaconda/envs/RAGAgents/lib/python3.12/shutil.py:759\u001b[39m, in \u001b[36mrmtree\u001b[39m\u001b[34m(path, ignore_errors, onerror, onexc, dir_fd)\u001b[39m\n\u001b[32m 757\u001b[39m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[32m 758\u001b[39m \u001b[38;5;28;01mwhile\u001b[39;00m stack:\n\u001b[32m--> \u001b[39m\u001b[32m759\u001b[39m _rmtree_safe_fd(stack, onexc)\n\u001b[32m 760\u001b[39m \u001b[38;5;28;01mfinally\u001b[39;00m:\n\u001b[32m 761\u001b[39m \u001b[38;5;66;03m# Close any file descriptors still on the stack.\u001b[39;00m\n\u001b[32m 762\u001b[39m \u001b[38;5;28;01mwhile\u001b[39;00m stack:\n",
113+
"\u001b[36mFile \u001b[39m\u001b[32m/anaconda/envs/RAGAgents/lib/python3.12/shutil.py:703\u001b[39m, in \u001b[36m_rmtree_safe_fd\u001b[39m\u001b[34m(stack, onexc)\u001b[39m\n\u001b[32m 701\u001b[39m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mOSError\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m err:\n\u001b[32m 702\u001b[39m err.filename = path\n\u001b[32m--> \u001b[39m\u001b[32m703\u001b[39m onexc(func, path, err)\n",
114+
"\u001b[36mFile \u001b[39m\u001b[32m/anaconda/envs/RAGAgents/lib/python3.12/shutil.py:662\u001b[39m, in \u001b[36m_rmtree_safe_fd\u001b[39m\u001b[34m(stack, onexc)\u001b[39m\n\u001b[32m 660\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m\n\u001b[32m 661\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m func \u001b[38;5;129;01mis\u001b[39;00m os.rmdir:\n\u001b[32m--> \u001b[39m\u001b[32m662\u001b[39m os.rmdir(name, dir_fd=dirfd)\n\u001b[32m 663\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m\n\u001b[32m 665\u001b[39m \u001b[38;5;66;03m# Note: To guard against symlink races, we use the standard\u001b[39;00m\n\u001b[32m 666\u001b[39m \u001b[38;5;66;03m# lstat()/open()/fstat() trick.\u001b[39;00m\n",
115+
"\u001b[31mOSError\u001b[39m: [Errno 39] Directory not empty: './data/temp_extract/s01/e11'"
118116
]
119117
}
120118
],
@@ -698,9 +696,9 @@
698696
],
699697
"metadata": {
700698
"kernelspec": {
701-
"display_name": "GPTSearch3 (Python 3.12)",
699+
"display_name": "RAGAgents (Python 3.12)",
702700
"language": "python",
703-
"name": "gptsearch3"
701+
"name": "ragagents"
704702
},
705703
"language_info": {
706704
"codemirror_mode": {
@@ -712,7 +710,7 @@
712710
"name": "python",
713711
"nbconvert_exporter": "python",
714712
"pygments_lexer": "ipython3",
715-
"version": "3.12.8"
713+
"version": "3.12.11"
716714
}
717715
},
718716
"nbformat": 4,

0 commit comments

Comments
 (0)