You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: adi_function_app/README.md
+29-13Lines changed: 29 additions & 13 deletions
Original file line number
Diff line number
Diff line change
@@ -24,13 +24,21 @@ Once the Markdown is obtained, several steps are carried out:
24
24
25
25
1.**Extraction of images / charts**. The figures identified are extracted from the original document and passed to a multi-modal model (gpt4o in this case) for analysis. We obtain a description and summary of the chart / image to infer the meaning of the figure. This allows us to index and perform RAG analysis the information that is visually obtainable from a chart, without it being explicitly mentioned in the text surrounding. The information is added back into the original chart.
26
26
27
-
2.**Cleaning of Markdown**. The final markdown content is cleaned of any characters or unsupported Markdown elements that we do not want in the chunk e.g. non-relevant images.
27
+
2.**Chunking**. The obtained content is chunked accordingly depending on the chunking strategy. This function app supports two chunking methods, **page wise** and **semantic chunking**. The page wise chunking is performed natively by Azure Document Intelligence. For a Semantic Chunking, we include a customer chunker that splits the text with the following strategy:
28
28
29
-
Page wise analysis in ADI is used to avoid splitting tables / figures across multiple chunks, when the chunking is performed.
29
+
- Splits text into sentences.
30
+
- Groups sentences if they are table or figure related to avoid splitting them in context.
31
+
- Semanticly groups sentences if the similarity is above the threshold, starting from the start of the text.
32
+
- Semanticly groups sentences if the similarity is above the threshold, starting from the end of the text.
33
+
- Removes non-existent chunks.
30
34
31
-
The properties returned from the ADI Custom Skill are then used to perform the following skills:
35
+
This chunking method aims to improve on page wise chunking, whilst still retaining similar sentences together. When tested, this method shows great performance improvements, over straight page wise chunking, without splitting up the context when relevant.
32
36
33
-
- Pre-vectorisation cleaning. This stage is important as we extract the section information in this step from the headers in the document. Additionally, we remove any Markdown tags or characters that would cause an embedding error.
37
+
3.**Cleaning of Markdown**. The final markdown content is cleaned of any characters or unsupported Markdown elements that we do not want in the chunk e.g. non-relevant images.
38
+
39
+
The properties returned from the ADI Custom Skill and Chunking are then used to perform the following skills:
40
+
41
+
- Markup cleaning. This stage is important as we extract the section information in this step from the headers in the document. Additionally, we remove any Markdown tags or characters that would cause an embedding error.
34
42
- Keyphrase extraction
35
43
- Vectorisation
36
44
@@ -49,18 +57,24 @@ The Figure 4 content has been interpreted and added into the extracted chunk to
49
57
50
58
## Provided Notebooks \& Utilities
51
59
52
-
-`./ai_search_with_adi_function_app` provides a pre-built Python function app that communicates with Azure Document Intelligence, Azure OpenAI etc to perform the Markdown conversion, extraction of figures, figure understanding and corresponding cleaning of Markdown.
60
+
-`./function_app` provides a pre-built Python function app that communicates with Azure Document Intelligence, Azure OpenAI etc to perform the Markdown conversion, extraction of figures, figure understanding and corresponding cleaning of Markdown.
53
61
-`./rag_with_ai_search.ipynb` provides example of how to utilise the AI Search plugin to query the index.
54
62
55
63
## Deploying AI Search Setup
56
64
57
65
To deploy the pre-built index and associated indexer / skillset setup, see instructions in `./deploy_ai_search/README.md`.
58
66
59
-
## ADI Custom Skill
67
+
## Custom Skills
68
+
69
+
Deploy the associated function app and the resources. To use with an index, either use the utility to configure a indexer in the provided form, or integrate the skill with your skillset pipeline.
70
+
71
+
### ADI Custom Skill
60
72
61
-
Deploy the associated function app and required resources. You can then experiment with the custom skill by sending an HTTP request in the AI Search JSON format to the `/adi_2_ai_search` HTTP endpoint.
73
+
You can then experiment with the custom skill by sending an HTTP request in the AI Search JSON format to the `/adi_2_ai_search` HTTP endpoint. The header controls the chunking technique *(page wise or not)*.
62
74
63
-
To use with an index, either use the utility to configure a indexer in the provided form, or integrate the skill with your skillset pipeline.
75
+
### Semantic Chunker Skill
76
+
77
+
You can then test the chunking by sending a AI Search JSON format to the `/semantic_text_chunker/ HTTP endpoint. The header controls the different chunking parameters *(num_surrounding_sentences, similarity_threshold, max_chunk_tokens, min_chunk_tokens)*.
64
78
65
79
### Deployment Steps
66
80
@@ -72,11 +86,15 @@ To use with an index, either use the utility to configure a indexer in the provi
72
86
73
87
#### function_app.py
74
88
75
-
`./indexer/ai_search_with_adi_function_app.py` contains the HTTP entrypoints for the ADI skill and the other provided utility skills.
89
+
`./indexer/function_app.py` contains the HTTP entrypoints for the ADI skill and the other provided utility skills.
90
+
91
+
#### semantic_text_chunker.py
76
92
77
-
#### adi_2_aisearch
93
+
`./semantic_text_chunker.py` contains the code to chunk the text semantically, whilst grouping similar sentences.
78
94
79
-
`./indexer/adi_2_aisearch.py` contains the methods for content extraction with ADI. The key methods are:
95
+
#### adi_2_ai_search.py
96
+
97
+
`./indexer/adi_2_ai_search.py` contains the methods for content extraction with ADI. The key methods are:
80
98
81
99
##### analyse_document
82
100
@@ -183,8 +201,6 @@ If `chunk_by_page` header is `False`:
183
201
}
184
202
```
185
203
186
-
**Page wise analysis in ADI is recommended to avoid splitting tables / figures across multiple chunks, when the chunking is performed.**
187
-
188
204
## Other Provided Custom Skills
189
205
190
206
Due to a AI Search product limitation that AI Search cannot connect to AI Services behind Private Endpoints, we provide a Custom Key Phrase Extraction Skill that will work within a Private Endpoint environment.
0 commit comments