Skip to content

Commit 721851f

Browse files
Improve the READMEs and add documentation for ADI Skill (#7)
* Progress on readmes * Add comprehensive documentation for ADI * Fix json
1 parent 02bc2d9 commit 721851f

File tree

5 files changed

+216
-3
lines changed

5 files changed

+216
-3
lines changed

README.md

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,22 @@
11
# Text2SQL and Image Processing in AI Search
22

3-
This repo provides sample code for improving RAG applications with rich data sources.
3+
This repo provides sample code for improving RAG applications with rich data sources including SQL Warehouses and documents analysed with Azure Document Intelligence.
4+
5+
It is intended that the plugins and skills provided in this repository, are adapted and added to your new or existing RAG application to improve the response quality.
6+
7+
## Components
48

59
- `./text2sql` contains an Multi-Shot implementation for Text2SQL generation and querying which can be used to answer questions backed by a database as a knowledge base.
6-
- `./ai_search_with_adi` contains code for linking Azure Document Intelligence with AI Search to process complex documents with charts and images, and uses multi-modal models to interpret and understand these.
10+
- `./ai_search_with_adi` contains code for linking Azure Document Intelligence with AI Search to process complex documents with charts and images, and uses multi-modal models (gpt4o) to interpret and understand these.
711

812
The above components have been successfully used on production RAG projects to increase the quality of responses. The code provided in this repo is a sample of the implementation and should be adjusted before being used in production.
913

14+
## High Level Implementation
15+
16+
The following diagram shows a workflow for how the Text2SQL and AI Search plugin would be incorporated into a RAG application. Using the plugins available, alongside the Function Calling capabilities of LLMs, the LLM can do Chain of Thought reasoning to determine the steps needed to answer the question. This allows the LLM to recognise intent and therefore pick appropriate data sources based on the intent of the question, or a combination of both.
17+
18+
![High level workflow for a plugin driven RAG application](./images/Plugin%20Based%20RAG%20Flow.png "High Level Workflow")
19+
1020
## Contributing
1121

1222
This project welcomes contributions and suggestions. Most contributions require you to agree to a

ai_search_with_adi/README.md

Lines changed: 196 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,196 @@
1+
# AI Search Indexing with Azure Document Intelligence
2+
3+
This portion of the repo contains code for linking Azure Document Intelligence with AI Search to process complex documents with charts and images, and uses multi-modal models (gpt4o) to interpret and understand these.
4+
5+
The implementation in Python, although it can easily be adapted for C# or another language. The code is designed to run in an Azure Function App inside the tenant.
6+
7+
**This approach makes use of Azure Document Intelligence v4.0 which is still in preview.**
8+
9+
## High Level Workflow
10+
11+
A common way to perform document indexing, is to either extract the text content or use [optical character recognition](https://learn.microsoft.com/en-us/azure/search/cognitive-search-skill-ocr) to gather the text content before indexing. Whilst this works well for simple files that contain mainly text based information, the response quality diminishes significantly when the documents contain mainly charts and images, such as a PowerPoint presentation.
12+
13+
To solve this issue and to ensure that good quality information is extracted from the document, an indexer using [Azure Document Intelligence (ADI)](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/overview?view=doc-intel-4.0.0) is developed with [Custom Skills](https://learn.microsoft.com/en-us/azure/search/cognitive-search-custom-skill-web-api):
14+
15+
![High level workflow for indexing with Azure Document Intelligence based skills](./images/Indexing%20vs%20Indexing%20with%20ADI.png "Indexing with Azure Document Intelligence Approach")
16+
17+
Instead of using OCR to extract the contents of the document, ADIv4 is used to analyse the layout of the document and convert it to a Markdown format. The Markdown format brings benefits such as:
18+
19+
- Table layout
20+
- Section and header extraction with Markdown headings
21+
- Figure and image extraction
22+
23+
Once the Markdown is obtained, several steps are carried out:
24+
25+
1. **Extraction of images / charts**. The figures identified are extracted from the original document and passed to a multi-modal model (gpt4o in this case) for analysis. We obtain a description and summary of the chart / image to infer the meaning of the figure. This allows us to index and perform RAG analysis the information that is visually obtainable from a chart, without it being explicitly mentioned in the text surrounding. The information is added back into the original chart.
26+
27+
2. **Extraction of sections and headers**. The sections and headers are extracted from the document and returned additionally to the indexer under a separate field. This allows us to store them as a separate field in the index and therefore surface the most relevant chunks.
28+
29+
3. **Cleaning of Markdown**. The final markdown content is cleaned of any characters or unsupported Markdown elements that we do not want in the chunk e.g. non-relevant images.
30+
31+
Page wise analysis in ADI is used to avoid splitting tables / figures across multiple chunks, when the chunking is performed.
32+
33+
The properties returned from the ADI Custom Skill are then used to perform the following skills:
34+
35+
- Pre-vectorisation cleaning
36+
- Keyphrase extraction
37+
- Vectorisation
38+
39+
## Provided Notebooks \& Utilities
40+
41+
- `./ai_search.py`, `./deployment.py` provide an easy Python based utility for deploying an index, indexer and corresponding skillset for AI Search.
42+
- `./function_apps/indexer` provides a pre-built Python function app that communicates with Azure Document Intelligence, Azure OpenAI etc to perform the Markdown conversion, extraction of figures, figure understanding and corresponding cleaning of Markdown.
43+
- `./rag_with_ai_search.ipynb` provides example of how to utilise the AI Search plugin to query the index.
44+
45+
## ADI Custom Skill
46+
47+
Deploy the associated function app and required resources. You can then experiment with the custom skill by sending an HTTP request in the AI Search JSON format to the `/adi_2_ai_search` HTTP endpoint.
48+
49+
To use with an index, either use the utility to configure a indexer in the provided form, or integrate the skill with your skillset pipeline.
50+
51+
### function_app.py
52+
53+
`./function_apps/indexer/function_app.py` contains the HTTP entrypoints for the ADI skill and the other provided utility skills.
54+
55+
### adi_2_aisearch
56+
57+
`./function_apps/indexer/adi_2_aisearch.py` contains the methods for content extraction with ADI. The key methods are:
58+
59+
#### analyse_document
60+
61+
This method takes the passed file, uploads it to ADI and retrieves the Markdown format.
62+
63+
#### process_figures_from_extracted_content
64+
65+
This method takes the detected figures, and crops them out of the page to save them as images. It uses the `understand_image_with_vlm` to communicate with Azure OpenAI to understand the meaning of the extracted figure.
66+
67+
`update_figure_description` is used to update the original Markdown content with the description and meaning of the figure.
68+
69+
#### clean_adi_markdown
70+
71+
This method performs the final cleaning of the Markdown contents. In this method, the section headings and page numbers are extracted for the content to be returned to the indexer.
72+
73+
### Input Format
74+
75+
The ADI Skill conforms to the [Azure AI Search Custom Skill Input Format](https://learn.microsoft.com/en-gb/azure/search/cognitive-search-custom-skill-web-api?WT.mc_id=Portal-Microsoft_Azure_Search#sample-input-json-structure). AI Search will automatically build this format if you use the utility file provided in this repo to build your indexer and skillset.
76+
77+
```json
78+
{
79+
"values": [
80+
{
81+
"recordId": "0",
82+
"data": {
83+
"source": "<FULL URI TO BLOB>"
84+
}
85+
},
86+
{
87+
"recordId": "1",
88+
"data": {
89+
"source": "<FULL URI TO BLOB>"
90+
}
91+
}
92+
]
93+
}
94+
```
95+
96+
### Output Format
97+
98+
The ADI Skill conforms to the [Azure AI Search Custom Skill Output Format](https://learn.microsoft.com/en-gb/azure/search/cognitive-search-custom-skill-web-api?WT.mc_id=Portal-Microsoft_Azure_Search#sample-output-json-structure).
99+
100+
If `chunk_by_page` header is `True` (recommended):
101+
102+
```json
103+
{
104+
"values": [
105+
{
106+
"recordId": "0",
107+
"data": {
108+
"extracted_content": [
109+
{
110+
"page_number": 1,
111+
"sections": [
112+
"<LIST OF DETECTED HEADINGS AND SECTIONS FOR PAGE NUMBER 1>"
113+
],
114+
"content": "<CLEANED MARKDOWN CONTENT FOR PAGE NUMBER 1>"
115+
},
116+
{
117+
"page_number": 2,
118+
"sections": [
119+
"<LIST OF DETECTED HEADINGS AND SECTIONS FOR PAGE NUMBER 2>"
120+
],
121+
"content": "<CLEANED MARKDOWN CONTENT FOR PAGE NUMBER 2>"
122+
}
123+
]
124+
}
125+
},
126+
{
127+
"recordId": "1",
128+
"data": {
129+
"extracted_content": [
130+
{
131+
"page_number": 1,
132+
"sections": [
133+
"<LIST OF DETECTED HEADINGS AND SECTIONS FOR PAGE NUMBER 1>"
134+
],
135+
"content": "<CLEANED MARKDOWN CONTENT FOR PAGE NUMBER 2>"
136+
},
137+
{
138+
"page_number": 2,
139+
"sections": [
140+
"<LIST OF DETECTED HEADINGS AND SECTIONS FOR PAGE NUMBER 1>"
141+
],
142+
"content": "<CLEANED MARKDOWN CONTENT FOR PAGE NUMBER 2>"
143+
}
144+
]
145+
}
146+
}
147+
]
148+
}
149+
```
150+
151+
If `chunk_by_page` header is `False`:
152+
153+
```json
154+
{
155+
"values": [
156+
{
157+
"recordId": "0",
158+
"data": {
159+
"extracted_content": {
160+
"sections": [
161+
"<LIST OF DETECTED HEADINGS AND SECTIONS FOR THE ENTIRE DOCUMENT>"
162+
],
163+
"content": "<CLEANED MARKDOWN CONTENT FOR THE ENTIRE DOCUMENT>"
164+
}
165+
}
166+
},
167+
{
168+
"recordId": "1",
169+
"data": {
170+
"extracted_content": {
171+
"sections": [
172+
"<LIST OF DETECTED HEADINGS AND SECTIONS FOR THE ENTIRE DOCUMENT>"
173+
],
174+
"content": "<CLEANED MARKDOWN CONTENT FOR THE ENTIRE DOCUMENT>"
175+
}
176+
}
177+
}
178+
]
179+
}
180+
```
181+
182+
**Page wise analysis in ADI is recommended to avoid splitting tables / figures across multiple chunks, when the chunking is performed.**
183+
184+
185+
## Production Considerations
186+
187+
Below are some of the considerations that should be made before using this custom skill in production:
188+
189+
- This approach makes use of Azure Document Intelligence v4.0 which is still in preview. Features may change before the GA release. ADI v4.0 preview is only available in select regions.
190+
- Azure Document Intelligence output quality varies significantly by file type. A PDF file type will producer richer outputs in terms of figure detection etc, compared to a PPTX file in our testing.
191+
192+
## Possible Improvements
193+
194+
Below are some possible improvements that could be made to the vectorisation approach:
195+
196+
- Storing the extracted figures in blob storage for access later. This would allow the LLM to resurface the correct figure or provide a link to the give in the reference system to be displayed in the UI.
Loading

text2sql/README.md

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ The sample provided works with Azure SQL Server, although it has been easily ada
1010

1111
The following diagram shows a workflow for how the Text2SQL plugin would be incorporated into a RAG application. Using the plugins available, alongside the [Function Calling](https://platform.openai.com/docs/guides/function-calling) capabilities of LLMs, the LLM can do [Chain of Thought](https://learn.microsoft.com/en-us/dotnet/ai/conceptual/chain-of-thought-prompting) reasoning to determine the steps needed to answer the question. This allows the LLM to recognise intent and therefore pick appropriate data sources based on the intent of the question.
1212

13-
![High level workflow for a plugin driven RAG application](./images/Plugin%20Based%20RAG%20Flow.png "High Level Workflow")
13+
![High level workflow for a plugin driven RAG application](../images/Plugin%20Based%20RAG%20Flow.png "High Level Workflow")
1414

1515
## Why Text2SQL instead of indexing the database contents?
1616

@@ -177,3 +177,10 @@ Below are some of the considerations that should be made before using this plugi
177177
- Consider limiting the permissions of the identity or connection string to only allow access to certain tables or perform certain query types.
178178
- If possible, run the queries under the identity of the end user so that any row or column level security is applied to the data.
179179
- Consider data masking for sensitive columns that you do not wish to be exposed.
180+
181+
## Possible Improvements
182+
183+
Below are some possible improvements that could be made to the Text2SQL approach:
184+
185+
- Storing the entity names / definitions / selectors in a vector database and using a vector search to obtain the most relevant entities.
186+
- Due to the small number of tokens that this approaches uses, this approach was not considered but if the number of tables is significantly larger, this approach may provide benefits in selecting the most appropriate tables (untested).

0 commit comments

Comments
 (0)