diff --git a/README.md b/README.md index 835664b..3fbf3e6 100644 --- a/README.md +++ b/README.md @@ -1,12 +1,22 @@ # Text2SQL and Image Processing in AI Search -This repo provides sample code for improving RAG applications with rich data sources. +This repo provides sample code for improving RAG applications with rich data sources including SQL Warehouses and documents analysed with Azure Document Intelligence. + +It is intended that the plugins and skills provided in this repository, are adapted and added to your new or existing RAG application to improve the response quality. + +## Components - `./text2sql` contains an Multi-Shot implementation for Text2SQL generation and querying which can be used to answer questions backed by a database as a knowledge base. -- `./ai_search_with_adi` contains code for linking Azure Document Intelligence with AI Search to process complex documents with charts and images, and uses multi-modal models to interpret and understand these. +- `./ai_search_with_adi` contains code for linking Azure Document Intelligence with AI Search to process complex documents with charts and images, and uses multi-modal models (gpt4o) to interpret and understand these. The above components have been successfully used on production RAG projects to increase the quality of responses. The code provided in this repo is a sample of the implementation and should be adjusted before being used in production. +## High Level Implementation + +The following diagram shows a workflow for how the Text2SQL and AI Search plugin would be incorporated into a RAG application. Using the plugins available, alongside the Function Calling capabilities of LLMs, the LLM can do Chain of Thought reasoning to determine the steps needed to answer the question. This allows the LLM to recognise intent and therefore pick appropriate data sources based on the intent of the question, or a combination of both. + +![High level workflow for a plugin driven RAG application](./images/Plugin%20Based%20RAG%20Flow.png "High Level Workflow") + ## Contributing This project welcomes contributions and suggestions. Most contributions require you to agree to a diff --git a/ai_search_with_adi/README.md b/ai_search_with_adi/README.md new file mode 100644 index 0000000..d43d1e7 --- /dev/null +++ b/ai_search_with_adi/README.md @@ -0,0 +1,196 @@ +# AI Search Indexing with Azure Document Intelligence + +This portion of the repo contains code for linking Azure Document Intelligence with AI Search to process complex documents with charts and images, and uses multi-modal models (gpt4o) to interpret and understand these. + +The implementation in Python, although it can easily be adapted for C# or another language. The code is designed to run in an Azure Function App inside the tenant. + +**This approach makes use of Azure Document Intelligence v4.0 which is still in preview.** + +## High Level Workflow + +A common way to perform document indexing, is to either extract the text content or use [optical character recognition](https://learn.microsoft.com/en-us/azure/search/cognitive-search-skill-ocr) to gather the text content before indexing. Whilst this works well for simple files that contain mainly text based information, the response quality diminishes significantly when the documents contain mainly charts and images, such as a PowerPoint presentation. + +To solve this issue and to ensure that good quality information is extracted from the document, an indexer using [Azure Document Intelligence (ADI)](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/overview?view=doc-intel-4.0.0) is developed with [Custom Skills](https://learn.microsoft.com/en-us/azure/search/cognitive-search-custom-skill-web-api): + +![High level workflow for indexing with Azure Document Intelligence based skills](./images/Indexing%20vs%20Indexing%20with%20ADI.png "Indexing with Azure Document Intelligence Approach") + +Instead of using OCR to extract the contents of the document, ADIv4 is used to analyse the layout of the document and convert it to a Markdown format. The Markdown format brings benefits such as: + +- Table layout +- Section and header extraction with Markdown headings +- Figure and image extraction + +Once the Markdown is obtained, several steps are carried out: + +1. **Extraction of images / charts**. The figures identified are extracted from the original document and passed to a multi-modal model (gpt4o in this case) for analysis. We obtain a description and summary of the chart / image to infer the meaning of the figure. This allows us to index and perform RAG analysis the information that is visually obtainable from a chart, without it being explicitly mentioned in the text surrounding. The information is added back into the original chart. + +2. **Extraction of sections and headers**. The sections and headers are extracted from the document and returned additionally to the indexer under a separate field. This allows us to store them as a separate field in the index and therefore surface the most relevant chunks. + +3. **Cleaning of Markdown**. The final markdown content is cleaned of any characters or unsupported Markdown elements that we do not want in the chunk e.g. non-relevant images. + +Page wise analysis in ADI is used to avoid splitting tables / figures across multiple chunks, when the chunking is performed. + +The properties returned from the ADI Custom Skill are then used to perform the following skills: + +- Pre-vectorisation cleaning +- Keyphrase extraction +- Vectorisation + +## Provided Notebooks \& Utilities + +- `./ai_search.py`, `./deployment.py` provide an easy Python based utility for deploying an index, indexer and corresponding skillset for AI Search. +- `./function_apps/indexer` provides a pre-built Python function app that communicates with Azure Document Intelligence, Azure OpenAI etc to perform the Markdown conversion, extraction of figures, figure understanding and corresponding cleaning of Markdown. +- `./rag_with_ai_search.ipynb` provides example of how to utilise the AI Search plugin to query the index. + +## ADI Custom Skill + +Deploy the associated function app and required resources. You can then experiment with the custom skill by sending an HTTP request in the AI Search JSON format to the `/adi_2_ai_search` HTTP endpoint. + +To use with an index, either use the utility to configure a indexer in the provided form, or integrate the skill with your skillset pipeline. + +### function_app.py + +`./function_apps/indexer/function_app.py` contains the HTTP entrypoints for the ADI skill and the other provided utility skills. + +### adi_2_aisearch + +`./function_apps/indexer/adi_2_aisearch.py` contains the methods for content extraction with ADI. The key methods are: + +#### analyse_document + +This method takes the passed file, uploads it to ADI and retrieves the Markdown format. + +#### process_figures_from_extracted_content + +This method takes the detected figures, and crops them out of the page to save them as images. It uses the `understand_image_with_vlm` to communicate with Azure OpenAI to understand the meaning of the extracted figure. + +`update_figure_description` is used to update the original Markdown content with the description and meaning of the figure. + +#### clean_adi_markdown + +This method performs the final cleaning of the Markdown contents. In this method, the section headings and page numbers are extracted for the content to be returned to the indexer. + +### Input Format + +The ADI Skill conforms to the [Azure AI Search Custom Skill Input Format](https://learn.microsoft.com/en-gb/azure/search/cognitive-search-custom-skill-web-api?WT.mc_id=Portal-Microsoft_Azure_Search#sample-input-json-structure). AI Search will automatically build this format if you use the utility file provided in this repo to build your indexer and skillset. + +```json +{ + "values": [ + { + "recordId": "0", + "data": { + "source": "" + } + }, + { + "recordId": "1", + "data": { + "source": "" + } + } + ] +} +``` + +### Output Format + +The ADI Skill conforms to the [Azure AI Search Custom Skill Output Format](https://learn.microsoft.com/en-gb/azure/search/cognitive-search-custom-skill-web-api?WT.mc_id=Portal-Microsoft_Azure_Search#sample-output-json-structure). + +If `chunk_by_page` header is `True` (recommended): + +```json +{ + "values": [ + { + "recordId": "0", + "data": { + "extracted_content": [ + { + "page_number": 1, + "sections": [ + "" + ], + "content": "" + }, + { + "page_number": 2, + "sections": [ + "" + ], + "content": "" + } + ] + } + }, + { + "recordId": "1", + "data": { + "extracted_content": [ + { + "page_number": 1, + "sections": [ + "" + ], + "content": "" + }, + { + "page_number": 2, + "sections": [ + "" + ], + "content": "" + } + ] + } + } + ] +} +``` + +If `chunk_by_page` header is `False`: + +```json +{ + "values": [ + { + "recordId": "0", + "data": { + "extracted_content": { + "sections": [ + "" + ], + "content": "" + } + } + }, + { + "recordId": "1", + "data": { + "extracted_content": { + "sections": [ + "" + ], + "content": "" + } + } + } + ] +} +``` + +**Page wise analysis in ADI is recommended to avoid splitting tables / figures across multiple chunks, when the chunking is performed.** + + +## Production Considerations + +Below are some of the considerations that should be made before using this custom skill in production: + +- This approach makes use of Azure Document Intelligence v4.0 which is still in preview. Features may change before the GA release. ADI v4.0 preview is only available in select regions. +- Azure Document Intelligence output quality varies significantly by file type. A PDF file type will producer richer outputs in terms of figure detection etc, compared to a PPTX file in our testing. + +## Possible Improvements + +Below are some possible improvements that could be made to the vectorisation approach: + +- Storing the extracted figures in blob storage for access later. This would allow the LLM to resurface the correct figure or provide a link to the give in the reference system to be displayed in the UI. diff --git a/ai_search_with_adi/images/Indexing vs Indexing with ADI.png b/ai_search_with_adi/images/Indexing vs Indexing with ADI.png new file mode 100644 index 0000000..c672e8b Binary files /dev/null and b/ai_search_with_adi/images/Indexing vs Indexing with ADI.png differ diff --git a/text2sql/images/Plugin Based RAG Flow.png b/images/Plugin Based RAG Flow.png similarity index 100% rename from text2sql/images/Plugin Based RAG Flow.png rename to images/Plugin Based RAG Flow.png diff --git a/text2sql/README.md b/text2sql/README.md index 7760929..eed7a40 100644 --- a/text2sql/README.md +++ b/text2sql/README.md @@ -10,7 +10,7 @@ The sample provided works with Azure SQL Server, although it has been easily ada The following diagram shows a workflow for how the Text2SQL plugin would be incorporated into a RAG application. Using the plugins available, alongside the [Function Calling](https://platform.openai.com/docs/guides/function-calling) capabilities of LLMs, the LLM can do [Chain of Thought](https://learn.microsoft.com/en-us/dotnet/ai/conceptual/chain-of-thought-prompting) reasoning to determine the steps needed to answer the question. This allows the LLM to recognise intent and therefore pick appropriate data sources based on the intent of the question. -![High level workflow for a plugin driven RAG application](./images/Plugin%20Based%20RAG%20Flow.png "High Level Workflow") +![High level workflow for a plugin driven RAG application](../images/Plugin%20Based%20RAG%20Flow.png "High Level Workflow") ## Why Text2SQL instead of indexing the database contents? @@ -177,3 +177,10 @@ Below are some of the considerations that should be made before using this plugi - Consider limiting the permissions of the identity or connection string to only allow access to certain tables or perform certain query types. - If possible, run the queries under the identity of the end user so that any row or column level security is applied to the data. - Consider data masking for sensitive columns that you do not wish to be exposed. + +## Possible Improvements + +Below are some possible improvements that could be made to the Text2SQL approach: + +- Storing the entity names / definitions / selectors in a vector database and using a vector search to obtain the most relevant entities. + - Due to the small number of tokens that this approaches uses, this approach was not considered but if the number of tables is significantly larger, this approach may provide benefits in selecting the most appropriate tables (untested).