microsoft · BenConstable9 · Nov 25, 2024 · Nov 18, 2024 · Nov 18, 2024 · Nov 18, 2024
@@ -7,7 +7,7 @@ It is intended that the plugins and skills provided in this repository, are adap
 ## Components
 
 - `./text_2_sql` contains an three Multi-Shot implementations for Text2SQL generation and querying which can be used to answer questions backed by a database as a knowledge base. A **prompt based** and **vector based** approach are shown, both of which exhibit great performance in answering sql queries. Additionally, a further iteration on the vector based approach is shown which uses a **query cache** to further speed up generation.  With these plugins, your RAG application can now access and pull data from any SQL table exposed to it to answer questions.
-- `./adi_function_app` contains code for linking **Azure Document Intelligence** with AI Search to process complex documents with charts and images, and uses **multi-modal models (gpt4o)** to interpret and understand these. With this custom skill, the RAG application can **draw insights from complex charts** and images during the vector search.
+- `./adi_function_app` contains code for linking **Azure Document Intelligence** with AI Search to process complex documents with charts and images, and uses **multi-modal models (gpt4o)** to interpret and understand these. With this custom skill, the RAG application can **draw insights from complex charts** and images during the vector search. This function app also contains a **Semantic Text Chunking** method that aims to intelligently group similar sentences, retaining figures and tables together, whilst separating out distinct sentences.
 - `./deploy_ai_search` provides an easy Python based utility for deploying an index, indexer and corresponding skillset for AI Search and for Text2SQL.
 
 The above components have been successfully used on production RAG projects to increase the quality of responses.

@@ -24,13 +24,21 @@ Once the Markdown is obtained, several steps are carried out:
 
 1. **Extraction of images / charts**. The figures identified are extracted from the original document and passed to a multi-modal model (gpt4o in this case) for analysis. We obtain a description and summary of the chart / image to infer the meaning of the figure. This allows us to index and perform RAG analysis the information that is visually obtainable from a chart, without it being explicitly mentioned in the text surrounding. The information is added back into the original chart.
 
-2. **Cleaning of Markdown**. The final markdown content is cleaned of any characters or unsupported Markdown elements that we do not want in the chunk e.g. non-relevant images.
+2. **Chunking**. The obtained content is chunked accordingly depending on the chunking strategy. This function app supports two chunking methods, **page wise** and **semantic chunking**. The page wise chunking is performed natively by Azure Document Intelligence. For a Semantic Chunking, we include a customer chunker that splits the text with the following strategy:
 
-Page wise analysis in ADI is used to avoid splitting tables / figures across multiple chunks, when the chunking is performed.
+    - Splits text into sentences.
+    - Groups sentences if they are table or figure related to avoid splitting them in context.
+    - Semanticly groups sentences if the similarity is above the threshold, starting from the start of the text.
+    - Semanticly groups sentences if the similarity is above the threshold, starting from the end of the text.
+    - Removes non-existent chunks.
 
-The properties returned from the ADI Custom Skill are then used to perform the following skills:
+    This chunking method aims to improve on page wise chunking, whilst still retaining similar sentences together. When tested, this method shows great performance improvements, over straight page wise chunking, without splitting up the context when relevant.
 
-- Pre-vectorisation cleaning. This stage is important as we extract the section information in this step from the headers in the document. Additionally, we remove any Markdown tags or characters that would cause an embedding error.
+3. **Cleaning of Markdown**. The final markdown content is cleaned of any characters or unsupported Markdown elements that we do not want in the chunk e.g. non-relevant images.
+
+The properties returned from the ADI Custom Skill and Chunking are then used to perform the following skills:
+
+- Markup cleaning. This stage is important as we extract the section information in this step from the headers in the document. Additionally, we remove any Markdown tags or characters that would cause an embedding error.
 - Keyphrase extraction
 - Vectorisation
 
@@ -49,18 +57,24 @@ The Figure 4 content has been interpreted and added into the extracted chunk to
 
 ## Provided Notebooks \& Utilities
 
-- `./ai_search_with_adi_function_app` provides a pre-built Python function app that communicates with Azure Document Intelligence, Azure OpenAI etc to perform the Markdown conversion, extraction of figures, figure understanding and corresponding cleaning of Markdown.
+- `./function_app` provides a pre-built Python function app that communicates with Azure Document Intelligence, Azure OpenAI etc to perform the Markdown conversion, extraction of figures, figure understanding and corresponding cleaning of Markdown.
 - `./rag_with_ai_search.ipynb` provides example of how to utilise the AI Search plugin to query the index.
 
 ## Deploying AI Search Setup
 
 To deploy the pre-built index and associated indexer / skillset setup, see instructions in `./deploy_ai_search/README.md`.
 
-## ADI Custom Skill
+## Custom Skills
+
+Deploy the associated function app and the resources. To use with an index, either use the utility to configure a indexer in the provided form, or integrate the skill with your skillset pipeline.
+
+### ADI Custom Skill
 
-Deploy the associated function app and required resources. You can then experiment with the custom skill by sending an HTTP request in the AI Search JSON format to the `/adi_2_ai_search` HTTP endpoint.
+You can then experiment with the custom skill by sending an HTTP request in the AI Search JSON format to the `/adi_2_ai_search` HTTP endpoint. The header controls the chunking technique *(page wise or not)*.
 
-To use with an index, either use the utility to configure a indexer in the provided form, or integrate the skill with your skillset pipeline.
+### Semantic Chunker Skill
+
+You can then test the chunking by sending a AI Search JSON format to the `/semantic_text_chunker/ HTTP endpoint. The header controls the different chunking parameters *(num_surrounding_sentences, similarity_threshold, max_chunk_tokens, min_chunk_tokens)*.
 
 ### Deployment Steps
 
@@ -72,11 +86,15 @@ To use with an index, either use the utility to configure a indexer in the provi
 
 #### function_app.py
 
-`./indexer/ai_search_with_adi_function_app.py` contains the HTTP entrypoints for the ADI skill and the other provided utility skills.
+`./indexer/function_app.py` contains the HTTP entrypoints for the ADI skill and the other provided utility skills.
+
+#### semantic_text_chunker.py
 
-#### adi_2_aisearch
+`./semantic_text_chunker.py` contains the code to chunk the text semantically, whilst grouping similar sentences.
 
-`./indexer/adi_2_aisearch.py` contains the methods for content extraction with ADI. The key methods are:
+#### adi_2_ai_search.py
+
+`./indexer/adi_2_ai_search.py` contains the methods for content extraction with ADI. The key methods are:
 
 ##### analyse_document
 
@@ -183,8 +201,6 @@ If `chunk_by_page` header is `False`:
 }
 ```
 
-**Page wise analysis in ADI is recommended to avoid splitting tables / figures across multiple chunks, when the chunking is performed.**
-
 ## Other Provided Custom Skills
 
 Due to a AI Search product limitation that AI Search cannot connect to AI Services behind Private Endpoints, we provide a Custom Key Phrase Extraction Skill that will work within a Private Endpoint environment.

@@ -23,7 +23,6 @@
 
 async def build_and_clean_markdown_for_response(
     markdown_text: str,
-    figures: dict,
     page_no: int = None,
     remove_irrelevant_figures=False,
 ):
@@ -39,28 +38,33 @@ async def build_and_clean_markdown_for_response(
         str: The cleaned Markdown text.
     """
 
-    output_dict = {}
-    comment_patterns = r"<!-- PageNumber=\"[^\"]*\" -->|<!-- PageHeader=\"[^\"]*\" -->|<!-- PageFooter=\"[^\"]*\" -->|<!-- PageBreak -->|<!-- Footnote=\"[^\"]*\" -->"
-    cleaned_text = re.sub(comment_patterns, "", markdown_text, flags=re.DOTALL)
+    # Pattern to match the comment start `<!--` and comment end `-->`
+    # Matches opening `<!--` up to the first occurrence of a non-hyphen character
+    comment_start_pattern = r"<!--[^<]*"
+    comment_end_pattern = r"(-->|\<)"
+
+    # Using re.sub to remove comments
+    cleaned_text = re.sub(
+        f"{comment_start_pattern}.*?{comment_end_pattern}", "", markdown_text
+    )
 
     # Remove irrelevant figures
     if remove_irrelevant_figures:
-        irrelevant_figure_pattern = r"<!-- FigureContent=\"Irrelevant Image\" -->\s*"
+        irrelevant_figure_pattern = r"<figure[^>]*>.*?Irrelevant Image.*?</figure>"
         cleaned_text = re.sub(
             irrelevant_figure_pattern, "", cleaned_text, flags=re.DOTALL
         )
 
     logging.info(f"Cleaned Text: {cleaned_text}")
 
-    output_dict["content"] = cleaned_text
-
-    output_dict["figures"] = figures
-
     # add page number when chunk by page is enabled
     if page_no is not None:
+        output_dict = {}
+        output_dict["content"] = cleaned_text
         output_dict["pageNumber"] = page_no
-
-    return output_dict
+        return output_dict
+    else:
+        return cleaned_text
 
 
 def update_figure_description(
@@ -323,23 +327,15 @@ async def process_figures_from_extracted_content(
             )
         )
 
-    figure_ids = [
-        figure_processing_data[0] for figure_processing_data in figure_processing_datas
-    ]
     logging.info("Running image understanding tasks")
     figure_descriptions = await asyncio.gather(*figure_understanding_tasks)
     logging.info("Finished image understanding tasks")
     logging.info(f"Image Descriptions: {figure_descriptions}")
 
     logging.info("Running image upload tasks")
-    figure_uris = await asyncio.gather(*figure_upload_tasks)
+    await asyncio.gather(*figure_upload_tasks)
     logging.info("Finished image upload tasks")
 
-    figures = [
-        {"figureId": figure_id, "figureUri": figure_uri}
-        for figure_id, figure_uri in zip(figure_ids, figure_uris)
-    ]
-
     running_offset = 0
     for figure_processing_data, figure_description in zip(
         figure_processing_datas, figure_descriptions
@@ -355,7 +351,7 @@ async def process_figures_from_extracted_content(
         )
         running_offset += desc_offset
 
-    return markdown_content, figures
+    return markdown_content
 
 
 def create_page_wise_content(result: AnalyzeResult) -> list:
@@ -586,8 +582,7 @@ async def process_adi_2_ai_search(record: dict, chunk_by_page: bool = False) ->
                 ):
                     build_and_clean_markdown_for_response_tasks.append(
                         build_and_clean_markdown_for_response(
-                            extracted_page_content[0],
-                            extracted_page_content[1],
+                            extracted_page_content,
                             page_number,
                             True,
                         )
@@ -609,10 +604,7 @@ async def process_adi_2_ai_search(record: dict, chunk_by_page: bool = False) ->
             else:
                 markdown_content = result.content
 
-                (
-                    extracted_content,
-                    figures,
-                ) = await process_figures_from_extracted_content(
+                (extracted_content) = await process_figures_from_extracted_content(
                     result,
                     operation_id,
                     container_and_blob,
@@ -622,7 +614,7 @@ async def process_adi_2_ai_search(record: dict, chunk_by_page: bool = False) ->
                 )
 
                 cleaned_result = await build_and_clean_markdown_for_response(
-                    extracted_content, figures, remove_irrelevant_figures=True
+                    extracted_content, remove_irrelevant_figures=True
                 )
         except Exception as e:
             logging.error(e)

@@ -6,8 +6,9 @@
 import asyncio
 
 from adi_2_ai_search import process_adi_2_ai_search
-from pre_embedding_cleaner import process_pre_embedding_cleaner
+from adi_function_app.mark_up_cleaner import process_mark_up_cleaner
 from key_phrase_extraction import process_key_phrase_extraction
+from semantic_text_chunker import process_semantic_text_chunker, SemanticTextChunker
 
 logging.basicConfig(level=logging.DEBUG)
 app = func.FunctionApp(http_auth_level=func.AuthLevel.FUNCTION)
@@ -50,8 +51,8 @@ async def adi_2_ai_search(req: func.HttpRequest) -> func.HttpResponse:
         )
 
 
-@app.route(route="pre_embedding_cleaner", methods=[func.HttpMethod.POST])
-async def pre_embedding_cleaner(req: func.HttpRequest) -> func.HttpResponse:
+@app.route(route="mark_up_cleaner", methods=[func.HttpMethod.POST])
+async def mark_up_cleaner(req: func.HttpRequest) -> func.HttpResponse:
     """HTTP trigger for data cleanup function.
 
     Args:
@@ -73,17 +74,73 @@ async def pre_embedding_cleaner(req: func.HttpRequest) -> func.HttpResponse:
 
         record_tasks = []
 
+        for value in values:
+            record_tasks.append(asyncio.create_task(process_mark_up_cleaner(value)))
+
+        results = await asyncio.gather(*record_tasks)
+        logging.debug("Results: %s", results)
+        cleaned_tasks = {"values": results}
+
+        return func.HttpResponse(
+            json.dumps(cleaned_tasks), status_code=200, mimetype="application/json"
+        )
+
+
+@app.route(route="semantic_text_chunker", methods=[func.HttpMethod.POST])
+async def semantic_text_chunker(req: func.HttpRequest) -> func.HttpResponse:
+    """HTTP trigger for text chunking function.
+
+    Args:
+        req (func.HttpRequest): The HTTP request object.
+
+    Returns:
+        func.HttpResponse: The HTTP response object."""
+    logging.info("Python HTTP trigger text chunking function processed a request.")
+
+    try:
+        req_body = req.get_json()
+        values = req_body.get("values")
+
+        semantic_text_chunker_config = req.headers
+
+        num_surrounding_sentences = semantic_text_chunker_config.get(
+            "num_surrounding_sentences", 1
+        )
+        similarity_threshold = semantic_text_chunker_config.get(
+            "similarity_threshold", 0.8
+        )
+        max_chunk_tokens = semantic_text_chunker_config.get("max_chunk_tokens", 500)
+        min_chunk_tokens = semantic_text_chunker_config.get("min_chunk_tokens", 50)
+
+    except ValueError:
+        return func.HttpResponse(
+            "Please valid Custom Skill Payload in the request body", status_code=400
+        )
+    else:
+        logging.debug("Input Values: %s", values)
+
+        record_tasks = []
+
+        semantic_text_chunker = SemanticTextChunker(
+            num_surrounding_sentences=num_surrounding_sentences,
+            similarity_threshold=similarity_threshold,
+            max_chunk_tokens=max_chunk_tokens,
+            min_chunk_tokens=min_chunk_tokens,
+        )
+
         for value in values:
             record_tasks.append(
-                asyncio.create_task(process_pre_embedding_cleaner(value))
+                asyncio.create_task(
+                    process_semantic_text_chunker(value, semantic_text_chunker)
+                )
             )
 
         results = await asyncio.gather(*record_tasks)
         logging.debug("Results: %s", results)
         cleaned_tasks = {"values": results}
 
         return func.HttpResponse(
-            json.dumps(cleaned_tasks), status_code=200, mimetype="application/json"
+            json.dump(cleaned_tasks), status_code=200, mimetype="application/json"
         )