Skip to content

Add Semantic Chunking Code #58

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Nov 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ It is intended that the plugins and skills provided in this repository, are adap
## Components

- `./text_2_sql` contains an three Multi-Shot implementations for Text2SQL generation and querying which can be used to answer questions backed by a database as a knowledge base. A **prompt based** and **vector based** approach are shown, both of which exhibit great performance in answering sql queries. Additionally, a further iteration on the vector based approach is shown which uses a **query cache** to further speed up generation. With these plugins, your RAG application can now access and pull data from any SQL table exposed to it to answer questions.
- `./adi_function_app` contains code for linking **Azure Document Intelligence** with AI Search to process complex documents with charts and images, and uses **multi-modal models (gpt4o)** to interpret and understand these. With this custom skill, the RAG application can **draw insights from complex charts** and images during the vector search.
- `./adi_function_app` contains code for linking **Azure Document Intelligence** with AI Search to process complex documents with charts and images, and uses **multi-modal models (gpt4o)** to interpret and understand these. With this custom skill, the RAG application can **draw insights from complex charts** and images during the vector search. This function app also contains a **Semantic Text Chunking** method that aims to intelligently group similar sentences, retaining figures and tables together, whilst separating out distinct sentences.
- `./deploy_ai_search` provides an easy Python based utility for deploying an index, indexer and corresponding skillset for AI Search and for Text2SQL.

The above components have been successfully used on production RAG projects to increase the quality of responses.
Expand Down
42 changes: 29 additions & 13 deletions adi_function_app/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,13 +24,21 @@ Once the Markdown is obtained, several steps are carried out:

1. **Extraction of images / charts**. The figures identified are extracted from the original document and passed to a multi-modal model (gpt4o in this case) for analysis. We obtain a description and summary of the chart / image to infer the meaning of the figure. This allows us to index and perform RAG analysis the information that is visually obtainable from a chart, without it being explicitly mentioned in the text surrounding. The information is added back into the original chart.

2. **Cleaning of Markdown**. The final markdown content is cleaned of any characters or unsupported Markdown elements that we do not want in the chunk e.g. non-relevant images.
2. **Chunking**. The obtained content is chunked accordingly depending on the chunking strategy. This function app supports two chunking methods, **page wise** and **semantic chunking**. The page wise chunking is performed natively by Azure Document Intelligence. For a Semantic Chunking, we include a customer chunker that splits the text with the following strategy:

Page wise analysis in ADI is used to avoid splitting tables / figures across multiple chunks, when the chunking is performed.
- Splits text into sentences.
- Groups sentences if they are table or figure related to avoid splitting them in context.
- Semanticly groups sentences if the similarity is above the threshold, starting from the start of the text.
- Semanticly groups sentences if the similarity is above the threshold, starting from the end of the text.
- Removes non-existent chunks.

The properties returned from the ADI Custom Skill are then used to perform the following skills:
This chunking method aims to improve on page wise chunking, whilst still retaining similar sentences together. When tested, this method shows great performance improvements, over straight page wise chunking, without splitting up the context when relevant.

- Pre-vectorisation cleaning. This stage is important as we extract the section information in this step from the headers in the document. Additionally, we remove any Markdown tags or characters that would cause an embedding error.
3. **Cleaning of Markdown**. The final markdown content is cleaned of any characters or unsupported Markdown elements that we do not want in the chunk e.g. non-relevant images.

The properties returned from the ADI Custom Skill and Chunking are then used to perform the following skills:

- Markup cleaning. This stage is important as we extract the section information in this step from the headers in the document. Additionally, we remove any Markdown tags or characters that would cause an embedding error.
- Keyphrase extraction
- Vectorisation

Expand All @@ -49,18 +57,24 @@ The Figure 4 content has been interpreted and added into the extracted chunk to

## Provided Notebooks \& Utilities

- `./ai_search_with_adi_function_app` provides a pre-built Python function app that communicates with Azure Document Intelligence, Azure OpenAI etc to perform the Markdown conversion, extraction of figures, figure understanding and corresponding cleaning of Markdown.
- `./function_app` provides a pre-built Python function app that communicates with Azure Document Intelligence, Azure OpenAI etc to perform the Markdown conversion, extraction of figures, figure understanding and corresponding cleaning of Markdown.
- `./rag_with_ai_search.ipynb` provides example of how to utilise the AI Search plugin to query the index.

## Deploying AI Search Setup

To deploy the pre-built index and associated indexer / skillset setup, see instructions in `./deploy_ai_search/README.md`.

## ADI Custom Skill
## Custom Skills

Deploy the associated function app and the resources. To use with an index, either use the utility to configure a indexer in the provided form, or integrate the skill with your skillset pipeline.

### ADI Custom Skill

Deploy the associated function app and required resources. You can then experiment with the custom skill by sending an HTTP request in the AI Search JSON format to the `/adi_2_ai_search` HTTP endpoint.
You can then experiment with the custom skill by sending an HTTP request in the AI Search JSON format to the `/adi_2_ai_search` HTTP endpoint. The header controls the chunking technique *(page wise or not)*.

To use with an index, either use the utility to configure a indexer in the provided form, or integrate the skill with your skillset pipeline.
### Semantic Chunker Skill

You can then test the chunking by sending a AI Search JSON format to the `/semantic_text_chunker/ HTTP endpoint. The header controls the different chunking parameters *(num_surrounding_sentences, similarity_threshold, max_chunk_tokens, min_chunk_tokens)*.

### Deployment Steps

Expand All @@ -72,11 +86,15 @@ To use with an index, either use the utility to configure a indexer in the provi

#### function_app.py

`./indexer/ai_search_with_adi_function_app.py` contains the HTTP entrypoints for the ADI skill and the other provided utility skills.
`./indexer/function_app.py` contains the HTTP entrypoints for the ADI skill and the other provided utility skills.

#### semantic_text_chunker.py

#### adi_2_aisearch
`./semantic_text_chunker.py` contains the code to chunk the text semantically, whilst grouping similar sentences.

`./indexer/adi_2_aisearch.py` contains the methods for content extraction with ADI. The key methods are:
#### adi_2_ai_search.py

`./indexer/adi_2_ai_search.py` contains the methods for content extraction with ADI. The key methods are:

##### analyse_document

Expand Down Expand Up @@ -183,8 +201,6 @@ If `chunk_by_page` header is `False`:
}
```

**Page wise analysis in ADI is recommended to avoid splitting tables / figures across multiple chunks, when the chunking is performed.**

## Other Provided Custom Skills

Due to a AI Search product limitation that AI Search cannot connect to AI Services behind Private Endpoints, we provide a Custom Key Phrase Extraction Skill that will work within a Private Endpoint environment.
Expand Down
48 changes: 20 additions & 28 deletions adi_function_app/adi_2_ai_search.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,6 @@

async def build_and_clean_markdown_for_response(
markdown_text: str,
figures: dict,
page_no: int = None,
remove_irrelevant_figures=False,
):
Expand All @@ -39,28 +38,33 @@ async def build_and_clean_markdown_for_response(
str: The cleaned Markdown text.
"""

output_dict = {}
comment_patterns = r"<!-- PageNumber=\"[^\"]*\" -->|<!-- PageHeader=\"[^\"]*\" -->|<!-- PageFooter=\"[^\"]*\" -->|<!-- PageBreak -->|<!-- Footnote=\"[^\"]*\" -->"
cleaned_text = re.sub(comment_patterns, "", markdown_text, flags=re.DOTALL)
# Pattern to match the comment start `<!--` and comment end `-->`
# Matches opening `<!--` up to the first occurrence of a non-hyphen character
comment_start_pattern = r"<!--[^<]*"
comment_end_pattern = r"(-->|\<)"

# Using re.sub to remove comments
cleaned_text = re.sub(
f"{comment_start_pattern}.*?{comment_end_pattern}", "", markdown_text
)

# Remove irrelevant figures
if remove_irrelevant_figures:
irrelevant_figure_pattern = r"<!-- FigureContent=\"Irrelevant Image\" -->\s*"
irrelevant_figure_pattern = r"<figure[^>]*>.*?Irrelevant Image.*?</figure>"
cleaned_text = re.sub(
irrelevant_figure_pattern, "", cleaned_text, flags=re.DOTALL
)

logging.info(f"Cleaned Text: {cleaned_text}")

output_dict["content"] = cleaned_text

output_dict["figures"] = figures

# add page number when chunk by page is enabled
if page_no is not None:
output_dict = {}
output_dict["content"] = cleaned_text
output_dict["pageNumber"] = page_no

return output_dict
return output_dict
else:
return cleaned_text


def update_figure_description(
Expand Down Expand Up @@ -323,23 +327,15 @@ async def process_figures_from_extracted_content(
)
)

figure_ids = [
figure_processing_data[0] for figure_processing_data in figure_processing_datas
]
logging.info("Running image understanding tasks")
figure_descriptions = await asyncio.gather(*figure_understanding_tasks)
logging.info("Finished image understanding tasks")
logging.info(f"Image Descriptions: {figure_descriptions}")

logging.info("Running image upload tasks")
figure_uris = await asyncio.gather(*figure_upload_tasks)
await asyncio.gather(*figure_upload_tasks)
logging.info("Finished image upload tasks")

figures = [
{"figureId": figure_id, "figureUri": figure_uri}
for figure_id, figure_uri in zip(figure_ids, figure_uris)
]

running_offset = 0
for figure_processing_data, figure_description in zip(
figure_processing_datas, figure_descriptions
Expand All @@ -355,7 +351,7 @@ async def process_figures_from_extracted_content(
)
running_offset += desc_offset

return markdown_content, figures
return markdown_content


def create_page_wise_content(result: AnalyzeResult) -> list:
Expand Down Expand Up @@ -586,8 +582,7 @@ async def process_adi_2_ai_search(record: dict, chunk_by_page: bool = False) ->
):
build_and_clean_markdown_for_response_tasks.append(
build_and_clean_markdown_for_response(
extracted_page_content[0],
extracted_page_content[1],
extracted_page_content,
page_number,
True,
)
Expand All @@ -609,10 +604,7 @@ async def process_adi_2_ai_search(record: dict, chunk_by_page: bool = False) ->
else:
markdown_content = result.content

(
extracted_content,
figures,
) = await process_figures_from_extracted_content(
(extracted_content) = await process_figures_from_extracted_content(
result,
operation_id,
container_and_blob,
Expand All @@ -622,7 +614,7 @@ async def process_adi_2_ai_search(record: dict, chunk_by_page: bool = False) ->
)

cleaned_result = await build_and_clean_markdown_for_response(
extracted_content, figures, remove_irrelevant_figures=True
extracted_content, remove_irrelevant_figures=True
)
except Exception as e:
logging.error(e)
Expand Down
67 changes: 62 additions & 5 deletions adi_function_app/function_app.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,9 @@
import asyncio

from adi_2_ai_search import process_adi_2_ai_search
from pre_embedding_cleaner import process_pre_embedding_cleaner
from adi_function_app.mark_up_cleaner import process_mark_up_cleaner
from key_phrase_extraction import process_key_phrase_extraction
from semantic_text_chunker import process_semantic_text_chunker, SemanticTextChunker

logging.basicConfig(level=logging.DEBUG)
app = func.FunctionApp(http_auth_level=func.AuthLevel.FUNCTION)
Expand Down Expand Up @@ -50,8 +51,8 @@ async def adi_2_ai_search(req: func.HttpRequest) -> func.HttpResponse:
)


@app.route(route="pre_embedding_cleaner", methods=[func.HttpMethod.POST])
async def pre_embedding_cleaner(req: func.HttpRequest) -> func.HttpResponse:
@app.route(route="mark_up_cleaner", methods=[func.HttpMethod.POST])
async def mark_up_cleaner(req: func.HttpRequest) -> func.HttpResponse:
"""HTTP trigger for data cleanup function.

Args:
Expand All @@ -73,17 +74,73 @@ async def pre_embedding_cleaner(req: func.HttpRequest) -> func.HttpResponse:

record_tasks = []

for value in values:
record_tasks.append(asyncio.create_task(process_mark_up_cleaner(value)))

results = await asyncio.gather(*record_tasks)
logging.debug("Results: %s", results)
cleaned_tasks = {"values": results}

return func.HttpResponse(
json.dumps(cleaned_tasks), status_code=200, mimetype="application/json"
)


@app.route(route="semantic_text_chunker", methods=[func.HttpMethod.POST])
async def semantic_text_chunker(req: func.HttpRequest) -> func.HttpResponse:
"""HTTP trigger for text chunking function.

Args:
req (func.HttpRequest): The HTTP request object.

Returns:
func.HttpResponse: The HTTP response object."""
logging.info("Python HTTP trigger text chunking function processed a request.")

try:
req_body = req.get_json()
values = req_body.get("values")

semantic_text_chunker_config = req.headers

num_surrounding_sentences = semantic_text_chunker_config.get(
"num_surrounding_sentences", 1
)
similarity_threshold = semantic_text_chunker_config.get(
"similarity_threshold", 0.8
)
max_chunk_tokens = semantic_text_chunker_config.get("max_chunk_tokens", 500)
min_chunk_tokens = semantic_text_chunker_config.get("min_chunk_tokens", 50)

except ValueError:
return func.HttpResponse(
"Please valid Custom Skill Payload in the request body", status_code=400
)
else:
logging.debug("Input Values: %s", values)

record_tasks = []

semantic_text_chunker = SemanticTextChunker(
num_surrounding_sentences=num_surrounding_sentences,
similarity_threshold=similarity_threshold,
max_chunk_tokens=max_chunk_tokens,
min_chunk_tokens=min_chunk_tokens,
)

for value in values:
record_tasks.append(
asyncio.create_task(process_pre_embedding_cleaner(value))
asyncio.create_task(
process_semantic_text_chunker(value, semantic_text_chunker)
)
)

results = await asyncio.gather(*record_tasks)
logging.debug("Results: %s", results)
cleaned_tasks = {"values": results}

return func.HttpResponse(
json.dumps(cleaned_tasks), status_code=200, mimetype="application/json"
json.dump(cleaned_tasks), status_code=200, mimetype="application/json"
)


Expand Down
Loading
Loading