Skip to content

Commit 9bd16dc

Browse files
Add Semantic Chunking Code (#58)
1 parent b8e45f3 commit 9bd16dc

File tree

11 files changed

+741
-121
lines changed

11 files changed

+741
-121
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ It is intended that the plugins and skills provided in this repository, are adap
77
## Components
88

99
- `./text_2_sql` contains an three Multi-Shot implementations for Text2SQL generation and querying which can be used to answer questions backed by a database as a knowledge base. A **prompt based** and **vector based** approach are shown, both of which exhibit great performance in answering sql queries. Additionally, a further iteration on the vector based approach is shown which uses a **query cache** to further speed up generation. With these plugins, your RAG application can now access and pull data from any SQL table exposed to it to answer questions.
10-
- `./adi_function_app` contains code for linking **Azure Document Intelligence** with AI Search to process complex documents with charts and images, and uses **multi-modal models (gpt4o)** to interpret and understand these. With this custom skill, the RAG application can **draw insights from complex charts** and images during the vector search.
10+
- `./adi_function_app` contains code for linking **Azure Document Intelligence** with AI Search to process complex documents with charts and images, and uses **multi-modal models (gpt4o)** to interpret and understand these. With this custom skill, the RAG application can **draw insights from complex charts** and images during the vector search. This function app also contains a **Semantic Text Chunking** method that aims to intelligently group similar sentences, retaining figures and tables together, whilst separating out distinct sentences.
1111
- `./deploy_ai_search` provides an easy Python based utility for deploying an index, indexer and corresponding skillset for AI Search and for Text2SQL.
1212

1313
The above components have been successfully used on production RAG projects to increase the quality of responses.

adi_function_app/README.md

Lines changed: 29 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -24,13 +24,21 @@ Once the Markdown is obtained, several steps are carried out:
2424

2525
1. **Extraction of images / charts**. The figures identified are extracted from the original document and passed to a multi-modal model (gpt4o in this case) for analysis. We obtain a description and summary of the chart / image to infer the meaning of the figure. This allows us to index and perform RAG analysis the information that is visually obtainable from a chart, without it being explicitly mentioned in the text surrounding. The information is added back into the original chart.
2626

27-
2. **Cleaning of Markdown**. The final markdown content is cleaned of any characters or unsupported Markdown elements that we do not want in the chunk e.g. non-relevant images.
27+
2. **Chunking**. The obtained content is chunked accordingly depending on the chunking strategy. This function app supports two chunking methods, **page wise** and **semantic chunking**. The page wise chunking is performed natively by Azure Document Intelligence. For a Semantic Chunking, we include a customer chunker that splits the text with the following strategy:
2828

29-
Page wise analysis in ADI is used to avoid splitting tables / figures across multiple chunks, when the chunking is performed.
29+
- Splits text into sentences.
30+
- Groups sentences if they are table or figure related to avoid splitting them in context.
31+
- Semanticly groups sentences if the similarity is above the threshold, starting from the start of the text.
32+
- Semanticly groups sentences if the similarity is above the threshold, starting from the end of the text.
33+
- Removes non-existent chunks.
3034

31-
The properties returned from the ADI Custom Skill are then used to perform the following skills:
35+
This chunking method aims to improve on page wise chunking, whilst still retaining similar sentences together. When tested, this method shows great performance improvements, over straight page wise chunking, without splitting up the context when relevant.
3236

33-
- Pre-vectorisation cleaning. This stage is important as we extract the section information in this step from the headers in the document. Additionally, we remove any Markdown tags or characters that would cause an embedding error.
37+
3. **Cleaning of Markdown**. The final markdown content is cleaned of any characters or unsupported Markdown elements that we do not want in the chunk e.g. non-relevant images.
38+
39+
The properties returned from the ADI Custom Skill and Chunking are then used to perform the following skills:
40+
41+
- Markup cleaning. This stage is important as we extract the section information in this step from the headers in the document. Additionally, we remove any Markdown tags or characters that would cause an embedding error.
3442
- Keyphrase extraction
3543
- Vectorisation
3644

@@ -49,18 +57,24 @@ The Figure 4 content has been interpreted and added into the extracted chunk to
4957

5058
## Provided Notebooks \& Utilities
5159

52-
- `./ai_search_with_adi_function_app` provides a pre-built Python function app that communicates with Azure Document Intelligence, Azure OpenAI etc to perform the Markdown conversion, extraction of figures, figure understanding and corresponding cleaning of Markdown.
60+
- `./function_app` provides a pre-built Python function app that communicates with Azure Document Intelligence, Azure OpenAI etc to perform the Markdown conversion, extraction of figures, figure understanding and corresponding cleaning of Markdown.
5361
- `./rag_with_ai_search.ipynb` provides example of how to utilise the AI Search plugin to query the index.
5462

5563
## Deploying AI Search Setup
5664

5765
To deploy the pre-built index and associated indexer / skillset setup, see instructions in `./deploy_ai_search/README.md`.
5866

59-
## ADI Custom Skill
67+
## Custom Skills
68+
69+
Deploy the associated function app and the resources. To use with an index, either use the utility to configure a indexer in the provided form, or integrate the skill with your skillset pipeline.
70+
71+
### ADI Custom Skill
6072

61-
Deploy the associated function app and required resources. You can then experiment with the custom skill by sending an HTTP request in the AI Search JSON format to the `/adi_2_ai_search` HTTP endpoint.
73+
You can then experiment with the custom skill by sending an HTTP request in the AI Search JSON format to the `/adi_2_ai_search` HTTP endpoint. The header controls the chunking technique *(page wise or not)*.
6274

63-
To use with an index, either use the utility to configure a indexer in the provided form, or integrate the skill with your skillset pipeline.
75+
### Semantic Chunker Skill
76+
77+
You can then test the chunking by sending a AI Search JSON format to the `/semantic_text_chunker/ HTTP endpoint. The header controls the different chunking parameters *(num_surrounding_sentences, similarity_threshold, max_chunk_tokens, min_chunk_tokens)*.
6478

6579
### Deployment Steps
6680

@@ -72,11 +86,15 @@ To use with an index, either use the utility to configure a indexer in the provi
7286

7387
#### function_app.py
7488

75-
`./indexer/ai_search_with_adi_function_app.py` contains the HTTP entrypoints for the ADI skill and the other provided utility skills.
89+
`./indexer/function_app.py` contains the HTTP entrypoints for the ADI skill and the other provided utility skills.
90+
91+
#### semantic_text_chunker.py
7692

77-
#### adi_2_aisearch
93+
`./semantic_text_chunker.py` contains the code to chunk the text semantically, whilst grouping similar sentences.
7894

79-
`./indexer/adi_2_aisearch.py` contains the methods for content extraction with ADI. The key methods are:
95+
#### adi_2_ai_search.py
96+
97+
`./indexer/adi_2_ai_search.py` contains the methods for content extraction with ADI. The key methods are:
8098

8199
##### analyse_document
82100

@@ -183,8 +201,6 @@ If `chunk_by_page` header is `False`:
183201
}
184202
```
185203

186-
**Page wise analysis in ADI is recommended to avoid splitting tables / figures across multiple chunks, when the chunking is performed.**
187-
188204
## Other Provided Custom Skills
189205

190206
Due to a AI Search product limitation that AI Search cannot connect to AI Services behind Private Endpoints, we provide a Custom Key Phrase Extraction Skill that will work within a Private Endpoint environment.

adi_function_app/adi_2_ai_search.py

Lines changed: 20 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,6 @@
2323

2424
async def build_and_clean_markdown_for_response(
2525
markdown_text: str,
26-
figures: dict,
2726
page_no: int = None,
2827
remove_irrelevant_figures=False,
2928
):
@@ -39,28 +38,33 @@ async def build_and_clean_markdown_for_response(
3938
str: The cleaned Markdown text.
4039
"""
4140

42-
output_dict = {}
43-
comment_patterns = r"<!-- PageNumber=\"[^\"]*\" -->|<!-- PageHeader=\"[^\"]*\" -->|<!-- PageFooter=\"[^\"]*\" -->|<!-- PageBreak -->|<!-- Footnote=\"[^\"]*\" -->"
44-
cleaned_text = re.sub(comment_patterns, "", markdown_text, flags=re.DOTALL)
41+
# Pattern to match the comment start `<!--` and comment end `-->`
42+
# Matches opening `<!--` up to the first occurrence of a non-hyphen character
43+
comment_start_pattern = r"<!--[^<]*"
44+
comment_end_pattern = r"(-->|\<)"
45+
46+
# Using re.sub to remove comments
47+
cleaned_text = re.sub(
48+
f"{comment_start_pattern}.*?{comment_end_pattern}", "", markdown_text
49+
)
4550

4651
# Remove irrelevant figures
4752
if remove_irrelevant_figures:
48-
irrelevant_figure_pattern = r"<!-- FigureContent=\"Irrelevant Image\" -->\s*"
53+
irrelevant_figure_pattern = r"<figure[^>]*>.*?Irrelevant Image.*?</figure>"
4954
cleaned_text = re.sub(
5055
irrelevant_figure_pattern, "", cleaned_text, flags=re.DOTALL
5156
)
5257

5358
logging.info(f"Cleaned Text: {cleaned_text}")
5459

55-
output_dict["content"] = cleaned_text
56-
57-
output_dict["figures"] = figures
58-
5960
# add page number when chunk by page is enabled
6061
if page_no is not None:
62+
output_dict = {}
63+
output_dict["content"] = cleaned_text
6164
output_dict["pageNumber"] = page_no
62-
63-
return output_dict
65+
return output_dict
66+
else:
67+
return cleaned_text
6468

6569

6670
def update_figure_description(
@@ -323,23 +327,15 @@ async def process_figures_from_extracted_content(
323327
)
324328
)
325329

326-
figure_ids = [
327-
figure_processing_data[0] for figure_processing_data in figure_processing_datas
328-
]
329330
logging.info("Running image understanding tasks")
330331
figure_descriptions = await asyncio.gather(*figure_understanding_tasks)
331332
logging.info("Finished image understanding tasks")
332333
logging.info(f"Image Descriptions: {figure_descriptions}")
333334

334335
logging.info("Running image upload tasks")
335-
figure_uris = await asyncio.gather(*figure_upload_tasks)
336+
await asyncio.gather(*figure_upload_tasks)
336337
logging.info("Finished image upload tasks")
337338

338-
figures = [
339-
{"figureId": figure_id, "figureUri": figure_uri}
340-
for figure_id, figure_uri in zip(figure_ids, figure_uris)
341-
]
342-
343339
running_offset = 0
344340
for figure_processing_data, figure_description in zip(
345341
figure_processing_datas, figure_descriptions
@@ -355,7 +351,7 @@ async def process_figures_from_extracted_content(
355351
)
356352
running_offset += desc_offset
357353

358-
return markdown_content, figures
354+
return markdown_content
359355

360356

361357
def create_page_wise_content(result: AnalyzeResult) -> list:
@@ -586,8 +582,7 @@ async def process_adi_2_ai_search(record: dict, chunk_by_page: bool = False) ->
586582
):
587583
build_and_clean_markdown_for_response_tasks.append(
588584
build_and_clean_markdown_for_response(
589-
extracted_page_content[0],
590-
extracted_page_content[1],
585+
extracted_page_content,
591586
page_number,
592587
True,
593588
)
@@ -609,10 +604,7 @@ async def process_adi_2_ai_search(record: dict, chunk_by_page: bool = False) ->
609604
else:
610605
markdown_content = result.content
611606

612-
(
613-
extracted_content,
614-
figures,
615-
) = await process_figures_from_extracted_content(
607+
(extracted_content) = await process_figures_from_extracted_content(
616608
result,
617609
operation_id,
618610
container_and_blob,
@@ -622,7 +614,7 @@ async def process_adi_2_ai_search(record: dict, chunk_by_page: bool = False) ->
622614
)
623615

624616
cleaned_result = await build_and_clean_markdown_for_response(
625-
extracted_content, figures, remove_irrelevant_figures=True
617+
extracted_content, remove_irrelevant_figures=True
626618
)
627619
except Exception as e:
628620
logging.error(e)

adi_function_app/function_app.py

Lines changed: 62 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,9 @@
66
import asyncio
77

88
from adi_2_ai_search import process_adi_2_ai_search
9-
from pre_embedding_cleaner import process_pre_embedding_cleaner
9+
from adi_function_app.mark_up_cleaner import process_mark_up_cleaner
1010
from key_phrase_extraction import process_key_phrase_extraction
11+
from semantic_text_chunker import process_semantic_text_chunker, SemanticTextChunker
1112

1213
logging.basicConfig(level=logging.DEBUG)
1314
app = func.FunctionApp(http_auth_level=func.AuthLevel.FUNCTION)
@@ -50,8 +51,8 @@ async def adi_2_ai_search(req: func.HttpRequest) -> func.HttpResponse:
5051
)
5152

5253

53-
@app.route(route="pre_embedding_cleaner", methods=[func.HttpMethod.POST])
54-
async def pre_embedding_cleaner(req: func.HttpRequest) -> func.HttpResponse:
54+
@app.route(route="mark_up_cleaner", methods=[func.HttpMethod.POST])
55+
async def mark_up_cleaner(req: func.HttpRequest) -> func.HttpResponse:
5556
"""HTTP trigger for data cleanup function.
5657
5758
Args:
@@ -73,17 +74,73 @@ async def pre_embedding_cleaner(req: func.HttpRequest) -> func.HttpResponse:
7374

7475
record_tasks = []
7576

77+
for value in values:
78+
record_tasks.append(asyncio.create_task(process_mark_up_cleaner(value)))
79+
80+
results = await asyncio.gather(*record_tasks)
81+
logging.debug("Results: %s", results)
82+
cleaned_tasks = {"values": results}
83+
84+
return func.HttpResponse(
85+
json.dumps(cleaned_tasks), status_code=200, mimetype="application/json"
86+
)
87+
88+
89+
@app.route(route="semantic_text_chunker", methods=[func.HttpMethod.POST])
90+
async def semantic_text_chunker(req: func.HttpRequest) -> func.HttpResponse:
91+
"""HTTP trigger for text chunking function.
92+
93+
Args:
94+
req (func.HttpRequest): The HTTP request object.
95+
96+
Returns:
97+
func.HttpResponse: The HTTP response object."""
98+
logging.info("Python HTTP trigger text chunking function processed a request.")
99+
100+
try:
101+
req_body = req.get_json()
102+
values = req_body.get("values")
103+
104+
semantic_text_chunker_config = req.headers
105+
106+
num_surrounding_sentences = semantic_text_chunker_config.get(
107+
"num_surrounding_sentences", 1
108+
)
109+
similarity_threshold = semantic_text_chunker_config.get(
110+
"similarity_threshold", 0.8
111+
)
112+
max_chunk_tokens = semantic_text_chunker_config.get("max_chunk_tokens", 500)
113+
min_chunk_tokens = semantic_text_chunker_config.get("min_chunk_tokens", 50)
114+
115+
except ValueError:
116+
return func.HttpResponse(
117+
"Please valid Custom Skill Payload in the request body", status_code=400
118+
)
119+
else:
120+
logging.debug("Input Values: %s", values)
121+
122+
record_tasks = []
123+
124+
semantic_text_chunker = SemanticTextChunker(
125+
num_surrounding_sentences=num_surrounding_sentences,
126+
similarity_threshold=similarity_threshold,
127+
max_chunk_tokens=max_chunk_tokens,
128+
min_chunk_tokens=min_chunk_tokens,
129+
)
130+
76131
for value in values:
77132
record_tasks.append(
78-
asyncio.create_task(process_pre_embedding_cleaner(value))
133+
asyncio.create_task(
134+
process_semantic_text_chunker(value, semantic_text_chunker)
135+
)
79136
)
80137

81138
results = await asyncio.gather(*record_tasks)
82139
logging.debug("Results: %s", results)
83140
cleaned_tasks = {"values": results}
84141

85142
return func.HttpResponse(
86-
json.dumps(cleaned_tasks), status_code=200, mimetype="application/json"
143+
json.dump(cleaned_tasks), status_code=200, mimetype="application/json"
87144
)
88145

89146

0 commit comments

Comments
 (0)