Skip to content

Commit 9209d5e

Browse files
committed
final updates
1 parent 9858d37 commit 9209d5e

File tree

4 files changed

+24
-10
lines changed

4 files changed

+24
-10
lines changed

deploy_ai_search_indexes/src/deploy_ai_search_indexes/image_processing.py

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -187,8 +187,6 @@ def get_skills(self) -> list:
187187
self.enable_page_by_chunking
188188
)
189189

190-
text_split_skill = self.get_semantic_chunker_skill(self.enable_page_by_chunking)
191-
192190
mark_up_cleaner_skill = self.get_mark_up_cleaner_skill(
193191
self.enable_page_by_chunking
194192
)
@@ -212,11 +210,12 @@ def get_skills(self) -> list:
212210
embedding_skill,
213211
]
214212
else:
213+
semantic_chunker_skill = self.get_semantic_chunker_skill()
215214
skills = [
216215
layout_skill,
217216
figure_skill,
218217
merger_skill,
219-
text_split_skill,
218+
semantic_chunker_skill,
220219
mark_up_cleaner_skill,
221220
embedding_skill,
222221
]

image_processing/README.md

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@ Instead of using OCR to extract the contents of the document, ADIv4 is used to a
2525
Once the Markdown is obtained, several steps are carried out:
2626

2727
1. **Extraction of figures / charts**. The figures identified are extracted from the original document and passed to a multi-modal model (gpt-4o-mini in this case) for analysis. We obtain a description and summary of the chart / image to infer the meaning of the figure. This allows us to index and perform RAG analysis the information that is visually obtainable from a chart, without it being explicitly mentioned in the text surrounding. The information is added back into the original chart.
28+
- **The prompt aims to generate a description and summary of the chart so it can be retrieved later during search. It does not aim to summarise every part of the figure. At runtime, retrieve the figures for the given chunk from the index and pass them to the visual model for context.**
2829

2930
2. **Chunking**. The obtained content is chunked accordingly depending on the chunking strategy. This function app supports two chunking methods, **page wise** and **semantic chunking**. The page wise chunking is performed natively by Azure Document Intelligence. For a Semantic Chunking, we include a customer chunker that splits the text with the following strategy:
3031

@@ -38,9 +39,21 @@ Once the Markdown is obtained, several steps are carried out:
3839

3940
3. **Cleaning of Markdown**. The final markdown content is cleaned of any characters or unsupported Markdown elements that we do not want in the chunk e.g. non-relevant figures.
4041

42+
### AI Search Enrichment Steps
43+
4144
> [!NOTE]
4245
>
43-
> For scalability, the above steps are performed across 5 differnet function app endpoints that are orchestrated by AI search.
46+
> For scalability, the above steps are performed across 5 different function app endpoints that are orchestrated by AI search.
47+
48+
### Page Wise Chunking
49+
50+
![AI Search Enrichment Steps & Flow for Page Wise Chunking](./images/Page%20Wise%20Chunking.png "Page Wise Chunking Enrichment Steps")
51+
52+
### Semantic Chunking
53+
54+
![AI Search Enrichment Steps & Flow for Semantic Chunking](./images/Semantic%20Chunking.png "Semantic Chunking Enrichment Steps")
55+
56+
Here, the output from the layout is considered a single block of text and the customer semantic chunker is used before vectorisation and projections. The custom chunker aims to retain figures and tables within the same chunks, and chunks when the similarity between sentences is lower than the threshold.
4457

4558
## Sample Output
4659

156 KB
Loading

image_processing/src/image_processing/function_app.py

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -171,14 +171,16 @@ async def semantic_text_chunker(req: func.HttpRequest) -> func.HttpResponse:
171171

172172
semantic_text_chunker_config = req.headers
173173

174-
num_surrounding_sentences = semantic_text_chunker_config.get(
175-
"num_surrounding_sentences", 1
174+
num_surrounding_sentences = int(
175+
semantic_text_chunker_config.get("num_surrounding_sentences", 1)
176176
)
177-
similarity_threshold = semantic_text_chunker_config.get(
178-
"similarity_threshold", 0.8
177+
similarity_threshold = float(
178+
semantic_text_chunker_config.get("similarity_threshold", 0.8)
179179
)
180-
max_chunk_tokens = semantic_text_chunker_config.get("max_chunk_tokens", 500)
181-
min_chunk_tokens = semantic_text_chunker_config.get("min_chunk_tokens", 50)
180+
max_chunk_tokens = int(
181+
semantic_text_chunker_config.get("max_chunk_tokens", 500)
182+
)
183+
min_chunk_tokens = int(semantic_text_chunker_config.get("min_chunk_tokens", 50))
182184

183185
except ValueError:
184186
return func.HttpResponse(

0 commit comments

Comments
 (0)