Skip to content

Commit 05fdc2c

Browse files
committed
Update ai search
1 parent fbe462c commit 05fdc2c

File tree

1 file changed

+29
-13
lines changed

1 file changed

+29
-13
lines changed

adi_function_app/README.md

Lines changed: 29 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -24,13 +24,21 @@ Once the Markdown is obtained, several steps are carried out:
2424

2525
1. **Extraction of images / charts**. The figures identified are extracted from the original document and passed to a multi-modal model (gpt4o in this case) for analysis. We obtain a description and summary of the chart / image to infer the meaning of the figure. This allows us to index and perform RAG analysis the information that is visually obtainable from a chart, without it being explicitly mentioned in the text surrounding. The information is added back into the original chart.
2626

27-
2. **Cleaning of Markdown**. The final markdown content is cleaned of any characters or unsupported Markdown elements that we do not want in the chunk e.g. non-relevant images.
27+
2. **Chunking**. The obtained content is chunked accordingly depending on the chunking strategy. This function app supports two chunking methods, **page wise** and **semantic chunking**. The page wise chunking is performed natively by Azure Document Intelligence. For a Semantic Chunking, we include a customer chunker that splits the text with the following strategy:
2828

29-
Page wise analysis in ADI is used to avoid splitting tables / figures across multiple chunks, when the chunking is performed.
29+
- Splits text into sentences.
30+
- Groups sentences if they are table or figure related to avoid splitting them in context.
31+
- Semanticly groups sentences if the similarity is above the threshold, starting from the start of the text.
32+
- Semanticly groups sentences if the similarity is above the threshold, starting from the end of the text.
33+
- Removes non-existent chunks.
3034

31-
The properties returned from the ADI Custom Skill are then used to perform the following skills:
35+
This chunking method aims to improve on page wise chunking, whilst still retaining similar sentences together. When tested, this method shows great performance improvements, over straight page wise chunking, without splitting up the context when relevant.
3236

33-
- Pre-vectorisation cleaning. This stage is important as we extract the section information in this step from the headers in the document. Additionally, we remove any Markdown tags or characters that would cause an embedding error.
37+
3. **Cleaning of Markdown**. The final markdown content is cleaned of any characters or unsupported Markdown elements that we do not want in the chunk e.g. non-relevant images.
38+
39+
The properties returned from the ADI Custom Skill and Chunking are then used to perform the following skills:
40+
41+
- Markup cleaning. This stage is important as we extract the section information in this step from the headers in the document. Additionally, we remove any Markdown tags or characters that would cause an embedding error.
3442
- Keyphrase extraction
3543
- Vectorisation
3644

@@ -49,18 +57,24 @@ The Figure 4 content has been interpreted and added into the extracted chunk to
4957

5058
## Provided Notebooks \& Utilities
5159

52-
- `./ai_search_with_adi_function_app` provides a pre-built Python function app that communicates with Azure Document Intelligence, Azure OpenAI etc to perform the Markdown conversion, extraction of figures, figure understanding and corresponding cleaning of Markdown.
60+
- `./function_app` provides a pre-built Python function app that communicates with Azure Document Intelligence, Azure OpenAI etc to perform the Markdown conversion, extraction of figures, figure understanding and corresponding cleaning of Markdown.
5361
- `./rag_with_ai_search.ipynb` provides example of how to utilise the AI Search plugin to query the index.
5462

5563
## Deploying AI Search Setup
5664

5765
To deploy the pre-built index and associated indexer / skillset setup, see instructions in `./deploy_ai_search/README.md`.
5866

59-
## ADI Custom Skill
67+
## Custom Skills
68+
69+
Deploy the associated function app and the resources. To use with an index, either use the utility to configure a indexer in the provided form, or integrate the skill with your skillset pipeline.
70+
71+
### ADI Custom Skill
6072

61-
Deploy the associated function app and required resources. You can then experiment with the custom skill by sending an HTTP request in the AI Search JSON format to the `/adi_2_ai_search` HTTP endpoint.
73+
You can then experiment with the custom skill by sending an HTTP request in the AI Search JSON format to the `/adi_2_ai_search` HTTP endpoint. The header controls the chunking technique *(page wise or not)*.
6274

63-
To use with an index, either use the utility to configure a indexer in the provided form, or integrate the skill with your skillset pipeline.
75+
### Semantic Chunker Skill
76+
77+
You can then test the chunking by sending a AI Search JSON format to the `/semantic_text_chunker/ HTTP endpoint. The header controls the different chunking parameters *(num_surrounding_sentences, similarity_threshold, max_chunk_tokens, min_chunk_tokens)*.
6478

6579
### Deployment Steps
6680

@@ -72,11 +86,15 @@ To use with an index, either use the utility to configure a indexer in the provi
7286

7387
#### function_app.py
7488

75-
`./indexer/ai_search_with_adi_function_app.py` contains the HTTP entrypoints for the ADI skill and the other provided utility skills.
89+
`./indexer/function_app.py` contains the HTTP entrypoints for the ADI skill and the other provided utility skills.
90+
91+
#### semantic_text_chunker.py
7692

77-
#### adi_2_aisearch
93+
`./semantic_text_chunker.py` contains the code to chunk the text semantically, whilst grouping similar sentences.
7894

79-
`./indexer/adi_2_aisearch.py` contains the methods for content extraction with ADI. The key methods are:
95+
#### adi_2_ai_search.py
96+
97+
`./indexer/adi_2_ai_search.py` contains the methods for content extraction with ADI. The key methods are:
8098

8199
##### analyse_document
82100

@@ -183,8 +201,6 @@ If `chunk_by_page` header is `False`:
183201
}
184202
```
185203

186-
**Page wise analysis in ADI is recommended to avoid splitting tables / figures across multiple chunks, when the chunking is performed.**
187-
188204
## Other Provided Custom Skills
189205

190206
Due to a AI Search product limitation that AI Search cannot connect to AI Services behind Private Endpoints, we provide a Custom Key Phrase Extraction Skill that will work within a Private Endpoint environment.

0 commit comments

Comments
 (0)