Skip to content

Commit 6049d7b

Browse files
committed
Update readme and model
1 parent a67a695 commit 6049d7b

File tree

4 files changed

+242
-246
lines changed

4 files changed

+242
-246
lines changed

image_processing/README.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,12 @@
1-
# AI Search Indexing with Azure Document Intelligence
1+
# Image Processing for RAG - AI Search Indexing with Azure Document Intelligence
22

33
This portion of the repo contains code for linking Azure Document Intelligence with AI Search to process complex documents with charts and figures, and uses multi-modal models (gpt-4o-mini) to interpret and understand these.
44

55
The implementation in Python, although it can easily be adapted for C# or another language. The code is designed to run in an Azure Function App inside the tenant.
66

7-
**This approach makes use of Azure Document Intelligence v4.0 which is still in preview.**
7+
> [!NOTE]
8+
>
9+
> See `GETTING_STARTED.md` for a step by step guide of how to use the accelerator.
810
911
## High Level Workflow
1012

@@ -37,6 +39,7 @@ Once the Markdown is obtained, several steps are carried out:
3739
3. **Cleaning of Markdown**. The final markdown content is cleaned of any characters or unsupported Markdown elements that we do not want in the chunk e.g. non-relevant figures.
3840

3941
> [!NOTE]
42+
>
4043
> For scalability, the above steps are performed across 5 differnet function app endpoints that are orchestrated by AI search.
4144
4245
## Sample Output
@@ -52,9 +55,6 @@ Using the [Phi-3 Technical Report: A Highly Capable Language Model Locally on Yo
5255

5356
The Figure 4 content has been interpreted and added into the extracted chunk to enhance the context for a RAG application. This is particularly powerful for applications where the documents are heavily imaged or chart based.
5457

55-
> [!NOTE]
56-
> See `GETTING_STARTED.md` for a step by step guide of how to use the accelerator.
57-
5858
## Provided Notebooks \& Utilities
5959

6060
- `./function_app` provides a pre-built Python function app that communicates with Azure Document Intelligence, Azure OpenAI etc to perform the Markdown conversion, extraction of figures, figure understanding and corresponding cleaning of Markdown.

image_processing/src/image_processing/semantic_text_chunker.py

Lines changed: 10 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -15,17 +15,16 @@ def __init__(
1515
num_surrounding_sentences: int = 1,
1616
similarity_threshold: float = 0.8,
1717
max_chunk_tokens: int = 200,
18-
min_chunk_tokens: int = 50,
19-
distill_model=True,
18+
min_chunk_tokens: int = 50
2019
):
2120
self.num_surrounding_sentences = num_surrounding_sentences
2221
self.similarity_threshold = similarity_threshold
2322
self.max_chunk_tokens = max_chunk_tokens
2423
self.min_chunk_tokens = min_chunk_tokens
2524

26-
self.distill_model = distill_model
2725
model_name = "minishlab/M2V_base_output"
2826
self.distilled_model = StaticModel.from_pretrained(model_name)
27+
2928
try:
3029
self._nlp_model = spacy.load("en_core_web_md")
3130
except IOError as e:
@@ -267,7 +266,7 @@ def look_ahead_and_behind_sentences(
267266
next_sentence_is_table_or_figure,
268267
) in enumerate(
269268
is_table_or_figure_map[
270-
current_sentence_index : current_sentence_index
269+
current_sentence_index: current_sentence_index
271270
+ surround_sentences_gap_to_test
272271
]
273272
):
@@ -301,7 +300,8 @@ def retrive_current_chunk_at_n(n):
301300
else:
302301
return current_chunk[n]
303302

304-
current_chunk_tokens = self.num_tokens_from_string(" ".join(current_chunk))
303+
current_chunk_tokens = self.num_tokens_from_string(
304+
" ".join(current_chunk))
305305

306306
if len(current_chunk) >= 2 and current_chunk_tokens >= self.min_chunk_tokens:
307307
logging.info("Comparing chunks")
@@ -403,13 +403,13 @@ def retrieve_current_chunk():
403403
new_is_table_or_figure_map.append(False)
404404
if forwards_direction:
405405
current_chunk = sentences[
406-
current_sentence_index : current_sentence_index
406+
current_sentence_index: current_sentence_index
407407
+ min_of_distance_to_next_figure_or_num_surrounding_sentences
408408
]
409409
else:
410410
current_chunk = sentences[
411-
current_sentence_index : current_sentence_index
412-
- min_of_distance_to_next_figure_or_num_surrounding_sentences : -1
411+
current_sentence_index: current_sentence_index
412+
- min_of_distance_to_next_figure_or_num_surrounding_sentences: -1
413413
]
414414
index += min_of_distance_to_next_figure_or_num_surrounding_sentences
415415
continue
@@ -446,12 +446,8 @@ def retrieve_current_chunk():
446446
return chunks, new_is_table_or_figure_map
447447

448448
def sentence_similarity(self, text_1, text_2):
449-
if self.distill_model:
450-
vec1 = self.distilled_model.encode(text_1)
451-
vec2 = self.distilled_model.encode(text_2)
452-
else:
453-
vec1 = self._nlp_model(text_1).vector
454-
vec2 = self._nlp_model(text_2).vector
449+
vec1 = self.distilled_model.encode(text_1)
450+
vec2 = self.distilled_model.encode(text_2)
455451

456452
dot_product = np.dot(vec1, vec2)
457453
magnitude = np.linalg.norm(vec1) * np.linalg.norm(vec2)

text_2_sql/README.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ The sample provided works with Azure SQL Server, although it has been easily ada
66

77
> [!NOTE]
88
>
9-
> - Previous versions of this approach have now been moved to `previous_iterations/semantic_kernel`. These will not be updated.
9+
> See `GETTING_STARTED.md` for a step by step guide of how to use the accelerator.
1010
1111
## Why Text2SQL instead of indexing the database contents?
1212

@@ -31,6 +31,10 @@ To solve these issues, a Multi-Shot approach is developed. Below is the iteratio
3131

3232
![Comparison between a common Text2SQL approach and a Multi-Shot Text2SQL approach.](./images/Text2SQL%20Approaches.png "Multi Shot SQL Approaches")
3333

34+
> [!NOTE]
35+
>
36+
> - Previous versions of this approach have now been moved to `previous_iterations/semantic_kernel`. These will not be updated or maintained.
37+
3438
Our approach has evolved as the system has matured into an multi-agent approach that brings improved reasoning, speed and instruction following capabilities. With separation into agents, different agents can focus on one task only, and provide a better overall flow and response quality.
3539

3640
Using Auto-Function calling capabilities, the LLM is able to retrieve from the plugin the full schema information for the views / tables that it considers useful for answering the question. Once retrieved, the full SQL query can then be generated. The schemas for multiple views / tables can be retrieved to allow the LLM to perform joins and other complex queries.
@@ -39,9 +43,6 @@ To improve the scalability and accuracy in SQL Query generation, the entity rela
3943

4044
For the query cache enabled approach, AI Search is used as a vector based cache, but any other cache that supports vector queries could be used, such as Redis.
4145

42-
> [!NOTE]
43-
> See `GETTING_STARTED.md` for a step by step guide of how to use the accelerator.
44-
4546
### Full Logical Flow for Agentic Vector Based Approach
4647

4748
The following diagram shows the logical flow within mutlti agent system. In an ideal scenario, the questions will follow the _Pre-Fetched Cache Results Path** which leads to the quickest answer generation. In cases where the question is not known, the group chat selector will fall back to the other agents accordingly and generate the SQL query using the LLMs. The cache is then updated with the newly generated query and schemas.

0 commit comments

Comments
 (0)