Update readme and model

BenConstable9 · BenConstable9 · commit 6049d7bed309 · 2025-01-21T21:02:43.000Z
diff --git a/image_processing/README.md b/image_processing/README.md
@@ -1,10 +1,12 @@
-# AI Search Indexing with Azure Document Intelligence
+# Image Processing for RAG - AI Search Indexing with Azure Document Intelligence
 
 This portion of the repo contains code for linking Azure Document Intelligence with AI Search to process complex documents with charts and figures, and uses multi-modal models (gpt-4o-mini) to interpret and understand these.
 
 The implementation in Python, although it can easily be adapted for C# or another language. The code is designed to run in an Azure Function App inside the tenant.
 
-**This approach makes use of Azure Document Intelligence v4.0 which is still in preview.**
+> [!NOTE]
+>
+> See `GETTING_STARTED.md` for a step by step guide of how to use the accelerator.
 
 ## High Level Workflow
 
@@ -37,6 +39,7 @@ Once the Markdown is obtained, several steps are carried out:
 3. **Cleaning of Markdown**. The final markdown content is cleaned of any characters or unsupported Markdown elements that we do not want in the chunk e.g. non-relevant figures.
 
 > [!NOTE]
+>
 > For scalability, the above steps are performed across 5 differnet function app endpoints that are orchestrated by AI search.
 
 ## Sample Output
@@ -52,9 +55,6 @@ Using the [Phi-3 Technical Report: A Highly Capable Language Model Locally on Yo
 
 The Figure 4 content has been interpreted and added into the extracted chunk to enhance the context for a RAG application. This is particularly powerful for applications where the documents are heavily imaged or chart based.
 
-> [!NOTE]
-> See `GETTING_STARTED.md` for a step by step guide of how to use the accelerator.
-
 ## Provided Notebooks \& Utilities
 
 - `./function_app` provides a pre-built Python function app that communicates with Azure Document Intelligence, Azure OpenAI etc to perform the Markdown conversion, extraction of figures, figure understanding and corresponding cleaning of Markdown.
diff --git a/image_processing/src/image_processing/semantic_text_chunker.py b/image_processing/src/image_processing/semantic_text_chunker.py
@@ -15,17 +15,16 @@ def __init__(
         num_surrounding_sentences: int = 1,
         similarity_threshold: float = 0.8,
         max_chunk_tokens: int = 200,
-        min_chunk_tokens: int = 50,
-        distill_model=True,
+        min_chunk_tokens: int = 50
     ):
         self.num_surrounding_sentences = num_surrounding_sentences
         self.similarity_threshold = similarity_threshold
         self.max_chunk_tokens = max_chunk_tokens
         self.min_chunk_tokens = min_chunk_tokens
 
-        self.distill_model = distill_model
         model_name = "minishlab/M2V_base_output"
         self.distilled_model = StaticModel.from_pretrained(model_name)
+
         try:
             self._nlp_model = spacy.load("en_core_web_md")
         except IOError as e:
@@ -267,7 +266,7 @@ def look_ahead_and_behind_sentences(
             next_sentence_is_table_or_figure,
         ) in enumerate(
             is_table_or_figure_map[
-                current_sentence_index : current_sentence_index
+                current_sentence_index: current_sentence_index
                 + surround_sentences_gap_to_test
             ]
         ):
@@ -301,7 +300,8 @@ def retrive_current_chunk_at_n(n):
             else:
                 return current_chunk[n]
 
-        current_chunk_tokens = self.num_tokens_from_string(" ".join(current_chunk))
+        current_chunk_tokens = self.num_tokens_from_string(
+            " ".join(current_chunk))
 
         if len(current_chunk) >= 2 and current_chunk_tokens >= self.min_chunk_tokens:
             logging.info("Comparing chunks")
@@ -403,13 +403,13 @@ def retrieve_current_chunk():
                     new_is_table_or_figure_map.append(False)
                     if forwards_direction:
                         current_chunk = sentences[
-                            current_sentence_index : current_sentence_index
+                            current_sentence_index: current_sentence_index
                             + min_of_distance_to_next_figure_or_num_surrounding_sentences
                         ]
                     else:
                         current_chunk = sentences[
-                            current_sentence_index : current_sentence_index
-                            - min_of_distance_to_next_figure_or_num_surrounding_sentences : -1
+                            current_sentence_index: current_sentence_index
+                            - min_of_distance_to_next_figure_or_num_surrounding_sentences: -1
                         ]
                     index += min_of_distance_to_next_figure_or_num_surrounding_sentences
                     continue
@@ -446,12 +446,8 @@ def retrieve_current_chunk():
         return chunks, new_is_table_or_figure_map
 
     def sentence_similarity(self, text_1, text_2):
-        if self.distill_model:
-            vec1 = self.distilled_model.encode(text_1)
-            vec2 = self.distilled_model.encode(text_2)
-        else:
-            vec1 = self._nlp_model(text_1).vector
-            vec2 = self._nlp_model(text_2).vector
+        vec1 = self.distilled_model.encode(text_1)
+        vec2 = self.distilled_model.encode(text_2)
 
         dot_product = np.dot(vec1, vec2)
         magnitude = np.linalg.norm(vec1) * np.linalg.norm(vec2)
diff --git a/text_2_sql/README.md b/text_2_sql/README.md
@@ -6,7 +6,7 @@ The sample provided works with Azure SQL Server, although it has been easily ada
 
 > [!NOTE]
 >
-> - Previous versions of this approach have now been moved to `previous_iterations/semantic_kernel`. These will not be updated.
+> See `GETTING_STARTED.md` for a step by step guide of how to use the accelerator.
 
 ## Why Text2SQL instead of indexing the database contents?
 
@@ -31,6 +31,10 @@ To solve these issues, a Multi-Shot approach is developed. Below is the iteratio
 
 ![Comparison between a common Text2SQL approach and a Multi-Shot Text2SQL approach.](./images/Text2SQL%20Approaches.png "Multi Shot SQL Approaches")
 
+> [!NOTE]
+>
+> - Previous versions of this approach have now been moved to `previous_iterations/semantic_kernel`. These will not be updated or maintained.
+
 Our approach has evolved as the system has matured into an multi-agent approach that brings improved reasoning, speed and instruction following capabilities. With separation into agents, different agents can focus on one task only, and provide a better overall flow and response quality.
 
 Using Auto-Function calling capabilities, the LLM is able to retrieve from the plugin the full schema information for the views / tables that it considers useful for answering the question. Once retrieved, the full SQL query can then be generated. The schemas for multiple views / tables can be retrieved to allow the LLM to perform joins and other complex queries.
@@ -39,9 +43,6 @@ To improve the scalability and accuracy in SQL Query generation, the entity rela
 
 For the query cache enabled approach, AI Search is used as a vector based cache, but any other cache that supports vector queries could be used, such as Redis.
 
-> [!NOTE]
-> See `GETTING_STARTED.md` for a step by step guide of how to use the accelerator.
-
 ### Full Logical Flow for Agentic Vector Based Approach
 
 The following diagram shows the logical flow within mutlti agent system. In an ideal scenario, the questions will follow the _Pre-Fetched Cache Results Path** which leads to the quickest answer generation. In cases where the question is not known, the group chat selector  will fall back to the other agents accordingly and generate the SQL query using the LLMs. The cache is then updated with the newly generated query and schemas.
diff --git a/uv.lock b/uv.lock