Update the readme

BenConstable9 · BenConstable9 · commit 87145e6da4c9 · 2024-09-23T16:59:36.000+01:00
diff --git a/text_2_sql/README.md b/text_2_sql/README.md
@@ -39,7 +39,7 @@ To solve these issues, a Multi-Shot approach is developed. Below is the iteratio
 Three different iterations are presented and code provided for:
  - **Iteration 2:** Injection of a brief description of the available entities is injected into the prompt. This limits the number of tokens used and avoids filling the prompt with confusing schema information.
  - **Iteration 3:** Indexing the entity definitions in a vector database, such as AI Search, and querying it to retrieve the most relevant entities for the key terms from the query.
-  - **Iteration 4:** Keeping an index of commonly asked questions and which schema / SQL query they resolve to. Additionally, indexing the entity definitions in a vector database, such as AI Search _(same as Iteration 3)_. First querying this index to see if a similar SQL query can be obtained _(if high probability of exact SQL query match, the results can be pre-fetched)_. If not, falling back to the schema index, and querying it to retrieve the most relevant entities for the key terms from the query.
+  - **Iteration 4:** Keeping an index of commonly asked questions and which schema / SQL query they resolve to - this index is generated by the LLM when it encounters a question that has not been previously asked. Additionally, indexing the entity definitions in a vector database, such as AI Search _(same as Iteration 3)_. First querying this index to see if a similar SQL query can be obtained _(if high probability of exact SQL query match, the results can be pre-fetched)_. If not, falling back to the schema index, and querying it to retrieve the most relevant entities for the key terms from the query.
 
 All approaches limit the number of tokens used and avoids filling the prompt with confusing schema information.
 
@@ -49,6 +49,12 @@ For the query cache enabled approach, AI Search is used as a vector based cache,
 
 ### Full Logical Flow for Vector Based Approach
 
+The following diagram shows the logical flow within the Vector Based plugin. In an ideal scenario, the questions will follow the _Pre-Fetched Cache Results Path** which leads to the quickest answer generation. In cases where the question is not known, the plugin will fall back the other paths accordingly.
+
+As the query cache is shared between users (no data is stored in the cache), a new user can benefit from the pre-mapped question and schema resolution in the index.
+
+**Database results were deliberately not stored within the cache. Storing them would have removed one of the key benefits of the Text2SQL plugin, the ability to get near-real time information inside a RAG application. Instead, the query is stored so that the most-recent results can be obtained quickly. Additionally, this retains the ability to apply Row or Column Level Security.**
+
 ![Vector Based with Query Cache Logical Flow.](./images/Text2SQL%20Query%20Cache.png "Vector Based with Query Cache Logical Flow")
 
 ### Comparison of Iterations
@@ -63,11 +69,12 @@ For the query cache enabled approach, AI Search is used as a vector based cache,
 | | Consumes a significant number of tokens as number of entities increases. | As number of entities increases, token usage will grow but at a lesser rate than Iteration 1. | | AI Search adds additional cost to the solution. |
 | | LLM struggled to differentiate which table to choose with the large amount of information passed. | | |
 
-### Timing Comparison for Test Question Set
+### Complete Execution Time Comparison for Approaches
 
-To compare the different in complete execution time, the following questions were tested 25 times each for 3 different modes.
+To compare the different in complete execution time, the following questions were tested 25 times each for 4 different approaches.
 
-Modes:
+Approaches:
+- Prompt-based Multi-Shot (Iteration 2)
 - Vector-Based Multi-Shot (Iteration 3)
 - Vector-Based Multi-Shot with Query Cache (Iteration 4)
 - Vector-Based Multi-shot with Pre Run Query Cache (Iteration 4)
@@ -77,6 +84,18 @@ Questions:
 - Give me the total number of orders in 2008?
 - Which country did had the highest number of orders in June 2008?
 
+The graph below shows the response times for the experimentation on a Known Question Set (i.e. the cache has already been populated with the query mapping by the LLM). gpt-4o was used as the completion LLM for this experiment. The response time is the complete execution time including:
+
+- Prompt Preparation
+- Question Understanding
+- Cache Index Requests _(if applicable)_
+- SQL Query Execution
+- Interpretation and generation of answer in the correct format
+
+![Response Time Distribution](./images/Known%20Question%20Response%20Time.png "Response Time Distribution By Approach")
+
+The vector-based cache approaches consistently outperform those that just use a Prompt-Based or Vector-Based approach by a significant margin. Given that it is highly likely the same Text2SQL questions will be repeated often, storing the question-sql mapping leads to **significant performance increases** that are beneficial, despite the initial additional latency (between 1 - 2 seconds from testing) when a question is asked the first time.
+
 ## Sample Output
 
 ### What is the top performing product by quantity of units sold?
@@ -128,7 +147,7 @@ The top-performing product by quantity of units sold is the **Classic Vest, S**
 - `./rag_with_vector_based_text_2_sql_query_cache.ipynb` provides example of how to utilise the Vector Based Text2SQL plugin, alongside the query cache, to query the database.
 - `./rag_with_ai_search_and_text_2_sql.ipynb` provides an example of how to use the Text2SQL and an AISearch plugin in parallel to automatically retrieve data from the most relevant source to answer the query.
     - This setup is useful for a production application as the SQL Database is unlikely to be able to answer all the questions a user may ask.
-- `./time_comparison_scripy.py` provides a utility script for performing time based comparisons between the different approaches.
+- `./time_comparison_script.py` provides a utility script for performing time based comparisons between the different approaches.
 
 ## Data Dictionary
 
@@ -176,17 +195,11 @@ The data dictionary is stored in `./data_dictionary/entities.json`. Below is a s
 
 A full data dictionary must be built for all the views / tables you which to expose to the LLM. The metadata provide directly influences the accuracy of the Text2SQL component.
 
-## Common Plugin Components
-
-#### run_sql_query()
-
-This method is called by the Semantic Kernel framework automatically, when instructed to do so by the LLM, to run a SQL query against the given database. It returns a JSON string containing a row wise dump of the results returned. These results are then interpreted to answer the question.
-
 ## Prompt Based SQL Plugin (Iteration 2)
 
-This approach works well for a small number of entities (test on up to 20 entities with hundreds of columns). It performed well on the testing, with correct metadata, we achieved 100% accuracy on the test set.
+This approach works well for a small number of entities (tested on up to 20 entities with hundreds of columns). It performed well on the testing, with correct metadata, we achieved 100% accuracy on the test set.
 
-Whilst a simple and high performing approach, the downside of this approach is the increase in number of tokens as the number of entities increases.
+Whilst a simple and high performing approach, the downside of this approach is the increase in number of tokens as the number of entities increases. Additionally, we found that the LLM started to get "confused" on which columns belong to which entities as the number of entities increased.
 
 ### prompt_based_sql_plugin.py
 
@@ -204,6 +217,10 @@ The **target_engine** is passed to the prompt, along with **engine_specific_rule
 
 This method is called by the Semantic Kernel framework automatically, when instructed to do so by the LLM, to fetch the full schema definitions for a given entity. This returns a JSON string of the chosen entity which allows the LLM to understand the column definitions and their associated metadata. This can be called in parallel for multiple entities.
 
+#### run_sql_query()
+
+This method is called by the Semantic Kernel framework automatically, when instructed to do so by the LLM, to run a SQL query against the given database. It returns a JSON string containing a row wise dump of the results returned. These results are then interpreted to answer the question.
+
 ## Vector Based SQL Plugin (Iterations 3 & 4)
 
 This approach allows the system to scale without significantly increasing the number of tokens used within the system prompt. Indexing and running an AI Search instance consumes additional cost, compared to the prompt based approach.
@@ -234,9 +251,17 @@ This method is called by the Semantic Kernel framework automatically, when instr
 
 The search text passed is vectorised against the entity level **Description** columns. A hybrid Semantic Reranking search is applied against the **EntityName**, **Entity**, **Columns/Name** fields.
 
-#### run_ai_search_query()
+#### fetch_queries_from_cache()
+
+The vector based with query cache uses the `fetch_queries_from_cache()` method to fetch the most relevant previous query and injects it into the prompt before the initial LLM call. The use of Auto-Function Calling here is avoided to reduce the response time as the cache index will always be used first.
+
+If the score of the top result is higher than the defined threshold, the query will be executed against the target data source and the results included in the prompt. This allows us to prompt the LLM to evaluated whether it can use these results to answer the question, **without further SQL Query generation** to speed up the process.
+
+#### run_sql_query()
+
+This method is called by the Semantic Kernel framework automatically, when instructed to do so by the LLM, to run a SQL query against the given database. It returns a JSON string containing a row wise dump of the results returned. These results are then interpreted to answer the question.
 
-The vector based with query cache uses the `run_ai_search_query()` method to fetch the most relevant previous query and injects it into the prompt before the initial LLM call. The use of Auto-Function Calling here is avoided to reduce the response time as the cache index will always be used first.
+Additionally, if any of the cache functionality is enabled, this method will update the query cache index based on the SQL query run, and the schemas used in execution.
 
 ## Tips for good Text2SQL performance.
 
diff --git a/text_2_sql/plugins/vector_based_sql_plugin/vector_based_sql_plugin.py b/text_2_sql/plugins/vector_based_sql_plugin/vector_based_sql_plugin.py
@@ -34,6 +34,7 @@ def __init__(self, target_engine: str = "Microsoft TSQL Server"):
         self.set_mode()
 
     def set_mode(self):
+        """Set the mode of the plugin based on the environment variables."""
         self.use_query_cache = (
             os.environ.get("Text2Sql__UseQueryCache", "False").lower() == "true"
         )
@@ -42,7 +43,16 @@ def set_mode(self):
             os.environ.get("Text2Sql__PreRunQueryCache", "False").lower() == "true"
         )
 
-    def filter_schemas_against_statement(self, sql_statement):
+    def filter_schemas_against_statement(self, sql_statement: str) -> list[dict]:
+        """Filter the schemas against the SQL statement to find the matching entities.
+
+        Args:
+        ----
+            sql_statement (str): The SQL statement to filter the schemas against.
+
+        Returns:
+        -------
+            list[dict]: The list of matching entities."""
         matching_entities = []
 
         logging.info("SQL Statement: %s", sql_statement)
@@ -88,7 +98,16 @@ async def query_execution(self, sql_query: str) -> list[dict]:
         logging.debug("Results: %s", results)
         return results
 
-    async def fetch_schemas_from_store(self, search: str):
+    async def fetch_schemas_from_store(self, search: str) -> list[dict]:
+        """Fetch the schemas from the store based on the search term.
+
+        Args:
+        ----
+            search (str): The search term to use to fetch the schemas.
+
+        Returns:
+        -------
+            list[dict]: The list of schemas fetched from the store."""
         schemas = await run_ai_search_query(
             search,
             ["DescriptionEmbedding"],
@@ -107,7 +126,17 @@ async def fetch_schemas_from_store(self, search: str):
 
         return schemas
 
-    async def fetch_queries_from_cache(self, question: str):
+    async def fetch_queries_from_cache(self, question: str) -> str:
+        """Fetch the queries from the cache based on the question.
+
+        Args:
+        ----
+            question (str): The question to use to fetch the queries.
+
+        Returns:
+        -------
+            str: The formatted string of the queries fetched from the cache. This is injected into the prompt.
+        """
         if not self.use_query_cache:
             return None