diff --git a/image_processing/README.md b/image_processing/README.md index b19a456..be740e4 100644 --- a/image_processing/README.md +++ b/image_processing/README.md @@ -39,7 +39,7 @@ Once the Markdown is obtained, several steps are carried out: 3. **Cleaning of Markdown**. The final markdown content is cleaned of any characters or unsupported Markdown elements that we do not want in the chunk e.g. non-relevant figures. -### AI Search Enrichment Steps +## AI Search Enrichment Steps > [!NOTE] > @@ -55,15 +55,18 @@ Once the Markdown is obtained, several steps are carried out: Here, the output from the layout is considered a single block of text and the customer semantic chunker is used before vectorisation and projections. The custom chunker aims to retain figures and tables within the same chunks, and chunks when the similarity between sentences is lower than the threshold. +## Runtime Retrieval - RAG Flow + +Below is a sample flow of how the enriched index can be used within a RAG flow: + +![Image Based RAG Flow Example for Image Understanding](./images/Image%20Based%20RAG.png "Example Image Based RAG Flow") + ## Sample Output Using the [Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone](https://arxiv.org/pdf/2404.14219) as an example, the following output can be obtained for page 7: -```json -{ - "final_chunk_content": "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
Table 1: Comparison results on RepoQA benchmark.
ModelCtx SizePythonC++RustJavaTypeScriptAverage
gpt-4O-2024-05-13128k958085969790.6
gemini-1.5-flash-latest1000k937987949790
Phi-3.5-MoE128k897481889585
Phi-3.5-Mini128k866773778277
Llama-3.1-8B-Instruct128k806573766371
Mixtral-8x7B-Instruct-v0.132k666564717468
Mixtral-8x22B-Instruct-v0.164k606774835567.8
\n\n\nsuch as Arabic, Chinese, Russian, Ukrainian, and Vietnamese, with average MMLU-multilingual scores\nof 55.4 and 47.3, respectively. Due to its larger model capacity, phi-3.5-MoE achieves a significantly\nhigher average score of 69.9, outperforming phi-3.5-mini.\n\nMMLU(5-shot) MultiLingual\n\nPhi-3-mini\n\nPhi-3.5-mini\n\nPhi-3.5-MoE\n\n\n\n\n\n We evaluate the phi-3.5-mini and phi-3.5-MoE models on two long-context understanding tasks:\nRULER [HSK+24] and RepoQA [LTD+24]. As shown in Tables 1 and 2, both phi-3.5-MoE and phi-\n3.5-mini outperform other open-source models with larger sizes, such as Llama-3.1-8B, Mixtral-8x7B,\nand Mixtral-8x22B, on the RepoQA task, and achieve comparable performance to Llama-3.1-8B on\nthe RULER task. However, we observe a significant performance drop when testing the 128K context\nwindow on the RULER task. We suspect this is due to the lack of high-quality long-context data in\nmid-training, an issue we plan to address in the next version of the model release.\n\n In the table 3, we present a detailed evaluation of the phi-3.5-mini and phi-3.5-MoE models\ncompared with recent SoTA pretrained language models, such as GPT-4o-mini, Gemini-1.5 Flash, and\nopen-source models like Llama-3.1-8B and the Mistral models. The results show that phi-3.5-mini\nachieves performance comparable to much larger models like Mistral-Nemo-12B and Llama-3.1-8B, while\nphi-3.5-MoE significantly outperforms other open-source models, offers performance comparable to\nGemini-1.5 Flash, and achieves above 90% of the average performance of GPT-4o-mini across various\nlanguage benchmarks.\n\n\n\n\n", - "page_number": 7 -} +``` +\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
Table 1: Comparison results on RepoQA benchmark.
ModelCtx SizePythonC++RustJavaTypeScriptAverage
gpt-4O-2024-05-13128k958085969790.6
gemini-1.5-flash-latest1000k937987949790
Phi-3.5-MoE128k897481889585
Phi-3.5-Mini128k866773778277
Llama-3.1-8B-Instruct128k806573766371
Mixtral-8x7B-Instruct-v0.132k666564717468
Mixtral-8x22B-Instruct-v0.164k606774835567.8
\n\n\nsuch as Arabic, Chinese, Russian, Ukrainian, and Vietnamese, with average MMLU-multilingual scores\nof 55.4 and 47.3, respectively. Due to its larger model capacity, phi-3.5-MoE achieves a significantly\nhigher average score of 69.9, outperforming phi-3.5-mini.\n\nMMLU(5-shot) MultiLingual\n\nPhi-3-mini\n\nPhi-3.5-mini\n\nPhi-3.5-MoE\n\n\n\n\n\n We evaluate the phi-3.5-mini and phi-3.5-MoE models on two long-context understanding tasks:\nRULER [HSK+24] and RepoQA [LTD+24]. As shown in Tables 1 and 2, both phi-3.5-MoE and phi-\n3.5-mini outperform other open-source models with larger sizes, such as Llama-3.1-8B, Mixtral-8x7B,\nand Mixtral-8x22B, on the RepoQA task, and achieve comparable performance to Llama-3.1-8B on\nthe RULER task. However, we observe a significant performance drop when testing the 128K context\nwindow on the RULER task. We suspect this is due to the lack of high-quality long-context data in\nmid-training, an issue we plan to address in the next version of the model release.\n\n In the table 3, we present a detailed evaluation of the phi-3.5-mini and phi-3.5-MoE models\ncompared with recent SoTA pretrained language models, such as GPT-4o-mini, Gemini-1.5 Flash, and\nopen-source models like Llama-3.1-8B and the Mistral models. The results show that phi-3.5-mini\nachieves performance comparable to much larger models like Mistral-Nemo-12B and Llama-3.1-8B, while\nphi-3.5-MoE significantly outperforms other open-source models, offers performance comparable to\nGemini-1.5 Flash, and achieves above 90% of the average performance of GPT-4o-mini across various\nlanguage benchmarks.\n\n\n\n\n ``` The Figure 4 content has been interpreted and added into the extracted chunk to enhance the context for a RAG application. This is particularly powerful for applications where the documents are heavily imaged or chart based. diff --git a/image_processing/images/Image Based RAG.png b/image_processing/images/Image Based RAG.png new file mode 100644 index 0000000..9bf83f4 Binary files /dev/null and b/image_processing/images/Image Based RAG.png differ