Document Analysis and identification with RAG Vector Database and Mistral LLM

Overview

This project is a comprehensive document analysis system, designed to automate the processing and analysis of documents from acquisition to consumption. It integrates advanced machine learning and AI models like RAG (Retrieval Augmented Generation) Vector Databases and Mistral LLM to efficiently extract, match, enrich, and process document information. The pipeline covers:

Document Acquisition
Document OCR (Optical Character Recognition)
Document Preprocessing
Document Information and Title Extraction
Document Matching and Enrichment
Document Consumption and Uploading

Features

Document Acquisition: Automated gathering of documents from specified sources.
OCR Integration: Converts scanned images or PDFs into machine-readable text using Optical Character Recognition (OCR).
Preprocessing: Cleans and prepares documents for downstream analysis (removal of noise, handling of incomplete documents).
Information Extraction: Title and relevant document information extraction using RAG vector database and Mistral LLM.
Document Matching & Enrichment: Leverages AI models to match and enrich documents with metadata and other contextual information.
Consumption & Uploading: Final processed document output, ready for consumption, and automatically uploaded to a specified location.

Architecture

Document Acquisition:
- Connects to a variety of data sources (local, cloud storage, web scraping).
Document OCR:
- Uses popular OCR libraries (e.g., Tesseract) for converting images or scanned PDFs into editable text.
Preprocessing:
- Text cleaning, noise reduction, normalization, and segmentation.
Information Extraction:
- Uses RAG and Mistral LLM for extracting meaningful information like document title, summary, and other details.
Matching and Enrichment:
- Matches documents with existing records and enriches them with metadata using contextual AI models.
Document Uploading:
- Final step to upload or store the enriched documents in a specified repository or database.

Getting Started

Prerequisites

Python 3.x
Libraries: pandas, numpy, spacy, torch, transformers, tesseract, opencv, sentence-transformers
Access to a RAG-compatible Vector Database
Mistral LLM API Key

Installation

Clone the repository:

git clone https://github.com/username/Document-Analysis-Pipeline-LLM-RAG.git

Navigate to the project directory:
```
cd Document-Analysis-Pipeline-LLM-RAG
```
Install the required dependencies:
```
pip install -r requirements.txt
```

Usage

Configure Sources:
- Define the document sources for acquisition (local storage, cloud storage, etc.).

Run the Pipeline:

python main.py --source [path_to_documents] --output [output_path]

OCR & Preprocessing:
- The system will automatically run OCR on documents and clean the text.
Information Extraction:
- Extract document titles and key information using the integrated Mistral LLM and RAG models.
Enrichment & Matching:
- The processed documents will be enriched with additional metadata and matched to existing datasets if applicable.
Final Output:
- The final output is ready for upload or further analysis.

Configuration

Document Sources: Define paths to documents in config.json.
Vector Database: Set up the connection details for the RAG Vector Database in rag_config.yaml.
Mistral LLM: Provide API keys for Mistral LLM integration in ml_config.yaml.

Contributor

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
Document_Classfication.py		Document_Classfication.py
Document_Extraction.py		Document_Extraction.py
Document_Image_Processing.py		Document_Image_Processing.py
Document_OCR.py		Document_OCR.py
Document_Text_Processing.py		Document_Text_Processing.py
Document_classification.zip		Document_classification.zip
README.md		README.md
__init__.py		__init__.py
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Document Analysis and identification with RAG Vector Database and Mistral LLM

Overview

Features

Architecture

Getting Started

Prerequisites

Installation

Usage

Configuration

Contributor

License

About

Uh oh!

Releases

Packages

Languages

AlinaBaber/Document-Analysis-Identification-with-RAG-Vector-Database-and-Mistral-LLM

Folders and files

Latest commit

History

Repository files navigation

Document Analysis and identification with RAG Vector Database and Mistral LLM

Overview

Features

Architecture

Getting Started

Prerequisites

Installation

Usage

Configuration

Contributor

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages