This project is a comprehensive document analysis system, designed to automate the processing and analysis of documents from acquisition to consumption. It integrates advanced machine learning and AI models like RAG (Retrieval Augmented Generation) Vector Databases and Mistral LLM to efficiently extract, match, enrich, and process document information. The pipeline covers:
- Document Acquisition
- Document OCR (Optical Character Recognition)
- Document Preprocessing
- Document Information and Title Extraction
- Document Matching and Enrichment
- Document Consumption and Uploading
- Document Acquisition: Automated gathering of documents from specified sources.
- OCR Integration: Converts scanned images or PDFs into machine-readable text using Optical Character Recognition (OCR).
- Preprocessing: Cleans and prepares documents for downstream analysis (removal of noise, handling of incomplete documents).
- Information Extraction: Title and relevant document information extraction using RAG vector database and Mistral LLM.
- Document Matching & Enrichment: Leverages AI models to match and enrich documents with metadata and other contextual information.
- Consumption & Uploading: Final processed document output, ready for consumption, and automatically uploaded to a specified location.
-
Document Acquisition:
- Connects to a variety of data sources (local, cloud storage, web scraping).
-
Document OCR:
- Uses popular OCR libraries (e.g., Tesseract) for converting images or scanned PDFs into editable text.
-
Preprocessing:
- Text cleaning, noise reduction, normalization, and segmentation.
-
Information Extraction:
- Uses RAG and Mistral LLM for extracting meaningful information like document title, summary, and other details.
-
Matching and Enrichment:
- Matches documents with existing records and enriches them with metadata using contextual AI models.
-
Document Uploading:
- Final step to upload or store the enriched documents in a specified repository or database.
- Python 3.x
- Libraries:
pandas,numpy,spacy,torch,transformers,tesseract,opencv,sentence-transformers - Access to a RAG-compatible Vector Database
- Mistral LLM API Key
-
Clone the repository:
git clone https://github.com/username/Document-Analysis-Pipeline-LLM-RAG.git
-
Navigate to the project directory:
cd Document-Analysis-Pipeline-LLM-RAG -
Install the required dependencies:
pip install -r requirements.txt
-
Configure Sources:
- Define the document sources for acquisition (local storage, cloud storage, etc.).
-
Run the Pipeline:
python main.py --source [path_to_documents] --output [output_path]
-
OCR & Preprocessing:
- The system will automatically run OCR on documents and clean the text.
-
Information Extraction:
- Extract document titles and key information using the integrated Mistral LLM and RAG models.
-
Enrichment & Matching:
- The processed documents will be enriched with additional metadata and matched to existing datasets if applicable.
-
Final Output:
- The final output is ready for upload or further analysis.
- Document Sources: Define paths to documents in
config.json. - Vector Database: Set up the connection details for the RAG Vector Database in
rag_config.yaml. - Mistral LLM: Provide API keys for Mistral LLM integration in
ml_config.yaml.
This project is licensed under the MIT License.