This project implements a Retrieval-Augmented Generation (RAG) based chatbot to intelligently answer questions using a PDF document containing both text and images. It uses local embedding models and the Groq API for high-performance, cost-free generation, avoiding any paid APIs.
- Extracts both text (using PyMuPDF) and image data from PDF using OCR (Tesseract).
- Converts PDF into vector embeddings using
all-MiniLM-L6-v2
. - Stores the embeddings in a FAISS vector database.
- Uses the Groq API (LLaMA3) for generating context-based answers.
- Runs locally in Visual Studio / VS Code using Python virtual environment.
- Simple test script (
test.py
) to query the bot. - Clean architecture for maintainability and extension.
.gitignore
β Files/folders ignored by Gitapp.py
β π Streamlit or UI interfaceCSR MODULES.pdf
β π Input PDF file (text + image content)images/
β π Auto-extracted images from PDF for OCRingest_pdf.py
β βοΈ PDF parser, OCR, and embedding logicrag_chatbot.py
β π€ Core RAG chatbot pipelinerequirements.txt
β π¦ Python dependenciestest.py
β π§ͺ CLI script to interact with the chatbotvectorstore/
β π§ Local FAISS vector databaseREADME.md
β π You're reading it now!
- General RAG Workflow
- Implemented RAG Chatbot Architecture
-
Change the directory to the cloned repo by using -> cd RAG-Based-Chatbot-for-Smart-Customer-Support-Documents
2. Create and activate a Virtual Environment (To work separately without disturbing the existing versions)
-
python -m venv venv
-
.\venv\Scripts\activate # For Windows
- pip install -r requirements.txt
-
Create a .env file and store your Groq API (Free):
-
It should be in the form of: GROQ_API_KEY=your_groq_api_key
-
You can either use the built-in PDF or upload your personal company documents as a PDF and chat with the bot
-
Run the following to process your PDF and store the vector embeddings: python ingest_pdf.py
After successful running, it will:
-
Extracts text.
-
Extracts images and runs OCR using Tesseract.
-
Embeds content using Sentence Transformers.
-
Stores embeddings in FAISS DB (vectorstore/).
- Use the test.py script to ask questions by: python test.py
- After successful completion, make sure to run the UI by using: streamlit run app.py
Q: What are the key objectives of CSR? A: [Answer from context]
-
Embeddings: all-MiniLM-L6-v2 from Sentence Transformers (free, local).
-
LLM: llama3-8b-8192 via Groq API (free and fast inference).
β Key Learnings
- Used LangChain (0.3.27) with new modular packages like:
-> langchain-community
-> langchain-core
-> langchain-huggingface
-
Enabled FAISS with allow_dangerous_deserialization=True for local use.
-
OCR setup using pytesseract + pdf2image + Tesseract executable path config.
-
Dealt with deprecated imports and updated LangChain compatibility manually.
- Expand to real-time document search (external knowledge integration).
πββοΈ Author
-
Balahariharasudhan T
π License
- This project is for educational and research purposes only.