🧠 The Enigmatic Research of Dr. X — NLP Pipeline (Local LLMs)

This project is a full-featured NLP pipeline designed to analyze the mysterious research documents left behind by Dr. X, a fictional scientist who vanished under mysterious circumstances. The goal is to extract, summarize, understand, and translate his research using local, offline NLP tools — no internet or cloud APIs required.

🚀 Features

✅ Multi-format file ingestion (.pdf, .docx, .csv, .xlsx, .xls, .xlsm)
✅ Token-based chunking with metadata (filename, page, chunk number)
✅ Local vector search using ChromaDB
✅ RAG Q&A system powered by local LLaMA (via Ollama)
✅ Automatic translation of English answers to Arabic
✅ Local summarization of full documents
✅ ROUGE metric evaluation
✅ Performance logging (tokens/sec for all major components)
✅ Fully modular & offline

🧱 Architecture

├── file_reader.py # Extracts text & tables from all formats
├── chunker.py # Tokenizes and chunks text with cl100k_base
├── embedding_pipeline.py # Embeds chunks and stores in ChromaDB
├── rag_qa_system.py # Runs Q&A retrieval + local LLaMA generation
├── translation_utils.py # Translates answers to Arabic (offline)
├── summarizer.py # Summarizes files + evaluates with ROUGE
└── requirements.txt # All dependencies
📁 files/
└── All input files (.pdf, .docx, .csv, etc.)

🧠 Tech Stack

Component	Tool/Library
LLM (local)	`Ollama` (e.g. `llama2`, `tinyllama`)
Embedding	`sentence-transformers` (`MiniLM`)
Vector DB	`ChromaDB (PersistentClient)`
Translation	`argos-translate` (EN ➝ AR)
Summarization	`Falconsai/text_summarization`
Metrics	`tiktoken`, `rouge-score`, `time`

💡 How It Works

Extract text + tables from PDFs, Word, and Excel files.
Chunk the text based on tokens (cl100k_base).
Embed chunks using MiniLM and store in a local ChromaDB.
Ask Questions via a CLI — the system retrieves relevant chunks and generates an answer using LLaMA.
Translate the answer into Arabic.
Summarize full documents and measure summary quality with ROUGE.

🧪 Example: CLI Output

❓ Ask a question about Dr. X's documents:
> What was his last known research?

💬 English Answer:
Dr. X’s final study focused on zero-point energy manipulation using ancient resonance systems.

🗣️ Arabic Translation:
ركزت الدراسة الأخيرة للدكتور إكس على التلاعب بطاقة النقطة الصفرية باستخدام أنظمة الرنين القديمة.

📊 Performance Metrics

Task	Tokens	Time	TPS
Embedding	1,200	1.8 sec	~666 TPS
RAG Generation	620	1.2 sec	~516 TPS
Summarization	1,500	3.0 sec	~500 TPS

Supported Formats

✅ PDF (.pdf) \
✅ Word (.docx) \
✅ Excel (.xlsx, .xls, .xlsm) \
✅ CSV (.csv) \
✅ Multi-sheet support with pandas \

🛠️ Setup Instructions

Install Requirements

pip install -r requirements.txt

Setup Ollama

install Ollama: https://ollama.com/download

ollama pull tinyllama

Run Embedding

python embedding_pipeline.py

Ask Questions (RAG + Arabic)

python rag_qa_system.py

Summarize a Document

python summarizer.py

✅ Evaluation Criteria Coverage

✅ Executes correctly across all modules
✅ Efficient + logs tokens/sec
✅ Translates and summarizes with high fluency
✅ Handles all required file formats
✅ Uses appropriate local LLMs and vector DB
✅ Clean code, modular design, creative solution

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧠 The Enigmatic Research of Dr. X — NLP Pipeline (Local LLMs)

🚀 Features

🧱 Architecture

🧠 Tech Stack

💡 How It Works

🧪 Example: CLI Output

📊 Performance Metrics

Supported Formats

🛠️ Setup Instructions

Install Requirements

Setup Ollama

Run Embedding

Ask Questions (RAG + Arabic)

Summarize a Document

✅ Evaluation Criteria Coverage

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
files		files
README.md		README.md
chunker.py		chunker.py
embedding_pipeline.py		embedding_pipeline.py
file_reader.py		file_reader.py
rag_qa_system.py		rag_qa_system.py
report.pdf		report.pdf
requirements.txt		requirements.txt
summarizer.py		summarizer.py
translation_utils.py		translation_utils.py

ksm26/dr-x-nlp-pipeline

Folders and files

Latest commit

History

Repository files navigation

🧠 The Enigmatic Research of Dr. X — NLP Pipeline (Local LLMs)

🚀 Features

🧱 Architecture

🧠 Tech Stack

💡 How It Works

🧪 Example: CLI Output

📊 Performance Metrics

Supported Formats

🛠️ Setup Instructions

Install Requirements

Setup Ollama

Run Embedding

Ask Questions (RAG + Arabic)

Summarize a Document

✅ Evaluation Criteria Coverage

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages