A smart application for full document extraction, reorganization, and Q&A interaction — powered by OpenAI and Streamlit.
- Upload Files: Supports
.pdf
,.txt
,.jpg
,.png
. - Smart Extraction:
- Attempts structured extraction via MarkItDown.
- Falls back to OCR (Tesseract) if necessary.
- Content Reorganization:
- Reorganizes messy extracted text into clean Markdown via OpenAI model (
gpt-4.1-mini
).
- Reorganizes messy extracted text into clean Markdown via OpenAI model (
- Interactive Q&A:
- Ask direct questions about the uploaded content.
- Receives precise and concise answers instantly.
- Download Reorganized Files:
- Save the cleaned-up Markdown as a
.txt
file.
- Save the cleaned-up Markdown as a
- User-Friendly Interface:
- Built with Streamlit for ease of use and beautiful layout.
- Sidebar contains author links and branding.
- PDF documents (
.pdf
) - Text files (
.txt
) - Image files (
.jpg
,.jpeg
,.png
,.bmp
,.tiff
)
- Upload a file.
- Start extraction and display the extracted text.
- Reorganize the text into structured Markdown.
- Ask questions about the content.
- Download the reorganized content.
Install the dependencies:
pip install streamlit openai tiktoken markitdown pytesseract pymupdf pillow
Note:
Ensure Tesseract OCR
is installed in your system for OCR extraction to work:
Ahmed Zeyad Tareq
- 🎓 Master's in Artificial Intelligence Engineering
- 📌 Data Scientist, AI Developer
- GitHub | LinkedIn | Kaggle
✨ Enjoy smart document interaction!