Skip to content

LLM-PDF-Parser is a FastAPI-based application that extracts text from PDFs and images and uses NuExtract LLM to extract specific fields based on a given JSON template.

License

Notifications You must be signed in to change notification settings

RiccardoTOTI/LLM-PDF-Extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

📄 LLM-PDF-Parser

LLM-PDF-Parser is a FastAPI-based application that extracts text from PDFs and images and uses Ollama's LLM to extract specific fields based on a given JSON template. 🚀

✨ Features

  • 📝 Extract text from PDFs and images (JPG, PNG, JPEG) using PyMuPDF and EasyOCR.
  • 🤖 Leverage AI to extract structured data based on a provided JSON template.
  • FastAPI backend for quick and easy integration.
  • 🔥 Supports OCR when text extraction from PDFs is insufficient.
  • 🔄 Cross-Origin Resource Sharing (CORS) enabled for flexible frontend integration.

🚀 Installation & Setup

1️⃣ Clone the Repository

git clone https://github.com/RiccardoTOTI/LLM-PDF-Extractor.git
cd LLM-PDF-Parser

2️⃣ Install Dependencies

pip install -r requirements.txt

3️⃣ Set Up Environment Variables

Create a .env file and configure the following variables (or set them in your environment):

OLLAMA_URL=http://localhost:11434
OLLAMA_MODEL=iodose/nuextract-v1.5

4️⃣ Run the Application

uvicorn main:app --host 0.0.0.0 --port 8000 --reload

5️⃣ Spin Ollama server

ollama run iodose/nuextract-v1.5

🐳 Docker Support

You can also run the application using Docker.

🏗 Build and Run with Docker

  1. Build the Docker image:
    docker build -t llm-pdf-parser .
  2. Run the container:
    docker run -p 8000:8000 llm-pdf-parser

🔄 Using Docker Compose

You can use Docker Compose to spin up both the application and Ollama:

  1. Run the services:
    docker-compose up -d
  2. Download the model using the script inside tools directory in the Ollama Container:
    docker exec -it ollama .tools/download_model.sh iodose/nuextract-v1.5

🔥 API Usage

📥 Upload a File & Extract Data

Endpoint: POST /extract

Request:

curl -X 'POST' \
  'http://localhost:8000/extract' \
  -H 'accept: application/json' \
  -H 'Content-Type: multipart/form-data' \
  -F 'file=@yourfile.pdf' \
  -F 'fields={"Patient":{"First Name":"","Last Name":"","Tax Code":"","Doctor":[]}}
'

Response Example:

{
  "extracted_data": {
  "Patient": {
    "First Name": "John",
    "Last Name": "Doe",
    "Tax Code": "ABC123XYZ",
    "Doctor": [ #support list of elements
        "Dr. Smith",
        "Dr. Bean"
    ]
  }
}
}

🏗 How It Works

  1. Uploads a PDF or Image via FastAPI.
  2. Extracts text using PyMuPDF or EasyOCR (for scanned documents/images).
  3. Sends extracted text to Ollama's LLM, which structures it based on the provided JSON template.
  4. Returns extracted structured data as a JSON response. ✅

🛠 Technologies Used

  • Python 🐍
  • FastAPI
  • PyMuPDF 📄
  • EasyOCR 🔍
  • Ollama LLM 🤖
  • Uvicorn 🚀

🏆 Contributing

Contributions are welcome! Feel free to submit issues or open a pull request. 😊

📜 License

This project is licensed under the Apache 2.0 License.


💡 Have suggestions or need help? Open an issue or reach out! 🚀

About

LLM-PDF-Parser is a FastAPI-based application that extracts text from PDFs and images and uses NuExtract LLM to extract specific fields based on a given JSON template.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published