Building a PDF-driven RAG system with Weaviate

This repository contains materials for the online workshop "Building a PDF-driven RAG system with Weaviate".

You’ll learn how to:

Extract and preprocess text and images from PDFs
Chunk and embed document content
Store and retrieve data using Weaviate
Build Retrieval-Augmented Generation (RAG) pipelines that combine text and images

The workshop is organized as a series of Jupyter notebooks.

The notebooks are numbered, so you can follow them along in order.

Requirements: Python 3.10+, Weaviate, Cohere/Anthropic API keys (for embeddings and LLMs).

Setup instructions

Set up your preferred Python environment
- e.g. Set up a virtual environment (optional but recommended):
```
python -m venv .venv
source venv/bin/activate  # On Windows use `venv\Scripts\activate`
pip install -r requirements.txt
```
- Or use uv, conda, or any other environment manager you prefer.
Set up the .env file
- Note: You ONLY need to do this if you do not have ANTHROPIC_API_KEY and COHERE_API_KEY set in your environment.
1. Copy .env.example to .env
2. Fill in the ANTHROPIC_API_KEY and COHERE_API_KEY with corresponding values.
3. In the live session, the instructor may provide temporary keys.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
assets		assets
data		data
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
1_basics_of_working_with_pdfs-complete.ipynb		1_basics_of_working_with_pdfs-complete.ipynb
1_basics_of_working_with_pdfs.ipynb		1_basics_of_working_with_pdfs.ipynb
2_basic_rag-complete.ipynb		2_basic_rag-complete.ipynb
2_basic_rag.ipynb		2_basic_rag.ipynb
3_pdfs_with_images-complete.ipynb		3_pdfs_with_images-complete.ipynb
3_pdfs_with_images.ipynb		3_pdfs_with_images.ipynb
4_pdfs_simplified-complete.ipynb		4_pdfs_simplified-complete.ipynb
4_pdfs_simplified.ipynb		4_pdfs_simplified.ipynb
README.md		README.md
preprocess_img_to_embeddings_cohere.py		preprocess_img_to_embeddings_cohere.py
preprocess_pdf_to_img.py		preprocess_pdf_to_img.py
preprocess_pdf_to_md.py		preprocess_pdf_to_md.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock