A Retrieval-Augmented Generation platform for financial document intelligence. FinDoc RAG ingests SEC 10-K filings, chunks and embeds them into a vector store, and exposes a REST API that answers natural language questions with cited sources.
The system is composed of three independently deployable services connected by Apache Kafka and a shared PostgreSQL database with the pgvector extension:
- Ingestion Service -- Fetches 10-K filings from the SEC EDGAR full-text search API and publishes raw filing text to Kafka.
- Embedding Worker -- Consumes raw filings from Kafka, splits them into section-aware chunks, generates dense vector embeddings, and stores the results in pgvector. Chunking is hierarchical: 10-K items are split by section (Item 1A, Item 7, etc.), then by paragraph, then into 512-token windows with 64-token overlap. Each chunk retains the section name, ticker, filing date, and a deterministic SHA256 chunk ID.
- Query API -- Accepts natural language questions, embeds the query and retrieves candidate chunks via cosine distance, applies Maximal Marginal Relevance (MMR) reranking to balance relevance and diversity, constructs a grounded prompt, and returns an LLM-generated answer with source citations.
SEC EDGAR --> Ingestion --> Kafka --> Embedding Worker --> PostgreSQL + pgvector
^
API Client <--> Query API <--> Ollama / OpenAI / Claude |
| |
+----------------------------------+
- Docker and Docker Compose (v2+)
- Python 3.12+ (for local development and running tests outside of Docker)
- GNU Make (optional, for convenience targets)
Hardware (Ollama / local LLM path only): make run pulls mistral:7b (~4 GB download) and requires at least 8 GB of free RAM at runtime (model weights + Docker overhead). On machines with less RAM the Ollama container will be OOM-killed. The remote-LLM path (make run-remote with LLM_BACKEND=claude or openai) has no GPU or RAM requirements beyond the base stack (~2 GB).
- Clone the repository and copy the environment template:
git clone https://github.com/drag0sd0g/FinDocDRAG.git
cd FinDocDRAG
cp .env.example .env- Review
.envand setEDGAR_USER_AGENTto a value that identifies you (the SEC requires a name and contact email):
EDGAR_USER_AGENT=YourName your.email@example.com
- Start the full stack:
# With Ollama (local LLM, no API key needed):
make run
# or: docker compose --profile local-llm up --build
# With a remote LLM (Claude or OpenAI, no GPU/RAM required):
export LLM_BACKEND=claude # or: export LLM_BACKEND=openai
export ANTHROPIC_API_KEY=sk-ant-… # or: export OPENAI_API_KEY=sk-…
make run-remote
# or: docker compose up --buildOn the first run this will pull container images, build the three services, and run database migrations. The schema is applied automatically from db/migrations/001_initial_schema.sql (creates the pgvector extension, ingestion_log and document_chunks tables, and the HNSW vector index). To run migrations manually at any time: make migrate.
If using Ollama (make run), the mistral:7b model (~4 GB) is downloaded automatically on first start. Subsequent starts reuse cached layers and volumes.
- Verify the services are running:
curl http://localhost:8001/health # Ingestion Service
curl http://localhost:8002/health # Embedding Worker
curl http://localhost:8000/health # Query APITrigger ingestion for specific tickers:
curl -X POST http://localhost:8001/v1/ingest \
-H "Content-Type: application/json" \
-d '{"tickers": ["AAPL", "MSFT"]}'If no body is provided, the service reads from config/tickers.yml.
curl -X POST http://localhost:8000/v1/query \
-H "Content-Type: application/json" \
-H "X-API-Key: dev-key-1" \
-d '{
"question": "What were Apple'\''s main risk factors related to supply chain in their 2024 10-K?",
"ticker_filter": "AAPL",
"top_k": 5
}'top_k is optional (default 5, range 1–20). Include an X-Request-ID header to correlate requests across log entries; if omitted, one is generated automatically and returned as request_id in the response.
The response shape:
{
"answer": "Apple's 2024 10-K identifies supply chain concentration as a key risk...",
"sources": [
{
"chunk_id": "a1b2c3d4e5f6...",
"ticker": "AAPL",
"filing_date": "2024-11-01",
"section": "Item 1A - Risk Factors",
"relevance_score": 0.87,
"text_preview": "The Company's operations and performance depend significantly on..."
}
],
"model": "mistral:7b",
"timing": {
"embedding_ms": 12.4,
"retrieval_ms": 38.1,
"generation_ms": 4821.0,
"total_ms": 4871.5
},
"degraded": false,
"request_id": "550e8400-e29b-41d4-a716-446655440000"
}When the LLM is unavailable, answer is null and degraded is true (HTTP 200 — see Graceful Degradation).
curl http://localhost:8000/v1/documents \
-H "X-API-Key: dev-key-1"Supports optional query parameters ticker, limit, and offset.
FastAPI generates OpenAPI 3.0 documentation automatically. All three services expose it while running:
| Service | Swagger UI | ReDoc | JSON spec |
|---|---|---|---|
| Query API | http://localhost:8000/docs | http://localhost:8000/redoc | http://localhost:8000/openapi.json |
| Ingestion Service | http://localhost:8001/docs | http://localhost:8001/redoc | http://localhost:8001/openapi.json |
| Embedding Worker | http://localhost:8002/docs | http://localhost:8002/redoc | http://localhost:8002/openapi.json |
To export a static spec without running the stack:
cd services/query-api
PYTHONPATH=. python -c "import json; from src.main import app; print(json.dumps(app.openapi(), indent=2))"findoc-rag/
config/tickers.yml Tickers to ingest (FR-2)
db/migrations/ PostgreSQL schema (pgvector)
services/
ingestion/ SEC EDGAR fetcher + Kafka producer
embedding-worker/ Chunker + sentence-transformers + pgvector writer
query-api/ RAG pipeline: retriever, prompt builder, LLM backends
eval/ Evaluation harness (ragas)
helm/findoc-rag/ Kubernetes Helm chart
monitoring/ Prometheus config + Grafana dashboards
docs/
technical-design-document.md Full TDD (architecture, requirements, decisions)
design-decisions.md ADRs and trade-off analysis
evaluation-results.md RAG quality metrics
docker-compose.yml Local full-stack orchestration
Makefile Development workflow targets
make setup # Creates venvs and installs dependencies for all servicesmake test # Runs pytest with coverage for all three services
make helm-test # Runs the helm unit-testsTests mock all external dependencies (Kafka, PostgreSQL, EDGAR, Ollama) and run without any infrastructure.
make lint # Runs ruff (linter) and mypy (type checker)| Target | Description |
|---|---|
make setup |
Create virtual environments, install dependencies |
make test |
Run pytest for all services |
make lint |
Run ruff and mypy |
make run |
Start full stack including Ollama (--profile local-llm) |
make run-remote |
Start stack without Ollama (for LLM_BACKEND=claude or openai) |
make stop |
Stop Docker Compose stack |
make clean |
Stop stack and remove all volumes |
make migrate |
Run database migrations |
make docker-build |
Build Docker images only |
make eval |
Run the RAG evaluation harness |
make helm-deploy |
Deploy to Kubernetes via Helm |
make helm-test |
Run Helm unit tests (requires helm-unittest plugin) |
make helm-teardown |
Remove the Helm release |
The helm/findoc-rag/ chart deploys the full stack to any Kubernetes cluster. The five steps below cover a typical first-time deployment.
kubectlconfigured against a running cluster (local: kind or minikube; cloud: GKE, EKS, AKS)- Helm v3.10+
- The
helm-unittestplugin (only required formake helm-test):helm plugin install https://github.com/helm-unittest/helm-unittest
kubectl create namespace findoc-ragCopy the defaults and change the values that must differ in production:
cp helm/findoc-rag/values.yaml helm/findoc-rag/values.prod.yamlKey overrides for a production deployment:
# helm/findoc-rag/values.prod.yaml
postgresql:
credentials:
password: "<strong-password>" # never commit this; use --set or a sealed-secret
grafana:
adminPassword: "<strong-password>"
ingress:
enabled: true
host: findoc-rag.example.com # your domain
tls:
enabled: true
secretName: findoc-rag-tls # pre-created TLS secret
queryApi:
apiKeys: "<key1>,<key2>"
llmBackend: claude # or: openai# PostgreSQL credentials
kubectl create secret generic findoc-prod-postgresql \
--namespace findoc-rag \
--from-literal=POSTGRES_DB=findocdrag \
--from-literal=POSTGRES_USER=findocdrag \
--from-literal=POSTGRES_PASSWORD="<strong-password>"
# LLM API keys (only the key you need)
kubectl create secret generic findoc-prod-api-keys \
--namespace findoc-rag \
--from-literal=ANTHROPIC_API_KEY="sk-ant-..."helm upgrade --install findoc-rag helm/findoc-rag \
--namespace findoc-rag \
--values helm/findoc-rag/values.prod.yaml \
--set postgresql.credentials.password="<strong-password>" \
--waitTo deploy without Ollama (remote LLM only, saves ~12 GB RAM):
helm upgrade --install findoc-rag helm/findoc-rag \
--namespace findoc-rag \
--values helm/findoc-rag/values.prod.yaml \
--set queryApi.llmBackend=claude \
--set ollama.replicas=0 \
--wait# Check all pods reach Running / Completed state
kubectl get pods -n findoc-rag
# Tail logs for any service
kubectl logs -n findoc-rag -l app.kubernetes.io/component=query-api -f
# Port-forward the Query API for a quick smoke test
kubectl port-forward -n findoc-rag svc/findoc-rag-query-api 8000:8000
curl http://localhost:8000/health
# Port-forward Grafana dashboards
kubectl port-forward -n findoc-rag svc/findoc-rag-grafana 3000:3000
# open http://localhost:3000 (admin / <adminPassword>)make helm-teardown
# or: helm uninstall findoc-rag --namespace findoc-ragAll configuration is driven by environment variables. See .env.example for the full list with descriptions. Key variables:
LLM & model selection
| Variable | Default | Description |
|---|---|---|
LLM_BACKEND |
ollama |
LLM provider: ollama, openai, or claude |
OLLAMA_URL |
http://ollama:11434 |
Ollama server URL (used when LLM_BACKEND=ollama) |
OLLAMA_MODEL |
mistral:7b |
Ollama model name |
OPENAI_API_KEY |
(empty) | Required when LLM_BACKEND=openai |
OPENAI_MODEL |
gpt-4o-mini |
OpenAI model name |
ANTHROPIC_API_KEY |
(empty) | Required when LLM_BACKEND=claude |
CLAUDE_MODEL |
claude-opus-4-6 |
Claude model name |
EMBEDDING_MODEL |
sentence-transformers/all-MiniLM-L6-v2 |
Sentence-transformers model for embedding |
Security & API
| Variable | Default | Description |
|---|---|---|
API_KEYS |
dev-key-1,dev-key-2 |
Comma-separated valid API keys for the Query API |
RATE_LIMIT |
30/minute |
Query API rate limit per API key (slowapi format, e.g. 60/minute) |
CORS_ORIGINS |
* |
Comma-separated allowed CORS origins (e.g. https://myapp.example.com) |
Ingestion
| Variable | Default | Description |
|---|---|---|
EDGAR_USER_AGENT |
FinDocDRAG findocdrag@example.com |
SEC-required identification string |
EDGAR_RATE_LIMIT_RPS |
10 |
Max EDGAR API requests per second (SEC limit is 10 r/s) |
KAFKA_BOOTSTRAP_SERVERS |
kafka:9092 |
Kafka broker address used by the embedding worker |
Database
| Variable | Default | Description |
|---|---|---|
POSTGRES_HOST |
postgres |
PostgreSQL hostname |
POSTGRES_PORT |
5432 |
PostgreSQL port |
POSTGRES_DB |
findocdrag |
Database name |
POSTGRES_USER |
findocdrag |
Database user |
POSTGRES_PASSWORD |
changeme |
Database password (override in production) |
Observability
| Variable | Default | Description |
|---|---|---|
LOG_LEVEL |
INFO |
Log verbosity for all services (DEBUG, INFO, WARNING) |
QUERY_API_PORT |
8000 |
HTTP port for the Query API |
The Query API supports 3 LLM backends, selectable via the LLM_BACKEND environment variable:
Ollama (default) -- Runs locally inside the Docker Compose stack. No API key required. The model is pulled automatically on first start. Start with make run (or docker compose --profile local-llm up). Suitable for development and self-hosted deployments. Requires ~6 GB RAM for the mistral:7b model weights (~8 GB total including Docker overhead).
OpenAI -- Calls the OpenAI chat completions API. Set LLM_BACKEND=openai and provide a valid OPENAI_API_KEY. Uses gpt-4o-mini by default. Start with make run-remote. Useful for higher-quality answers and evaluation comparisons.
Claude (Anthropic) -- Calls the Anthropic messages API. Set LLM_BACKEND=claude and provide a valid ANTHROPIC_API_KEY. Uses claude-opus-4-6 by default (override with CLAUDE_MODEL). Start with make run-remote. No local GPU or RAM requirements beyond the base stack.
The Query API requires an X-API-Key header. Valid keys are configured via the API_KEYS environment variable. If API_KEYS is empty or unset, authentication is disabled (development mode).
Rate limiting is enforced at 30 requests per minute per API key.
If the LLM backend is unavailable (timeout, network error, or API failure), the Query API still returns HTTP 200 with the retrieved source chunks and relevance scores, but answer is null and degraded is true:
{
"answer": null,
"sources": [ ... ],
"model": "mistral:7b",
"timing": { ... },
"degraded": true,
"request_id": "..."
}This allows downstream consumers to present relevant context to users even when generation is unavailable. The degraded field is the canonical signal — do not rely on answer being absent, as future versions may return a partial answer with degraded: true.
| Document | Description |
|---|---|
| Technical Design Document | Architecture, functional and non-functional requirements, data model, technology rationale, scalability, resilience, observability, security, and deployment. |
| Design Decisions | Architectural decision records with trade-off analysis. |
| Evaluation Results | RAG quality metrics (context precision, faithfulness, answer relevancy). |
This project is a portfolio/demonstration project and is not intended for production multi-tenant use. See the Technical Design Document Section 2.4 for scope boundaries.