Skip to content
This repository was archived by the owner on Mar 25, 2026. It is now read-only.

feat(vectors): add pgvector embeddings, /vectors page, and bdp-embed Python CLI#1

Open
sebastianstupak wants to merge 41 commits intomainfrom
feature/vectors
Open

feat(vectors): add pgvector embeddings, /vectors page, and bdp-embed Python CLI#1
sebastianstupak wants to merge 41 commits intomainfrom
feature/vectors

Conversation

@sebastianstupak
Copy link
Contributor

Summary

  • Database: 4 migrations adding pgvector extension, entry_embeddings (halfvec(512) + HNSW index), entry_projections (pre-computed 2D coords), and vector_projection_runs (pipeline tracking)
  • Python CLI (tools/bdp-embed): embed (OpenAI → entry_embeddings), project (landmark UMAP → entry_projections via MinIO), tiles (quadtree WizMap tiles → MinIO) subcommands
  • Rust backend: 4 CQRS query handlers — get_stats, semantic_search (Moka-cached embeddings), get_neighbors (KNN), get_tile (MinIO tile proxy); all registered at /api/v1/vectors
  • Frontend: /vectors page with regl-scatterplot WebGL canvas, tile-based loading, source-type legend, sidebar (neighbors), search bar (semantic), and header nav link

Test Plan

  • Run cargo xtask db migrate then cargo xtask sqlx prepare to generate SQLx metadata for new queries
  • Install bdp-embed: cd tools/bdp-embed && pip install -e .
  • Run bdp-embed embed --db-url $DATABASE_URL --openai-key $OPENAI_API_KEY to populate embeddings
  • Run bdp-embed project then bdp-embed tiles to generate projection + tiles
  • GET /api/v1/vectors/stats → returns JSON with run status and counts
  • GET /api/v1/vectors/search?q=ribosome → returns semantic search results
  • Navigate to /vectors → scatter plot renders with points colored by source type
  • Click a point → sidebar shows neighbors; type in search bar → flies to results

🤖 Generated with Claude Code

sebastianstupak and others added 30 commits March 21, 2026 18:07
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…olumes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Terraform IaC: Hetzner VPS (cx22), persistent primary IP, data volume (prevent_destroy), Storage Box for restic backups, Cloudflare DNS
- Cloud-init: Docker, Dokploy install, UFW, restic cron, /etc/dokploy symlink for LE cert persistence
- docker-compose: remove standalone traefik, add MinIO (replaces OVH S3), postgres bind mounts
- xtask infra: 16 commands (bootstrap, plan, apply, ssh, status, post-deploy, backup-*, logs, update, etc.)
- infrastructure/README.md updated for new setup
- .secrets.example: use plain key=value format (no TF_VAR_ prefix)
- infra.rs load_env_preamble: parse .secrets and export each key both
  as key=val (direct) and TF_VAR_key=val (for Terraform)
- infra.rs ssh_key_path: read lowercase ssh_key_path= key
- bootstrap: reference lowercase $ssh_key_path var
- Add .github/workflows/infrastructure.yml for Hetzner Terraform CI
  with plan/apply/destroy via GitHub Environment secrets (TF_VAR_*)
- Remove old OVH infrastructure.yml.disabled (superseded)

No .tfvars files — all Terraform vars via TF_VAR_* env vars.
GitHub CI stores secrets as TF_VAR_<key> in production environment.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Full design for pgvector-based semantic embeddings across all BDP
bioinformatics registry entries, WizMap-style quadtree tile visualization
page using regl-scatterplot, and semantic search for MCP integration.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ledge graph

Full architecture spec covering deck.gl v9 tile-based streaming renderer,
FlatBuffers binary protocol, Rust CQRS tile server with PostGIS spatial indexing,
offline Louvain+ForceAtlas2 layout pipeline, extensible entity/edge type registry
pre-seeded with all future bioinformatics domains, and 9-phase ingestion roadmap.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- .secrets.example: all keys now CAPITALIZED_LIKE_THIS (standard .env)
- load_env_preamble: lowercases key before TF_VAR_ prefix so Terraform
  vars match (HCLOUD_TOKEN -> TF_VAR_hcloud_token)
- ssh_key_path: reads SSH_KEY_PATH (uppercase)
- bootstrap: prints SSH_PUBLIC_KEY= ready-to-paste line for existing keys
- Add `cargo xtask infra gen-secrets` — generates all random secrets
  (passwords + restic passphrase) and prints remaining manual steps

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tors page

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ental writes

- Add db.py with psycopg async connection helper (get_conn context manager)
- Implement embed.py with embed command that:
  - Fetches unemebedded registry entries from database
  - Batches entries and calls OpenAI text-embedding-3-small API
  - Implements exponential backoff for rate limiting
  - Truncates text to 32k chars for safety
  - Writes vectors to entry_embeddings table with upsert
  - Shows progress with tqdm and user-friendly messages
- Update cli.py to import and register embed subcommand
- Supports DATABASE_URL and OPENAI_API_KEY from environment

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…pload

Implement quadtree tile generation from 2D point projections with:
- Vectorized cell assignment for O(N) performance
- Adaptive downsampling (fewer points at lower zoom levels)
- Multi-level tile generation (zoom 0-14 by default)
- S3/MinIO upload with progress tracking
- Database status update on completion

Includes comprehensive unit tests for tile key generation, point filtering,
and quadtree building with progressive downsampling verification.

Uncomment tiles import in cli.py to register the new subcommand.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Introduces the vectors feature module with:
- GetVectorStatsQuery / VectorStatsResponse following the CQRS pattern
  (implements Request<Result<…>> and crate::cqrs::middleware::Query)
- Live counts from registry_entries and entry_embeddings
- Most-recent row from vector_projection_runs
- queries/mod.rs with only get_stats (semantic_search, get_neighbors,
  get_tile added in Tasks 9-10)
- vectors/mod.rs with stub comment for routes (added in Task 10)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
sebastianstupak and others added 6 commits March 22, 2026 03:08
Implements SemanticSearchQuery (OpenAI embed → pgvector KNN with moka
in-process cache) and GetNeighborsQuery (seed-vector KNN excluding self).
Both follow CQRS Query pattern via mediator trait impls.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…handlers in mediator

- Create get_tile.rs query: fetches tile bytes from S3 at
  vectors/tiles/{run_id}/{z}/{x}/{y}.json using storage.download()
- Create routes.rs: mounts /stats, /search, /:entry_id/neighbors,
  /tiles/:run_id/:z/:x/:y with proper error mapping and cache headers
- Register 4 vector handlers in cqrs/mod.rs (get_stats, semantic_search,
  get_neighbors, get_tile)
- Add vectors module to features/mod.rs and mount at /vectors

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add VectorSidebar component showing point metadata and nearest neighbors,
VectorSearchBar with debounced semantic search and centroid fly-to, and
a /vectors nav link in the header following the existing icon+text pattern.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…react

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
sebastianstupak and others added 5 commits March 22, 2026 10:13
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ous notes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…isabled harness

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant