Skip to content

Containerize and replace Solr with PostgreSQL #82

@kevinmarlis

Description

@kevinmarlis

It would be great to containerize the ECCO‑obs‑pipeline and replace the existing Solr-based metadata/state store with a PostgreSQL database. This will simplify deployment, reduce external dependencies, and improve reproducibility of the pipeline across environments.

This is a longer term goal and not something that will actively be worked unless the time or a greater need presents itself.

Goals:

  • Provide a reproducible, portable Docker container for the ECCO‑obs‑pipeline.
  • Replace Solr with a PostgreSQL database for metadata and state tracking.
  • Maintain current functionality for dataset harvesting, transformation, and aggregation.
  • Support both local development and production-scale Postgres deployment.

Benefits:

  • Eliminates dependency on a Solr server.
  • Simplifies installation and environment setup.
  • Makes the pipeline more portable and easier to deploy.
  • Provides robust transactional metadata/state tracking.
  • Easier CI/CD integration and versioning through Docker images.

Proposed Work Sketch / Steps:

  1. Inventory Solr Usage
  • Identify all Solr queries and updates in the pipeline.
  • Document which fields are queried, filtered, or updated.
  • Determine which queries are exact match, range, full-text, or faceted.
  1. Database Schema Design (PostgreSQL)
  • Define tables for granule metadata, dataset state, and pipeline tracking.
  • Include indexed columns for fields used in filters (e.g., dataset, timestamp, status).
  • Optional JSONB column for flexible metadata fields.
  • Implement bulk load script to migrate existing Solr documents into Postgres.
  1. Refactor Code
  • Replace solr_query calls with SQL queries.
  • Replace solr_update calls with inserts/updates (INSERT … ON CONFLICT).
  • Add configuration for database connection (via DATABASE_URL).
  1. Containerization
  • Create Dockerfile for the pipeline app (Python 3.10, system dependencies, Python packages).
  • Optional multi-stage build to reduce image size.
  • Use Docker Compose for local development: app + Postgres container.
  • Mount data volumes for database persistence and for input/output files.
  • Configure environment variables for credentials and dataset configs.
  1. Testing / Validation
  • Verify pipeline steps (harvest → transform → aggregate) work correctly with Postgres.
  • Compare outputs against current Solr-based pipeline.
  • Test parallel processing / multiple-step runs to ensure Postgres handles concurrent writes.
  1. Deployment / Documentation
  • Provide instructions for running the containerized pipeline on Linux.
  • Include environment variable / config template for database connection and dataset configs.
  • Update CI/CD pipeline to build and test Docker images.

Notes / Considerations:

  • PostgreSQL is recommended over SQLite for production due to concurrent writes, robustness, and scalability.
  • Existing Solr documents will need to be exported and imported into Postgres.
  • Large volumes of input/output data should remain on host-mounted volumes to avoid bloating the container image.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions