-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Labels
Description
It would be great to containerize the ECCO‑obs‑pipeline and replace the existing Solr-based metadata/state store with a PostgreSQL database. This will simplify deployment, reduce external dependencies, and improve reproducibility of the pipeline across environments.
This is a longer term goal and not something that will actively be worked unless the time or a greater need presents itself.
Goals:
- Provide a reproducible, portable Docker container for the ECCO‑obs‑pipeline.
- Replace Solr with a PostgreSQL database for metadata and state tracking.
- Maintain current functionality for dataset harvesting, transformation, and aggregation.
- Support both local development and production-scale Postgres deployment.
Benefits:
- Eliminates dependency on a Solr server.
- Simplifies installation and environment setup.
- Makes the pipeline more portable and easier to deploy.
- Provides robust transactional metadata/state tracking.
- Easier CI/CD integration and versioning through Docker images.
Proposed Work Sketch / Steps:
- Inventory Solr Usage
- Identify all Solr queries and updates in the pipeline.
- Document which fields are queried, filtered, or updated.
- Determine which queries are exact match, range, full-text, or faceted.
- Database Schema Design (PostgreSQL)
- Define tables for granule metadata, dataset state, and pipeline tracking.
- Include indexed columns for fields used in filters (e.g., dataset, timestamp, status).
- Optional JSONB column for flexible metadata fields.
- Implement bulk load script to migrate existing Solr documents into Postgres.
- Refactor Code
- Replace solr_query calls with SQL queries.
- Replace solr_update calls with inserts/updates (INSERT … ON CONFLICT).
- Add configuration for database connection (via DATABASE_URL).
- Containerization
- Create Dockerfile for the pipeline app (Python 3.10, system dependencies, Python packages).
- Optional multi-stage build to reduce image size.
- Use Docker Compose for local development: app + Postgres container.
- Mount data volumes for database persistence and for input/output files.
- Configure environment variables for credentials and dataset configs.
- Testing / Validation
- Verify pipeline steps (harvest → transform → aggregate) work correctly with Postgres.
- Compare outputs against current Solr-based pipeline.
- Test parallel processing / multiple-step runs to ensure Postgres handles concurrent writes.
- Deployment / Documentation
- Provide instructions for running the containerized pipeline on Linux.
- Include environment variable / config template for database connection and dataset configs.
- Update CI/CD pipeline to build and test Docker images.
Notes / Considerations:
- PostgreSQL is recommended over SQLite for production due to concurrent writes, robustness, and scalability.
- Existing Solr documents will need to be exported and imported into Postgres.
- Large volumes of input/output data should remain on host-mounted volumes to avoid bloating the container image.