Containerize and replace Solr with PostgreSQL

It would be great to containerize the ECCO‑obs‑pipeline and replace the existing Solr-based metadata/state store with a PostgreSQL database. This will simplify deployment, reduce external dependencies, and improve reproducibility of the pipeline across environments. 

_This is a longer term goal and not something that will actively be worked unless the time or a greater need presents itself._

## Goals:
- Provide a reproducible, portable Docker container for the ECCO‑obs‑pipeline.
- Replace Solr with a PostgreSQL database for metadata and state tracking.
- Maintain current functionality for dataset harvesting, transformation, and aggregation.
- Support both local development and production-scale Postgres deployment.

## Benefits:
- Eliminates dependency on a Solr server.
- Simplifies installation and environment setup.
- Makes the pipeline more portable and easier to deploy.
- Provides robust transactional metadata/state tracking.
- Easier CI/CD integration and versioning through Docker images.

## Proposed Work Sketch / Steps:
1. Inventory Solr Usage

- Identify all Solr queries and updates in the pipeline.
- Document which fields are queried, filtered, or updated.
- Determine which queries are exact match, range, full-text, or faceted.

2. Database Schema Design (PostgreSQL)

- Define tables for granule metadata, dataset state, and pipeline tracking.
- Include indexed columns for fields used in filters (e.g., dataset, timestamp, status).
- Optional JSONB column for flexible metadata fields.
- Implement bulk load script to migrate existing Solr documents into Postgres.

3. Refactor Code

- Replace solr_query calls with SQL queries.
- Replace solr_update calls with inserts/updates (INSERT … ON CONFLICT).
- Add configuration for database connection (via DATABASE_URL).

4. Containerization

- Create Dockerfile for the pipeline app (Python 3.10, system dependencies, Python packages).
- Optional multi-stage build to reduce image size.
- Use Docker Compose for local development: app + Postgres container.
- Mount data volumes for database persistence and for input/output files.
- Configure environment variables for credentials and dataset configs.

5. Testing / Validation

- Verify pipeline steps (harvest → transform → aggregate) work correctly with Postgres.
- Compare outputs against current Solr-based pipeline.
- Test parallel processing / multiple-step runs to ensure Postgres handles concurrent writes.

6. Deployment / Documentation

- Provide instructions for running the containerized pipeline on Linux.
- Include environment variable / config template for database connection and dataset configs.
- Update CI/CD pipeline to build and test Docker images.

## Notes / Considerations:

- PostgreSQL is recommended over SQLite for production due to concurrent writes, robustness, and scalability.
- Existing Solr documents will need to be exported and imported into Postgres.
- Large volumes of input/output data should remain on host-mounted volumes to avoid bloating the container image.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Containerize and replace Solr with PostgreSQL #82

Goals:

Benefits:

Proposed Work Sketch / Steps:

Notes / Considerations:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Containerize and replace Solr with PostgreSQL #82

Description

Goals:

Benefits:

Proposed Work Sketch / Steps:

Notes / Considerations:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions