Skip to content

dataforgoodfr/offseason-shelter-for-science

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🏠 Offseason Shelter for Science

A distributed data rescue system designed to preserve and manage climate data from data.gov (the US government's open data portal). This project creates a resilient infrastructure for storing and managing climate datasets that might be at risk of being lost or becoming inaccessible.

🎯 Project Overview

The "Offseason Shelter for Science" is a microservices-based platform that enables distributed data rescue operations. It provides intelligent prioritization, automated discovery, and resilient storage for climate datasets from government open data portals.

🌟 Key Features

  • πŸ” Automated Data Discovery: Scrapes and catalogs datasets from CKAN-based portals
  • ⚑ Intelligent Prioritization: Dynamic priority system for dataset rescue order
  • πŸ”„ Distributed Processing: Multi-node architecture for scalable data rescue
  • πŸ’Ύ Resilient Storage: PostgreSQL database with proper relationships and indexing
  • πŸ“Š Real-time Monitoring: API endpoints for system status and progress tracking
  • 🐳 Containerized Deployment: Docker-based microservices architecture

πŸ—οΈ Architecture

The project consists of four main microservices, each running in Docker containers:

πŸ“Š Rescue DB - Central Database Service

  • Purpose: Manages the catalog of datasets, resources, and assets
  • Technology: PostgreSQL + FastAPI + SQLAlchemy
  • Key Entities: Datasets, Resources, Assets, Organizations, AssetKinds
  • Port: 8000

πŸ•·οΈ DataGov Asset Collector - Data Ingestion Service

  • Purpose: Discovers and catalogs downloadable assets from data.gov
  • Technology: Scrapy + Redis + CKAN API
  • Features: Web scraping, metadata extraction, caching
  • Process: Queries CKAN API β†’ Scrapes dataset pages β†’ Extracts file metadata

πŸ“‘ Dispatcher - Task Distribution Service

  • Purpose: Coordinates allocation of data rescue tasks across nodes
  • Technology: FastAPI
  • Functionality: Task distribution, resource management
  • Port: 8001

🎯 Priorizer - Intelligent Task Prioritization Service

  • Purpose: Manages dataset rescue priorities and resource allocation
  • Technology: FastAPI + APScheduler
  • Features: Dynamic priority scoring, automatic updates, resource matching
  • API: Allocation, release, priority updates, status monitoring

πŸš€ Quick Start

Prerequisites

  • Docker and Docker Compose
  • UV (modern Python package manager)

Running the Services

Each service can be run independently using Docker Compose:

# Start the Rescue DB (database + API)
cd rescue_db
docker compose up

# Start the Dispatcher service
cd dispatcher
docker compose up

# Start the Priorizer service
cd priorizer
docker compose up

# Start the DataGov Asset Collector
cd datagov/asset
docker compose up

Development Mode

For development, you can run services locally:

# Rescue DB API
cd rescue_db
uv run fastapi dev rescue_api/main.py

# Dispatcher API
cd dispatcher
uv run fastapi dev api/dispatcher_service.py

# Priorizer API
cd priorizer
uv run fastapi dev main.py

πŸ“‘ API Endpoints

Rescue DB API

Dispatcher API

Priorizer API

  • URL: http://localhost:8002/docs (default)
  • Endpoints:
    • POST /allocate - Assign datasets to nodes
    • POST /release - Release completed datasets
    • POST /update-ckan - Update dataset priorities
    • GET /status - System status and statistics

πŸ”„ Data Flow

  1. πŸ” Discovery: DataGov collector queries CKAN API and scrapes dataset pages
  2. πŸ“ Cataloging: Metadata is extracted and stored in Rescue DB
  3. 🎯 Prioritization: Priorizer assigns and adjusts priority scores
  4. πŸ“¦ Distribution: Dispatcher allocates work to available nodes
  5. πŸ’Ύ Storage: Nodes download and store assigned datasets
  6. πŸ“Š Monitoring: System tracks progress and provides status updates

πŸ› οΈ Development

Database Migrations

cd rescue_db
uv run alembic upgrade head
uv run alembic revision --autogenerate -m "Description of changes"

Testing

# Run dispatcher tests
python test/dispatcher/test_dispatcher.py

# Run datagov tests
python test/datagov/ckan/test_resource.py

πŸ”§ Configuration

Each service has its own configuration:

  • Environment Variables: Copy .env.dist to .env and configure
  • Docker Compose: Service-specific configurations in each directory
  • Database: PostgreSQL with configurable credentials and database name

πŸ“ Project Structure

offseason-shelter-for-science/
β”œβ”€β”€ πŸ“Š rescue_db/          # Central database and API
β”œβ”€β”€ πŸ•·οΈ datagov/            # Data collection and scraping
β”œβ”€β”€ πŸ“‘ dispatcher/         # Task distribution service
β”œβ”€β”€ 🎯 priorizer/          # Priority management service
└── πŸ§ͺ test/              # Test suites

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests if applicable
  5. Submit a pull request

πŸ“„ License

This project is open source and available under the MIT License.

πŸ†˜ Support

For questions or issues, please check the individual service README files or create an issue in the repository.


Built with ❀️ by Data For Science, for climate data preservation

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors 9

Languages