A distributed data rescue system designed to preserve and manage climate data from data.gov (the US government's open data portal). This project creates a resilient infrastructure for storing and managing climate datasets that might be at risk of being lost or becoming inaccessible.
The "Offseason Shelter for Science" is a microservices-based platform that enables distributed data rescue operations. It provides intelligent prioritization, automated discovery, and resilient storage for climate datasets from government open data portals.
- π Automated Data Discovery: Scrapes and catalogs datasets from CKAN-based portals
- β‘ Intelligent Prioritization: Dynamic priority system for dataset rescue order
- π Distributed Processing: Multi-node architecture for scalable data rescue
- πΎ Resilient Storage: PostgreSQL database with proper relationships and indexing
- π Real-time Monitoring: API endpoints for system status and progress tracking
- π³ Containerized Deployment: Docker-based microservices architecture
The project consists of four main microservices, each running in Docker containers:
- Purpose: Manages the catalog of datasets, resources, and assets
- Technology: PostgreSQL + FastAPI + SQLAlchemy
- Key Entities: Datasets, Resources, Assets, Organizations, AssetKinds
- Port: 8000
- Purpose: Discovers and catalogs downloadable assets from data.gov
- Technology: Scrapy + Redis + CKAN API
- Features: Web scraping, metadata extraction, caching
- Process: Queries CKAN API β Scrapes dataset pages β Extracts file metadata
- Purpose: Coordinates allocation of data rescue tasks across nodes
- Technology: FastAPI
- Functionality: Task distribution, resource management
- Port: 8001
- Purpose: Manages dataset rescue priorities and resource allocation
- Technology: FastAPI + APScheduler
- Features: Dynamic priority scoring, automatic updates, resource matching
- API: Allocation, release, priority updates, status monitoring
Each service can be run independently using Docker Compose:
# Start the Rescue DB (database + API)
cd rescue_db
docker compose up
# Start the Dispatcher service
cd dispatcher
docker compose up
# Start the Priorizer service
cd priorizer
docker compose up
# Start the DataGov Asset Collector
cd datagov/asset
docker compose up
For development, you can run services locally:
# Rescue DB API
cd rescue_db
uv run fastapi dev rescue_api/main.py
# Dispatcher API
cd dispatcher
uv run fastapi dev api/dispatcher_service.py
# Priorizer API
cd priorizer
uv run fastapi dev main.py
- URL: http://localhost:8000/docs
- Purpose: Database management and dataset catalog
- URL: http://localhost:8001/docs
- Purpose: Task distribution and coordination
- URL: http://localhost:8002/docs (default)
- Endpoints:
POST /allocate
- Assign datasets to nodesPOST /release
- Release completed datasetsPOST /update-ckan
- Update dataset prioritiesGET /status
- System status and statistics
- π Discovery: DataGov collector queries CKAN API and scrapes dataset pages
- π Cataloging: Metadata is extracted and stored in Rescue DB
- π― Prioritization: Priorizer assigns and adjusts priority scores
- π¦ Distribution: Dispatcher allocates work to available nodes
- πΎ Storage: Nodes download and store assigned datasets
- π Monitoring: System tracks progress and provides status updates
cd rescue_db
uv run alembic upgrade head
uv run alembic revision --autogenerate -m "Description of changes"
# Run dispatcher tests
python test/dispatcher/test_dispatcher.py
# Run datagov tests
python test/datagov/ckan/test_resource.py
Each service has its own configuration:
- Environment Variables: Copy
.env.dist
to.env
and configure - Docker Compose: Service-specific configurations in each directory
- Database: PostgreSQL with configurable credentials and database name
offseason-shelter-for-science/
βββ π rescue_db/ # Central database and API
βββ π·οΈ datagov/ # Data collection and scraping
βββ π‘ dispatcher/ # Task distribution service
βββ π― priorizer/ # Priority management service
βββ π§ͺ test/ # Test suites
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
This project is open source and available under the MIT License.
For questions or issues, please check the individual service README files or create an issue in the repository.
Built with β€οΈ by Data For Science, for climate data preservation