Skip to content

Move dataset of interest from published location online to s3 #1

@danlewis85

Description

@danlewis85

Aim
Take an existing online datasource and put it in our s3 bucket. In doing this, we would like to:

  • retain original file format - only moving the file.
  • use a generic approach that we can tailor to most/all data sources of interest.
  • store target file urls in a config yaml.
  • capture file provenance and store alongside raw data file.
  • store captured files under a common naming convention.

General approach
A data getter function that makes a get request for a file, returning the file as a binary object and uploads to a designated place in s3.

The data getter should not overwrite an existing file of the same name in the s3 bucket, instead throw an error.

Naming
Adopting a medallion architecture, these files will be labelled 'bronze'.

The dataset will be given a canonical name (e.g. a fixed, unique identifier by which we can identify any instance of the dataset) without date-time qualification. e.g. 'heat pump deployment statistics' not 'heat pump deployment statistics: September 2024'

The stored file will retain the same name and extension as the original file. e.g. s3://asf-mission-data-tool/bronze/heat_pump_deployment_statistics/Heat_pump_deployment_quarterly_statistics_United_Kingdom_2024_Q3.xlsx

Provenance
It may make sense to implement provencance as a decorator for the main data getter function, that way it can be maintained independently of the data getter code. Provenance will be captured as a 'sidecar' file in a structured format (e.g. yaml).
The key things we want to capture are:

  • original url of file
  • (possibly) collection url of dataset
  • date time of capture
  • user id (possibly users git id?)

The sidecar file will have the same name as the data file plus an extension e.g. s3://asf-mission-data-tool/bronze/heat_pump_deployment_statistics/Heat_pump_deployment_quarterly_statistics_United_Kingdom_2024_Q3.xlsx.toml

Usage
A data pipeline with be created for each dataset in which the first instruction is a call to get the latest data. Relevant info to do this is held in a config file.
The config file (yaml) will have a suitable structure, like:

---
dataset:
    canonical_name:
        file_url: http://etc
        collection_url: http://etc or NA
    canonical_name:
        file_url: http://etc
        collection_url: http://etc or NA
...

We will likely expand the dataset config to capture standard within dataset information (e.g. worksheet, cell ranges etc.) in the ETL stage.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions