-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Aim
Take an existing online datasource and put it in our s3 bucket. In doing this, we would like to:
- retain original file format - only moving the file.
- use a generic approach that we can tailor to most/all data sources of interest.
- store target file urls in a config yaml.
- capture file provenance and store alongside raw data file.
- store captured files under a common naming convention.
General approach
A data getter function that makes a get request for a file, returning the file as a binary object and uploads to a designated place in s3.
The data getter should not overwrite an existing file of the same name in the s3 bucket, instead throw an error.
Naming
Adopting a medallion architecture, these files will be labelled 'bronze'.
The dataset will be given a canonical name (e.g. a fixed, unique identifier by which we can identify any instance of the dataset) without date-time qualification. e.g. 'heat pump deployment statistics' not 'heat pump deployment statistics: September 2024'
The stored file will retain the same name and extension as the original file. e.g. s3://asf-mission-data-tool/bronze/heat_pump_deployment_statistics/Heat_pump_deployment_quarterly_statistics_United_Kingdom_2024_Q3.xlsx
Provenance
It may make sense to implement provencance as a decorator for the main data getter function, that way it can be maintained independently of the data getter code. Provenance will be captured as a 'sidecar' file in a structured format (e.g. yaml).
The key things we want to capture are:
- original url of file
- (possibly) collection url of dataset
- date time of capture
- user id (possibly users git id?)
The sidecar file will have the same name as the data file plus an extension e.g. s3://asf-mission-data-tool/bronze/heat_pump_deployment_statistics/Heat_pump_deployment_quarterly_statistics_United_Kingdom_2024_Q3.xlsx.toml
Usage
A data pipeline with be created for each dataset in which the first instruction is a call to get the latest data. Relevant info to do this is held in a config file.
The config file (yaml) will have a suitable structure, like:
---
dataset:
canonical_name:
file_url: http://etc
collection_url: http://etc or NA
canonical_name:
file_url: http://etc
collection_url: http://etc or NA
...
We will likely expand the dataset config to capture standard within dataset information (e.g. worksheet, cell ranges etc.) in the ETL stage.