Move dataset of interest from published location online to s3

**Aim**
Take an existing online datasource and put it in our s3 bucket. In doing this, we would like to:
* retain original file format - only moving the file.
* use a generic approach that we can tailor to most/all data sources of interest.
* store target file urls in a **config** yaml.
* capture file provenance and store alongside raw data file.
* store captured files under a common naming convention.

**General approach**
A data getter function that makes a get request for a file, returning the file as a binary object and uploads to a designated place in s3.

The data getter should not overwrite an existing file of the same name in the s3 bucket, instead throw an error.

**Naming**
Adopting a medallion architecture, these files will be labelled 'bronze'.

The dataset will be given a canonical name (e.g. a fixed, unique identifier by which we can identify any instance of the dataset) without date-time qualification. e.g. 'heat pump deployment statistics' not 'heat pump deployment statistics: September 2024'

The stored file will retain the same name and extension as the original file. e.g. s3://asf-mission-data-tool/bronze/heat_pump_deployment_statistics/Heat_pump_deployment_quarterly_statistics_United_Kingdom_2024_Q3.xlsx

**Provenance**
It may make sense to implement provencance as a **decorator** for the main data getter function, that way it can be maintained independently of the data getter code. Provenance will be captured as a 'sidecar' file in a structured format (e.g. yaml).
The key things we want to capture are:
* original url of file
* (possibly) collection url of dataset
* date time of capture
* user id (possibly users git id?)

The sidecar file will have the same name as the data file plus an extension e.g. s3://asf-mission-data-tool/bronze/heat_pump_deployment_statistics/Heat_pump_deployment_quarterly_statistics_United_Kingdom_2024_Q3.xlsx.toml

**Usage**
A data pipeline with be created for each dataset in which the first instruction is a call to get the latest data. Relevant info to do this is held in a config file.
The config file (yaml) will have a suitable structure, like:
```(yaml)
---
dataset:
    canonical_name:
        file_url: http://etc
        collection_url: http://etc or NA
    canonical_name:
        file_url: http://etc
        collection_url: http://etc or NA
...
```
We will likely expand the dataset config to capture standard within dataset information (e.g. worksheet, cell ranges etc.) in the ETL stage.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Move dataset of interest from published location online to s3 #1

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Move dataset of interest from published location online to s3 #1

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions