-
Notifications
You must be signed in to change notification settings - Fork 0
1 move dataset of interest from published location online to s3 #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
1 move dataset of interest from published location online to s3 #3
Conversation
latest_version = max( | ||
versions, key=lambda x: datetime.strptime(x["release_date"], "%Y-%m-%d") | ||
) | ||
latest_version["file_bronze"] = s3_file_path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
latest_version["file_bronze"] = s3_file_path | |
latest_version[new_field_name] = s3_file_path |
While developing the ETL flow to Silver, I realised that this entire function could be modified to be more generic, taking an argument for any new field name to append to config. Just flagging it here as a change we could implement before merging this branch.
|
||
dataset_name = "uk_territorial_greenhouse_gas_emissions_statistics" | ||
|
||
latest_version = get_latest_version(dataset_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
latest_version = get_latest_version(dataset_name) | |
latest_version = get_latest_version(dataset_name, filter="final") |
Also wanted to note that the most recent dataset is provisional for 2023. If we want the most recent "final" version (which have different files and tables) then we can pass this filter argument.
Description
Addresses issue #1
Data getter functions
This PR introduces a data getter function that extracts a file from its original source and uploads it to the designated s3 bucket.
The following functions have been introduced in the
data_getters.py
module:save_to_s3_bronze(dataset_name, target_url)
: This function makes a request for the target file, checks if it already exists in the s3 bucket (prompts user to confirm overwrite if it does exist) and uploads to s3. It returns the s3 path in which the main file was uploaded to._save_provenance_to_toml(func)
: Decorator for the main functionsave_to_s3_bronze
in which provenance metadata is captured and uploaded in a .toml alongside the main file to s3. This decorator function also checks if a sidecar .toml for the main file already exists in the bucket, prompting for a separate user input to confirm overwrite. As discussed offline, I have left these overwrite prompts separate but would welcome any suggestions on how to combine them.get_latest_version(dataset_name, filter)
: This function returns information about the latest version of a dataset from thebase.yaml
config file. For a given dataset, it identifies the latest version using therelease_date
field and returns all information about the latest version as a dictionary. Its primary use is to call thefile_url
of the latest dataset version.append_file_bronze_to_latest_version(dataset_name, s3_file_path, filter)
: Updates thebase.yaml
config file with an additionalfile_bronze
field to the latest version of the dataset.Dataset information in config file
Information on each dataset is stored in the
base.yaml
config file. The general structure (Note: I've enabled automatic sorting of keys by alphabetical order in the functionappend_file_bronze_to_latest_version()
):Lists are used for:
versions
)file_url
andfile_bronze
). The order of files across these two fields are currently ensured in a semi-manual way as thefile_bronze
list is populated via looping through each item in thefile_url
list.Extraction stage scripts for each shortlisted dataset
In the pipeline folder, there are child folders for 12 datasets each containing
get_[dataset]_to_bronze.py
. Running this file for a dataset calls the latest version, moves the file(s) to s3 and appends the config file with the s3 destination path.All follow a similar structure, except for the Public Attitudes Tracking Survey dataset where the
filter
argument needs to be used to upload the latest version of each season (summer, spring, winter).Instructions for Reviewer
For this PR, please review the data getters module and their usage, checking for any bugs, missing documentation and opportunities for improvement.
Please also check that running the
get_[dataset]_to_bronze.py
scripts produce the expected output in terms of what is uploaded to the s3 bucket and also asking for user input in the correct circumstances to overwrite/abort upload.Checklist:
notebooks/
pre-commit
and addressed any issues not automatically fixeddev
README
s