1 move dataset of interest from published location online to s3 #3

eglucas · 2025-01-29T11:35:46Z

Description

Addresses issue #1

Data getter functions
This PR introduces a data getter function that extracts a file from its original source and uploads it to the designated s3 bucket.

The following functions have been introduced in the data_getters.py module:

save_to_s3_bronze(dataset_name, target_url): This function makes a request for the target file, checks if it already exists in the s3 bucket (prompts user to confirm overwrite if it does exist) and uploads to s3. It returns the s3 path in which the main file was uploaded to.
_save_provenance_to_toml(func): Decorator for the main function save_to_s3_bronze in which provenance metadata is captured and uploaded in a .toml alongside the main file to s3. This decorator function also checks if a sidecar .toml for the main file already exists in the bucket, prompting for a separate user input to confirm overwrite. As discussed offline, I have left these overwrite prompts separate but would welcome any suggestions on how to combine them.
get_latest_version(dataset_name, filter): This function returns information about the latest version of a dataset from the base.yaml config file. For a given dataset, it identifies the latest version using the release_date field and returns all information about the latest version as a dictionary. Its primary use is to call the file_url of the latest dataset version.
append_file_bronze_to_latest_version(dataset_name, s3_file_path, filter): Updates the base.yaml config file with an additional file_bronze field to the latest version of the dataset.

Dataset information in config file
Information on each dataset is stored in the base.yaml config file. The general structure (Note: I've enabled automatic sorting of keys by alphabetical order in the function append_file_bronze_to_latest_version()):

dataset:
  canonical_name_of_dataset:
    collection_url: https://...
    versions:
    - file_bronze:
        - s3://.../file_1
        - s3://.../file_2
      file_url:
        - https://.../file_1
        - https://.../file_2
      page_url: https://...
      release_date: YYYY-MM-DD
    - file_bronze:
        - s3://.../file_1
        - s3://.../file_2
      file_url:
        - https://.../file_1
        - https://.../file_2
      page_url: https://...
      release_date: YYYY-MM-DD

Lists are used for:

Multiple versions of a dataset (under versions)
Multiple files in a dataset (under file_url and file_bronze). The order of files across these two fields are currently ensured in a semi-manual way as the file_bronze list is populated via looping through each item in the file_url list.

Extraction stage scripts for each shortlisted dataset
In the pipeline folder, there are child folders for 12 datasets each containing get_[dataset]_to_bronze.py. Running this file for a dataset calls the latest version, moves the file(s) to s3 and appends the config file with the s3 destination path.

All follow a similar structure, except for the Public Attitudes Tracking Survey dataset where the filter argument needs to be used to upload the latest version of each season (summer, spring, winter).

Instructions for Reviewer

For this PR, please review the data getters module and their usage, checking for any bugs, missing documentation and opportunities for improvement.

Please also check that running the get_[dataset]_to_bronze.py scripts produce the expected output in terms of what is uploaded to the s3 bucket and also asking for user input in the correct circumstances to overwrite/abort upload.

Checklist:

eglucas · 2025-02-03T10:02:22Z

asf_mission_data_tool/getters/data_getters.py

+        latest_version = max(
+            versions, key=lambda x: datetime.strptime(x["release_date"], "%Y-%m-%d")
+        )
+    latest_version["file_bronze"] = s3_file_path


Suggested change

latest_version["file_bronze"] = s3_file_path

latest_version[new_field_name] = s3_file_path

While developing the ETL flow to Silver, I realised that this entire function could be modified to be more generic, taking an argument for any new field name to append to config. Just flagging it here as a change we could implement before merging this branch.

eglucas · 2025-02-03T14:12:12Z

...ions_statistics_pipeline/get_uk_territorial_greenhouse_gas_emissions_statistics_to_bronze.py

+
+dataset_name = "uk_territorial_greenhouse_gas_emissions_statistics"
+
+latest_version = get_latest_version(dataset_name)


Suggested change

latest_version = get_latest_version(dataset_name)

latest_version = get_latest_version(dataset_name, filter="final")

Also wanted to note that the most recent dataset is provisional for 2023. If we want the most recent "final" version (which have different files and tables) then we can pass this filter argument.

eglucas added 5 commits January 28, 2025 17:18

update requirements

bb3199b

add data getter functions

8a297fc

add dataset information to config

b13ded6

add dataset pipelines to bronze

9a50118

update data getters docstring and error handling

a042d8b

eglucas requested a review from danlewis85 January 29, 2025 11:35

eglucas self-assigned this Jan 29, 2025

eglucas linked an issue Jan 29, 2025 that may be closed by this pull request

Move dataset of interest from published location online to s3 #1

Open

eglucas commented Feb 3, 2025

View reviewed changes

eglucas added 3 commits February 27, 2025 12:44

update field append in bronze getters

52e7c83

update field append and silver getter functions

b1bd0ec

add seventh carbon budget data

3144c87

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

1 move dataset of interest from published location online to s3 #3

1 move dataset of interest from published location online to s3 #3

Uh oh!

eglucas commented Jan 29, 2025 •

edited

Loading

Uh oh!

eglucas Feb 3, 2025 •

edited

Loading

Uh oh!

eglucas Feb 3, 2025

Uh oh!

Uh oh!

	latest_version["file_bronze"] = s3_file_path
	latest_version[new_field_name] = s3_file_path


		dataset_name = "uk_territorial_greenhouse_gas_emissions_statistics"

		latest_version = get_latest_version(dataset_name)

1 move dataset of interest from published location online to s3 #3

Are you sure you want to change the base?

1 move dataset of interest from published location online to s3 #3

Uh oh!

Conversation

eglucas commented Jan 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Instructions for Reviewer

Checklist:

Uh oh!

eglucas Feb 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eglucas Feb 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

eglucas commented Jan 29, 2025 •

edited

Loading

eglucas Feb 3, 2025 •

edited

Loading