Skip to content

Conversation

eglucas
Copy link

@eglucas eglucas commented Jan 29, 2025


Description

Addresses issue #1

Data getter functions
This PR introduces a data getter function that extracts a file from its original source and uploads it to the designated s3 bucket.

The following functions have been introduced in the data_getters.py module:

  • save_to_s3_bronze(dataset_name, target_url): This function makes a request for the target file, checks if it already exists in the s3 bucket (prompts user to confirm overwrite if it does exist) and uploads to s3. It returns the s3 path in which the main file was uploaded to.

  • _save_provenance_to_toml(func): Decorator for the main function save_to_s3_bronze in which provenance metadata is captured and uploaded in a .toml alongside the main file to s3. This decorator function also checks if a sidecar .toml for the main file already exists in the bucket, prompting for a separate user input to confirm overwrite. As discussed offline, I have left these overwrite prompts separate but would welcome any suggestions on how to combine them.

  • get_latest_version(dataset_name, filter): This function returns information about the latest version of a dataset from the base.yaml config file. For a given dataset, it identifies the latest version using the release_date field and returns all information about the latest version as a dictionary. Its primary use is to call the file_url of the latest dataset version.

  • append_file_bronze_to_latest_version(dataset_name, s3_file_path, filter): Updates the base.yaml config file with an additional file_bronze field to the latest version of the dataset.

Dataset information in config file
Information on each dataset is stored in the base.yaml config file. The general structure (Note: I've enabled automatic sorting of keys by alphabetical order in the function append_file_bronze_to_latest_version()):

dataset:
  canonical_name_of_dataset:
    collection_url: https://...
    versions:
    - file_bronze:
        - s3://.../file_1
        - s3://.../file_2
      file_url:
        - https://.../file_1
        - https://.../file_2
      page_url: https://...
      release_date: YYYY-MM-DD
    - file_bronze:
        - s3://.../file_1
        - s3://.../file_2
      file_url:
        - https://.../file_1
        - https://.../file_2
      page_url: https://...
      release_date: YYYY-MM-DD  

Lists are used for:

  • Multiple versions of a dataset (under versions)
  • Multiple files in a dataset (under file_url and file_bronze). The order of files across these two fields are currently ensured in a semi-manual way as the file_bronze list is populated via looping through each item in the file_url list.

Extraction stage scripts for each shortlisted dataset
In the pipeline folder, there are child folders for 12 datasets each containing get_[dataset]_to_bronze.py. Running this file for a dataset calls the latest version, moves the file(s) to s3 and appends the config file with the s3 destination path.

All follow a similar structure, except for the Public Attitudes Tracking Survey dataset where the filter argument needs to be used to upload the latest version of each season (summer, spring, winter).

Instructions for Reviewer

For this PR, please review the data getters module and their usage, checking for any bugs, missing documentation and opportunities for improvement.

Please also check that running the get_[dataset]_to_bronze.py scripts produce the expected output in terms of what is uploaded to the s3 bucket and also asking for user input in the correct circumstances to overwrite/abort upload.

Checklist:

  • I have refactored my code out from notebooks/
  • I have checked the code runs
  • I have tested the code
  • I have run pre-commit and addressed any issues not automatically fixed
  • I have merged any new changes from dev
  • I have documented the code
    • Major functions have docstrings
    • Appropriate information has been added to READMEs
  • I have explained this PR above
  • I have requested a code review

@eglucas eglucas requested a review from danlewis85 January 29, 2025 11:35
@eglucas eglucas self-assigned this Jan 29, 2025
@eglucas eglucas linked an issue Jan 29, 2025 that may be closed by this pull request
latest_version = max(
versions, key=lambda x: datetime.strptime(x["release_date"], "%Y-%m-%d")
)
latest_version["file_bronze"] = s3_file_path
Copy link
Author

@eglucas eglucas Feb 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
latest_version["file_bronze"] = s3_file_path
latest_version[new_field_name] = s3_file_path

While developing the ETL flow to Silver, I realised that this entire function could be modified to be more generic, taking an argument for any new field name to append to config. Just flagging it here as a change we could implement before merging this branch.


dataset_name = "uk_territorial_greenhouse_gas_emissions_statistics"

latest_version = get_latest_version(dataset_name)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
latest_version = get_latest_version(dataset_name)
latest_version = get_latest_version(dataset_name, filter="final")

Also wanted to note that the most recent dataset is provisional for 2023. If we want the most recent "final" version (which have different files and tables) then we can pass this filter argument.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Move dataset of interest from published location online to s3
1 participant