Skip to content

Conversation

crispy-wonton
Copy link
Collaborator

@crispy-wonton crispy-wonton commented Jun 11, 2025

Fixes #152

Description

Refactor garden size pipeline into metaflow.

New files:

  • /pipeline/flows/run_calculate_garden_size_flow.py - new flow to estimate garden size for individual properties. Refactored version of /pipeline/run_scripts/run_calculate_garden_size.py
  • /utils/parallel_utils.py - new utils file with new function to assist parallelisation.
  • MANIFEST.in - required to import asf_heat_pump_suitability as package in batch machine

Updated files:

  • setup.cfg ; setup.py - changes required to import asf_heat_pump_suitability as package in batch machine

Instructions for Reviewer

I have set up the script to run on a sample of 3 building footprint + land registry file pairs and save it to S3. This is simply to test the flow of the whole pipeline, but the results should still make sense as this is just a subset of gardens. It should be relatively quick to run (~30-60 mins - if it's taking much longer, please kill the run and let me know). Please could you test that the flow runs all the way through successfully.

export METAFLOW_USER=yournamewithnospacesorpunctuation
python asf_heat_pump_suitability/pipeline/flows/run_calculate_garden_size_flow.py --datastore=s3 run --epc s3://asf-daps/lakehouse/processed/epc/old/deduplicated/processed_dedupl-0.parquet --year 2023 --quarter 4 --nations ews --max-workers 1

After the test run is complete, check that there are no flows running with the following line of code:
python asf_heat_pump_suitability/pipeline/flows/run_calculate_garden_size_flow.py --datastore=s3 batch list --my-runs

Please pay special attention to ...

  • Please do test that the creation of the sample works for you.
  • the logic of the refactored flow steps where it has changed from the original script. I've indicated the parts of the script that don't need a thorough review because they are unchanged (except to refactor into metaflow syntax). Please still do review these steps for the new syntax but the logic has been reviewed. And please do review the general overall flow and the documentation.

Checklist:

  • I have refactored my code out from notebooks/
  • I have checked the code runs
  • I have tested the code
  • I have run pre-commit and addressed any issues not automatically fixed
  • I have merged any new changes from dev
  • I have documented the code
    • Major functions have docstrings
    • Appropriate information has been added to READMEs
  • I have explained this PR above
  • I have requested a code review


quarter = Parameter(
name="quarter",
help="EPC data quarter",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
help="EPC data quarter",
help="EPC data quarter, 1-4",

required=True,
default="ews",
)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likely want to run a debug flag here where we are able to run just through the sample of 3 building footprint + land registry file pairs

@helloaidank
Copy link
Collaborator

Thanks for this Roisín, looks great!

High level comments, I suggested potentially:

  • one memory optimisation change around using a generator instead of a list
  • had a question around how land parcels correspond to building footprint
  • suggested including a debugging flag (much like the sampling flags from the other metaflow PRs).

I was able to run over the script, it took about an hour which seems fine! I quickly checked the head of the resulting parquet file, garden sizes seemed to check out, which is good.

Let me know if anything is unclear!

Aidan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parallelise garden size run script
2 participants