Skip to content

Conversation

crispy-wonton
Copy link
Collaborator

@crispy-wonton crispy-wonton commented Feb 11, 2025

Fixes #123

Description

  • Update England building conservation area dataset with Feb 2025 data. NB: this will change our results so we need to rerun the script to ADD FEATURES and the one to CALCULATE SUITABILITY. I can do this after this has been reviewed.
  • Add deduplication of geometries in England and Wales building conservation areas (this shouldn't change our results)
  • Add analysis of Historic England building conservation area dataset to check completeness (described in issue and notebook)

Instructions for Reviewer

In order to test the code in this PR you need to ...
In terminal, run the following line:
jupytext --to notebook asf_heat_pump_suitability/analysis/protected_areas/20250207_missing_protected_areas.py

Then you can run the notebook.

Please pay special attention to ...

  • Analysis is done correctly, especially the way joins are made
  • Results of the analysis. Ultimately this analysis was intended to help us decide whether we can fill nulls of in_protected_area with False. I think this analysis shows we could, although we would have to accept that there may be some missing conservation areas for some LAs. Let me know what you think.
  • Is there anything further you would be interested in me exploring? We could go down the route of improving the deduplication but I didn't want to get too far into that before sharing these initial results.

Checklist:

  • I have refactored my code out from notebooks/
  • I have checked the code runs
  • I have tested the code
  • I have run pre-commit and addressed any issues not automatically fixed
  • I have merged any new changes from dev
  • I have documented the code
    • Major functions have docstrings
    • Appropriate information has been added to READMEs
  • I have explained this PR above
  • I have requested a code review

Copy link
Collaborator

@caldwellst caldwellst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Roisin, nice stuff!

I left some comments in the notebook. Overall the analysis is okay, I think one of the joins needs fixing, and a bit more commenting might've helped me track. I've put questions where I had some issues.

However, overall, the results seem to indicate we shouldn't be dropping. I thought it would be good to just go through and check the two most "off" local authorities.

Stratford-on-Avon is just missing loads of data compared in the Historic England dataset, as you note in your analysis. I looked at their website and just can't seem to find any of the conservation areas they list, when searching by name. Oxford Canal is weirdly in the HE data, but isn't on their site, but the other area in the dataset we have, Whichford, is in there. Seems not much we can do on this missingness issue?

However, I checked out as well Derbyshire Dales. This is where we have too many options compared to the Derbyshire Dales reported conservation areas. Some manual investigation, and these extras are actually conservation areas within the Peak District National Park!

Could this be a common driver of differences? LAs may not be responsible for management of conservation areas that fall under a national park designation, but that isn't reflected in the HE/Wales data?

On duplicated names, I think there are a few different options that you outline! I checked a few by spot, and a lot seem to be contiguous areas stored as separate shapes. However there are some that are not next to each other. Not sure if you'd be able to fully validate these. These are probably cases where it's difficult to automatically do anything, and we wouldn't necessarily want to with more confidence. The problem case would be ones like I saw in my GitHub notifications where an entire area is classed as a conservation area.

You could check that there are no conservation area polygons fully contained in another. Could do a spatial join, where you expect each area to only contain itself, and flag others to check. Those would be ones where maybe we manually check and if agree, automatically remove, so as not to overestimate conservation area through some wild setup! Maybe @lizgzil already checked for this in #130?

counts_df = pl.read_csv(
"s3://asf-heat-pump-suitability/evaluation/building_conservation_area_datasets/building_conservation_area_counts_sample.csv"
)
cons_areas_df = cons_areas_df.join(counts_df, how="inner", on="LAD23NM")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is dropping 3 rows where we have no conservation areas: Wakefield, Wokingham, and Leeds. Not sure if intentional, but if not, should do an outer join and fill in in_conservation_area_ew with 0 I think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These seem to be list in the initial filtering on overlay? Or at least some of them.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great spot thank you! Yes you are correct they are lost in the filtering step. Whenever I remove the filtering, I can see they are all retained. Between them they get matched to 11 conservation areas with all but 1 match being <1% of the conservation area. The other is 2.96%.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see!

# Load geospatial boundaries of local authorities
council_bounds = get_datasets.load_gdf_ons_council_bounds()

# %%
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the logic of this following section? Lose about 7% of our data here. Just that we don't trust the conservation area if it falls outside the council bounds? Would be good to add jsutification here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, you're right about the documentation, there isn't enough! I will go back and add more.

Because the analysis is focussing on conservation area data availability for Local Authority Districts (councils), the steps below are meant to join the conservation areas to their local authorities. I tried out a couple of different join methods and this one seemed to produce the best results when comparing to our count data.

E.g. sjoin with intersects predicate seemed to be too permissive and there was a high number of LADs with too many conservation areas. sjoin with contains predicate (as in, LAD contains conservation areas) seemed to be too restrictive and there was a high number of LADs with not enough conservation areas.

As a result, I chose this methodology where I calculated what percentage of each conservation area fell within the LAD and then kept those where at least 10% of the conservation area was found in the LAD. This was to avoid small intersections with neighbouring LAD conservation areas being matched because my assumption was that they wouldn't be included in the conservation area count by the LAD.

However, this is definitely an assumption so happy to change the methodology. Did you have another idea of how it could be improved?

Also, when you say we lose 7% of our data, which data do you mean?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now! Was slightly misunderstanding what was happening, but this is just for checking against the current setup. The approach makes sense. You could consider just taking the max for each to maintain the original row count, but after your filtration I think there's only like 20 or so rows duplicated across, so not a huge deal for just this checking. Was getting confused with this approach for analysis and how it might be applied across LSOAs, but see now what was happening.


# %%
full_cons_areas_gdf["cons_area_size_m2"] = full_cons_areas_gdf["geometry"].area
full_cons_areas_gdf = gpd.overlay(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, just noticed that because we are doing the intersection, we are dropping over 400 rows from the original data. A lot of this in places like Angelsey where I can see Beaumaris and the Menai Bridge, but also some villages in England as I spot check. I guess not to much to do since we are validating against the boundaries we have, but interesting as it could be another way we are losing data relative to boundaries that we do have.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Check building conservation area data
2 participants