123 check building cons area data #124

crispy-wonton · 2025-02-11T13:20:12Z

Fixes #123

Description

Update England building conservation area dataset with Feb 2025 data. NB: this will change our results so we need to rerun the script to ADD FEATURES and the one to CALCULATE SUITABILITY. I can do this after this has been reviewed.
Add deduplication of geometries in England and Wales building conservation areas (this shouldn't change our results)
Add analysis of Historic England building conservation area dataset to check completeness (described in issue and notebook)

Instructions for Reviewer

In order to test the code in this PR you need to ...
In terminal, run the following line:
jupytext --to notebook asf_heat_pump_suitability/analysis/protected_areas/20250207_missing_protected_areas.py

Then you can run the notebook.

Please pay special attention to ...

Analysis is done correctly, especially the way joins are made
Results of the analysis. Ultimately this analysis was intended to help us decide whether we can fill nulls of in_protected_area with False. I think this analysis shows we could, although we would have to accept that there may be some missing conservation areas for some LAs. Let me know what you think.
Is there anything further you would be interested in me exploring? We could go down the route of improving the deduplication but I didn't want to get too far into that before sharing these initial results.

Checklist:

…england building conservation areas dataset

caldwellst

Hey Roisin, nice stuff!

I left some comments in the notebook. Overall the analysis is okay, I think one of the joins needs fixing, and a bit more commenting might've helped me track. I've put questions where I had some issues.

However, overall, the results seem to indicate we shouldn't be dropping. I thought it would be good to just go through and check the two most "off" local authorities.

Stratford-on-Avon is just missing loads of data compared in the Historic England dataset, as you note in your analysis. I looked at their website and just can't seem to find any of the conservation areas they list, when searching by name. Oxford Canal is weirdly in the HE data, but isn't on their site, but the other area in the dataset we have, Whichford, is in there. Seems not much we can do on this missingness issue?

However, I checked out as well Derbyshire Dales. This is where we have too many options compared to the Derbyshire Dales reported conservation areas. Some manual investigation, and these extras are actually conservation areas within the Peak District National Park!

Could this be a common driver of differences? LAs may not be responsible for management of conservation areas that fall under a national park designation, but that isn't reflected in the HE/Wales data?

On duplicated names, I think there are a few different options that you outline! I checked a few by spot, and a lot seem to be contiguous areas stored as separate shapes. However there are some that are not next to each other. Not sure if you'd be able to fully validate these. These are probably cases where it's difficult to automatically do anything, and we wouldn't necessarily want to with more confidence. The problem case would be ones like I saw in my GitHub notifications where an entire area is classed as a conservation area.

You could check that there are no conservation area polygons fully contained in another. Could do a spatial join, where you expect each area to only contain itself, and flag others to check. Those would be ones where maybe we manually check and if agree, automatically remove, so as not to overestimate conservation area through some wild setup! Maybe @lizgzil already checked for this in #130?

caldwellst · 2025-02-13T15:52:08Z

asf_heat_pump_suitability/analysis/protected_areas/20250207_missing_protected_areas.py

+counts_df = pl.read_csv(
+    "s3://asf-heat-pump-suitability/evaluation/building_conservation_area_datasets/building_conservation_area_counts_sample.csv"
+)
+cons_areas_df = cons_areas_df.join(counts_df, how="inner", on="LAD23NM")


This is dropping 3 rows where we have no conservation areas: Wakefield, Wokingham, and Leeds. Not sure if intentional, but if not, should do an outer join and fill in in_conservation_area_ew with 0 I think?

These seem to be list in the initial filtering on overlay? Or at least some of them.

Great spot thank you! Yes you are correct they are lost in the filtering step. Whenever I remove the filtering, I can see they are all retained. Between them they get matched to 11 conservation areas with all but 1 match being <1% of the conservation area. The other is 2.96%.

caldwellst · 2025-02-13T16:00:43Z

asf_heat_pump_suitability/analysis/protected_areas/20250207_missing_protected_areas.py

+# Load geospatial boundaries of local authorities
+council_bounds = get_datasets.load_gdf_ons_council_bounds()
+
+# %%


What's the logic of this following section? Lose about 7% of our data here. Just that we don't trust the conservation area if it falls outside the council bounds? Would be good to add jsutification here.

Sorry, you're right about the documentation, there isn't enough! I will go back and add more.

Because the analysis is focussing on conservation area data availability for Local Authority Districts (councils), the steps below are meant to join the conservation areas to their local authorities. I tried out a couple of different join methods and this one seemed to produce the best results when comparing to our count data.

E.g. sjoin with intersects predicate seemed to be too permissive and there was a high number of LADs with too many conservation areas. sjoin with contains predicate (as in, LAD contains conservation areas) seemed to be too restrictive and there was a high number of LADs with not enough conservation areas.

As a result, I chose this methodology where I calculated what percentage of each conservation area fell within the LAD and then kept those where at least 10% of the conservation area was found in the LAD. This was to avoid small intersections with neighbouring LAD conservation areas being matched because my assumption was that they wouldn't be included in the conservation area count by the LAD.

However, this is definitely an assumption so happy to change the methodology. Did you have another idea of how it could be improved?

Also, when you say we lose 7% of our data, which data do you mean?

I see now! Was slightly misunderstanding what was happening, but this is just for checking against the current setup. The approach makes sense. You could consider just taking the max for each to maintain the original row count, but after your filtration I think there's only like 20 or so rows duplicated across, so not a huge deal for just this checking. Was getting confused with this approach for analysis and how it might be applied across LSOAs, but see now what was happening.

asf_heat_pump_suitability/analysis/protected_areas/20250207_missing_protected_areas.py

caldwellst · 2025-02-13T21:35:12Z

asf_heat_pump_suitability/analysis/protected_areas/20250207_missing_protected_areas.py

+
+# %%
+full_cons_areas_gdf["cons_area_size_m2"] = full_cons_areas_gdf["geometry"].area
+full_cons_areas_gdf = gpd.overlay(


Also, just noticed that because we are doing the intersection, we are dropping over 400 rows from the original data. A lot of this in places like Angelsey where I can see Beaumaris and the Menai Bridge, but also some villages in England as I spot check. I guess not to much to do since we are validating against the boundaries we have, but interesting as it could be another way we are losing data relative to boundaries that we do have.

crispy-wonton added 3 commits February 11, 2025 11:29

update config/base.yaml and config/README with new Feb 2025 historic …

5ec0b18

…england building conservation areas dataset

update protected_areas.py to drop duplicate geometries

fbad76a

add analysis of protected area data

16dd6ee

crispy-wonton requested review from lizgzil and caldwellst February 11, 2025 13:20

caldwellst reviewed Feb 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

123 check building cons area data #124

123 check building cons area data #124

Uh oh!

crispy-wonton commented Feb 11, 2025 •

edited

Loading

Uh oh!

caldwellst left a comment •

edited

Loading

Uh oh!

caldwellst Feb 13, 2025

Uh oh!

caldwellst Feb 13, 2025

Uh oh!

crispy-wonton Feb 13, 2025

Uh oh!

caldwellst Feb 13, 2025

Uh oh!

caldwellst Feb 13, 2025

Uh oh!

crispy-wonton Feb 13, 2025

Uh oh!

caldwellst Feb 13, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

caldwellst Feb 13, 2025

Uh oh!

Uh oh!

123 check building cons area data #124

Are you sure you want to change the base?

123 check building cons area data #124

Uh oh!

Conversation

crispy-wonton commented Feb 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Instructions for Reviewer

Checklist:

Uh oh!

caldwellst left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

crispy-wonton commented Feb 11, 2025 •

edited

Loading

caldwellst left a comment •

edited

Loading