-
Notifications
You must be signed in to change notification settings - Fork 1
123 check building cons area data #124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
…england building conservation areas dataset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Roisin, nice stuff!
I left some comments in the notebook. Overall the analysis is okay, I think one of the joins needs fixing, and a bit more commenting might've helped me track. I've put questions where I had some issues.
However, overall, the results seem to indicate we shouldn't be dropping. I thought it would be good to just go through and check the two most "off" local authorities.
Stratford-on-Avon is just missing loads of data compared in the Historic England dataset, as you note in your analysis. I looked at their website and just can't seem to find any of the conservation areas they list, when searching by name. Oxford Canal is weirdly in the HE data, but isn't on their site, but the other area in the dataset we have, Whichford, is in there. Seems not much we can do on this missingness issue?
However, I checked out as well Derbyshire Dales. This is where we have too many options compared to the Derbyshire Dales reported conservation areas. Some manual investigation, and these extras are actually conservation areas within the Peak District National Park!
Could this be a common driver of differences? LAs may not be responsible for management of conservation areas that fall under a national park designation, but that isn't reflected in the HE/Wales data?
On duplicated names, I think there are a few different options that you outline! I checked a few by spot, and a lot seem to be contiguous areas stored as separate shapes. However there are some that are not next to each other. Not sure if you'd be able to fully validate these. These are probably cases where it's difficult to automatically do anything, and we wouldn't necessarily want to with more confidence. The problem case would be ones like I saw in my GitHub notifications where an entire area is classed as a conservation area.
You could check that there are no conservation area polygons fully contained in another. Could do a spatial join, where you expect each area to only contain itself, and flag others to check. Those would be ones where maybe we manually check and if agree, automatically remove, so as not to overestimate conservation area through some wild setup! Maybe @lizgzil already checked for this in #130?
counts_df = pl.read_csv( | ||
"s3://asf-heat-pump-suitability/evaluation/building_conservation_area_datasets/building_conservation_area_counts_sample.csv" | ||
) | ||
cons_areas_df = cons_areas_df.join(counts_df, how="inner", on="LAD23NM") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is dropping 3 rows where we have no conservation areas: Wakefield, Wokingham, and Leeds. Not sure if intentional, but if not, should do an outer
join and fill in in_conservation_area_ew
with 0 I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These seem to be list in the initial filtering on overlay? Or at least some of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great spot thank you! Yes you are correct they are lost in the filtering step. Whenever I remove the filtering, I can see they are all retained. Between them they get matched to 11 conservation areas with all but 1 match being <1% of the conservation area. The other is 2.96%.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see!
# Load geospatial boundaries of local authorities | ||
council_bounds = get_datasets.load_gdf_ons_council_bounds() | ||
|
||
# %% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the logic of this following section? Lose about 7% of our data here. Just that we don't trust the conservation area if it falls outside the council bounds? Would be good to add jsutification here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, you're right about the documentation, there isn't enough! I will go back and add more.
Because the analysis is focussing on conservation area data availability for Local Authority Districts (councils), the steps below are meant to join the conservation areas to their local authorities. I tried out a couple of different join methods and this one seemed to produce the best results when comparing to our count data.
E.g. sjoin
with intersects
predicate seemed to be too permissive and there was a high number of LADs with too many conservation areas. sjoin
with contains
predicate (as in, LAD contains conservation areas) seemed to be too restrictive and there was a high number of LADs with not enough conservation areas.
As a result, I chose this methodology where I calculated what percentage of each conservation area fell within the LAD and then kept those where at least 10% of the conservation area was found in the LAD. This was to avoid small intersections with neighbouring LAD conservation areas being matched because my assumption was that they wouldn't be included in the conservation area count by the LAD.
However, this is definitely an assumption so happy to change the methodology. Did you have another idea of how it could be improved?
Also, when you say we lose 7% of our data, which data do you mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see now! Was slightly misunderstanding what was happening, but this is just for checking against the current setup. The approach makes sense. You could consider just taking the max for each to maintain the original row count, but after your filtration I think there's only like 20 or so rows duplicated across, so not a huge deal for just this checking. Was getting confused with this approach for analysis and how it might be applied across LSOAs, but see now what was happening.
asf_heat_pump_suitability/analysis/protected_areas/20250207_missing_protected_areas.py
Show resolved
Hide resolved
asf_heat_pump_suitability/analysis/protected_areas/20250207_missing_protected_areas.py
Show resolved
Hide resolved
asf_heat_pump_suitability/analysis/protected_areas/20250207_missing_protected_areas.py
Show resolved
Hide resolved
|
||
# %% | ||
full_cons_areas_gdf["cons_area_size_m2"] = full_cons_areas_gdf["geometry"].area | ||
full_cons_areas_gdf = gpd.overlay( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, just noticed that because we are doing the intersection, we are dropping over 400 rows from the original data. A lot of this in places like Angelsey where I can see Beaumaris and the Menai Bridge, but also some villages in England as I spot check. I guess not to much to do since we are validating against the boundaries we have, but interesting as it could be another way we are losing data relative to boundaries that we do have.
Fixes #123
Description
Instructions for Reviewer
In order to test the code in this PR you need to ...
In terminal, run the following line:
jupytext --to notebook asf_heat_pump_suitability/analysis/protected_areas/20250207_missing_protected_areas.py
Then you can run the notebook.
Please pay special attention to ...
in_protected_area
withFalse
. I think this analysis shows we could, although we would have to accept that there may be some missing conservation areas for some LAs. Let me know what you think.Checklist:
notebooks/
pre-commit
and addressed any issues not automatically fixeddev
README
s