-
Notifications
You must be signed in to change notification settings - Fork 1
Add notebook to create full features dataset for plymouth #175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@crispy-wonton amazing work! There's so much knowledge of handling gdf's, polars and of UK open datasets! I learnt loads from reviewing this PR. And so much work in this PR. Thank you.
I've left questions where I need clarifications, but haven't found any bugs or problems with the logic.
We have since discussed improvements on the modelling side, so I didn't leave any comments on that part of the code. Feel free to add your newest code and I'm happy to review again and think of improvements on the model/data used.
# ## Load data | ||
|
||
# %% | ||
fiona.listlayers("s3://asf-heat-pump-suitability/source_data/opmplc_gb.gpkg") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does opmplc_gb stand for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry - should have added notation! it's the Open Map Local dataset for GB and the layers are all the different geospatial datasets contained in the file :)
listed_buildings_gdf = gpd.read_file( | ||
"../spatial_clustering/data/National_Heritage_List_for_England_NHLE_v02_VIEW_-464524051049198649/Listed_Building_points.shp" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs to be updated later to read from S3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
cons_areas_gdf = gpd.read_file( | ||
"../spatial_clustering/data/conservation-area (1).geojson" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs to be updated later to read from S3
"Secondary Education", | ||
"Higher or University Education", | ||
"Primary Education", | ||
"Post Office", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this one will lead to some false negatives. It's very common to have flats above post offices. But if there's no other way to remove the false positives, then it's better to have some false negatives. Just worth taking note that that's the case.
I guess that there are some UPRNs we could keep, in case they are non-null UPRNs in EPC - since those UPRNs will be only domestic properties.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeh I was thinking this too when I wrote this list. I think in a city it's also reasonably common to have residential properties above certain types of education centres, places of worship, and perhaps sports and leisure centres. The methodology to identify the residential properties generally could do with some improvement.
I think your second point is a great point - I think I need to revisit the UPRN to EPC join so that I retain any EPC UPRNs that might have been filtered out here.
) | ||
|
||
print("\nLoading off gas postcodes...") | ||
off_gas_list = off_gas.process_off_gas_data() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these used as postcodes with properties off gas or with the assumption that all properties in those postcodes are off gas? If the second one, I think that's probably likely in rural areas but not necessarily otherwise. Am I wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my understanding, it's the whole postcode doesn't have a gas connection. See Xoserve's documentation of the data. I've lifted the below from it. I think this means we can assume no properties in the postcode have gas supply.
The Off Gas Postcode Register is a list of postcodes where Xoserve holds no record of an active gas connection by either large or small gas transporters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting, my bad then. Glad this is the case!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think pyarrow
needs adding to the requirements. I needed to do pip install pyarrow
to get off gas data to load.
print("Loading age band data...") | ||
age_bands_df = pl.read_csv( | ||
"s3://asf-heat-pump-suitability/exploration/spatial_clustering_plymouth/2021Census_age_bands_OA_plymouth.csv", | ||
skip_rows=4, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't open the original datasets, without skip_rows
so I haven't checked if this is correct. For age bands skip_rows=4
, while skip_rows=6
for all other datasets. Hopefully this is correct, but wanted to flag in case it was a typo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's correct, I checked all the dfs manually before creating these. I remember there were different lengths of headers across the diff files.
) | ||
|
||
# Join census datasets together | ||
census_df = oa_tenure_df |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
census_df = oa_tenure_df | |
census_df = oa_tenure_df.clone() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is just for renaming purposes basically as oa_tenure_df
isnt used again. also I dont think deep copying dfs is that important in polars
as there's not much (if any?) in place operations.
) | ||
.join(census_df, how="left", left_on="oa21", right_on="oa_code") | ||
.join( | ||
features_df.select(["UPRN", "in_cons_area", "in_listed_building"]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
features_df.select(["UPRN", "in_cons_area", "in_listed_building"]), | |
features_df.select(["UPRN", "in_conservation_area", "in_listed_building"]), |
if you decide to change above
max_prob=pl.concat_list( | ||
"owner-occupied", "rental (private)", "rental (social)" | ||
).list.max(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Obviously not for now, but for when this is refactored: I think this part requires a bit of documentation, it's not immediate obvious what is happening
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point!!
...at_pump_suitability/analysis/exploratory/create_full_dataset/create_full_dataset_plymouth.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for all this @crispy-wonton !!
The code worked well and was great to see such elegant polars mastery.
My version of scikit-learn (1.7.1) made it so that I needed to change the code, but I see now you wrote in the description that you might need 1.3 anyway!
I looked over it all since I needed to run it all, but I focused on the tenure model and the extra code from adding the IMD deciles in my review. Think it looks great - no big changes at all - mostly notes.
cons_areas_gdf = gpd.read_file( | ||
"s3://asf-heat-pump-suitability/exploration/spatial_clustering_plymouth/conservation-area (1).geojson" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this the new dataset (#167)?
fiona.listlayers("s3://asf-heat-pump-suitability/source_data/opmplc_gb.gpkg") | ||
|
||
# %% | ||
print("LOADING DATASETS TO GET RESIDENTIAL UPRNS FOR PLYMOUTH...") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see some of the data you read from exploration/spatial_clustering_plymouth
is particular to that part of the country (I guess the SX
prefixed ones), but are some of them for the whole UK? (e.g. listed buildings and conservation zones). Nothing to change - but I'm curious about your logic to include data in this exploration/spatial_clustering_plymouth
folder vs the source_data
folder? Perhaps just for speed during exploration?
) | ||
|
||
print("\nLoading off gas postcodes...") | ||
off_gas_list = off_gas.process_off_gas_data() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think pyarrow
needs adding to the requirements. I needed to do pip install pyarrow
to get off gas data to load.
uprns_in_buildings = os_openmap_buildings_plymouth_gdf.sjoin( | ||
os_uprn_plymouth_gdf, how="inner", predicate="contains" | ||
)["UPRN"].tolist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I was plotting I think there were some times when a UPRN coordinate is just outside the building polygon. (see example of the left of slide 9 here). So when improvements are made - I guess we need to see if we can account for these cases.
coords = np.array(nn_tenure_df.geometry.map(lambda p: [p.x, p.y]).tolist()) | ||
|
||
# Find all neighbours within 100m radius of each UPRN | ||
knn = NearestNeighbors(radius=100, algorithm="kd_tree").fit(coords) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at some point it'd be good to play around with this threshold to see what difference it makes to the results
y_pred_arr = np.copy(y_pred) | ||
for k, v in tenure_mapping.items(): | ||
y_pred_arr[y_pred == k] = v |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why, but this wasn't working as expected for me - y_pred_arr
wasn't changed to numbers.
y_pred_arr = [tenure_mapping.get(v) for v in y_pred]
worked though.
# One hot encode property type data | ||
onehotenc_df = pd.DataFrame( | ||
enc.transform( | ||
pd_features_df[["use_property_type"]].rename( | ||
columns={"use_property_type": "property_type"} | ||
) | ||
).toarray() | ||
) | ||
onehotenc_df.columns = list(col.lower().replace(" ", "_") for col in enc.categories_[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be good to add all you processing steps to a function so you are sure you cleaned your training data in the same way as you cleaned the data for prediction. Looks consistent to me now though, so don't think there are bugs
# %% | ||
# Cluster UPRNs on distance only | ||
model = HDBSCAN( | ||
min_cluster_size=5, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure on this - should it be 1, but it's fine for now. Will need to think it through - but just noting!
model = HDBSCAN( | ||
min_cluster_size=5, | ||
cluster_selection_epsilon=20, | ||
algorithm="balltree", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI my version of HDBSCAN required this to be `algorithm="ball_tree". I have scikit-learn=1.7.1
Progress towards fixing #171
NOTE: to run the notebook you may need to
pip install scikit-learn==1.3
Description
Add new notebook to start creating full dataset of features for Plymouth for new Phase 3 suitability and feasibility calculations. The final dataset should be one with one row per UPRN of all residential UPRNs in Plymouth and a complete set of features relevant to feasibility scoring and tech categorisation. The main purpose of this notebook is to start collating the relevant datasets and explore some methodologies that could be used to impute missing data points. The notebook does not generate the full final dataset - it adds (and imputes, where required) the following features:
Additional features will be added and imputed in ongoing work but I wanted to get this out so you can start sense-checking the methodology.
The purposes of this exploratory work are to:
Note: I fully expect us to iterate these methodologies to improve them over time as we develop the pipeline. I want to prioritise getting this data good enough for a demo at the moment so please focus on reviewing from that perspective, but also please do leave any comments/suggestions for improvements that will take more time and exploration with the knowledge that we will make deeper improvements after the Plymouth workshop.
Instructions for Reviewer
In order to test the code in this PR you need to ...
convert the script to a notebook using jupytext:
Please pay special attention to ...
Checklist:
notebooks/
pre-commit
and addressed any issues not automatically fixeddev
README
s