Add notebook to create full features dataset for plymouth #175

crispy-wonton · 2025-08-21T11:41:19Z

Progress towards fixing #171

NOTE: to run the notebook you may need to pip install scikit-learn==1.3

Description

Add new notebook to start creating full dataset of features for Plymouth for new Phase 3 suitability and feasibility calculations. The final dataset should be one with one row per UPRN of all residential UPRNs in Plymouth and a complete set of features relevant to feasibility scoring and tech categorisation. The main purpose of this notebook is to start collating the relevant datasets and explore some methodologies that could be used to impute missing data points. The notebook does not generate the full final dataset - it adds (and imputes, where required) the following features:

in listed building
in building conservation area
in HN zone
property type
tenure
off gas status

Additional features will be added and imputed in ongoing work but I wanted to get this out so you can start sense-checking the methodology.

The purposes of this exploratory work are to:

Identify methodologies for imputing missing feature data
Create an idea of the flow of what the pipeline will eventually look like
Create a minimum viable example dataset for demonstration purposes in Plymouth

Note: I fully expect us to iterate these methodologies to improve them over time as we develop the pipeline. I want to prioritise getting this data good enough for a demo at the moment so please focus on reviewing from that perspective, but also please do leave any comments/suggestions for improvements that will take more time and exploration with the knowledge that we will make deeper improvements after the Plymouth workshop.

Instructions for Reviewer

In order to test the code in this PR you need to ...
convert the script to a notebook using jupytext:

pip install jupytext
jupytext --to notebook asf_heat_pump_suitability/analysis/exploratory/create_full_dataset/create_full_dataset_plymouth.py

Please pay special attention to ...

The methodologies for imputation and the evaluation/validation of those methodologies. The main one to pay close attention to is the method to impute missing TENURE data. As you will see, I haven't done any hyperparameter tuning or detailed evaluation. I wanted to sense check the general method and datasets used before going too far into model tuning and evaluation. If you think a multi-class classifier model is suitable, I'd be keen to hear your ideas for how to best evaluate it beyond average precision and accuracy scores. I'm very open to other ideas for models/methods.
Please let me know if you have other ideas for improving the imputation methods of the other features as well.

Checklist:

sofiapinto

@crispy-wonton amazing work! There's so much knowledge of handling gdf's, polars and of UK open datasets! I learnt loads from reviewing this PR. And so much work in this PR. Thank you.

I've left questions where I need clarifications, but haven't found any bugs or problems with the logic.

We have since discussed improvements on the modelling side, so I didn't leave any comments on that part of the code. Feel free to add your newest code and I'm happy to review again and think of improvements on the model/data used.

sofiapinto · 2025-08-27T13:32:45Z

...at_pump_suitability/analysis/exploratory/create_full_dataset/create_full_dataset_plymouth.py

+# ## Load data
+
+# %%
+fiona.listlayers("s3://asf-heat-pump-suitability/source_data/opmplc_gb.gpkg")


what does opmplc_gb stand for?

sorry - should have added notation! it's the Open Map Local dataset for GB and the layers are all the different geospatial datasets contained in the file :)

sofiapinto · 2025-08-27T13:39:18Z

...at_pump_suitability/analysis/exploratory/create_full_dataset/create_full_dataset_plymouth.py

+listed_buildings_gdf = gpd.read_file(
+    "../spatial_clustering/data/National_Heritage_List_for_England_NHLE_v02_VIEW_-464524051049198649/Listed_Building_points.shp"
+)


needs to be updated later to read from S3

sofiapinto · 2025-08-27T13:39:25Z

...at_pump_suitability/analysis/exploratory/create_full_dataset/create_full_dataset_plymouth.py

+cons_areas_gdf = gpd.read_file(
+    "../spatial_clustering/data/conservation-area (1).geojson"
+)


needs to be updated later to read from S3

sofiapinto · 2025-08-27T14:03:54Z

...at_pump_suitability/analysis/exploratory/create_full_dataset/create_full_dataset_plymouth.py

+    "Secondary Education",
+    "Higher or University Education",
+    "Primary Education",
+    "Post Office",


I think this one will lead to some false negatives. It's very common to have flats above post offices. But if there's no other way to remove the false positives, then it's better to have some false negatives. Just worth taking note that that's the case.

I guess that there are some UPRNs we could keep, in case they are non-null UPRNs in EPC - since those UPRNs will be only domestic properties.

Yeh I was thinking this too when I wrote this list. I think in a city it's also reasonably common to have residential properties above certain types of education centres, places of worship, and perhaps sports and leisure centres. The methodology to identify the residential properties generally could do with some improvement.

I think your second point is a great point - I think I need to revisit the UPRN to EPC join so that I retain any EPC UPRNs that might have been filtered out here.

sofiapinto · 2025-08-27T14:23:25Z

...at_pump_suitability/analysis/exploratory/create_full_dataset/create_full_dataset_plymouth.py

+)
+
+print("\nLoading off gas postcodes...")
+off_gas_list = off_gas.process_off_gas_data()


Are these used as postcodes with properties off gas or with the assumption that all properties in those postcodes are off gas? If the second one, I think that's probably likely in rural areas but not necessarily otherwise. Am I wrong?

From my understanding, it's the whole postcode doesn't have a gas connection. See Xoserve's documentation of the data. I've lifted the below from it. I think this means we can assume no properties in the postcode have gas supply.

The Off Gas Postcode Register is a list of postcodes where Xoserve holds no record of an active gas connection by either large or small gas transporters.

interesting, my bad then. Glad this is the case!

I think pyarrow needs adding to the requirements. I needed to do pip install pyarrow to get off gas data to load.

sofiapinto · 2025-08-27T19:26:07Z

...at_pump_suitability/analysis/exploratory/create_full_dataset/create_full_dataset_plymouth.py

+print("Loading age band data...")
+age_bands_df = pl.read_csv(
+    "s3://asf-heat-pump-suitability/exploration/spatial_clustering_plymouth/2021Census_age_bands_OA_plymouth.csv",
+    skip_rows=4,


I didn't open the original datasets, without skip_rows so I haven't checked if this is correct. For age bands skip_rows=4, while skip_rows=6 for all other datasets. Hopefully this is correct, but wanted to flag in case it was a typo.

I think it's correct, I checked all the dfs manually before creating these. I remember there were different lengths of headers across the diff files.

sofiapinto · 2025-08-27T19:33:26Z

...at_pump_suitability/analysis/exploratory/create_full_dataset/create_full_dataset_plymouth.py

+)
+
+# Join census datasets together
+census_df = oa_tenure_df


Suggested change

census_df = oa_tenure_df

census_df = oa_tenure_df.clone()

this is just for renaming purposes basically as oa_tenure_df isnt used again. also I dont think deep copying dfs is that important in polars as there's not much (if any?) in place operations.

sofiapinto · 2025-08-27T19:35:30Z

...at_pump_suitability/analysis/exploratory/create_full_dataset/create_full_dataset_plymouth.py

+    )
+    .join(census_df, how="left", left_on="oa21", right_on="oa_code")
+    .join(
+        features_df.select(["UPRN", "in_cons_area", "in_listed_building"]),


Suggested change

features_df.select(["UPRN", "in_cons_area", "in_listed_building"]),

features_df.select(["UPRN", "in_conservation_area", "in_listed_building"]),

if you decide to change above

sofiapinto · 2025-08-28T08:24:58Z

...at_pump_suitability/analysis/exploratory/create_full_dataset/create_full_dataset_plymouth.py

+        max_prob=pl.concat_list(
+            "owner-occupied", "rental (private)", "rental (social)"
+        ).list.max(),


Obviously not for now, but for when this is refactored: I think this part requires a bit of documentation, it's not immediate obvious what is happening

Good point!!

...at_pump_suitability/analysis/exploratory/create_full_dataset/create_full_dataset_plymouth.py

lizgzil

Thanks so much for all this @crispy-wonton !!

The code worked well and was great to see such elegant polars mastery.

My version of scikit-learn (1.7.1) made it so that I needed to change the code, but I see now you wrote in the description that you might need 1.3 anyway!

I looked over it all since I needed to run it all, but I focused on the tenure model and the extra code from adding the IMD deciles in my review. Think it looks great - no big changes at all - mostly notes.

lizgzil · 2025-09-04T08:48:54Z

...at_pump_suitability/analysis/exploratory/create_full_dataset/create_full_dataset_plymouth.py

+cons_areas_gdf = gpd.read_file(
+    "s3://asf-heat-pump-suitability/exploration/spatial_clustering_plymouth/conservation-area (1).geojson"
+)


is this the new dataset (#167)?

lizgzil · 2025-09-04T08:53:52Z

...at_pump_suitability/analysis/exploratory/create_full_dataset/create_full_dataset_plymouth.py

+fiona.listlayers("s3://asf-heat-pump-suitability/source_data/opmplc_gb.gpkg")
+
+# %%
+print("LOADING DATASETS TO GET RESIDENTIAL UPRNS FOR PLYMOUTH...")


I can see some of the data you read from exploration/spatial_clustering_plymouth is particular to that part of the country (I guess the SX prefixed ones), but are some of them for the whole UK? (e.g. listed buildings and conservation zones). Nothing to change - but I'm curious about your logic to include data in this exploration/spatial_clustering_plymouth folder vs the source_data folder? Perhaps just for speed during exploration?

lizgzil · 2025-09-04T09:03:33Z

...at_pump_suitability/analysis/exploratory/create_full_dataset/create_full_dataset_plymouth.py

+)
+
+print("\nLoading off gas postcodes...")
+off_gas_list = off_gas.process_off_gas_data()


I think pyarrow needs adding to the requirements. I needed to do pip install pyarrow to get off gas data to load.

lizgzil · 2025-09-04T09:09:11Z

...at_pump_suitability/analysis/exploratory/create_full_dataset/create_full_dataset_plymouth.py

+uprns_in_buildings = os_openmap_buildings_plymouth_gdf.sjoin(
+    os_uprn_plymouth_gdf, how="inner", predicate="contains"
+)["UPRN"].tolist()


When I was plotting I think there were some times when a UPRN coordinate is just outside the building polygon. (see example of the left of slide 9 here). So when improvements are made - I guess we need to see if we can account for these cases.

lizgzil · 2025-09-04T11:19:23Z

...at_pump_suitability/analysis/exploratory/create_full_dataset/create_full_dataset_plymouth.py

+coords = np.array(nn_tenure_df.geometry.map(lambda p: [p.x, p.y]).tolist())
+
+# Find all neighbours within 100m radius of each UPRN
+knn = NearestNeighbors(radius=100, algorithm="kd_tree").fit(coords)


at some point it'd be good to play around with this threshold to see what difference it makes to the results

lizgzil · 2025-09-04T11:35:21Z

...at_pump_suitability/analysis/exploratory/create_full_dataset/create_full_dataset_plymouth.py

+y_pred_arr = np.copy(y_pred)
+for k, v in tenure_mapping.items():
+    y_pred_arr[y_pred == k] = v


Not sure why, but this wasn't working as expected for me - y_pred_arr wasn't changed to numbers.

y_pred_arr = [tenure_mapping.get(v) for v in y_pred] worked though.

lizgzil · 2025-09-04T11:48:33Z

...at_pump_suitability/analysis/exploratory/create_full_dataset/create_full_dataset_plymouth.py

+# One hot encode property type data
+onehotenc_df = pd.DataFrame(
+    enc.transform(
+        pd_features_df[["use_property_type"]].rename(
+            columns={"use_property_type": "property_type"}
+        )
+    ).toarray()
+)
+onehotenc_df.columns = list(col.lower().replace(" ", "_") for col in enc.categories_[0])


would be good to add all you processing steps to a function so you are sure you cleaned your training data in the same way as you cleaned the data for prediction. Looks consistent to me now though, so don't think there are bugs

lizgzil · 2025-09-04T11:56:31Z

...at_pump_suitability/analysis/exploratory/create_full_dataset/create_full_dataset_plymouth.py

+# %%
+# Cluster UPRNs on distance only
+model = HDBSCAN(
+    min_cluster_size=5,


Not sure on this - should it be 1, but it's fine for now. Will need to think it through - but just noting!

lizgzil · 2025-09-04T11:59:16Z

...at_pump_suitability/analysis/exploratory/create_full_dataset/create_full_dataset_plymouth.py

+model = HDBSCAN(
+    min_cluster_size=5,
+    cluster_selection_epsilon=20,
+    algorithm="balltree",


FYI my version of HDBSCAN required this to be `algorithm="ball_tree". I have scikit-learn=1.7.1

add notebook to create full features dataset for plymouth

e231d96

crispy-wonton requested a review from sofiapinto August 21, 2025 11:41

delete unused imports in create_full_dataset_plymouth.py

3771192

sofiapinto reviewed Aug 28, 2025

View reviewed changes

crispy-wonton added 3 commits August 29, 2025 12:14

TEMP - comment out lines in garden_size.py

a70ecd0

fill missing data new features in create_full_dataset_plymouth.py

5cf03bf

add requirements

399929e

lizgzil approved these changes Sep 4, 2025

View reviewed changes

	features_df.select(["UPRN", "in_cons_area", "in_listed_building"]),
	features_df.select(["UPRN", "in_conservation_area", "in_listed_building"]),

Add notebook to create full features dataset for plymouth #175

Are you sure you want to change the base?

Add notebook to create full features dataset for plymouth #175

Uh oh!

Conversation

crispy-wonton commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Instructions for Reviewer

Checklist:

Uh oh!

sofiapinto left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lizgzil left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

crispy-wonton commented Aug 21, 2025 •

edited

Loading

sofiapinto left a comment •

edited

Loading