Skip to content

Conversation

crispy-wonton
Copy link
Collaborator

@crispy-wonton crispy-wonton commented Aug 21, 2025

Progress towards fixing #171

NOTE: to run the notebook you may need to pip install scikit-learn==1.3

Description

Add new notebook to start creating full dataset of features for Plymouth for new Phase 3 suitability and feasibility calculations. The final dataset should be one with one row per UPRN of all residential UPRNs in Plymouth and a complete set of features relevant to feasibility scoring and tech categorisation. The main purpose of this notebook is to start collating the relevant datasets and explore some methodologies that could be used to impute missing data points. The notebook does not generate the full final dataset - it adds (and imputes, where required) the following features:

  • in listed building
  • in building conservation area
  • in HN zone
  • property type
  • tenure
  • off gas status

Additional features will be added and imputed in ongoing work but I wanted to get this out so you can start sense-checking the methodology.

The purposes of this exploratory work are to:

  1. Identify methodologies for imputing missing feature data
  2. Create an idea of the flow of what the pipeline will eventually look like
  3. Create a minimum viable example dataset for demonstration purposes in Plymouth

Note: I fully expect us to iterate these methodologies to improve them over time as we develop the pipeline. I want to prioritise getting this data good enough for a demo at the moment so please focus on reviewing from that perspective, but also please do leave any comments/suggestions for improvements that will take more time and exploration with the knowledge that we will make deeper improvements after the Plymouth workshop.

Instructions for Reviewer

In order to test the code in this PR you need to ...
convert the script to a notebook using jupytext:

pip install jupytext
jupytext --to notebook asf_heat_pump_suitability/analysis/exploratory/create_full_dataset/create_full_dataset_plymouth.py

Please pay special attention to ...

  • The methodologies for imputation and the evaluation/validation of those methodologies. The main one to pay close attention to is the method to impute missing TENURE data. As you will see, I haven't done any hyperparameter tuning or detailed evaluation. I wanted to sense check the general method and datasets used before going too far into model tuning and evaluation. If you think a multi-class classifier model is suitable, I'd be keen to hear your ideas for how to best evaluate it beyond average precision and accuracy scores. I'm very open to other ideas for models/methods.
  • Please let me know if you have other ideas for improving the imputation methods of the other features as well.

Checklist:

  • I have refactored my code out from notebooks/
  • I have checked the code runs
  • I have tested the code
  • I have run pre-commit and addressed any issues not automatically fixed
  • I have merged any new changes from dev
  • I have documented the code
    • Major functions have docstrings
    • Appropriate information has been added to READMEs
  • I have explained this PR above
  • I have requested a code review

Copy link

@sofiapinto sofiapinto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@crispy-wonton amazing work! There's so much knowledge of handling gdf's, polars and of UK open datasets! I learnt loads from reviewing this PR. And so much work in this PR. Thank you.

I've left questions where I need clarifications, but haven't found any bugs or problems with the logic.

We have since discussed improvements on the modelling side, so I didn't leave any comments on that part of the code. Feel free to add your newest code and I'm happy to review again and think of improvements on the model/data used.

# ## Load data

# %%
fiona.listlayers("s3://asf-heat-pump-suitability/source_data/opmplc_gb.gpkg")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does opmplc_gb stand for?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry - should have added notation! it's the Open Map Local dataset for GB and the layers are all the different geospatial datasets contained in the file :)

Comment on lines 77 to 79
listed_buildings_gdf = gpd.read_file(
"../spatial_clustering/data/National_Heritage_List_for_England_NHLE_v02_VIEW_-464524051049198649/Listed_Building_points.shp"
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs to be updated later to read from S3

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Comment on lines 82 to 84
cons_areas_gdf = gpd.read_file(
"../spatial_clustering/data/conservation-area (1).geojson"
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs to be updated later to read from S3

"Secondary Education",
"Higher or University Education",
"Primary Education",
"Post Office",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this one will lead to some false negatives. It's very common to have flats above post offices. But if there's no other way to remove the false positives, then it's better to have some false negatives. Just worth taking note that that's the case.

I guess that there are some UPRNs we could keep, in case they are non-null UPRNs in EPC - since those UPRNs will be only domestic properties.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeh I was thinking this too when I wrote this list. I think in a city it's also reasonably common to have residential properties above certain types of education centres, places of worship, and perhaps sports and leisure centres. The methodology to identify the residential properties generally could do with some improvement.

I think your second point is a great point - I think I need to revisit the UPRN to EPC join so that I retain any EPC UPRNs that might have been filtered out here.

)

print("\nLoading off gas postcodes...")
off_gas_list = off_gas.process_off_gas_data()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these used as postcodes with properties off gas or with the assumption that all properties in those postcodes are off gas? If the second one, I think that's probably likely in rural areas but not necessarily otherwise. Am I wrong?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding, it's the whole postcode doesn't have a gas connection. See Xoserve's documentation of the data. I've lifted the below from it. I think this means we can assume no properties in the postcode have gas supply.

The Off Gas Postcode Register is a list of postcodes where Xoserve holds no record of an active gas connection by either large or small gas transporters.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting, my bad then. Glad this is the case!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think pyarrow needs adding to the requirements. I needed to do pip install pyarrow to get off gas data to load.

print("Loading age band data...")
age_bands_df = pl.read_csv(
"s3://asf-heat-pump-suitability/exploration/spatial_clustering_plymouth/2021Census_age_bands_OA_plymouth.csv",
skip_rows=4,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't open the original datasets, without skip_rows so I haven't checked if this is correct. For age bands skip_rows=4, while skip_rows=6 for all other datasets. Hopefully this is correct, but wanted to flag in case it was a typo.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's correct, I checked all the dfs manually before creating these. I remember there were different lengths of headers across the diff files.

)

# Join census datasets together
census_df = oa_tenure_df

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
census_df = oa_tenure_df
census_df = oa_tenure_df.clone()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just for renaming purposes basically as oa_tenure_df isnt used again. also I dont think deep copying dfs is that important in polars as there's not much (if any?) in place operations.

)
.join(census_df, how="left", left_on="oa21", right_on="oa_code")
.join(
features_df.select(["UPRN", "in_cons_area", "in_listed_building"]),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
features_df.select(["UPRN", "in_cons_area", "in_listed_building"]),
features_df.select(["UPRN", "in_conservation_area", "in_listed_building"]),

if you decide to change above

Comment on lines +1090 to +1092
max_prob=pl.concat_list(
"owner-occupied", "rental (private)", "rental (social)"
).list.max(),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obviously not for now, but for when this is refactored: I think this part requires a bit of documentation, it's not immediate obvious what is happening

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point!!

Copy link
Collaborator

@lizgzil lizgzil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for all this @crispy-wonton !!

The code worked well and was great to see such elegant polars mastery.

My version of scikit-learn (1.7.1) made it so that I needed to change the code, but I see now you wrote in the description that you might need 1.3 anyway!

I looked over it all since I needed to run it all, but I focused on the tenure model and the extra code from adding the IMD deciles in my review. Think it looks great - no big changes at all - mostly notes.

Comment on lines +91 to +93
cons_areas_gdf = gpd.read_file(
"s3://asf-heat-pump-suitability/exploration/spatial_clustering_plymouth/conservation-area (1).geojson"
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the new dataset (#167)?

fiona.listlayers("s3://asf-heat-pump-suitability/source_data/opmplc_gb.gpkg")

# %%
print("LOADING DATASETS TO GET RESIDENTIAL UPRNS FOR PLYMOUTH...")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see some of the data you read from exploration/spatial_clustering_plymouth is particular to that part of the country (I guess the SX prefixed ones), but are some of them for the whole UK? (e.g. listed buildings and conservation zones). Nothing to change - but I'm curious about your logic to include data in this exploration/spatial_clustering_plymouth folder vs the source_data folder? Perhaps just for speed during exploration?

)

print("\nLoading off gas postcodes...")
off_gas_list = off_gas.process_off_gas_data()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think pyarrow needs adding to the requirements. I needed to do pip install pyarrow to get off gas data to load.

Comment on lines +158 to +160
uprns_in_buildings = os_openmap_buildings_plymouth_gdf.sjoin(
os_uprn_plymouth_gdf, how="inner", predicate="contains"
)["UPRN"].tolist()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I was plotting I think there were some times when a UPRN coordinate is just outside the building polygon. (see example of the left of slide 9 here). So when improvements are made - I guess we need to see if we can account for these cases.

coords = np.array(nn_tenure_df.geometry.map(lambda p: [p.x, p.y]).tolist())

# Find all neighbours within 100m radius of each UPRN
knn = NearestNeighbors(radius=100, algorithm="kd_tree").fit(coords)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at some point it'd be good to play around with this threshold to see what difference it makes to the results

Comment on lines +1272 to +1274
y_pred_arr = np.copy(y_pred)
for k, v in tenure_mapping.items():
y_pred_arr[y_pred == k] = v
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why, but this wasn't working as expected for me - y_pred_arr wasn't changed to numbers.

y_pred_arr = [tenure_mapping.get(v) for v in y_pred] worked though.

Comment on lines +1482 to +1490
# One hot encode property type data
onehotenc_df = pd.DataFrame(
enc.transform(
pd_features_df[["use_property_type"]].rename(
columns={"use_property_type": "property_type"}
)
).toarray()
)
onehotenc_df.columns = list(col.lower().replace(" ", "_") for col in enc.categories_[0])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be good to add all you processing steps to a function so you are sure you cleaned your training data in the same way as you cleaned the data for prediction. Looks consistent to me now though, so don't think there are bugs

# %%
# Cluster UPRNs on distance only
model = HDBSCAN(
min_cluster_size=5,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure on this - should it be 1, but it's fine for now. Will need to think it through - but just noting!

model = HDBSCAN(
min_cluster_size=5,
cluster_selection_epsilon=20,
algorithm="balltree",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI my version of HDBSCAN required this to be `algorithm="ball_tree". I have scikit-learn=1.7.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants