Conversation
lizgzil
left a comment
There was a problem hiding this comment.
This looks really good! I ran it and it worked (I commented out the save bit though). Just some suggestions for style, but I doubt they will save much computing time. I may have spotted one bug on line 71 of utils, hence the requested changes (a comment below it gives a fix)
|
|
||
| 3. **Generate indicator score by weighing company buffer counts per EPC home.** Where the minimum buffer count weights more than the maximum buffer count. If a given EPC home falls within many minimum company buffers, the home will have a high proximity indicator score while the opposite is true if the home falls within many maximum company buffers. | ||
|
|
||
| <img width="814" alt="indicator_visual" src="https://user-images.githubusercontent.com/46863334/166961845-b5faebef-6e8e-4338-a2fc-817f01669d91.png"> |
| latlongs = installation_info[["longitude", "latitude"]].values | ||
| if len(latlongs) > 1: | ||
| central_point = calculate_point_midpoint(latlongs[0], latlongs[1]) | ||
| elif company_name in list(set(cleaned_installer_companies.installer_name)): | ||
| central_point = tuple( | ||
| cleaned_installer_companies[ | ||
| cleaned_installer_companies["installer_name"] == company_name | ||
| ][["longitude", "latitude"]].values[0] | ||
| ) | ||
|
|
||
| dists = [ | ||
| distance(central_point, install_latlong) for install_latlong in latlongs | ||
| ] |
There was a problem hiding this comment.
I think you can utilise a bit more of pandas here. Also I think you need a final else even if its an error message.
if len(installation_info) > 1:
central_point = installation_info[["longitude", "latitude"]].mean()
elif company_name in set(cleaned_installer_companies.installer_name):
central_point = cleaned_installer_companies[
cleaned_installer_companies["installer_name"] == company_name
][["longitude", "latitude"]].values[0]
)
else:
print("some sort of message")
def func_name(long, lat, central_point):
return distance(tuple(central_point), (long, lat))
dists = installation_info.apply(lambda x: func_name(x['longitude'], x['latitude'], central_point), axis =1).tolist()
| for company_name, installation_info in clean_installations.groupby( | ||
| "installer_name" | ||
| ): | ||
| latlongs = installation_info[["longitude", "latitude"]].values |
There was a problem hiding this comment.
warning! you use "latlongs" as the variable name but the order is long - lat. Just thought I'd flag in case this means something needs to be switched around
| central_point = tuple( | ||
| cleaned_installer_companies[ | ||
| cleaned_installer_companies["installer_name"] == company_name | ||
| ][["longitude", "latitude"]].values[0] |
There was a problem hiding this comment.
You are taking the first one with this name, but are there any case where there might be multiple with the same name?
| ): | ||
| latlongs = installation_info[["longitude", "latitude"]].values | ||
| if len(latlongs) > 1: | ||
| central_point = calculate_point_midpoint(latlongs[0], latlongs[1]) |
There was a problem hiding this comment.
I dont think this is the centroid of all the lat longs but rather the centroid of just the first and second?
| comp_buffers = [ | ||
| generate_geodesic_point_buffer( | ||
| central_point[1], central_point[0], dist.km | ||
| ) | ||
| for dist in (min(dists), dist_25, dist_50, dist_75, max(dists)) | ||
| ] | ||
| company_install_shapes[company_name] = { | ||
| "min_buffer": comp_buffers[0], | ||
| "25_buffer": comp_buffers[1], | ||
| "50_buffer": comp_buffers[2], | ||
| "75_buffer": comp_buffers[3], | ||
| "max_buffer": comp_buffers[4], | ||
| } |
There was a problem hiding this comment.
perhaps slightly nicer:
company_install_shapes[company_name] = {
buffer_name: generate_geodesic_point_buffer(central_point[1], central_point[0], dist.km)
for buffer_name, dist in [("min_buffer", min(dists)), ("25_buffer", dist_25), .... ]
}
| "lost ", | ||
| len(self.epc_data) - len(self.epc_geo_data), | ||
| " records when geocoding.", |
There was a problem hiding this comment.
I noticed len(self.epc_geo_data) was 9999. Made me wonder if records are lost because of some sort of upper bound to the length? (no idea about this really, but 9999 just felt like a curious number)
| in_buffer = [shape.contains(epc_geo) for shape in company_shapes.values()] | ||
| if any(in_buffer) is not False: | ||
| smallest_buffer = list(company_shapes.keys())[in_buffer.index(True)] | ||
| buffer_distribution.append(smallest_buffer) |
There was a problem hiding this comment.
I think you can do
for buffer_name, shape in company_shapes.items():
if shape.contains(epc_geo):
buffer_distribution.append(buffer_name)
break
| """ | ||
| weighted_installer_closeness_score = 0 | ||
| for buffer_dist, buffer_count in buffer_distb.items(): | ||
| if "min_" in buffer_dist: |
There was a problem hiding this comment.
to cut down on ink, before the if you could set buffer_count / buffer_distb["max_buffer"] to a variable and then use it in the rest of the calculations
lizgzil
left a comment
There was a problem hiding this comment.
sorry for such a long delay in looking at this PR again 😳
I couldnt get it to run since I don't have asf-core-data/outputs/MCS/cleaned_geocoded_mcs_installations.csv . I've made a suggested change since I think this is correct - but see what you think
| ): | ||
| if len(installation_info) > 1: | ||
| central_point = installation_info[["longitude", "latitude"]].mean() | ||
| central_point = cleaned_installer_companies[["longitude", "latitude"]].median() |
There was a problem hiding this comment.
| central_point = cleaned_installer_companies[["longitude", "latitude"]].median() | |
| central_point = cleaned_installer_companies[cleaned_installer_companies["installer_name"] == company_name][["longitude", "latitude"]].median() |
There was a problem hiding this comment.
thanks for this!!! I've got to re-jog my memory with this one but i'll wrap this up soon :)
Hey @lizgzil,
Thanks for reviewing my indicator flow - really appreciate it.
The main two scripts to review are:
installer_proximity_indicator_flow.pyinstaller_proximity_indicator_utils.pyTo run the flow in the activated conda environment,
python installer_proximity_indicator_flow.py run. This should generate an indicator score for a sample of 10k EPC records.Let me know if any of my calculations in the utils file seem a bit funny or if the functions calculate_buffer_distribution and calculate_weighted_installer_proximity could be optimised for speed because it takes a bit of time for the step that uses these functions to finish running in the flow.
Cheers!