Supervised ML Classifiers for Predicting Water Wells Condition

1. Project Overview

This project explores the use of supervised machine learning classifiers to predict the condition of water wells in Tanzania. By leveraging a ternary classification approach, the goal is to distinguish between wells that are functional, non-functional, or functional but in need of repair. The workflow encompasses data preprocessing, model building, evaluation, and actionable recommendations to support sustainable water resource management.

2. Business Understanding

Access to clean and reliable water is a critical challenge in Tanzania. Many wells fall into disrepair or become non-functional, impacting communities' health and livelihoods. Predicting the condition of water wells enables stakeholders to prioritize maintenance, allocate resources efficiently, and ensure long-term water access. This project addresses the question: Can supervised ML classification models trained on available data predict the operational status of a water-well in Tanzania ?

3. Data Preprocessing

The Data Preprocessing Pipeline encompases the following steps that are executed sequentially to prevent data-leakage:

Define Exog and Endog.
Perform Train-Test split.
Drop rendundant and irrelevant columns in X_train.
Handle missing values in X_train.
Feature engineering on X_train.
Multicollinearity check on numerical features in X_train.
Numerical Features' Normalization in X_train.
Categorical Features' OneHot Encoding in X_train.
Target Variable Label Encoding in y_train.
Address Class Imbalance in y_train
Preprocess Test set (X_test and y_test): Steps 3 to 9.
Preprocess Evaluation Data (testdata.csv) Steps 3 to 8.

4. Modelling

Multiple supervised ML classifiers are build, trained on a balanced training set, tuned, and their respective performance analyzed to determine the best-fit model for predicting values of a ternary target variable.

Decision Tree Classifier (baseline): The baseline non-parametric model captures non-linear relationships and feature interactions. Hyperparameter tuning is implemented using GridSearchCV to optimize max_depth, min_samples_split, and min_samples_leaf.
Gradient Boosting Classifier: The Ensemble Boosting model combines multiple weak learners sequentially to improve performance. Hyperparameter tuning is performed using GridSearchCV to optimize n_estimators, learning_rate, max_depth, subsample, and max_features.
Random Forest Classifier: The Ensemble Bagging model builds indepedent/ parralel multiple decision trees to improve performance. Hyperparameter tuning is implemented via GridSearchCV to optimize n_estimators, max_depth, min_samples_split, and max_features.

5. Model Evaluation

Confusion Matrices

Gradient Boosting Classifier:

Random Forest Classifier:

ROC Curves

Gradient Boosting Classifier:

Random Forest Classifier:

Performance Metrics

The predictive performance of the three classifiers on the test-set based on F1-score, and ROC-AUC.

Model	F1-score (Train-set)	F1-score (Test-set)	Test ROC-AUC (Train-set)	Test ROC-AUC (Test-set)
Decision Tree Classifier (tuned)	0.998	0.756	1.0	0.756
Gradient Boosting Classifier (tuned)	0.876	0.777	0.971	0.889
Random Forest Classifier (tuned)	0.972	0.791	0.999	0.900

The prediction accuracy percentage of the three classifiers on the testdata.csv dataset.

Model	Prediction Accuracy Percentage
Decision Tree Classifier	67.54%
Gradient Boosting Classifier	71.06%
Random Forest Classifier	70.77%

Selected Model for Deployment: Although the Random Forest Classifier outperforms the other models across the three performace metrics (Accuracy, F1-score, and ROC-AUC); the Gradient Boosting Classifier is selected for deployment. This is because the gap between the Gradient Boosting Classifier's performance metrics on the train-set Vs. on the test-set is the least for the model. The Gradient Boosting Classifier's superiority and generalizability to this ternary classification problem is justified by its comparatively higher prediction accuracy (71.06%) on the testdata.csv dataset.

Top 10 Important Features

6. Conclusion

The analysis demonstrates that tuned supervised ML models can effectively predict the condition of water wells using appropriately preprocessed features. The tuned Gradient Boosting Classifier is particularly a powerful, highly generalizable model that stakeholders in the Tanzanian water sector can leverage to to anticipate well failures and plan interventions proactively.

7. Business Recommendations

Prioritize Maintenance: Model predictions can be used to identify wells at risk and allocate maintenance resources efficiently.
Data Collection: Continously improve the training dataset's quality by collecting more data on the key important features, and ensuring variable's labels/ values are recorded accurately.
Stakeholder Engagement: Insights deduced from this project should be shared with local authorities, other NGOs involved in public service and the Tanzanian Government representatives to support data-driven decision-making.

8. Next Steps

Model Deployment: Integration of the hyperparameter-tuned Gradient Boosting Classifier into a user-friendly dashboard for real-time predictions and leveraging the model's predicitions to pilot target interventions.
Feature Expansion: Incorporate additional features such as weather data and a water-well's usage patterns) to enhance model accuracy.
Continuous Monitoring: Formulate and implement frameworks for updating the model's training dataset with recent/ latest data to improve the classifier's performance.

9. Repository Structure

├─ data

├── images

├── .gitignore

├── README.md

├── index.ipynb

├── notebook.pdf

└── presentation.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Supervised ML Classifiers for Predicting Water Wells Condition

1. Project Overview

2. Business Understanding

3. Data Preprocessing

4. Modelling