Tanzanian Waterpoints: Ternary classification with three ML models

12 min readMar 19, 2022

In this article I walk through the process I went through to build three classification models to predict the operational status of waterpoints in Tanzania.

Data

The dataset that was used in this project is available for free and is at the time of writing still part of an open competition (Source). This data contains information about waterpoints that are spread throughout Tanzania.

The target variable, status_group, indicates the operating status of a given waterpoint — essentially whether or not is working or needs work. status_group consisted of three fairly self-explanatory values:

functional
non functional
function needs repair

In addition to status, there were 39 features describing each waterpoint, such as where it was located, how much water it contains, and its altitude. Descriptions of each of the variables can be found here.

I. EDA

The goal of the EDA was to better understand the relationships of different features on statusand in doing so, decide which features to engineer, keep and drop before creating any models.

As one can plainly see from the variables above, there are several groups of variables containing similar information about a waterpoint. Understanding which variables had a more outsized impact on statuswas also important.

Labels Analysis

Looking at the status, it was clear that different status groups were not equally represented in the data. Roughly 7% of the waterpoints in the data belong to the ‘functional needs repair’ category. Oversampling is used below when fitting the models to account for this.

Figure 1: Number of waterpoints belonging to each category in “`status"`

Missing Values

Some features could be removed simply because of the high number of missing values

Geographic Data

Here you can see where all of the waterpoints are located.

red: functional waterpoint
yellow: non-functional
blue: waterpoints need repairs

Though there is a lot of overlap between functional and non-functional waterpoints, there does seem to be particular areas where waterpoints that need repair (blue) are clustered.

Figure 3: Waterpoints plotted on a map based on latitude and longitude and colored by “status.”

EDA was then done on the smaller number of continuous variables, before being conducted on the myriad of categorical ones.

After extensive EDA on the different features, 26 columns were dropped. Some features were simply irrelevant to status_group such as id. Others had too many missing values. Many categorical features were highly correlated and described the same information (ie: waterpoint_type and waterpoint_type_group).

Insights

Certain installers seem to have a higher incidence of non functioning waterpoints. In the below visualization, you can see that with waterpoints installed by ‘RWE’, only 25% are functional (compared to the 54% above).

Number of waterpoints installed by ‘RWE’ grouped by status.

Dry waterpoints were almost all non functional. In the below visualization, you can see that with waterpoints considered ‘dry’ 97% are non-f(compared to the 38% average).

II. Feature Engineering

In addition to dropping columns, several features were engineered.

construction year

construction_year included a high number of missing values, labeled as '0' in the data. There seemed to be average differences in the age of a waterpoint, depending on the status group it belonged to:

Due to the higher number of unique values, and to limit the number of columns, construction_year was broken into 4 categories:

unknown: 0
old: > 0 <= 1994
mid: > 1994 < 2003
new: >= 2003

latitude / longitude

Second, rather than use raw latitude and longitude numbers, these values were grouped into clusters and combined into one variable, using KMeans clustering. As viewable in the figure below, the scores quickly ascend until they reach 4 and then level off, so ‘4’ was chosen to use as a value for the number of clusters.

Figure 4: Elbow curve showing optimal number of clusters for KMeans algorithm.

K_clusters = range(1,10)
kmeans = [KMeans(n_clusters=i) for i in K_clusters]
Y_axis = df[['latitude']]
X_axis = df[['longitude']]
score = [kmeans[i].fit(Y_axis).score(Y_axis) for i in range(len(kmeans))]# Visualize
plt.plot(K_clusters, score)
plt.xlabel('Number of Clusters')
plt.ylabel('Score')
plt.title('Elbow Curve')
plt.show()# create cluster column 
kmeans = KMeans(n_clusters = 4, init ='k-means++') # use 4 clusters from above 
# fit to calculate clustering 
kmeans.fit(df[['latitude','longitude']]) 
# create new column with cluster labels 
df['cluster_label'] = kmeans.fit_predict(df[['latitude','longitude']])

Dropping columns, adding in the engineered features, and then one hot encoding the categorical variables resulted in 109 columns.

funder/installer

Since it was clear from the EDA that certain installers and funders had a relationship with status_group, I created two boolean variables to indicate whether a waterpoint was installed or funded by certain parties.

III. Model Prototyping

After dropping the relevant columns, feature engineering and one hot encoding the categorical variables, I was left with 115 columns. Next, I split the data and fitted and trained various classification models.

These are the three types of models I worked with:

Logistic Regression
Random Forests
XGBoost

I first built a logistic regression model as a sort of starting point, and then worked with the two tree-based ensemble methods. For each type of model, I first fit a baseline model, and then I tuned hyperparameters in an attempt to achieve optimal results.

A note on metrics

In order to compare and optimize models, several classification metrics were considered.

Within the target variable status_group, the ‘non functional’ and ‘functional needs repair’ classes seemed most important in that order, since identifying these waterpoints should lead to making improvements faster.

If ‘in need’ waterpoints are those that need repair or are non-functional, incorrectly identifying waterpoints as ‘in need’ should be preferable to incorrectly identifying waterpoints as functional (missing an ‘in need’ waterpoint).

From the perspective of the ‘in need’ waterpoints, I was looking to:

Minimize false negatives (missing an ‘in need’ waterpoint). False negatives happen when the model chooses ‘functional’ when the waterpoint was actually ‘non functional’ or ‘functional needs repair’.
This would come at the expense of an increase in false positives (saying a waterpoint is ‘in need’ when it isn’t). False positives happen when the model chooses ‘non functional’ or ‘functional needs repair’ when the waterpoint is actually functional.
In other words, I was looking to maximize recall scores for the minority classes. To do this, I used an oversampling technique called SMOTE:

SMOTE is an advanced oversampling technique commonly used to deal with class imbalance (as seen in figure 1 above). In this case, the minority classes are arguably more important (‘functional needs repair’ and ‘nonfunctional). SMOTE adds synthetic data points to the minority classes, so that these classes will be less ignored by machine learning models.

Baseline Models

I fit a baseline model for each without any scaling or tuning. Because I wasn’t using a distance based algorithm for any of these, it wasn’t deemed necessary to normalize the data before training the models.

Baseline Logistic Regression Model Cross Validation Score: 73.53%
Baseline Random Forests Model Cross Validation Score: 78.32%
Baseline XGBoost Model Cross Validation Score: 78.50%

Logistic Regression Tuning

To tune a logistic regression model, I used GridSearchCV and for creating a pipeline I used the imblearn.pipeline package. Although a logistic regression model does not require scaling, I added StandardScaler so that coefficient weights could be compared after fitting. I also set up one pipeline with SMOTE and without SMOTE to compare the results. Using a pipeline allowed SMOTE to be used in every fold within cross-validation.

# set params for search 
logreg_grid_params = {'classifier__max_iter':[100,1000,2000],
                     'classifier__multi_class':['multinomial'],
                     'classifier__C':[1,1e16],
                     'classifier__class_weight':[None,'balanced'],
                     'classifier__solver': 'lbfgs','sag','saga','newton-cg']}# set up pipeline with SMOTE. Do this in pipeline so that smote is applied to every fold 
pipeline_smote = imbpipeline(steps=[['smote',SMOTE(random_state=42)],
                              ['scaler',StandardScaler()],
                              ['classifier',LogisticRegression()]])# set up pipeline without SMOTE for comparison. Do this in pipeline so that smote is applied to every fold 
pipeline = imbpipeline(steps=[['scaler',StandardScaler()],
                              ['classifier',LogisticRegression()]])# set up the gridsearch object 
lr_grid_smote = GridSearchCV(estimator=pipeline_smote, 
                           param_grid=logreg_grid_params, 
                           cv=kf, verbose=1)lr_grid_no_smote = GridSearchCV(estimator=pipeline_smote, 
                           param_grid=logreg_grid_params, 
                           cv=kf, verbose=1)# fit the grid searches 
lr_grid_smote = lr_grid_smote.fit(X_train, y_train)
lr_grid_no_smote = lr_grid_no_smote.fit(X_train, y_train)

SMOTE Results Comparison

Here is a side by side comparison of results with and without SMOTE. Although overall testing accuracy is much lower for the model with SMOTE, this model also has a much lower amount of false negatives (misses the minority classes less), more true positives for the minority classes (correctly predicting ‘in need’ waterpoints) and a much higher recall score for the most underrepresented group, ‘functional needs repair’ (.61 vs .04).

Side by side comparison of classification reports and metrics for the logistic regression model with and without using SMOTE oversampling.

As the results show, overall accuracy suffered when adding oversampling. However, recall scores for the minority class (“functional needs repair”) improved. This makes sense — SMOTE adds synthetic data points to this class to address class imbalance, and in doing so, the model should predict that class more frequently.

In this case, false positives greatly increased which meant that waterpoints were labeled as needing work (‘non-functional’ or ‘functional needs repair’), when they were actually doing just fine (‘functional’). However, false negatives decreased significantly, which meant the model fitted with SMOTE mislabeled “in need” waterpoints as being functional less frequently. This is what we want to work towards here, as we do not want to overlook waterpoints that require servicing. However, there is clearly room for improvement. Enter ensemble methods!

Random Forests Tuning

To find the best Random Forests I conducted hyperparameter tuning using two methods: RandomizedSearchCV and GridSearchCV.

Baseline Random Forests Model Cross Validation Score: 74.97%

RandomizedSearchCV

Because a Grid Search is exhaustive, I decided to conduct a Randomized Search first to narrow my search aperture a bit. Using the code below I conducted a Randomized Search, which fits models using random combinations of my specified hyperparameter ranges. The results below indicate the model that was strongest.

# define the hyperparameter ranges 
estimators = np.arange(100,700,50)
max_depth = np.arange(10,110,10)
criterion = ['gini','entropy']
min_samples_leaf = np.arange(1,5,1)
min_samples_split = np.arange(1,10,1)
random_grid = dict(classifier__n_estimators=estimators, 
                   classifier__max_depth=max_depth, 
                   classifier__min_samples_leaf=min_samples_leaf, 
                   classifier__min_samples_split=min_samples_split, 
                   classifier__criterion=criterion)pipeline_smote = imbpipeline(steps=[['smote',SMOTE(random_state=42)],                              ['classifier',RandomForestClassifier()]])# set up the object and fit 
rf_random_grid = RandomizedSearchCV(estimator=pipeline_smote, 
                            param_distributions = random_grid, 
                            n_iter = 100, cv=kf, random_state=123)rf_random_grid = rf_random_grid.fit(X_train, y_train)

Top RandomizedSearchCV Model Cross Validation Score: 76.21%

GridSearchCV

Using parameters from the randomized search, I applied further tuning with a grid search. A Grid Search goes through every combination with the hyperparameter ranges specified.

Top GridSearchCV Model Cross Validation Score: 77.63%

A report on the final Random Forests model indicates significant improvements over the logistic regression models in overall accuracy.

Classification Report and metrics for the final random forests model.

XGBoost Tuning

Repeating the methodology for tuning the Random Forests models, I conducted hyperparameter tuning for XGBoost using two methods: RandomizedSearchCV and GridSearchCV.

Baseline Model Cross Validation Score: 78.50%

Top GridSearchCV Model Cross Validation Score: 77.63%

Classification Report and metrics for the final XGBoost model.

IV. Model Comparison

Judging by the cross validation scores, it was clear that Random Forests and XGBoost were the stronger algorithms.

I was most interested in recall scores for the minority classes. Below are graphs comparing each model’s recall scores for waterpoints that need repairs or are not functional.

Looking at the ‘functional needs repair’ class, you can see the effects of the SMOTE oversampling technique on each of the baseline models. An increase in recall here corresponds to finding more of the minority class (‘functional needs repair’). The model says ‘yes’ more frequently to this class, at the expense of more false positives (incorrectly classifying a waterpoint as this class).

Cross validation scores for the Random Forests and XGBoost models were similar, but there were some differences in results.

The final step here was to plot all three final models chose onto a radar chart for visualization:

The logistic regression model performed strongly on keeping recall scores high for the needs repair class, but was a poor performer otherwise and wasn’t chosen as a result.

The winner here was the XGBoost model although a close second was the Random Forests model. Though the XGBoost model performed weaker on training time (took longer to fit), it was stronger on cross-validation score and on the “non functional” class, which is arguably the most important.

V. Feature Selection

As a final step, I decided to take my ‘winner’ XGBoost model and optimize features.

Starting with my original data, rather than drop columns based on EDA, I only dropped columns with significant missing values, as well as categorical columns with too many unique values.

The result was keeping 15 more columns than I used above. After one hot encoding and conducting a train test split, I was left with 344 columns in my training data.

From here, I did random feature selection using sklearn’s RFECV. This essentially allowed me to start with all 344 features, and then going 1 by 1, find the optimal number of features to use.

from sklearn.feature_selection import RFECVmin_features_to_select = 1  # Minimum number of features to consider
xgb_rfe = xgb_grid_smote.best_estimator_.named_steps['classifier']
rfecv = RFECV(estimator=xgb_rfe,
              step=1,
              cv=3,
              scoring='accuracy',
              min_features_to_select=1, 
              verbose=1)rfecv.fit(X_train, y_train)
print(f'Optimal number of features: {rfecv.n_features_}')
X_train_rfe = X_train.loc[:,rfecv.support_]
X_test_rfe = X_test.loc[:,rfecv.support_]

After conducting the search, the optimal number of features chosen was 297.

Plot of number of features against model score. Optimal number of features was 323.

Plugging these features back in, the results did show a slight improvement in cross validation score (78.39% vs 77.63%). Recall scores for the ‘needs repair’ class improved by 1% as well.

Although the results were slightly stronger, the resulting number of optimal features was significantly higher than the original model (297 vs. 115). The increase in dimensions means longer fitting and prediction times, which should be taken into consideration.

VI. Insights

Feature Importances

Using XGBoost, you can easily plot and view the weights of each feature.

As one can see, it can be important to know whether a waterpoint is considered dry, which corresponds to the EDA above. Isolating dry waterpoints made it clear that most dry waterpoints are non functional (specifically 97% of waterpoints).

Predictions

After feeding the test data into the chosen XGBoost model, these were the regions that were predicted to have the most non functional and needing repair waterpoints:

Regions predicted to have the most in need (non functional + needing repair) waterpoints.

Plotting the waterpoints with their predicted status on a map revealed that a lot of the in-need waterpoints were expected to be near Lake Victoria.

Map of waterpoints around Lake Victoria with predicted status group. Yellow waterpoints are predicted to be non-functional, while blue waterpoints represent those needing repair.

Older waterpoints are expected to be in more dire need. Below is a count plot of predicted waterpoints, grouped by age. Old waterpoints mean they were built during or before ’94.

Predicted counts of non functional waterpoints grouped by age.

Consistent with the EDA above, dry waterpoints and those installed by RWE are also predicted to have a higher incidence of issues.

VII. Conclusions

So there you have it! A walkthrough of all the steps to landing on a final classification model, from EDA and preprocessing through model evaluation and comparison and feature selection. A full notebook can be found here (Source).

It should come as no surprise, but there are many features that can be used to predict the status of different waterpoints in Tanzania. Knowing the quantity of the water, where it is located, who installed it, and when it was built are important traits.

Next Steps

Other machine learning algorithms: KNN, Naive Bayes, and Support Vector Machines are missing from the above trials. Due to the size of the data, training time should be considered.
More feature selection techniques: Due to the larger number of features present in the dataset, utilizing other methods to refine the feature list could improve model performance and efficiency.
Metrics: As mentioned above, the final models chosen in this analysis were based on optimizing for recall of the minority classes. Other metrics, like f-1 score and precision can be prioritized in future studies. In addition other resampling techniques could be experimented with to see if this positively affects results.
More feature engineering: Conducting more EDA and experimenting with other ways of creating new features could yield more positive outcomes.
Further investigation: Digging deeper into some of the insights. For example, it seems like waterpoints around Lake Victoria have more issues. Why? What makes waterpoints installed by certain parties more likely to have issues?