Save water! Save life! Proposed methods

At this point on the project, I am still working on the data preperation for exploration and subsequent analysis. Therefore, I provide below brief guidelines on how I will be performing the data analysis especially the sub chapter on Analysis wherein its not final but only provides the approach that I should consider. I now present a draft of the proposed methods.


The factors associated with operating conditions of water points in Tanzania


The data was obtained from Taarifa and the Tanzanian Ministry of Water via the DrivenData’s competition, “Pump it up: Data Mining the Water Table” The data for this study is about water points like well’s, hand pumps, motor pumps, tube-wells etc. from where the water is drawn out for consumption. The objective of the study is to predict which pumps are functional, which need some repairs and which are non-functional. The dataset is already split up in training and testing subset with N=59, 400 observations (80%) training subset and 14,850 (20%) of testing subset. The records have dates from 2002-10-14 to 2013-12-03


The operating condition of a water point is a categorical response variable, which can be one of the alternatives among “functional”, “functional, needs repair” or “nonfunctional”.

Predictors included can be grouped as:

  1. Location: GPS coordinates and altitude as well as village, region and district
  2. Quantity: Amount of water available in the water point
  3. Quality: Quality of the water provided by the water point
  4. Administrative: Funder, installer, management and payment details
  5. Characteristics: Year of construction, extraction type, source and water point type

The operating condition of a water point is a categorical response variable, which can be one of the alternatives among “functional”, “functional, needs repair” or “nonfunctional”.

The Code Book

I now define only those predictors that I have chosen for subsequent data analysis which are shown in table 1.

Table 1. Explanatory or the Predictor Variables

Variable name Variable type Levels Description
Id Discrete 0 Water point identification
Population Discrete 0 Population around the well
Permit Categorical 2 (True, False) If the water point is permitted
Construction year Discrete 0 Year the water point was built
Extraction type class Categorical 7 (gravity, hand-pump, motor-pump, other, rope pump, submersible, wind-powered) The kind of extraction tool the water point uses to draw water out
Payment type Categorical 7 (annually, monthly, never pay, on failure, other, per bucket, unknown) What the water costs
Quality group Categorical 6 (colored, fluoride, good, milky, salty, unknown) The quality of the water
Quantity group Categorical 5 (dry, enough, insufficient, seasonal, unknown) The quantity of water left in the water point
Water point type group Categorical 5 (cattle trough, communal standpipe, dam, hand-pump, improved spring) The type of water point
Water quality Categorical 8 (colored, fluoride, fluoride abandoned, milky, salty, salty abandoned, soft, unknown) The quality of the water
Source Categorical 10 (dam, hand dtw, lake, machine dbh, other, rainwater harvesting, river, shallow well, spring, unknown) The source of the water
Source type Categorical 7 (borehole, dam, other, rainwater harvesting, river/lake, shallow well, spring) The source of the water
Source class Categorical 3 (groundwater, surface, unknown) The source of the water

And I chose the status group as the response or the dependent or the predictor variable. As we can see from table 1, a majority of the explanatory variables are categorical in nature.


The distributions for the predictors and the operating condition of a water-point were evaluated by examining frequency tables for categorical variables and calculating the mean, standard deviation and minimum and maximum values for quantitative variables.

Scatter plots and box plots were also examined, and Pearson correlation and Analysis of variance (ANOVA) were used to test bivariate associations between individual predictors and the operating condition of a water-point (response variable). During the initial data exploration phase, it was found that there are 46094 missing values in the given dataset.


One thought on “Save water! Save life! Proposed methods

  1. Pingback: Save water! Save life! Exploratory data analysis & Visualization | The enigma of data science

Comments are closed.