In a previous article, I had discussed the topic in detail. I will revisit it again for the sake of brevity. I want to find out the predictors that are highly correlated to each other in reference to the operating conditions of water points in Tanzania.

Continuing further, I begin the data exploration phase by first loading the dataset into a dataframe. Next, I check the data types of the variables. a majority of the predictirs that I have chosen for this study are categorical in nature.

Next, I explore the data by checking its distribution; frequency distribution for continuous variable and value_counts() for categorical variable as given;

sub1.describe() # for continuous variable

and for categorical variable use the *value_counts() function*

sub1['extraction_type_class'].value_counts() gravity 26780 handpump 16456 other 6430 submersible 6179 motorpump 2987 rope pump 451 wind-powered 117 Name: extraction_type_class, dtype: int64

#### Data Preprocessing

So, my first task is to convert them into continuous values. What this means is that if a variable like “Sex” has two values, “Male”, “Female” than on converting it to continuous, its values will be coded to as 1,0. I chose to use the sklearn library to encode the categorical variables as given

from sklearn.preprocessing import LabelEncoder var_mod = ['extraction_type_class','payment_type','quality_group','quantity_group','waterpoint_type_group','water_quality','source_type'] le = LabelEncoder() for i in var_mod: sub1[i] = le.fit_transform(sub1[i])

#### Data Visualization

With the relevant predictors converted to numerical levels, I then use the seaborn python library to plot them so as to see any associations. I plot a histogram to inspect if the water quality is dependent on the water source type. T show this in fig 1 and the code is given as

t2=pd.crosstab(sub2['source_type'],sub2['payment_type']) t2.plot(kind='hist', stacked=True, grid=False, legend=True, title="Water quality and types of payment")

Fig 1: Histogram of water quality and water payment type

I then show the distribution of observations within the categories using the boxplot. A boxplot is a kind of plot shows the three quartile values of the distribution along with extreme values. The “whiskers” extend to points that lie within 1.5 IQRs of the lower and upper quartile, and then observations that fall outside this range are displayed independently. Importantly, this means that each value in the boxplot corresponds to an actual observation in the data. This, I show in fig 2 and the code is of just one line

sns.boxplot(x="source_type", y="payment_type",data=sub1)

Fig 2: Boxplot showing the statistical distribution of the variable

Often, rather than showing the distribution within each category, you might want to show the central tendency of the values. This is achieved by the barplots that show the statistical distribution within the data. As an example, I show a barplot between the variable ‘extraction type class’ and ‘source type’ in fig 3

Fig 3: Barplot showing the statistical distribution of the value

A special case for the bar plot is when you want to show the number of observations in each category rather than computing a statistic for a second variable. This is similar to a histogram over a categorical, rather than quantitative, variable. In seaborn, it’s easy to do so with the countplot() function. I show this in fig 4, where I use the function to inspect the different types of water quality.

Fig 4: Type of water quality

The complete IPython notebook is present on my github account here. Interested readers are advised to look in here for a visual treat.

#### Learning points

The predictors that i have chosen for the study are all categorical in nature. so early in the data analysis stage I had to encode them to continuous values. To do this, either I could have written a custom function to manually encode the variables or else use the built in library function. I chose the latter because it was easier to implement.

#### Future To Do Work

One issue that has piqued my interest is about the levels within a category. The *Labelencoder() *function from the scikit learn library that I chose for the one-hot encoding, I need to explore it further. This I leave as a future work to do.