In a bid to explore and detect variables of interest that were most strongly correlated with each other, I proceeded with the regression analysis (see here for the previous article). But there were challenges; One of the fundamental problem was that the variables were categorical and for regression analysis it is important that the variables should be continuous in nature. What I wanted to achieve was “dummy coding” for example, attribute Sex value is either male or female, so assign the attribute dummy code of ‘1’ or ‘2’ and then perform operations on these dummy codes such that they are not treated as continuous. This challenge baffled me for quite some time, until I stumbled upon this Kaggle post by Mark . See fig 1, showing how the data looked like in its raw form Fig 1: Raw data

## Data preprocessing

I used “one-hot encoding“, the label encoder from the python’s sklearn library and recoded the explanatory variables of interest to dummy codes as given

from sklearn import preprocessing le_enc = preprocessing.LabelEncoder() #to convert into numbers sub1.permit = le_enc.fit_transform(sub1.permit) sub1.extraction_type_class=le_enc.fit_transform(sub1.extraction_type_class) sub1.payment_type=le_enc.fit_transform(sub1.payment_type) sub1.quality_group=le_enc.fit_transform(sub1.quality_group) sub1.quantity_group=le_enc.fit_transform(sub1.quantity_group) sub1.waterpoint_type_group=le_enc.fit_transform(sub1.waterpoint_type_group) sub1.water_quality=le_enc.fit_transform(sub1.water_quality) sub1.source_class=le_enc.fit_transform(sub1.source_class) sub1.status_group=le_enc.fit_transform(sub1.status_group)

See fig 2 which shows the transformed data.Fig 2: Transformed data

## Multivariate data analysis

Next, I check for the correlation between water quality and water quantity and apply the stats model formula app library’s ols regression model to the transformed data as given

print ("OLS regresssion model for the association between water pump condition status and quality of water in it") reg1=smf.ols('status_group~permit',data=sub1).fit() print (reg1.summary())

The result is shown in fig 3;Fig 3: OLS Regression analysis between water pump status and water permit

Status group is the response or dependent variable and permit is the independent variable.

The number of observations show the no. of observations that had valid data and thus were included in the analysis. The F-statistic is 66.57 and the p value is very small (Prob (F-statistic))= 3.44e-16 considerably less than our alpha level of 0.05 which tell us that we can reject the null hypothesis and conclude that permit is significantly associated with water pump status group.

The linear regression equation Y = b0 + b1X where X is the explanatory variable or the independent variable and Y is the response or the dependent variable.–(EQN 1)

Note: EQN 1 is significant because it can also help us in prediction of Y. Next, we look at the parameter estimates or the coeffecients or beta weights . Thus the coeffecient for permit is -0.0697 and the intercept is 0.8903.

Than the best fit line for permit is; status_group=0.89+0.06*permit — (EQN 2)

In the above example, lets say we are told that a country has 80% people with valid water permits than can we predict the status of the water pump device?

Yes, we plug the value of 80 in EQN 2 as given b0 = 0.89, b1 = 0.06 permit= 80

Than, y(hat) = 0.89+0.06*80 y(hat)= 5.69 or we can say that for 80% people with valid permits there will be approximately 6% water pumps that are functional

Also note the P>|t| value is very small for permit. It is 0.0 and that the R-squared value is 0.001.

We now know that this model accounts for 0.001% variability that we see in our response variable permit.

Next, I continue to add other variables of interest to this model and check to see if the model performance increases.

# Now, I continue to add the variables to this model to check for any loss of significance print ("OLS regresssion model for the association between status_group and other variables of interest") reg1=smf.ols('status_group~quantity_group+extraction_type_class+waterpoint_type_group',data=sub1).fit() print (reg1.summary())

and I notice that the model performance increases to 7.8% (see R-squared statistic in fig 4 below) after adding variables like water quantity_group, extraction_type_class and waterpoint_type_group.Fig 4: OLS Regression analysis between water pump status and some more variables

Finally, I add all the variables of interest to the model and check for the variability in the response variable;

print ("OLS regresssion model for the association between water pump status group and all variables of interest") reg1=smf.ols('status_group~extraction_type_class+payment_type+quality_group+quantity_group+waterpoint_type_group+source_class+permit+water_quality',data=sub1).fit() print (reg1.summary())

The model is shown in fig 5;Fig 5: OLS Regression analysis between water pump status and all variables of interest

.As we can see from the R-squared statistic (yellow highlight) the model performance increases from 7.8% to 9.4% when plugged in with all the variables of interest.

## Discussion

To summarise this analysis, the variables of interest significantly affected the data model. The model performance rose from the initial 0.001% to 7.8% to 9.4% Although not very efficient as such but it does provide credibility to the variables of interest. As a future work, I will work towards creating a prediction model that would be able to predict the status of the water pump. The complete IPython notebook is present on my github account here.

## Learning from this exercise

The one hot encoder scheme for coding categorical variables was a cool thing to learn. As a future work, I will explore how the same applies to R programming language.