# Save water! Save life! Multivariate data analysis report

In a bid to explore and detect variables of interest that were most strongly correlated with each other, I proceeded with the regression analysis (see here for the previous article). But there were challenges; One of the fundamental problem was that the variables were categorical and for regression analysis it is important that the variables…

# Save water! Save life! Exploratory data analysis & Visualization

In a previous article, I had discussed the topic in detail. I will revisit it again for the sake of brevity.  I want to find out the predictors that are highly correlated to each other in reference to the operating conditions of water points in Tanzania. Continuing further, I begin the data exploration phase by…

# Save water! Save life! Proposed methods

At this point on the project, I am still working on the data preperation for exploration and subsequent analysis. Therefore, I provide below brief guidelines on how I will be performing the data analysis especially the sub chapter on Analysis wherein its not final but only provides the approach that I should consider. I now…

# Save water! Save life! The association between water point operational factors and water consumption

Draft Title Save water! Save life! The association between water point maintenance factors and water consumption Introduction to the Research Question In this research study, I would like to explore the factors that contribute to the depletion of a water point. A water point is an installation such as a well or a tap from…

# “Give me data and I promise you cluster’s”: The case of k-means algorithm

Introduction The title of this week’s essay is actually derived from the infamous speech (“Give me blood and I promise you freedom!”) by the Indian nationalist Subhash Chandra Bose’s speech delivered in Burma on July 4th 1944. An essay makes more sense if its title can relate to its contents. Thus, after considerable debate on how to aptly title it,…

# To penalise or not to penalise: The curious case of automatic feature selection

What is Lasso Regression? The LASSO (Least Absolute Shrinkage and Selection Operator)  is a shrinkage and selection method for linear regression. This method involves penalizing the absolute size of the regression coefficients. A good description for layman understanding is given on this SO post; to quote, ” By penalizing (or equivalently constraining the sum of the absolute…

# A random forest approach to predicting breast cancer in working class women

What is a Random Forest? A random forest is an ensemble (group or combination) of tree’s that collectively vote for the most popular class (or feature) amongst them by cancelling out the noise. Ensemble learning– ensemble means group or combination. Ensemble learning in the context of machine learning is referred to methods that generate many classifiers…

# Supervised Machine Learning- Decision Tree Classification

In general, a decision tree is a an inverted tree structure having a single root whose branches lead to various subtrees, which themselves may have have sub-subtrees, until terminating in leaves. Unlike, the biological trees, a decision tree in computer science is upside down 🙂 Technically: a tree is a set of nodes and arcs, where each arc…

# Big or small-let’s save them all: Logistic Regression analysis

For this study, I am using the gapminder code book and have chosen the following variables for data analysis; Explanatory or Predictor or Independent variables: Income per person and alcohol consumption. Where Income per person is the 2010 Gross Domestic Product per capita in constant 2000 US\$. And alcohol consumption is the 2008 alcohol consumption per…

# Big or small-let’s save them all: Uncovering the factors responsible- Multiple Regression Analysis

Multiple regression analysis is tool that allows you to expand on your research question, and conduct a more rigorous test of the association between your explanatory and response variable by adding additional quantitative and/or categorical explanatory variables to your linear regression model. I discuss this in detail in this post