# Predict Blood Donation -warmup

Continuing from my previous post, in this post I will discuss on the inferential and predictive analysis. About the dataset and the problem to solve: a brief The dataset is derived from UCI Machine learning repository and the task is to predict if a donor has donated blood in March 2007 (1 stand for donating blood; 0…

# Learning from data science competitions- baby steps

Off lately a considerable number of winner machine learning enthusiasts have used XGBoost as their predictive analytics solution. This algorithm has taken a preceedence over the traditional tree based algorithms like Random Forests and Neural Networks. The acronym Xgboost stands for eXtreme Gradient Boosting package. The creators of this algorithm presented its implementation by winning the Kaggle Otto…

# Basic assumptions to be taken care of when building a predictive model

Before starting to build on a predictive model in R, the following assumptions should be taken care off; Assumption 1: The parameters of the linear regression model must be numeric and linear in nature.  If the parameters are non-numeric like categorical then use one-hot encoding (python) or dummy encoding (R) to convert them to numeric. Assumption…

# Data Transformations

A number of reasons can be attributed to when a predictive model crumples such as: Inadequate data pre-processing Inadequate model validation Unjustified extrapolation Over-fitting (Kuhn, 2013) Before we dive into data preprocessing, let me quickly define a few terms that I will be commonly using. Predictor/Independent/Attributes/Descriptors – are the different terms that are used as…

# Data Splitting

A few common steps in data model building are; Pre-processing the predictor data (predictor – independent variable’s) Estimating the model parameters Selecting the predictors for the model Evaluating the model performance Fine tuning the class prediction rules “One of the first decisions to make when modeling is to decide which samples will be used to…

# Save water! Save life! Multivariate data analysis report

In a bid to explore and detect variables of interest that were most strongly correlated with each other, I proceeded with the regression analysis (see here for the previous article). But there were challenges; One of the fundamental problem was that the variables were categorical and for regression analysis it is important that the variables…

# Save water! Save life! Exploratory data analysis & Visualization

In a previous article, I had discussed the topic in detail. I will revisit it again for the sake of brevity.  I want to find out the predictors that are highly correlated to each other in reference to the operating conditions of water points in Tanzania. Continuing further, I begin the data exploration phase by…

# Save water! Save life! Proposed methods

At this point on the project, I am still working on the data preperation for exploration and subsequent analysis. Therefore, I provide below brief guidelines on how I will be performing the data analysis especially the sub chapter on Analysis wherein its not final but only provides the approach that I should consider. I now…

# Save water! Save life! The association between water point operational factors and water consumption

Draft Title Save water! Save life! The association between water point maintenance factors and water consumption Introduction to the Research Question In this research study, I would like to explore the factors that contribute to the depletion of a water point. A water point is an installation such as a well or a tap from…

# “Give me data and I promise you cluster’s”: The case of k-means algorithm

Introduction The title of this week’s essay is actually derived from the infamous speech (“Give me blood and I promise you freedom!”) by the Indian nationalist Subhash Chandra Bose’s speech delivered in Burma on July 4th 1944. An essay makes more sense if its title can relate to its contents. Thus, after considerable debate on how to aptly title it,…