baby-steps1

Learning from data science competitions- baby steps

Off lately a considerable number of winner machine learning enthusiasts have used XGBoost as their predictive analytics solution. This algorithm has taken a preceedence over the traditional tree based algorithms like Random Forests and Neural Networks. The acronym Xgboost stands for eXtreme Gradient Boosting package. The creators of this algorithm presented its implementation by winning the Kaggle Otto…

split-1

Data Transformations

A number of reasons can be attributed to when a predictive model crumples such as: Inadequate data pre-processing Inadequate model validation Unjustified extrapolation Over-fitting (Kuhn, 2013) Before we dive into data preprocessing, let me quickly define a few terms that I will be commonly using. Predictor/Independent/Attributes/Descriptors – are the different terms that are used as…

To propose, go down on one knee. Doing the splits is considered very poor form.

Data Splitting

A few common steps in data model building are; Pre-processing the predictor data (predictor – independent variable’s) Estimating the model parameters Selecting the predictors for the model Evaluating the model performance Fine tuning the class prediction rules “One of the first decisions to make when modeling is to decide which samples will be used to…

methods1

Save water! Save life! Proposed methods

At this point on the project, I am still working on the data preperation for exploration and subsequent analysis. Therefore, I provide below brief guidelines on how I will be performing the data analysis especially the sub chapter on Analysis wherein its not final but only provides the approach that I should consider. I now…

blind_men_elephant

“Give me data and I promise you cluster’s”: The case of k-means algorithm

Introduction The title of this week’s essay is actually derived from the infamous speech (“Give me blood and I promise you freedom!”) by the Indian nationalist Subhash Chandra Bose’s speech delivered in Burma on July 4th 1944. An essay makes more sense if its title can relate to its contents. Thus, after considerable debate on how to aptly title it,…

lasso-2

To penalise or not to penalise: The curious case of automatic feature selection

What is Lasso Regression? The LASSO (Least Absolute Shrinkage and Selection Operator)  is a shrinkage and selection method for linear regression. This method involves penalizing the absolute size of the regression coefficients. A good description for layman understanding is given on this SO post; to quote, ” By penalizing (or equivalently constraining the sum of the absolute…