# To penalise or not to penalise: The curious case of automatic feature selection

What is Lasso Regression? The LASSO (Least Absolute Shrinkage and Selection Operator)  is a shrinkage and selection method for linear regression. This method involves penalizing the absolute size of the regression coefficients. A good description for layman understanding is given on this SO post; to quote, ” By penalizing (or equivalently constraining the sum of the absolute…

# A random forest approach to predicting breast cancer in working class women

What is a Random Forest? A random forest is an ensemble (group or combination) of tree’s that collectively vote for the most popular class (or feature) amongst them by cancelling out the noise. Ensemble learning– ensemble means group or combination. Ensemble learning in the context of machine learning is referred to methods that generate many classifiers…

# Supervised Machine Learning- Decision Tree Classification

In general, a decision tree is a an inverted tree structure having a single root whose branches lead to various subtrees, which themselves may have have sub-subtrees, until terminating in leaves. Unlike, the biological trees, a decision tree in computer science is upside down 🙂 Technically: a tree is a set of nodes and arcs, where each arc…

# Big or small-let’s save them all: Logistic Regression analysis

For this study, I am using the gapminder code book and have chosen the following variables for data analysis; Explanatory or Predictor or Independent variables: Income per person and alcohol consumption. Where Income per person is the 2010 Gross Domestic Product per capita in constant 2000 US\$. And alcohol consumption is the 2008 alcohol consumption per…

# Big or small-let’s save them all: Uncovering the factors responsible- Multiple Regression Analysis

Multiple regression analysis is tool that allows you to expand on your research question, and conduct a more rigorous test of the association between your explanatory and response variable by adding additional quantitative and/or categorical explanatory variables to your linear regression model. I discuss this in detail in this post

# Big or small-let’s save them all: How the data was collected

For this research study, I have chosen the “Gapminder Codebook”. It combines longitudinal survey data on respondents’ social, economic, psychological and physical well-being with contextual data on the family, neighbourhood, community, school, friendships, peer groups, and romantic relationships, providing unique opportunities to study how social environments and behaviours in adolescence are linked to health and achievement…

# Big or small: let’s save them all: Exploring Statistical Interactions

Statistical interaction describes a relationship between two variables that is dependent upon, or is moderated by, a third variable. The effect of a moderating variable is often characterized statistically as an interaction. That is a third variable that affects the direction and or strength of the relation between your explanatory, or x variable, and your…

# Big or small: let’s save them all: Pearson’s Correlation Coeffecient

The Pearson’s correlation (denoted by r) is the inferential test that will be used to examine the association between two quantitative variables. I previously discussed that a scatter plot is an appropriate way to visualize two quantitative variables when you want to examine the relationship between them. Now, let me first briefly review the scatter plot and…

# Big or small-let’s save them all: Chi-Square Test of Independence

Continuing the series, I will briefly state the dependent variable/response/outcome variable (Y) = breast cancer per 100th women and the independent /explantory variable (X) = alcohol consumption, female employ rate. I begin the analysis by making a copy of the original dataset as sub5 I then impute the missing values by the mean Both of my…

# Big or small-let’s save them all: ANOVA & Tukey’s HSD Post Hoc Comparison Test

The objective of this post is to examine the differences in the mean response variable for each category of our explanatory variable. I have provided a detailed ANOVA and post hoc comparison test on the gapminder dataset using Tukey’s HSD in python.