# To penalise or not to penalise: The curious case of automatic feature selection

### What is Lasso Regression?

The LASSO (Least Absolute Shrinkage and Selection Operator)  is a shrinkage and selection method for linear regression. This method involves penalizing the absolute size of the regression coefficients. A good description for layman understanding is given on this SO post; to quote, ” By penalizing (or equivalently constraining the sum of the absolute values of the estimates) you end up in a situation where some of the parameter estimates may be exactly zero. The larger the penalty applied, the further estimates are shrunk towards zero. This is convenient when we want some automatic feature/variable selection, or when dealing with highly correlated predictors, where standard regression will usually have regression coefficients that are ‘too large’.”

Benefit of using LASSO: is to be able to extract the most relevant featureor variable in any given dataset. This method works only for numerical data. There is a detailed discussion on penalisation adopted by LASSO here and this discssion here is on when to use various regularisation methods including LASSO

### Introduction

A lasso regression analysis was conducted to identify a subset of variables from a pool of 9 quantitative predictor variables that best predicted a binary response variable measuring the presence of high breast cancer cases in women. The data for the analysis is extracted from the GapMinder project. The GapMinder project collects country-level time series data on health, wealth and development. The data set for this analysis only has one year of data for 213 countries.

Response or Dependent variable

Breast cancer rate: 2002 breast cancer new cases per 100,000 female.

Explanatory variables

All explanatory variables were standardized to have a mean of zero and a standard deviation of one. The following explanatory variables were included in the data set:

• Alcohol Consumption – 2008 recorded and estimated average alcohol consumption, adult (15+) per capita as collected by the World Heath Organization
• CO2 Emissions – Total amount of CO2 emission in metric tons from 1751 to 2006 as collected by CDIAC
• Female Employment Rate – Percentage of female population, age above 15, that has been employed during 2007 as collected by the International Labour Organization
• Internet Use Rate – 2010 Internet users per 100 people as collected by the World Bank
• Life Expectancy – 2011 life expectancy at birth (in years) as collected by various sources
• Polity Score – 2009 Democracy score as collected by the Polity IV Project
• Employment Rate – Percentage of total population, age above 15, that has been employed during 2009 as collected by the International Labour Organization
• Urbanization Rate – 2008 Urban population (% total population) as collected by the World Bank
• Income per person – is the 2010 Gross Domestic Product per capita in constant 2000 US\$.

### Data Analysis

Data were randomly split into a training set that included 70% of the observations (N=148) and a test set that included 30% of the observations (N=45). The least angle regression algorithm with k=10 fold cross validation was used to estimate the lasso regression model in the training set, and the model was validated using the test set. The change in the cross validation average (mean) squared error at each step was used to identify the best subset of predictor variables.

Figure 1. Mean squared error on each fold

### Findings

Of the 9 predictor variables, 5 were retained in the selected model. Income per person was the most strongly associated variable with breast cancer in women. urbanrate, femaleemployrate, CO2 emission and polityscore were also positively associated with high breast cancer cases. These 5 variables accounted for 41 percent variance for breast cancer in women response variable in the test data set.

Table 1. Regression Coefficients

 Variable Regression Coefficients incomeperperson 0.110918 urbanrate 0.048246 femaleemployrate 0.39503 co2emissions 0.035295 polityscore 0.32616 alcoconsumption 0 internetuserate 0 lifeexpectancy 0 employrate 0

### Discussion

The previous week I examined the variables using a Random Forest approach. The explanatory variables with the highest relative importance scores were life expectancy, internet use rate, urbanization rate. The LASSO regression only selected the internet use rate as a significant variable. As noted in last week’s post, these findings suggest that my previous work looking at the relationship between the level of alcohol consumption in working class women leading to high breast cancer cases may have been confounded by other variables.

The source code in IPython notebook format for this analysis is listed on my github repository here.