A random forest approach to predicting breast cancer in working class women

What is a Random Forest?

A random forest is an ensemble (group or combination) of tree’s that collectively vote for the most popular class (or feature) amongst them by cancelling out the noise.

Ensemble learningensemble means group or combination. Ensemble learning in the context of machine learning is referred to methods that generate many classifiers and aggregate their results. There are two well- known methods namely ‘boosting’ (Shapire et al, 1998) and ‘bagging’ (Brieman, 1996) of classification trees. “In boosting, successive trees give extra weight to points incorrectly predicted by earlier predictors. In the end, a weighted vote is taken for prediction. In bagging, successive trees do not depend on earlier trees — each is independently constructed using a bootstrap sample of the data set. In the end, a simple majority vote is taken for prediction.” (Liaw & Wiener, 2002).

“Okay, okay I understand (a bit complex at first glance as I read through) but then you still have not answered why in God’s name, Random Forest! (Emphasized angry tonality)

What is random in the forest?

Leo Briemann (2001) proposed to add an additional layer of randomness to the bagging method which means that in a random forest, each node is split using the best among a subset of predictors randomly chosen at that node whereas in standard decision trees, each node is split using the best split among all variables. This drastically improves classification or prediction accuracy. There is a very good explanation by Edwin Chen on Quora, see here

“Hmm, puts on his thinking cap…and say’s ‘go on…I’m listening’ (skeptical tonality)” Man-surrounded-by-self-doubt-218x218

The reason Briemann developed this ensemble classification and regression approach was because to minimize the prediction error that is generated by standard trees in classification. It’s worth noting that his paper has been cited more than 21016 times as per google scholar. Now, things become interesting, why? Because, Briemann, conducted his experiments on 13 small sized datasets that were derived from the UCI Machine Learning repository and his claim that using random forest does not overfit the data is contentious as discussed by (Segal, 2004) where he has proved that Briemann’s random forest method overfits both real world and simulated dataset’s.

“Nice…controversies & scandals…smacks his lips! So are you saying this ensemble learning is no good? Are you nuts (uncontrollable laughter) – Look at the 21,016 citations…Are you out of your freaking mind?” Controversy

What kind of machine learning problem is random forest best at solving?

Random Forest works well for classification problems where the dataset is categorical in nature with the response variable typically being binary. Its performance goes for a dive in regression problems where the dataset is continuous. Also, if the dataset is imbalanced (which the real world dataset’s often are) its performance is no good as shown by (Dudoit & Fridlyand, 2003)

“Alright, I see the point now. Although, I still have lurking doubts but first show me what you got with your experiments. I have invested a fortune in you!”

Experiment setup 

In an earlier article, I have detailed at length on the experimental setup. Reader’s are advised to see this post Also, in this experiment, I will be converting the continuous data to categorical so as to test and validate random forest classifier. I will present random forest regression as a future work. 

       How do I clean and fill?
Good question! The dataset I’m using is from the gapminder study and has no categorical variables but has missing values. This particular python function requires floats for the input variables, so all strings need to be converted, and any missing data needs to be filled.In the reaminder of this post wherever required, I will abbreviate Random Forests as RF. This dataset has numerical continuous variable values. One important concept of RF is that the variable values should be categorical like (1,0) or (1,2,3).

I begin the data anlysis by loading the pertinent libraries in python as

# loading the libraries
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import os
import matplotlib.pylab as plt
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
import sklearn.metrics
# Feature Importance
from sklearn import datasets
from sklearn.ensemble import ExtraTreesClassifier 

Next, I load the dataset in a data frame called gapdata as

 #Load the dataset
gapdata= pd.read_csv("gap.csv", low_memory=False) 

In the previous posts I have already established that this dataset has missing values, so to treat them I call the dropna() function and save the clean data in a new variable called data-clean as


Next, I coerce the explanatory and response variables to numeric by the convert_objects() method as

 # Data pre-processing tasks
data_clean['breastcancerper100th']= data_clean['breastcancerper100th'].convert_objects(convert_numeric=True)
data_clean['femaleemployrate']= data_clean['femaleemployrate'].convert_objects(convert_numeric=True)
data_clean['alcconsumption']= data_clean['alcconsumption'].convert_objects(convert_numeric=True)

So far, I have simply repeated the steps from previous posts on this topic. Now, I will change the explanatory and response variables and recode them such that there values are discrete as given;

 #Create binary Breast Cancer Rate
def bin2cancer (row):
   if row['breastcancerper100th'] <= 20 :
return 0
elif row['breastcancerper100th'] > 20 :
     return 1
#Create binary Alcohol consumption
def bin2alcohol(row):
   if row['alcconsumption'] <= 5 :
return 0
elif row['alcconsumption'] > 5 :
     return 1
# create binary Female employee rate
def bin2femalemployee(row):
   if row['femaleemployrate'] <= 50 :
return 0
elif row['femaleemployrate'] > 50 :
     return 1

I will now attach these new variables into the dataset by using the apply function as shown

#Apply the new variables bin2alcohol,bin2femalemployee, bin2cancer to the gapmind dataset
data_clean['bin2femalemployee'] = data_clean.apply (lambda row: bin2femalemployee (row),axis=1)
data_clean['bin2alcohol'] = data_clean.apply (lambda row: bin2alcohol (row),axis=1)
data_clean['bin2cancer']=data_clean.apply(lambda row: bin2cancer(row),axis=1) 

To check the dataset, I use the describe() function as,


On executing this command, I can see the following result as shown in fig 1;df-1
Since I want to predict the effect of alcohol consumption in female employee leading to breast cancer, my explanatory variables (note it will be the recoded one’s) bin2femalemployee and bin2alcohol and my response variable is bin2cancer. So, I will now set the predictor and target variable as,

 # Assign predictor and traget variable

Once, the predictor and target variables have been assigned, I will then split the sample into 60% training & 40% testing sets using the train_test_split function from the sklearn library as where 0.4 means 40% test size

 #Split into training and testing sets

Before, I begin with the training of the data, let me show the data shape


will yield 99,2 which means it has 99 rows and 2 columns. Alright, so now I build the model on the training data as

 #Build model on training data
from sklearn.ensemble import RandomForestClassifier

Once the model is bult, I will then call the confusion matrix to check for correct & incorrect classification’s using the sklearn.metrics function as


that yields a conusion matrix as in fig 2 confusion_matrix

We see the True Positive (TP)[equivalent term= Hit or correct] are 8 cases and the True Negatives (TN)[equivalentterm= correct rejection] are 47 cases on the diagonal. Similarly, False Positive (FP)[equivalent term= false alarm, Type 1 error] are 7 cases and False Negative (FT)[equivalent term= miss, Type II error] are 4 cases. I then check for the model accuracy

 sklearn.metrics.accuracy_score(tar_test, predictions)

and I get an accuracy result of 83% as shown in fig 3accuracy

Key Learning Outcome’s

  • An accuracy score of 83% is far better than compared to a 22% score predicted by decision tree but should I be happy about it? I’m skeptical because I had to convert the original continuous data to categorical otherwise I was unable to conduct the experiment in python.
  • If I make the mistake of running the RF classifier on continuous data, then the classifier throws an error, ValueError: Can’t handle mix of continuous and multiclass”To solve this error, see this stack overflow answer here.
  • As suggested RF classifier will work either for binary values or for a fixed label set. Fixed label set can be defined by categorising the explanatory and response variables as shown above.
  • Therefore, for continuous data, use RF Regression algorithm. And this I leave it as a future work.

The IPython notebook is listed here


Breiman. Bagging predictors. Machine Learning, 24 (2):123–140, 1996
Breiman. Random Forests. Machine Learning, 45(1): 5–32, 2001
Dudoit, S., & Fridlyand, J. (2003). Classification in microarray experiments. Statistical analysis of gene expression microarray data, 1, 93-158.
Liaw, A., & Wiener, M. (2002). Classification and regression by random forest. R news, 2(3), 18-22. 
Shapire, Y. Freund, P. Bartlett, and W. Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. Annals of Statistics, 26 (5):1651–1686, 1998. 1
Segal, Mark R. “Machine learning benchmarks and random forest regression.” Center for Bioinformatics & Molecular Biostatistics (2004).


One thought on “A random forest approach to predicting breast cancer in working class women

  1. Pingback: To penalise or not to penalise: The curious case of automatic feature selection | The enigma of data science

Comments are closed.