In general, a decision tree is a an inverted tree structure having a single root whose branches lead to various subtrees, which themselves may have have sub-subtrees, until terminating in leaves. Unlike, the biological trees, a decision tree in computer science is upside down 🙂 Technically: a tree is a set of nodes and arcs, where each arc “descends” from a node to the children of that node.
Since, I am not a biologist so I will explain it in a different perspective. For the remainder of this text, I will be referring to Decision Trees as DTs. According to the sklearn documentation, “Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.”. One of the advantages is that DT can handle both categorical and numerical data types. And, its possible to validate a model using statistical tests. That makes it possible to account for the reliability of the model. However, some drawbacks include small variations in the data might result in a completely different tree being generated and the problem of learning an optimal decision tree is known to be NP-complete. Also, the decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the dataset prior to fitting with the decision tree.
Linear Regression as shown in the previous posts generates only a single linear equation that works for the full sample. In fact when the data has a large number of explantory variables that may interact with each other in a very complicated ways building a global linear model may be difficult if not foolish. In this post, I will review the DTs, a data mining method that allow to explore the presence of potentially complicated interactions within our data by creating segmentation or subgroups. Like, linear regression, DTs are statistical models designed for what are known as supervisied prediction problems. In supervised prediction problems a set of explanatory variables also known as predictors or inputs or features is used to predict the value of a response variable also known as outcome or target variable. Ideally, when a response variable is categorical in nature the model is called a classification tree. However, the dataset that I have chosen is purely non-categorical in nature. So if I do not convert the response variable to categorical and run a classification model my prediction rate would not be correct.
I begin by loading the dataset as given, but first I will import the necessary libraries that I will require for analysis,
from pandas import Series, DataFrame import pandas as pd import numpy as np import os import matplotlib.pylab as plt from sklearn.cross_validation import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import classification_report import sklearn.metrics
I will now load the dataset as
''' Data Engineering & Analysis ''' gapdata= pd.read_csv("gap.csv", low_memory=False) data_clean=gapdata.dropna() data_clean['breastcancerper100th']= data_clean['breastcancerper100th'].convert_objects(convert_numeric=True) data_clean['femaleemployrate']= data_clean['femaleemployrate'].convert_objects(convert_numeric=True) data_clean['alcconsumption']= data_clean['alcconsumption'].convert_objects(convert_numeric=True) data_clean['incomeperperson']= data_clean['incomeperperson'].convert_objects(convert_numeric=True)
Now, the important step; I will now convert the response varible breastcancerper100th to binary such that breast cancer cases reported that less than or equal to 20 will be coded as 0 and cases greater than 20 will be coded as 1. I will also convert the explantory variables to categorical too.
#Create binary Breast Cancer Rate def bin2cancer (row): if row['breastcancerper100th'] <= 20 : return 0 elif row['breastcancerper100th'] > 20 : return 1 #Create binary Alcohol consumption def bin2alcohol(row): if row['alcconsumption'] <= 5 : return 0 elif row['alcconsumption'] > 5 : return 1 # create binary Female employee rate def bin2femalemployee(row): if row['femaleemployrate'] <= 50 : return 0 elif row['femaleemployrate'] > 50 : return 1 #Apply the new variable bin2alcohol to the gapmind dataset data_clean['bin2femalemployee'] = data_clean.apply (lambda row: bin2femalemployee (row),axis=1) data_clean['bin2alcohol'] = data_clean.apply (lambda row: bin2alcohol (row),axis=1) data_clean['bin2cancer']=data_clean.apply(lambda row: bin2cancer(row),axis=1)
Next, I now train the classifier by splitting the data into training (60%) and testing (40%) sets;
""" Modeling and Prediction """ #Split into training and testing sets predictors=data_clean[['femaleemployrate','alcconsumption']] target=data_clean.bin2cancer print data_clean.dtypes print data_clean.describe() pred_train,pred_test,tar_train,tar_test=train_test_split(predictors,target,test_size=0.4)
The training data sample has 127observations or rows with 2 explanatory variables 60% of the data while the test sample has 86 observations 40% of the original sample and again has 2 explanatory variables or columns as shown in fig 1. Fig 1: Training data shape
Once the training & testing datasets have been created I will now initialize the decision tree classifier from sklearn. Now, I then buld the model on training data using the classifier.fit() function as shown. It is this fit() function that will build our model ;
#Build model on training data classifier=DecisionTreeClassifier() classifier=classifier.fit(pred_train,tar_train) predictions=classifier.predict(pred_test)
I then check the confusion matrix and classification error,
sklearn.metrics.confusion_matrix(tar_test,predictions) sklearn.metrics.accuracy_score(tar_test, predictions)
Fig 2: Classification accuracy
I get a prediction accuracy rate of 22% (technically its awfull) which suggests that the DT model has correctly classified the 22% breast cancer cases correctly. Finally, I will now grow the decision tree;
#Displaying the decision tree from sklearn import tree #from StringIO import StringIO from io import BytesIO as StringIO #from StringIO import StringIO from IPython.display import Image out = StringIO() tree.export_graphviz(classifier, out_file=out) import pydotplus graph=pydotplus.graph_from_dot_data(out.getvalue()) Image(graph.create_png()) # saving the DT to a file with open('picture_out1.png', 'wb') as f: f.write(graph.create_png())
Interpreting the results:
Fig 3: Decision Tree
The resulting tree in fig 3 above starts with a split at X our first explanatory variable alcohol consumption. I have recoded it as a binary variable such that alcohol consumption less than equal to 5 litres is coded 0 and above is coded as 1. So here, I see the split is less than 1 (X<=0.5), that means alcohol consumption is less than 5 litres than the observations move to the left side and include 52 samples out of 99 individuals in the training sample. From this node, another split is made on variable X which is femaleemployeerate such that those female employees with low alcohol consumption in the first split and also those workplaces where the female employee rate is less than equal to 2.5 in the second split, there are 2 individuals who do not have breast cancer while 21 of them have breast cancer. To the right of this split X greater than 0.5 we see that workplaces with high female employee rates 3 individuals have breast cancer while 26 do not.
- Classification works very well when the target variable is categorical in nature as well as quantitative target variables (known as regression trees).
- Converting continuous to categorical lowers the prediction accuracy
- Dropping the missing value’s significantly alters the prediction accuracy
- Data massaging (meaning proper treatment of the missing values, outliers and maintaining a uniform data distribution) plays a significant role.
The complete code is listed on my github account