Big or small-let’s save them all: ANOVA & Tukey’s HSD Post Hoc Comparison Test

The objective of this post is to examine the differences in the mean response variable for each category of our explanatory variable. The dataset that I have chosen does not have any categorical response variable but I will show in this post how to convert a quantitative variable to a categorical variable. The Null Hypothesis states that there is NO relationship between explanatory and response variables. In other words ALL MEANS ARE EQUAL. The Alternate Hypothesis IS that there is a relationship between explanatory and response variable. In other words, ALL MEANS ARE NOT EQUAL.

In a previous post, I had shown data visualizations in python. In this post, I will discuss on the various inferential statistical tests that can be used in the analysis. But before that, let me revisit the data variables of interest as given;

  1. alcconsumption’- average alcohol consumption, adult (15+) per capita consumption in litres pure alcohol
  2. ‘breastcancerper100TH’- Number of new cases of breast cancer in 100,000 female residents during the certain year
  3. ‘femaleemployrate’- Percentage of female population, age above 15, that has been employed during the given year

Recall, for this study the various hypotheses were;

H0 (Null Hypothesis) =   Breast cancer is not caused by alcohol consumption

H1 (Alternative Hypothesis) = Alcohol consumption causes breast cancer

H2 (Alternative Hypothesis) = Female employee are susceptible to increased risk of breast cancer.

The dependent variable/response/outcome variable (Y) = breast cancer per 100th women and the independent variables/explantory variable (X) = alcohol consumption, female employ rate

1. The response variable or the dependent variable is breast cancer per 100th women (Y)
2. convert response variable breast cancer to categorical in python

In table 1, I show how you can judiciously choose the inferential test to conduct

Explanatory/Independent variable Response/Dependent variable Inferential Test Choice
Categorical Quantitative ANOVA
Categorical Categorical Chi-Square
Quantitative Categorical (with only 2 levels) Chi-Square
Quantitative Quantitative Correlation Coefficient

Table 1: Choosing an Inferential test based on variable type

So, let’s check if there is any relationship between the breast cancer and alcohol consumption.

The dataset that I have chosen has both the explanatory and the response variable as quantitative. But, what if I want to test the analysis of variance as the inferential test on this dataset, what do I do?
Answer: Convert the quantitative explanatory/independent variable into categorical. This can be achieved in python by using the .astype(‘category’)

Note: For statsmodel to work, I first have to install pasty library into python as given

pip install

# Reading the data where low_memory=False increases the program efficiency
data= pd.read_csv("gapminder.csv", low_memory=False)

data['breastcancerper100th']= data['breastcancerper100th'].convert_objects(convert_numeric=True)
data['femaleemployrate']= data['femaleemployrate'].convert_objects(convert_numeric=True)
data['alcconsumption']= data['alcconsumption'].convert_objects(convert_numeric=True)
# Making a copy of the dataset as sub5
print "\n\t### Before Missing value count: ",sub5.isnull().sum()
sub5.fillna(sub5['breastcancerper100th'].mean(), inplace=True)
sub5.fillna(sub5['femaleemployrate'].mean(), inplace=True)
sub5.fillna(sub5['alcconsumption'].mean(), inplace=True)

print "\n\t### After Missing value count: ",sub5.isnull().sum()

I will now create categories for the explanatory variable alcohol consumption and define it as alcohol consumption in 0 litres, 1-4 litres, 5-9 litres, 10-14 litres, 15-19 litres and 20-24 litres. I code this in python as follows;

sub5['alco']=pd.qcut(sub4.alcconsumption,6,labels=[<span class="pl-s"><span class="pl-pds">"</span>0<span class="pl-pds">"</span></span>,<span class="pl-s"><span class="pl-pds">"</span>1-4<span class="pl-pds">"</span></span>,<span class="pl-s"><span class="pl-pds">"</span>5-9<span class="pl-pds">"</span></span>,<span class="pl-s"><span class="pl-pds">"</span>10-14<span class="pl-pds">"</span></span>,<span class="pl-s"><span class="pl-pds">"</span>15-19<span class="pl-pds">"</span></span>,<span class="pl-s"><span class="pl-pds">"</span>20-24<span class="pl-pds">"</span></span>])
sub5['brst']=pd.qcut(sub4.breastcancerper100th,5,labels=[<span class="pl-s"><span class="pl-pds">"</span>1-20<span class="pl-pds">"</span></span>,<span class="pl-s"><span class="pl-pds">"</span>21-40<span class="pl-pds">"</span></span>,<span class="pl-s"><span class="pl-pds">"</span>41-60<span class="pl-pds">"</span></span>,<span class="pl-s"><span class="pl-pds">"</span>61-80<span class="pl-pds">"</span></span>,<span class="pl-s"><span class="pl-pds">"</span>81-90<span class="pl-pds">"</span></span>])
sub5['emply']=pd.qcut(sub4.femaleemployrate,4,labels=[<span class="pl-s"><span class="pl-pds">"</span>30-39<span class="pl-pds">"</span></span>,<span class="pl-s"><span class="pl-pds">"</span>40-59<span class="pl-pds">"</span></span>,<span class="pl-s"><span class="pl-pds">"</span>60-79<span class="pl-pds">"</span></span>,<span class="pl-s"><span class="pl-pds">"</span>80-90<span class="pl-pds">"</span></span>])
# Don't forget to tell python to treat this variable as categorical by typing the following code
print ('\n\t describe alcohol consumption\n')
desc3=sub5['alco'].describe() print(desc3) 

Now, to perform ANOVA test. Note, that by adding a capital C in the below code, I am telling python that the variable “alco” is categorical in nature. I code it as


 On executing this code, I get the following statistical results as given. The F-Statistic is calculated as F=Variation among Sample Means/Variation within Groups

OLS Regression Results
Dep. Variable: breastcancerper100th R-squared: 0.338
Model: OLS Adj. R-squared: 0.322
Method: Least Squares F-statistic: 21.16
Date: Wed, 10 Feb 2016 Prob (F-statistic): 4.54e-17
Time: 11:31:13 Log-Likelihood: -900.54
No. Observations: 213 AIC: 1813.
Df Residuals: 207 BIC: 1833.
Df Model: 5
Covariance Type: nonrobust
 coef std err t P|t| [95.0% Conf. Int.]
Intercept 25.8390 2.805 9.211 0.000 20.308 31.370
C(alco)[T.1-4] 0.7356 3.995 0.184 0.854 -7.141 8.613
C(alco)[T.5-9] 2.5974 3.967 0.655 0.513 -5.224 10.419
C(alco)[T.10-14] 22.0557 3.995 5.520 0.000 14.179 29.933
C(alco)[T.15-19] 31.7415 3.995 7.944 0.000 23.864 39.619
C(alco)[T.20-24] 12.8044 3.967 3.228 0.001 4.983 20.626
Omnibus: 30.228 Durbin-Watson: 1.693
Prob(Omnibus): 0.000 Jarque-Bera (JB): 41.008
Skew: 0.884 Prob(JB): 1.25e-09
Kurtosis: 4.222 Cond. No. 6.81
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

From the above statistic, I see that the ANOVA F statistic value is 21.16 and that the P value is Prob (F-statistic): 4.54e-17 which is less than the significance value of 0.05 showing that the null hypothesis is rejected in favor of the alternative hypothesis.

However, this is not the end because the means have to be tested too, which I show as follows;

In this case, the explanatory variable has more than 2 levels or more than 2 groups of categorization therefore I also have to conduct a post-hoc comparison test. In such a case, a significant ANOVA does not tell us which group are different from others. To test this difference, I conduct a Post-Hoc Comparison. For this, I choose Tukey’s Honestly Significant Difference Test. Conducting post hoc is pertinent so as to avoid type 1 error. The type 1 error is when the researcher incorrectly rejects the null hypothesis as being false but it actually is true. So in order to conduct the post hoc paired comparison in the context of my ANOVA in examining the association between alcohol consumption by women has an increased risk of breast cancer, I am going to use the Tukey HSD or Tukey’ Honesty Significant Difference Test. To do this, I will first add the python library statsmodel.stats.multicomp as shown

In python, to conduct the post hoc multi comparison test, import the library

import statsmodels.stats.multicomp as multi

The python code for this test is given as where I put the quantitative response variable ‘breastcancerper100th’ followed by the categorical explanatory variable ‘alco’ in parenthesis sub5[‘breastcancerper100th’],sub5[‘alco’] as shown;

# Conducting a post hoc comparison test to check for type 1 error
print res1.summary() 

And the resultant output is;

Multiple Comparison of Means - Tukey HSD,FWER=0.05
group1 group2 meandiff  lower    upper   reject
  0     1-4    0.7356  -10.7577  12.229  False

  0    10-14  22.0557  10.5624  33.5491   True
  0    15-19  31.7415  20.2482  43.2349   True
  0    20-24  12.8044   1.3922  24.2165   True

  0     5-9    2.5974  -8.8148  14.0095  False

 1-4   10-14  21.3201   9.7461  32.8941   True
 1-4   15-19  31.0059  19.4319  42.5799   True
 1-4   20-24  12.0687   0.5754  23.5621   True

 1-4    5-9    1.8617  -9.6316  13.3551  False
10-14  15-19   9.6858  -1.8882  21.2598  False
10-14  20-24  -9.2513  -20.7447  2.242   False

10-14   5-9   -19.4583 -30.9517  -7.965   True
15-19  20-24  -18.9371 -30.4305 -7.4438   True
15-19   5-9   -29.1441 -40.6375 -17.6508  True

20-24   5-9   -10.207  -21.6191  1.2051  False

From the above Tukey’s post hoc comparison test, we see the comparison between different levels of the categorical explanatory variable “alco” and I make the following conclusions;

In the sixth row of the table, we see the comparison between groups of alcohol consumption in 1-4 litres and alcohol consumption in 10-14 litres as well as the mean difference between breast cancer cases reported in this two groups. From this row we can see safely reject the null hypothesis and accept the alternative hypothesis that increased intake of alcohol consumption in women causes a higher risk of breast cancer.

The complete python code is listed on my github repository here.

Thank you for reading.