The objective of this post is to examine the differences in the mean response variable for each category of our explanatory variable. The dataset that I have chosen does not have any categorical response variable but I will show in this post how to convert a quantitative variable to a categorical variable. The Null Hypothesis states that there is NO relationship between explanatory and response variables. In other words ALL MEANS ARE EQUAL. The Alternate Hypothesis IS that there is a relationship between explanatory and response variable. In other words, ALL MEANS ARE NOT EQUAL.

In a previous post, I had shown data visualizations in python. In this post, I will discuss on the various inferential statistical tests that can be used in the analysis. But before that, let me revisit the data variables of interest as given;

- alcconsumption’- average alcohol consumption, adult (15+) per capita consumption in litres pure alcohol
- ‘breastcancerper100TH’- Number of new cases of breast cancer in 100,000 female residents during the certain year
- ‘femaleemployrate’- Percentage of female population, age above 15, that has been employed during the given year

Recall, for this study the various hypotheses were;

H_{0 }(Null Hypothesis) = Breast cancer is not caused by alcohol consumption

H_{1 }(Alternative Hypothesis) = Alcohol consumption causes breast cancer

H_{2 }(Alternative Hypothesis) = Female employee are susceptible to increased risk of breast cancer.

The dependent variable/response/outcome variable (Y) = breast cancer per 100th women and the independent variables/explantory variable (X) = alcohol consumption, female employ rate

1. The response variable or the dependent variable is breast cancer per 100th women (Y)

2. convert response variable breast cancer to categorical in python

In table 1, I show how you can judiciously choose the inferential test to conduct

Explanatory/Independent variable | Response/Dependent variable | Inferential Test Choice |

Categorical | Quantitative | ANOVA |

Categorical | Categorical | Chi-Square |

Quantitative | Categorical (with only 2 levels) | Chi-Square |

Quantitative | Quantitative | Correlation Coefficient |

Table 1: Choosing an Inferential test based on variable type

So, let’s check if there is any relationship between the breast cancer and alcohol consumption.

The dataset that I have chosen has both the explanatory and the response variable as quantitative. But, what if I want to test the analysis of variance as the inferential test on this dataset, what do I do?

Answer: Convert the quantitative explanatory/independent variable into categorical. This can be achieved in python by using the .astype(‘category’)

**Note:** For statsmodel to work, I first have to install pasty library into python as given

pip install https://github.com/pydata/patsy/archive/master.zip

# Reading the data where low_memory=False increases the program efficiency data= pd.read_csv("gapminder.csv", low_memory=False) data['breastcancerper100th']= data['breastcancerper100th'].convert_objects(convert_numeric=True) data['femaleemployrate']= data['femaleemployrate'].convert_objects(convert_numeric=True) data['alcconsumption']= data['alcconsumption'].convert_objects(convert_numeric=True) # Making a copy of the dataset as sub5 sub5=data.copy() print "\n\t### Before Missing value count: ",sub5.isnull().sum() sub5.fillna(sub5['breastcancerper100th'].mean(), inplace=True) sub5.fillna(sub5['femaleemployrate'].mean(), inplace=True) sub5.fillna(sub5['alcconsumption'].mean(), inplace=True) print "\n\t### After Missing value count: ",sub5.isnull().sum()

I will now create categories for the explanatory variable alcohol consumption and define it as alcohol consumption in 0 litres, 1-4 litres, 5-9 litres, 10-14 litres, 15-19 litres and 20-24 litres. I code this in python as follows;

sub5['alco']=pd.qcut(sub4.alcconsumption,6,labels=[<span class="pl-s"><span class="pl-pds">"</span>0<span class="pl-pds">"</span></span>,<span class="pl-s"><span class="pl-pds">"</span>1-4<span class="pl-pds">"</span></span>,<span class="pl-s"><span class="pl-pds">"</span>5-9<span class="pl-pds">"</span></span>,<span class="pl-s"><span class="pl-pds">"</span>10-14<span class="pl-pds">"</span></span>,<span class="pl-s"><span class="pl-pds">"</span>15-19<span class="pl-pds">"</span></span>,<span class="pl-s"><span class="pl-pds">"</span>20-24<span class="pl-pds">"</span></span>]) sub5['brst']=pd.qcut(sub4.breastcancerper100th,5,labels=[<span class="pl-s"><span class="pl-pds">"</span>1-20<span class="pl-pds">"</span></span>,<span class="pl-s"><span class="pl-pds">"</span>21-40<span class="pl-pds">"</span></span>,<span class="pl-s"><span class="pl-pds">"</span>41-60<span class="pl-pds">"</span></span>,<span class="pl-s"><span class="pl-pds">"</span>61-80<span class="pl-pds">"</span></span>,<span class="pl-s"><span class="pl-pds">"</span>81-90<span class="pl-pds">"</span></span>]) sub5['emply']=pd.qcut(sub4.femaleemployrate,4,labels=[<span class="pl-s"><span class="pl-pds">"</span>30-39<span class="pl-pds">"</span></span>,<span class="pl-s"><span class="pl-pds">"</span>40-59<span class="pl-pds">"</span></span>,<span class="pl-s"><span class="pl-pds">"</span>60-79<span class="pl-pds">"</span></span>,<span class="pl-s"><span class="pl-pds">"</span>80-90<span class="pl-pds">"</span></span>]) sub5['alco']=sub5['alco'].astype('category') # Don't forget to tell python to treat this variable as categorical by typing the following code sub5['alco']=sub5['alco'].astype('category') print ('\n\t describe alcohol consumption\n') desc3=sub5['alco'].describe() print(desc3)

Now, to perform ANOVA test. Note, that by adding a capital C in the below code, I am telling python that the variable “alco” is categorical in nature. I code it as

model1=smf.ols(formula='breastcancerper100th~C(alco)',data=sub5) results1=model1.fit() print(results1.summary())

On executing this code, I get the following statistical results as given. The F-Statistic is calculated as *F=Variation among Sample Means/Variation within Groups*

OLS Regression Results ================================================================================ Dep. Variable: breastcancerper100th R-squared: 0.338 Model: OLS Adj. R-squared: 0.322 Method: Least Squares F-statistic: 21.16 Date: Wed, 10 Feb 2016 Prob (F-statistic): 4.54e-17 Time: 11:31:13 Log-Likelihood: -900.54 No. Observations: 213 AIC: 1813. Df Residuals: 207 BIC: 1833. Df Model: 5 Covariance Type: nonrobust ================================================================================== coef std err t P|t| [95.0% Conf. Int.] ---------------------------------------------------------------------------------- Intercept 25.8390 2.805 9.211 0.000 20.308 31.370 C(alco)[T.1-4] 0.7356 3.995 0.184 0.854 -7.141 8.613 C(alco)[T.5-9] 2.5974 3.967 0.655 0.513 -5.224 10.419 C(alco)[T.10-14] 22.0557 3.995 5.520 0.000 14.179 29.933 C(alco)[T.15-19] 31.7415 3.995 7.944 0.000 23.864 39.619 C(alco)[T.20-24] 12.8044 3.967 3.228 0.001 4.983 20.626 ============================================================================== Omnibus: 30.228 Durbin-Watson: 1.693 Prob(Omnibus): 0.000 Jarque-Bera (JB): 41.008 Skew: 0.884 Prob(JB): 1.25e-09 Kurtosis: 4.222 Cond. No. 6.81 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

From the above statistic, I see that the ANOVA F statistic value is 21.16 and that the P value is Prob (F-statistic): 4.54e-17 which is less than the significance value of 0.05 showing that the null hypothesis is rejected in favor of the alternative hypothesis.

However, this is not the end because the means have to be tested too, which I show as follows;

In this case, the explanatory variable has more than 2 levels or more than 2 groups of categorization therefore I also have to conduct a post-hoc comparison test. In such a case, a significant ANOVA does not tell us which group are different from others. To test this difference, I conduct a **Post-Hoc Comparison**. For this, I choose Tukey’s Honestly Significant Difference Test. Conducting post hoc is pertinent so as to avoid type 1 error. The type 1 error is when the researcher incorrectly rejects the null hypothesis as being false but it actually is true. So in order to conduct the post hoc paired comparison in the context of my ANOVA in examining the association between alcohol consumption by women has an increased risk of breast cancer, I am going to use the Tukey HSD or Tukey’ Honesty Significant Difference Test. To do this, I will first add the python library statsmodel.stats.multicomp as shown

In python, to conduct the post hoc multi comparison test, import the library

import statsmodels.stats.multicomp as multi

The python code for this test is given as where I put the quantitative response variable ‘breastcancerper100th’ followed by the categorical explanatory variable ‘alco’ in parenthesis sub5[‘breastcancerper100th’],sub5[‘alco’] as shown;

# Conducting a post hoc comparison test to check for type 1 error mc1=multi.MultiComparison(sub5['breastcancerper100th'],sub5['alco']) res1=mc1.tukeyhsd() print res1.summary()

And the resultant output is;

Multiple Comparison of Means - Tukey HSD,FWER=0.05 =============================================== group1 group2 meandiff lower upper reject ----------------------------------------------- 0 1-4 0.7356 -10.7577 12.229 False 0 10-14 22.0557 10.5624 33.5491 True 0 15-19 31.7415 20.2482 43.2349 True 0 20-24 12.8044 1.3922 24.2165 True 0 5-9 2.5974 -8.8148 14.0095 False 1-4 10-14 21.3201 9.7461 32.8941 True 1-4 15-19 31.0059 19.4319 42.5799 True 1-4 20-24 12.0687 0.5754 23.5621 True 1-4 5-9 1.8617 -9.6316 13.3551 False 10-14 15-19 9.6858 -1.8882 21.2598 False 10-14 20-24 -9.2513 -20.7447 2.242 False 10-14 5-9 -19.4583 -30.9517 -7.965 True 15-19 20-24 -18.9371 -30.4305 -7.4438 True 15-19 5-9 -29.1441 -40.6375 -17.6508 True 20-24 5-9 -10.207 -21.6191 1.2051 False -----------------------------------------------

From the above Tukey’s post hoc comparison test, we see the comparison between different levels of the categorical explanatory variable “alco” and I make the following conclusions;

In the sixth row of the table, we see the comparison between groups of alcohol consumption in 1-4 litres and alcohol consumption in 10-14 litres as well as the mean difference between breast cancer cases reported in this two groups. From this row we can see safely reject the null hypothesis and accept the alternative hypothesis that increased intake of alcohol consumption in women causes a higher risk of breast cancer.

The complete python code is listed on my github repository here.

Thank you for reading.

Cheers,

Ashish