Big or small-let’s save them all: Chi-Square Test of Independence

Continuing the series, I will briefly state the dependent variable/response/outcome variable (Y) = breast cancer per 100th women and the independent /explantory variable (X) = alcohol consumption, female employ rate.

I begin the analysis by making a copy of the original dataset as sub5

 # Create a copy of the original dataset as sub5 by using the copy() method
sub5=data.copy()

I then impute the missing values by the mean

 # Since the data is all continuous variables therefore the use the mean() for missing value imputation
sub5.fillna(sub5['breastcancerper100th'].mean(), inplace=True)
sub5.fillna(sub5['femaleemployrate'].mean(), inplace=True)
sub5.fillna(sub5['alcconsumption'].mean(), inplace=True)

Both of my dependent and independent variables are continuous in nature. Since, chi-square test works
for categorical variables, I have three choices;
a) Choose a new categorical variable and abandon any one existing continuous variable OR
b) Abandon both of the existing continuous variables and choose two new categorical variables OR
c) Convert the existing response continuous variable to categorical response variable

For this analysis, I will choose option c because I want to observe what is/are the consequences of variable conversion. This I do it in python as follows;

 # Converting response variable to categorical
sub5['brst']=sub5['brst'].astype('category')

Now I categorize quantitative variable based on customized splits using the cut function. I categorize the variable alcohol consumption into equal intervals of 4 litres each where the interval 0 infers to “No alcohol consumption”, interval 1-4 infers to alcohol consumption between 1 litre to 4 litre and likewise;
Similary, I split the breastcancerper100th variable into equal intervals of 20 cases each where interval 0 infers to “No breast cancer case reported”, interval 1-20 infers to breast cancer case reported between 1 to 20 cases and likewise;

sub5['alco']=pd.qcut(sub5.alcconsumption,6,labels=["0","1-4","5-9","10-14","15-19","20-24"])
sub5['brst']=pd.qcut(sub5.breastcancerper100th,5,labels=["1-20","21-40","41-60","61-80","81-90"])

Next, I request for the contingency table of the observed counts which I am calling ct1 and will use the pandas (imported the library here as pd) crosstab function. Within the parenthesis I include my converted categorical variable breastcancerper100th followed by the continuous variable alcohol consumption.

 # Cross tabulating the variables
ct1=pd.crosstab(sub5['brst'],sub5['alco'])
print ct1

And the output is as given

 Contigency Table
alco 0 1-4 5-9 10-14 15-19 20-24
brst
1-20 12 15 10 3 0 3
21-40 14 10 10 8 1 0
41-60 7 6 11 9 9 23
61-80 3 1 2 1 6 6
81-90 0 3 3 14 19 4

Now, I want to generate column percentage which will show me the percentage frequency of alcohol consumption (explanatory/independent variable) with each level. The axis=0 statement tells python to sum all the values in each column.

 # the axis=0 statement tells python to sum all the values in each column in python
colsum=ct1.sum(axis=0)
colpct=ct1/colsum
print(colpct)

I will get a output like this;

 alco 0 1-4 5-9 10-14 15-19 20-24
brst
1-20 0.333333 0.428571 0.277778 0.085714 0.000000 0.083333
21-40 0.388889 0.285714 0.277778 0.228571 0.028571 0.000000
41-60 0.194444 0.171429 0.305556 0.257143 0.257143 0.638889
61-80 0.083333 0.028571 0.055556 0.028571 0.171429 0.166667
81-90 0.000000 0.085714 0.083333 0.400000 0.542857 0.111111

Finally, I request Chi-square calculation which includes chi square value, the associated p value and a
table of expected counts that are used in these calculations. I call these calculations as cs1 and ask python to print them.

 # Chi-Square
print('chi-square value, p value, expected counts')
cs1=scipy.stats.chi2_contingency(ct1)
print(cs1)

And I get the following statistics as given

 Chi-square value, p value, expected counts
(112.15051654180174, 7.9488417504750683e-15, 20, array([[ 7.26760563, 7.0657277 , 7.26760563, 7.0657277 ,
7.0657277 , 7.26760563],
[ 7.26760563, 7.0657277 , 7.26760563, 7.0657277 ,
7.0657277 , 7.26760563],
[ 10.98591549, 10.68075117, 10.98591549, 10.68075117,
10.68075117, 10.98591549],
[ 3.21126761, 3.12206573, 3.21126761, 3.12206573,
3.12206573, 3.21126761],
[ 7.26760563, 7.0657277 , 7.26760563, 7.0657277 ,
7.0657277 , 7.26760563]]))

My results first include the table of counts of the response (=dependent) variable by the explantory (=independent) variable. As you can see that there are 23 breast cancer cases reported per 100th women where the alcohol consumption is between 20-24 liters and if you look at the begining of the table, there are 12-14 cases that report no breast cancer for no alcohol consumption. This means that there are other reasons besides
alcohol consumption that can also cause breast cancer. Interesting find.

Next, the table of column percentage makes this counts more meaningful by showing the percentage of individuals. There are 63% breast cancer cases reported per 100th women where the alcohol consumption is between 20-24 liters and approximately 33% to 38% cases that report no breast cancer when there is no alcohol consumption which means that there are other reasons besides alcohol consumption which attribute to breast cancer. These other reasons are not reflected in the data that I am analysing but this is an
interesting find.

Finally, looking at the chi square results the chi square value is large 112 and the p value shown in scientific notation is quite small approximately 7.94 e to the power of negative 15 which clearly tells us that alcohol consumption by women and breast cancer are significatinly associated.

Now, I plot this association on a bi-variate bar chart as shown (Note: I am unable to generate the bar chat in PyCharm because of the following error message, “UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter. warnings.warn(self.msg_depr % (key, alt_key))” to which I am searching for a solution and am not able to find. Once I find the solution to this error code, I will update this post. If you know the solution, please post it in the comments section. Thanks)

Also, I know from looking at the significant p value that I will accept the alternative hypothesis that alcohol consumption causes breast cancer in women. If my explanatory variable had two levels I could interpret the two corresponding level column percentages and be able to say which group has a significantly higher rate of alcohol consumption but my explanatory variable has 6 cateogries so I know to that not all are equal, but I dont know which are different and which are not. When the explanatory variable has more than 2 levels than Chi square statistic and p value does not provide insight to why the null hypothesis can be rejected. It does not tell us in what ways are the rate of breast cancer cases in women are equal across the frequency categories. There are off course many ways for the rates the rates to be unequal. To determine which groups are different from others, we would again need to perform a post hoc test by conducting post hoc comparisons between pairs of rates in
a way that avoids excessive type 1 error in other words avoids rejecting the null hypothesis when the null hypothesis is true. If we reject the null hypothesis, we need to perform comparisons for each pair of breast cancer reported dependent cases across the six alcohol consumption frequency categories. In the case of six groups we actually need to perform 15 pair wise comparisons that we need to conduct.

So to appropriately protect against the type 1 error in the context of the chi-square test we will use the post hoc approach known as the Bonferroni Adjustment. The goal of using the bonferroni adjustment is to control the family-wise error rate also known as the maximum overall type 1 error rates so that we can evaluate which pairs of breast cancer cases are different from one another. Briefly, the process would be to conduct each of the fifteen pair comparisons but rather than evaluating the significance at the p=0.5 level we would adjust the p value to make it more difficult to reject the null hypothesis. The adjusted p value is calculated by dividing the p=0.5 by c=number of comparisons that we plan to make. The complete code is listed on my github repository here

My thoughts & learnings

Data Science, R and Business…

Big or small-let’s save them all: Chi-Square Test of Independence

Share this:

Related posts