# Big or small: let’s save them all: Exploring Statistical Interactions

Statistical interaction describes a relationship between two variables that is dependent upon, or is moderated by, a third variable. The effect of a moderating variable is often characterized statistically as an interaction. That is a third variable that affects the direction and or strength of the relation between your explanatory, or x variable, and your response, or y variable. Recall, for this study I had chosen 3 variables of interest as given;

1. ‘alcconsumption’ – alcohol consumption per adult (age 15+), litres Recorded and estimated average alcohol consumption, adult (15+) per capita consumption in litres pure alcohol
2. ‘breastcancerper100TH’- breast cancer new cases per 100,000 female Number of new cases of breast cancer in 100,000 female residents during the certain year.
3. ‘femaleemployrate’- female employees age 15+ (% of population) Percentage of female population, age above 15, that has been employed during the given year.

So far I have discussed and analyzed the two variables. Now, I will use the third variable ‘female employee rate’ to help understand moderation or statistical interaction Also note that from the previous posts I have already proved that there is an association between alcohol consumption and breast cancer in women. Now, we will see if working women (female employee rate) is/are related to breast cancer and/or alcohol consumption.

In statistics, moderation occurs when the relationship between two variables depends on a third variable. In this case, the third variable is referred to as the moderating variable or simply the moderator. The effect of the moderating variable is often characterized statistically as an interaction. That is, a third variable that effects the direction and or strength of the relation between your explanatory and response variable. So does the female employees or employed women effect the direction or the strength of the relationship between alcohol consumption and breast cancer in women? The standard way of asking this question in the context of analysis of variance, is to move to the use of a two-way or two factor analysis of variance, rather than the one-way or one factor ANOVA that I’ve been using. Instead, now I’m going to take a less standard approach that can be consistently used across each of the inferential tolls. That is, ANOVA, Chi Square, and Pearson Correlation. In each of these contexts, we’re actually will be asking the question is our explanatory variable associated with our response variable, for each population subgroup or each level of our third variable? That is alcohol consumption and breast cancer associated for those women who are employed?

To accomplish this, I am going to run ANOVA for the third variable as a whole because this variable does not have any levels. So I first create new data frames with only the subsample of interest given as

``` data= pd.read_csv("gapminder.csv", low_memory=False)
data['breastcancerper100th']= data['breastcancerper100th'].convert_objects(convert_numeric=True) data['femaleemployrate']= data['femaleemployrate'].convert_objects(convert_numeric=True) data['alcconsumption']= data['alcconsumption'].convert_objects(convert_numeric=True) # Making a copy of the dataset as sub10
sub10=data.copy() ```

Next, I impute the missing values with mean as all variables are continuous in nature

``` # Since the data is all continuous variables therefore the use the mean() for missing value imputation
sub10.fillna(sub4['breastcancerper100th'].mean(), inplace=True) sub10.fillna(sub4['femaleemployrate'].mean(), inplace=True) sub10.fillna(sub4['alcconsumption'].mean(), inplace=True)```

Now, when I run the analysis of variance test as

```print "\nAssociation between Alcohol Consumption and Female Employ Rate"
model1=smf.ols(formula='alcconsumption~C(emply)',data=sub10)
results1=model1.fit()
print(results1.summary()) ```

you’ll see the following results as shown in Fig 1. The ANOVA table examining the relationship between alcohol consumption and female employee rate shows a small F value (F-statistic=2.069) , and a non-significant p-value (Prob (F-statistic): 0.105).

Fig 1: ANOVA table for alcohol consumption and female employee rate

I then examine the mean table by coding it as

```m1=sub10.groupby('alco').mean()
print " Mean Table for Alcohol consumption\n",m1 ```

When examining this table as shown in fig 2, for alcohol consumption we see that on an average of 45 employed women out of 100 drink 15-19 liters of alcohol and show an average of 57 cases of breast cancer.The association between alcohol consumption and female employee rate is insignificant.

Fig 2: Mean table for Alcohol consumption

Similarly, When examining the means table

```m2=sub10.groupby('emply').mean()
print "\n Mean Table for Female Employee Rate \n",m2 ```

as given in fig 3, for female employee rate we see that alcohol consumption is highest at an average of 12 liters amongst 30-39% of working females and shows an average of 35 breast cancer cases per 100. So the association between breast cancer and female employee rate is insignificant.

Fig 3: Mean table for Female Employee Rate

Now lets evaluate third variables as potential moderator in the context of Chi-Square Test of Independence. Now, we’re going to request a chi-squared test of independence examining the association between alcohol consumption and female employee rate. I code this as

```# contingency table of observed counts
ct1=pd.crosstab(sub10['alco'],sub10['emply'])
print "\nContingency table of observed counts\n"
print ct1
# column percentage
colsum=ct1.sum(axis=0)
colpct=ct1/colsum
print "\nColumn percentages\n",colpct
# Chi Square
print "\n Chi Square value, p value, expected counts"
cs1=ss.chi2_contingency(ct1)
print cs1 ```

I get the following Chi-Square and p value results as shown in fig 4

Fig 4: Chi-Square Test of independence for alcohol consumption & female employee rate

As you can see the highlighted chi-square value and the associated p value are very small so I can say that alcohol consumption & female employee rate are not related. Therefore in the presence of a moderating variable female employee rate there is not effect on alcohol consumption in women or in other words there is no relationship between employed women and alcohol consumption as per this dataset that I am examining.

I now continue to test the moderation in the context of Pearson’s correlation. Note, to calculate the Pearson correlation in python its imperative to drop the missing values in variables else it will not work. So I code it as a new variable calling it as sub10_clean

```# Removing the missing values otherwise the Pearson correlation will not work
sub10_clean=sub10.dropna() ```

From the previous post, I now test to see if alcohol consumption and breast cancer are related or not which i code in python by calling the scipy.stats.pearsonr() module as shown

```print("\nAssociation between alcohol consumption and breast cancer per 100th\n")
print(ss.pearsonr(sub10_clean['alcconsumption'], sub10_clean['breastcancerper100th'])) ```

When i execute this code snippet, I get the following result, see fig 5;

Fig 5: Pearson Correlation Coefficient for alcohol consumption & breast cancer

As it is evident from fig 5, I get a correlation coefficient of  0.21 a very small p value 0.001 as compared to the statistical significance of 0.005 that indicates a strong relationship between alcohol consumption and breast cancer in women.

Now, I will introduce the moderator variable to test the claim that working class women have higher alcohol consumption and are more susceptible to breast cancer. To verify it, I will code it in python as

```print "\n Association between alcohol consumption and female employee\n"
print(ss.pearsonr(sub10_clean['alcconsumption'], sub10_clean['femaleemployrate']))
print"\nAssociation between breast cancer and female employee\n"
print(ss.pearsonr(sub10_clean['breastcancerper100th'], sub10_clean['femaleemployrate'])) ```

On executing this code snippet, I get the result shown in fig 6;

Fig 6: Pearson Correlation Coefficient for alcohol consumption, breast cancer & female employee rate

As is evident from this fig 6, the p value for alcohol consumption in working class women is 0.167 which is greater than the statistical significance of 0.005 and similarly the the p value for breast cancer in working class women is 0.336 which is greater than the 0.005. This clearly shows and proves that there is no affect/association of the moderating variable ‘femaleemployeerate’ on either the alcohol consumption and nor the breast cancer cases. Note, this does not mean that there is no causation. This I will explore in the next post. The complete code is listed on my gitHub repository here

Finally, to summarize I have shown through the context of this post on how we can model the various statistical interactions between variables of interest which can lead us to discover interesting generalizations.