In previous posts I have been successful in statistically proving that there exists a relationship between alcohol consumption and breast cancer in women. I now want to determine if the presence of another variable can affect this established relationship. So I chose the variable female employee rate and stated an alternative hypothesis that female employees are susceptible to breast cancer. What did I find? Let’s examine it using python as given;
As usual, I begin by importing the pertinent libraries and then load the dataset in a python data frame called data.
# Importing the required libraries # Note %matplotlib inline works only for ipython notebook. It will not work for PyCharm. It is used to show the plot distributions #%matplotlib inline import pandas as pd import numpy as np import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi import seaborn import matplotlib.pyplot as plt # Reading the data where low_memory=False increases the program efficiency data= pd.read_csv("gapminder.csv", low_memory=False)
Next, I set the variables that I will be working with to numeric. When you execute this line of code, python interpreter will show some future warnings, you can just ignore them
# setting variables that you will be working with to numeric # FutureWarning: convert_objects is deprecated. Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric #data['breastcancerper100th']= pd.to_numeric(data['breastcancerper100th'],errors='ignore') data['breastcancerper100th']=data['breastcancerper100th'].convert_objects(convert_numeric=True) data['alcconsumption']= data['alcconsumption'].convert_objects(convert_numeric=True) #data['lifeexpectancy']=data['lifeexpectancy'].convert_objects(convert_numeric=True) data['femaleemployrate']=data['femaleemployrate'].convert_objects(convert_numeric=True)
After this, as a good programming and data analysis practice I make a copy of the original data frame as sub11.
# creating a copy of the dataframe sub11=data.copy()
I then check for the missing values in the data frame with the line of code
# Checking for missing values in the data frame print "\n\t Missing values count\n",sub11.isnull().sum()
You will see that there variables in the dataset. Now, one school of thought is to drop the missing values but my opinion is to fill the missing values with mean of the average values of the dataset. There is a very interesting and highly cited paper on this issue itself. You can check it here. So I code the missing with mean imputation in python as given;
# Since the data is all continuous variables therefore the use the mean() for missing value imputation sub11.fillna(sub11['breastcancerper100th'].mean(), inplace=True) sub11.fillna(sub11['femaleemployrate'].mean(), inplace=True) sub11.fillna(sub11['alcconsumption'].mean(), inplace=True) print "\nMissing value imputation done\n",sub11.isnull().sum()
Thereafter, I center the quantitative independent variables, ‘alcoholconsumption’ and ‘femaleemployrate’ by subtracting the mean of the variable with the actual variable itself
# Now centering the quantitative Independent Variable for regression analysis sub11['alcconsumption_c']=(sub11['alcconsumption'] - sub11['alcconsumption'].mean()) sub11['femaleemployrate_c']=(sub11['femaleemployrate'] - sub11['femaleemployrate'].mean())
I now check for the association between breast cancer and female employee rate using ordinary least square method of the stats model library in python as
# Now checking for the association between breast cancer and female employee rate reg11=smf.ols('breastcancerper100th~alcconsumption + femaleemployrate_c',data=sub11).fit() print reg11.summary()
Here is the output;
Examine the p-values (highlighted) and the parameter estimates for each predictor variable i.e. the explanatory variable alcohol consumption and our confounder female employee rate. As you can see from the highlighted that the p value for alcohol consumption is less than 0.05 which is statistically significant however, the p value for female employee rate is greater than 0.05 and is statistically insignificant. Also note the t value also known as the t-test which is positive for alcohol consumption indicating that alcohol consumption is associated with breast cancer in women but the t-value is negative for female employee rate which signifies that there is no relationship between female employee rate and breast cancer and alcohol consumption.Thus, we can conclude that female employee cannot be attributed to breast cancer and alcohol consumption because the p-value is insignificant and the t-value is negative however there is a positive correlation between alcohol consumption and breast cancer because the p-value is significant and the t-value is positive.
Next, I continue to add other variables to this multiple regression analysis model to check which for significance. But first, I convert it to numeric then impute the missing value with the mean thereafter I center it, perform missing value imputation and only then I add it in the regression model as given.
# setting variables that you will be working with to numeric data['lifeexpectancy']=data['lifeexpectancy'].convert_objects(convert_numeric=True) # Since the data is all continuous variables therefore using the mean() for missing value imputation sub11['lifeexpectancy_c']=sub11['lifeexpectancy'] - sub11['lifeexpectancy'].mean() sub11.fillna(sub11['lifeexpectancy'].mean(), inplace=True) # Now checking for multiple variables signicance in regression analysis reg12=smf.ols('breastcancerper100th~alcconsumption + femaleemployrate_c+lifeexpectancy_c',data=sub11).fit() print reg12.summary()
Here is the output. Fig 2
Examine the p-values (highlighted) and the parameter estimates for each predictor variable i.e. the explanatory variable alcohol consumption, female employee rate and our confounder life expectancy. As you can see from the highlighted that the p value for both alcohol consumption and life expectancy is less than 0.05 which is statistically significant Also note the t value also known as the t-test which is positive for alcohol consumption and life expectancy. Thus we can conclude that life expectancy, alcohol consumption are positively and significantly correlated with breast cancer whereas female employee rate is not.
I then check if urban rate has any contribution to breast cancer. I follow the above steps which I provide briefly as given
data['urbanrate']=data['urbanrate'].convert_objects(convert_numeric=True) sub11.fillna(sub11['urbanrate'].mean(), inplace=True) sub11['urbanrate_c']=sub11['urbanrate'] - sub11['urbanrate'].mean() # Now checking for multiple variables signicance in regression analysis reg13=smf.ols('breastcancerper100th~alcconsumption_c + femaleemployrate_c+lifeexpectancy_c+urbanrate_c',data=sub11).fit() print reg13.summary()
Here is the output.
Examine the p-values (highlighted) and the parameter estimates for each predictor variable i.e. the explanatory variable alcohol consumption, female employee rate and our confounder urban rate. As you can see from the highlighted that the p value for alcohol consumption, life expectancy and urban rate is less than 0.05 which is statistically significant Also note the t value is positive for alcohol consumption, life expectancy and urban rate. Thus we can conclude that life expectancy, alcohol consumption and urban rate are positively and significantly correlated with breasts cancer whereas female employee rate is not. Thus, I conclude that based on this multiple regression analysis that female employee does not relate to breast cancer but other confounding variables such as life expectancy and urban rate are positively and statistically significantly are correlated to breast cancer in women.
So far I have just interpreted the estimated regression coefficients however remember that our dataset is just a sample of the population. So the regression coefficients that we get from the analysis on our sample are only estimates of the true population parameters. That is why they are called parameter estimates. For example, if I were to draw another sample from the population the parameter estimates may not be the same as were from the original sample. This is due to the sampling variability. Meaning that the sample we draw is not likely to be exactly like that of the population. Thus to get a better understanding of how our sample estimates represent the population values we can look at confidence intervals. The confidence intervals tell us which values are plausible in the population. Typically we look at the 95% confidence intervals which tell us with certainity that we are 95% certain that the true population parameter falls somewhere between the lower and upper confidence limits that are estimated based on our sample parameter estimate. If I look at the confidence interval shown in Fig 3 which provides the estimated lower and upper limits for the 95% confidence interval for each parameter estimate. So for example, if we take a look at the parameter estimate (coef value in Fig 3) for alcohol consumption we see that it is 0.28 which means that on an average women with breast cancer have 0.28 alcohol consumption symptoms than women without breast cancer. This is the point estimate of our population parameter. If we look at the confidence interval though, we see that it ranges from 0.065 to 0.5 meaning that we are 95% certain that the true population parameter for the association between alcohol consumption and breast cancer in women falls somewhere between 0.065 and 0.5 that is in the population there is a 95% chance that women with breast cancer have anywhere between 0.065 to 0.5 more alcohol consumption symptoms than women without it.
Also note that the second alternative hypothesis, “female employees are susceptibe to increased risk of breast cancer could not be proved true as its p value is 0.400 (refer to Fig 3) which is not statistically significant. If we look at the confidence interval of this variable (refer to Fig 3) we see that it ranges from (negative) -0.097 to (positive) 0.243 which includes a value of Zero in that range. This means that we cannot rule out with 95% confidence the possibility of association between female employee rate and breast cancer symptoms and for adjusting the other variables in the model is Zero. Remember in linear regression when you have a insignificant p value than 95% confidence interval for the apriori estimate will include a value of Zeo meaning “No association”.
Next, I show the concave curvelinear relationship between alcohol consumption and breast cancer in fig 4. I plot this in python as
scat1 = seaborn.regplot(x="breastcancerper100th", y="alcconsumption", order=2, scatter=True, data=sub11) plt.xlabel('Breast Cancer per 100th woman') plt.ylabel ('Alcohol Consumption') plt.title ('Scatterplot for the association between Breast Cancer & Alcohol Consumption') #print scat1 plt.show()
and it will give me a plot like shown below in fig 4
Fig 4: Scatterplot
Next, I will be plotting a q-q (quantile-quantile) plot to evaluate the residuals from our regression model are normally distributed. These are very useful plots for assessing how closely a data set fits a particular distribution. For more details and very good explanation see here. To plot a qq plot in python I should first import the statsmodel.api library as shown in fig 5
import statsmodel.api as sm # Q-Q plot for normality fig1=sm.qqplot(reg15.resid,line='r') plt.show()
Fig 5: Q-Q Plot for the residuals
Shown in fig 5 is the qq plot. The qq plots generally follow a straight line but deviate at the lower & the upper ends. This indicates that the residulas do not follow a perfectly normal distribution. This could mean that the curvelinear association that we observed in our scatter plot may not be fully estimated by the quadratic alcoholconsumption term. There may be other explanatory variables that we might consider including in our model that could improve estimation of the observed curviliearity.
Finally, I will show a leverage plot to identify observations that have unusually large influence on the estimation of the predicted value of the response variable alcohol consumption or there are outliers or both. In statistics and in particular in regression analysis, leverage is a measure of how far away the independent variable values of an observation are from those of the other observations. I code this in python as
# Leverage plot fig3=sm.graphics.influence_plot(reg15,size=8) plt.show()
In statistics and in particular in regression analysis, leverage is a measure of how far away the independent variable values of an observation are from those of the other observations. On executing this code snippet, I get the leverage plot as shown in fig 6
Fig 6: Leverage plot
High-leverage points are those observations, if any, made at extreme or outlying values of the independent variables such that the lack of neighboring observations means that the fitted regression model will pass close to that particular observation. The leverage always takes on values between 0 (zero) and 1 (one). A point with 0 (zero) leverage has no effect on the regression model and outliers are observations with residuals greater than 2 or less than negative 2.
In fig 6, one of the first things that we can see that we have few outliers ie values greater than +2 and less than -2 (highlighted yellow). I have identified the outliers in other plots shown above but this plot shows that these outliers have small or close to zero leverage values meaning that although they are outliner observations they do not have an undue influence on the estimation of the regression model.
The complete python code is listed on my github repository here
If there be any questions or suggestions, please pen them in the comments section. i will be happy to answer them. Thanks for reading.