For this study, I am using the gapminder code book and have chosen the following variables for data analysis;
Explanatory or Predictor or Independent variables: Income per person and alcohol consumption. Where Income per person is the 2010 Gross Domestic Product per capita in constant 2000 US$. And alcohol consumption is the 2008 alcohol consumption per adult (age 15+) liters, recorded and estimated average alcohol consumption, adult (15+) per capita consumption in liters pure alcohol.
Response or Dependent variable : Breast cancer rate: 2002 breast cancer new cases per 100,000 female.
Data management: All variables of interest are quantitative continuous. So they were recoded into two categories (binary).
- Breast cancer rate less or equal to 20 per 100,000 female is coded as low breast cancer rate (0) and coded as high rate when greater than 20 per 100,000 female (1).
- Income per person less or equal to 5,000 US$ was coded as low income per person (0) and coded as high income per person when greater than 5,000 US$ (1).
- Alcohol consumption less or equal to 5 liters was coded as low alcohol consumption (0) and coded as high alcohol consumption when greater than 5 liters (1).
- Female Employment less or equal to 50 was coded as low female employee (0) and coded as high employment rate when grater than 50 (1)
In python, I recode the variables as given;
<pre>def bin2cancer (row): if row['breastcancerper100th'] <= 20 : return 0 elif row['breastcancerper100th'] > 20 : return 1 #Apply the new variable bin2cancer to the gapmind dataset data['bin2cancer']=data.apply(lambda row: bin2cancer(row),axis=1) #Creat binary Income per person def bin2income(row): if row['incomeperperson'] <= 5000 : return 0 elif row['incomeperperson'] > 5000 : return 1 #Apply the new variable bin2income to the gapmind dataset data['bin2income'] = data.apply (lambda row: bin2income (row),axis=1) #Creat binary Alcohol consumption def bin2alcohol(row): if row['alcconsumption'] <= 5 : return 0 elif row['alcconsumption'] > 5 : return 1 #Apply the new variable bin2alcohol to the gapmind dataset data['bin2alcohol'] = data.apply (lambda row: bin2alcohol (row),axis=1) # create binary Female employee rate def bin2femalemployee(row): if row['femaleemployrate'] <= 50 : return 0 elif row['femaleemployrate'] > 50 : return 1 #Apply the new variable bin2alcohol to the gapmind dataset data['bin2femalemployee'] = data.apply (lambda row: bin2femalemployee (row),axis=1)</pre> <pre>
Note: Independent variable is the cause. This variable is manipulated by the researcher in an experiment to determine its relationship to an observed phenomenon, called the dependent variable. The dependent variable is the Effect.
Logistic Regression simplified
In statistics, logistic regression or logit regression or logit model is a regression1 model where the dependent variable2 is categorical3. Logistic regression measures the relationship between categorical dependent (or the response) variable and one or more independent (or the explanatory/predictor) variables by estimating the probabilities using a logistic function which is the cumulative logistic distribution. Logistic regression is widely used in medical sciences to determine if a patient has a given disease based on its sex, weight, race, residence etc. Another example could be whether voters would vote for contestant A or contestant B. Yet another example could be the probability of student passing the exam given the number of hours invested in studying.
On computing a logistic regression analysis, it generates the coefficient of intercept, the standard error, the z-value and the p-value. These values are put in the logistic equation to predict the probability of a given event.
- Logistic regression can be binomial, ordinal or multinomial. Binomial or binary logistic regression deals with situations where the categorical dependent variable can have only two possible values (for example; pass or fail, dead or alive, yes or no etc.) Multinomial logistic regression deals with situations where the categorical dependent variable can have three or more possible values (example; disease A vs disease B vs disease C vs disease D) that are unordered. Ordinal logistic regression deals with categorical dependent variable that are ordered.
- Unlike other forms of linear regression analysis, logistic regression is used for predicting binary dependent variables with finite value of either 0 or 1
- Given this difference that logistic regression can only be used for predicting binary dependent variables, it violates the assumptions of linear regression because the residuals will not be normally distributed because they are not continuous and are binary in nature.
- Therefore to solve this problem of converting binary variable into a continuous one that can take on any real value (negative or positive), the logistic regression function first takes the odds of the event happening for different levels of each independent variable, then it takes the ratio of those odds (which is continuous and cannot be negative) and then takes the logarithm of that ratio. This is referred to as logit or log-odds to create a continuous criterion which is a transformed version of the dependent variable.
- Therefore, the logit transformation is referred to as the link function in logistic regression. So although, the dependent variable in logistic regression is binomial, the logit is the continuous criterion on which the linear regression is conducted4.
As stated above, the variables that I have chosen for this study both response & explanatory variables are quantitative continuos in nature but for logistic regression to work the requirement is categorical response variable. So either, I choose another response variable which is categorical in nature and restate the research hypothesis (in other words start from scratch…which I will not do for now) or recode the existing continuous response variable to categorical. I will loose hell lot of information on recoding the variable and there is a good disccusion on this issue but i will take the risk. Next, I use the logit function in the statsmodel api library to do the analysis like given;
# logistic regression with binary breast cancer per 100th women lreg1 = smf.logit(formula = 'bin2cancer ~ bin2alcohol', data = data).fit() print (lreg1.summary()) # odds ratios print ("Odds Ratios") print (np.exp(lreg1.params)) # odd ratios with 95% confidence intervals print ('Logistic regression with binary alcohol consumption') print ('Odd ratios with 95% confidence intervals') params = lreg1.params conf = lreg1.conf_int() conf['OR'] = params conf.columns = ['Lower CI', 'Upper CI', 'OR'] print (np.exp(conf))
The complete code is listed on my github repository here. I will now discuss the results;
The bivariate analysis of the association between breast cancer per 100th person and alcohol consumption rate shows that the odd of having higher cancer rates was 4 times (OR=4.15, 95% CI=1.93–8.95, p-value<0.001) is greater for countries with higher alcohol consumption (underlined red color).
Fig 1: Bivariate association between breast cancer per 100th person & alcohol consumption
After controlling for alcohol consumption, countries with higher income per person still have 1.28 times greater odds of having higher cancer rates (OR =1.28174, 95% CI (0.00 – inf), p-value = 0.001), see fig 2.
Fig 2:Bivariate association between income per person & alcohol consumption
Countries with higher alcohol consumption also have 4 times greater odds of having higher breast cancer rate (OR = 4.15, 95% CI (1.93 – 8.95), p-value = 0.000), adjusting for income, see fig1.
This result support the hypothesis of association between alcohol consumption and breast cancer rate. There was no evidence that female employee rate confounds this relationship.