Big or small: let’s save them all: Pearson’s Correlation Coeffecient

The Pearson’s correlation (denoted by r) is the inferential test that will be used to examine the association between two quantitative variables. I previously discussed that a scatter plot is an appropriate way to visualize two quantitative variables when you want to examine the relationship between them.

Now, let me first briefly review the scatter plot and how to interpret them. To create a scatter plot each pair of values is plotted such that the value of the explanatory variable is plotted on the X (horizontal) axis and the value of the response variable is plotted on the Y (vertical) axis.
When describing the overall pattern of the relationship we look at its direction, form and strength. The direction of the relationship can be positive, negative or neither. A positive or increasing relationship (r >0) means that an increase in one variables is associated with an increase in another variable, a negative (r<0) or decreasing relationship means that a increase in one variables is associated with an decrease in another variable.

Not all relationships can be classifed as positive or negative. The form of a relationship is its general shape. When identifying the form, we try to find the simplest way to describe the shape of the scatter plot. There are many possible forms. Here are a couple that are quite common. Relationships with a linear form are most simply described as points scattered about a line. Relationships with a curvilinear form are most simply described as points dispersed around the same curved line. By definition, the correlation coefficient measures a linear relationship between two quantitative variables. So at this time I won’t be concerned with curvilinear or any other possible forms a scatter plot may take. The strength of the relationship is determined by how closely the data follow the form of the relationship.

Though assessing the strength of a relationship just by looking at the scatter plot is quite problematic. We need a numerical measure to help us with that. The numerical measure that measures the strength of a linear relationship between two quantitative variables is called the correlation coefficient. And is denoted by a lower cased r. The value of r ranges from -1 to +1. Not surprisingly negative values of r indicate a negative direction for a linear relationship between the two variables. And positive values indicate a positive direction for the linear relationship. Values that are close to 0, whether they’re negative or positive. Indicate a weak linear relationship. And values that are close to -1 or close to +1 indicate a strong linear relationship. Either negative or positive.

The correlation (r) only measures the strength of a linear relationship between two variables. It ignores any other type of relationship, no matter how strong it is. So beware of interpreting r when it’s close to zero as an indicator of a weak relationship. Rather it indicates a weak linear relationship. It’s very important to always look at the data in the scatter plot. It’s very important to interpret both the scatter plot and the correlation together. As with the other inferential tools I’ve discussed, an associated p-value is also calculated for the correlation coefficient. And it’s interpreted as significant when it’s less than or equal to 0.05. Thus, the correlation coefficient and corresponding p-value will now allow me to evaluate whether or not the relationship between two quantitative variables I have observed in a sample holds for the larger population.

To demonstrate how to request a correlation coefficient in Python, First, I create a new data frame I am calling it data clean, or data_clean, that drops all missing, that is, N/A values for each of the variables from the Gapminder data set. I do this because a correlation coefficient can not be calculated in the presence of N/A’s.


Next, I request a Pearson correlation, measuring the association between alcohol consumption and breast cancer per 100th women. I used the pearsonr function from the SciPy Stats Library. And include the variable pair in a separate command. Python will then generate both the correlation coefficient and the associated p-value.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats
# Reading the data where low_memory=False increases the program efficiency
data= pd.read_csv(&quot;gapminder.csv&quot;, low_memory=False)
# setting variables that you will be working with to numeric
data['breastcancerper100th']= data['breastcancerper100th'].convert_objects(convert_numeric=True)
data['femaleemployrate']= data['femaleemployrate'].convert_objects(convert_numeric=True)
data['alcconsumption']= data['alcconsumption'].convert_objects(convert_numeric=True)
#print &quot;Showing missing data coulmn-wise&quot;
#print data.isnull().sum()
# Create a copy of the original dataset as sub5 by using the copy() method
print(&quot;\nAssociation between alcohol consumption and breast cancer per 100th\n&quot;)
print(scipy.stats.pearsonr(sub6_clean['alcconsumption'], sub6_clean['breastcancerper100th']))

On executing this code snippet, I get the following output as

 Association between alcohol consumption and breast cancer per 100th

(0.48944334303601011, 2.53818007444985e-11) 

For the association between alcohol consumption and breast cancer, the correlation coefficient is approximately 0.4,
with a very small p-value. This tells us that the relationship is statistically significant. The corresponding scatter plot is coded as

 # using scatter plot the visulaize quantitative variable.
scat2= sns.regplot(x='alcconsumption', y='breastcancerper100th', data=sub6_clean)
plt.xlabel('Alcohol consumption in liters')
plt.ylabel('Breast cancer per 100th person')
plt.title('Scatterplot for the Association between Alcohol Consumption and Breast Cancer 100th person') 

This gives me the following scatter plot sct1

Post hoc tests are not necessary when conducting Pearson correlation. Post hoc tests are needed only when your research question includes a categorical explanatory variable with more than two levels. Because our explanatory variable and the context of correlation coefficient is quantitative, there’s never a need to perform a post hoc test. Another interesting and useful aspect of the correlation coefficient is, if we square the correlation coefficient, that is, we multiply it by itself, we get a value that also helps our understanding of the association between the two quantitative variables. Small r squared is the fraction of the variability of one variable that can be predicted by the other. For example, when looking at the relationship between alcohol consumption and breast cancer, if I square the correlation coefficient of 0.48, we get 0.23. This could be interpreted
the following way. If we know the alcohol consumption, we can predict 23% of the variability we will see in the rate of breast cancer cases reported per 100th women. Of course, that also means that 77% of the variability is unaccounted for. Again, correlation coefficients are commonly denoted with a lowercase r, and they’re squared to determine the amount of variability that can be predicted.

The complete code for this analysis is hosted on my GitHub account here