Big or small-let’s save them all: Exploratory Data Analysis

In my previous post, I had discussed at length on the research question, the dataset, the variables and the various research hypothesis.
For the sake of brevity, I will restate the research question and the variables of study.

Research question: Can alcohol consumption increase the risk of breast cancer in working class women.
Variables to explore are:

  1. ‘alcconsumption’- average alcohol consumption, adult (15+) per capita consumption in litres pure alcohol
  2. ‘breastcancerper100TH’- Number of new cases of breast cancer in 100,000 female residents during the certain year
  3. ‘femaleemployrate’- Percentage of female population, age above 15, that has been employed during the given year

In this post, I present to the readers an exploratory data analysis of the gapminder dataset.

Although, for this course we are provided with the relevant dataset, however if you are not taking this course and are interested in the source of the data, then you can get it from here. In the List of indicators search box type “breast cancer, new cases per 100,000 women” to download the dataset.

I will be using python for Exploratory Data Analysis (EDA). I begin by importing the libraries pandas and numpy

# Importing the libraries
import pandas as pd
import numpy as np

I have already downloaded the dataset which is .csv (comma seperated value format) and will now load/read it in a variable called datausing pandas library as given

# Reading the data where low_memory=False increases the program efficiency
data = pd.read_csv('data/train.csv', low_memory=False)

To get a quick look at the number of rows and columns and the coulmn headers, you can do the following;

print (len(data)) # shows the number of rows, here 213 rows
print (len(data.columns))# shows the number of cols, here 4 columns# Print the column headers/headings
names=data.columns.values
print names

You will see the output as

213
4
213
['country' 'breastcancerper100th' 'femaleemployrate' 'alcconsumption']

Now, to see the frequency distribution of these four variables I use the value_counts() function to generate the frequency counts of the breast cancer dependence variables. Note, if you want to see the data with the missing values then choose the flag dropna=False as shown. For this dataset, majority of variable values have a frequency of 1.

print "\nAlcohol Consumption\nFrequency Distribution (in %)"
c1=data['alcconsumption'].value_counts(sort=False,dropna=False)
print c1
print "\nBreast Cancer per 100th"
c2=data['breastcancerper100th'].value_counts(sort=False)
print c2
print "\nFemale Employee Rate"
c3=data['femaleemployrate'].value_counts(sort=False)
print c3 

The output will be

Alcohol Consumption 5.25 1 9.75 1 0.50 1 9.50 1 9.60 1 

In the above output, values 5.25,9.75,0.50,5.05 are the alcohol consumption in litres and the value 0.004695 is the percentage count of the value. The flag sort=False means that values will not be sorted according to their frequencies. Similarly, I show the frequency distribution for the other two variables

Breast Cancer per 100th
23.5 2
70.5 1
31.5 1
62.5 1
19.5 6

and

Female Employee Rate
45.900002 2
55.500000 1
35.500000 1
40.500000 1
45.500000 1

I now subset the data to explore my research question in a bid to see if it requires any improvement or not. I want to see which countries are prone to greater risk of breast cancer among female employee where the average alcohol intake is 10L;

<span class="n">sub3</span><span class="o">=</span><span class="n">data</span><span class="p">[(</span><span class="n">data</span><span class="p">[</span><span class="s">'alcconsumption'</span><span class="p">]</span><span class="o">></span><span class="mi">10</span><span class="p">)</span><span class="o">&</span><span class="p">(</span><span class="n">data</span><span class="p">[</span><span class="s">'breastcancerper100th'</span><span class="p">]</span><span class="o">></span><span class="mi">70</span><span class="p">)</span><span class="o">&</span><span class="p">(</span><span class="n">data</span><span class="p">[</span><span class="s">'femaleemployrate'</span><span class="p">]</span><span class="o">></span><span class="mi">50</span><span class="p">)]</span>
<span class="k">print</span> <span class="n">sub3</span>

and the result is;

 country breastcancerper100th femaleemployrate alcconsumption
9      Australia 83.2 54.599998 10.21
32     Canada 84.3 58.900002 10.20
50     Denmark 88.7 58.099998 12.02
63     Finland 84.7 53.400002 13.10
90     Ireland 74.9 51.000000 14.92
185    Switzerland 81.7 57.000000 11.41
202    United Kingdom 87.2 53.099998 13.24

Interestingly, countries with stable economies like Australia, Canada, Denmark, Finland, Ireland, Switzerland & UK top the list of high breast cancer risk among working women class. These countries are liberal to women rights. Now, this can be an interesting question that will be explored later.

How about countries with very low female employee rates- how much is there contribution to alcohol consumption and breast cancer risk? (I set the threshold for high employee rate as greater than 40% and threshold for high alcohol consumption to be greater than 20 liters and breast cancer risk at less than 50%). And the winner is, Moldova a landlocked country in Eastern Europe. Here we can see that Moldova contributes to approximately 50% of new breast cancer cases reported per 100,000th female residents with a per capita alcohol consumption of 23%. So with a low female employee rate of 43% (as compared to the threshold of 40%) this country does have a significant amount of new breast cancer cases reported because of high alcohol consumption by the relatively less number of adult female residents. ((on a side note: “Heaven’s! Moldavian working class women drink a lot 🙂 ))

# Creating a subset of the data
<span class="c"># Creating a subset of the data</span>
<span class="n">sub1</span><span class="o">=</span><span class="n">data</span><span class="p">[(</span><span class="n">data</span><span class="p">[</span><span class="s">'femaleemployrate'</span><span class="p">]</span><span class="o">></span><span class="mi">40</span><span class="p">)</span> <span class="o">&</span> <span class="p">(</span><span class="n">data</span><span class="p">[</span><span class="s">'alcconsumption'</span><span class="p">]</span><span class="o">>=</span><span class="mi">20</span><span class="p">)</span><span class="o">&</span> <span class="p">(</span><span class="n">data</span><span class="p">[</span><span class="s">'breastcancerper100th'</span><span class="p">]</span><span class="o"><</span><span class="mi">50</span><span class="p">)]</span>
<span class="c"># creating a copy of the subset. This copy will be used for subsequent analysis</span>
<span class="n">sub2</span><span class="o">=</span><span class="n">sub1</span><span class="o">.</span><span class="n">copy</span><span class="p">()</span>

the result is

 country breastcancerper100th femaleemployrate alcconsumption
126 Moldova        49.6       43.59        23.01

The complete python code is listed on my github account

This series will be continued….

Advertisements