Big or small-let’s save them all: Visualizing Data

Big or small-let’s save them all: Visualizing Data

I am revisiting the research question once again, “Can alcohol consumption increase the risk of breast cancer in working class women? and the variables to explore are;

  1. ‘alcconsumption’- average alcohol consumption, adult (15+) per capita consumption in litres pure alcohol
  2. ‘breastcancerper100TH’- Number of new cases of breast cancer in 100,000 female residents during the certain year
  3. ‘femaleemployrate’- Percentage of female population, age above 15, that has been employed during the given year

From the research question, the dependent variable or the response or the outcome variable is breast cancer per 100th women and the independent variables are alcohol consumption and female employ rate

Let us now look at the measures of center and spread of the aforementioned variables. This will help us to better understand our quantitative variables. In python, to measure the mean, median, mode, minimum and maximum value, standard deviation and percentiles of a quantitative variable can be computed using the describe() function as shown below

 # using the describe function to get the standard deviation and other descriptive statistics of our variables
desc1=data['breastcancerper100th'].describe()
desc2=data['femaleemployrate'].describe()
desc3=data['alcconsumption'].describe()
print "\nBreast Cancer per 100th person\n", desc1
print "\nfemale employ rate\n", desc2
print "\nAlcohol consumption in litres\n", desc3

And the result will be

Breast Cancer per 100th person
count    173.000000
mean      37.402890
std       22.697901
min        3.900000
25%       20.600000
50%       30.000000
75%       50.300000
max      101.100000 

So, on an average there are 37 women per 100th in whom breast cancer is reported with a standard deviation of +- 22.

Similarly, I next find the mean and standard deviation of the variable, ‘femalemployrate’

female employ rate
count    178.000000
mean      47.549438
std       14.625743
min       11.300000
25%       38.725000
50%       47.549999
75%       55.875000
max       83.300003 

I can say that on an average there are 47% women employed in a given year with a deviation of +-15.

Finally, I find the mean and deviation of the variable, ‘alcconsumption’ given as

Alcohol consumption in litres
count    187.000000
mean       6.689412
std        4.899617
min        0.030000
25%        2.625000
50%        5.920000
75%        9.925000
max       23.010000 

This can be interpreted as among adults (15+) the average alcohol consumption in liters per capita income is 7 liters (rounding off) with a standard deviation of +-5 (rounding off).

Therefore the inference will be that in 47% (+-15) employed women in a given year the average alcohol consumption is 7 liters (+-5) per capita and the number of breast cancer cases reported on an average are 37 (+-22) per 100th female residents.

Another, alternative method of finding descriptive statistic for your variables is to use the describe() on the dataframe which in this case is called ‘data’ as given

data.describe()

I now provide the univariate data analysis of the individual variables

# Now plotting the univariate quantitative variables using the distribution plot
sub5=sub4.copy()
sns.distplot(sub5['alcconsumption'].dropna(),kde=True)
plt.xlabel('Alcohol consumption in litres')
plt.title('Breast cancer in working class women')
plt.show()

'''Note: Although there is no need to use the show() method for ipython notebook as %matplotlib inline does the trick but I am adding it here because matplotlib inline does not work for an IDE like Pycharm and for that i need to use plt.show'''

And the barchart is

fd1

Bar Chart 1: Alcohol consumption in liters

Notice, we have two peaks in bar chart 1. So it is a bimodal distribution which means that there are two distinct groups of data. The two groups are evident from the bar chart 1, where the first group (or the first peak) is centered at 5 liters of alcohol consumption and the second group (or the second peak) is centered at 35 liters of alcohol consumption

sns.distplot(sub5['breastcancerper100th'].dropna(),kde=True)
plt.xlabel('Breast cancer per 100th women')
plt.title('Breast cancer in working class women')
plt.show() 

And the barchart is

fd2

Bar Chart 2: Breast cancer per 100th women

Similarly, in bar chart 2,  there are two peaks so it is a bimodal distribution where the first group is centered at 35 cases of new breast cancer reported and the second group  is centered at 86 cases of new breast cancer reported.

sns.distplot(sub5['femaleemployrate'].dropna(),kde=True)
plt.xlabel('Female employee rate')
plt.title('Breast cancer in working class women')
plt.show()

And the bar chart is

fd3

Bar Chart 3: Female Employed Rate above 15+ (in %age) in a given year

In bar chart 3 we see a unimodal distribution where there is one group with maximum number of 42.

Now that we have seen the individual variable visually, I will now come back to the research question to see if there is any relationship between the research questions. Recall, for this study the various hypotheses were;

H0 (Null Hypothesis) =   Breast cancer is not caused by alcohol consumption

H1 (Alternative Hypothesis) = Alcohol consumption causes breast cancer

H2 (Alternative Hypothesis) = Female employee are susceptible to increased risk of breast cancer.

So, let’s check if there is any relationship between the breast cancer and alcohol consumption.

Please note here that since all the variables of this study are quantitative in nature so I will be using the scatter plot to visualize them.

Note that a histogram is not a bar chart. Histograms are used to show distributions of variables while bar charts are used to compare variables. Histograms plot quantitative data with ranges of the data grouped into bins or intervals while bar charts plot categorical data. For Dell Statistica, you can take a look here for the graphical data visualization and in Python it can be done using matplotlib library as shown here and a good SO question here

  • When visualizing a categorical to categorical relationship we use a Bar Chart.
  • When visualizing a categorical to quantitative relationship we use a Bar Chart.
  • When visualizing a quantitative to quantitative relationship we use a Scatter Plot.

Also, please note that it is very important to bear in mind when plotting association between two variables, the independent or the explanatory variable is ‘X’ plotted on the x-axis and the dependent or the response variable is ‘Y’ plotted on the y-axis

ind_dep_graph

So to check if the relationship exist or not, I code it in python as follows

# using scatter plot the visulaize quantitative variable.
# if categorical variable then use histogram
scat1= sns.regplot(x='alcconsumption', y='breastcancerper100th', data=data)
plt.xlabel('Alcohol consumption in liters')
plt.ylabel('Breast cancer per 100th person')
plt.title('Scatterplot for the Association between Alcohol Consumption and Breast Cancer 100th person')

And the corresponding scatter plot is sct1

Scatter Plot 1

From the scatter plot 1, its evident that we have a positive relationship between the two variables. And this proves the alternative hypothesis (H1) that higher alcohol consumption by women has increased chances of breast cancer in them. Thus we can safely reject the null hypothesis that alcohol consumption does not cause breast cancer in women. Also, the points on the scatter plot are densely scattered around the linear line therefore the strength of the relationship is strong. This means that we have a statistically significant and strong positive relationship between higher alcohol consumption causing increased number of breast cancer patients in women.

Now, let us check if the other alternative hypothesis (H2), “Female employee are susceptible to increased risk of breast cancer” is true or not.  To verify this claim, I code it as

 scat2= sns.regplot(x='femaleemployrate', y='breastcancerper100th', data=data)
plt.xlabel('Female Employ Rate')
plt.ylabel('Breast cancer per 100th person')
plt.title('Scatterplot for the Association between Female Employ Rate and Breast Cancer per 100th Rate')

And the scatter plot is  sct2

Scatter Plot 2

From scatter plot 2, we can see that there is a negative relationship between the two variables. That means as the number of female employment count increases the number of breast cancer patients in employed women decreases. Also the strength of this relationship is weak as the number of points are sparsely located on the linear line. So, I will say that although the relationship is statistically significant but it is weak thus its safe to conclude that female employment rate does not necessarily contribute to breast cancer in women.

I now come to the conclusion of this analytical series. After performing descriptive and exploratory data analysis on the gapminder dataset using python as a programming tool, I have been successful in determining that higher alcohol consumption by women increases the chance of breast cancer in them. I have also been successful in determining that breast cancer occurrence in employed females has a weak correlation. Perhaps, there are other factors that could prove this.

Finally, to conclude this exploratory data analysis series of posts has been very fruitful and immensely captivating to me. In the next post, I will discuss on the statistical relationships between the variables and testing the hypotheses in the context of Analysis of Variance (when you have one quantitative variable and one categorical variable). And since the dataset that I chose does not have any categorical variable, I will also show how to categorize a quantitative variable.

I hope you enjoyed reading it as much as I enjoyed writing and presenting the results to you. The complete python code is listed on my GitHub account here

Thank you for reading.

Cheers

Ashish

 

 

 

Big or small-let’s save them all: Making Data Management Decisions

Big or small-let’s save them all: Making Data Management Decisions

So far I have discussed the data-set, the research question, introduced the variables to analyze and performed some exploratory data analysis in which I showed how to get a brief overview of the data using python. Continuing further, I have now reached a stage wherein I must ‘dive into’ the data-set and make some strategic data management decisions. This stage cannot be taken lightly because it lays the foundation of the entire project. A misjudgment here can spell doom to the entire data analysis cycle.

The first step is to see, if the data is complete or not? By completeness, I mean to check the rows and the columns of the data-set for any missing values or junk values. (Do note, here I have asked two questions. In this post I will answer the first question only. In another post i will answer the second question); a) How to deal with missing values and b) How to deal with junk values.

To answer the first question, I use the following code to get the sum of missing values by rows thereafter I use the is.null().sum() as given to display the column count of the missing values.

# Create a copy of the original dataset as sub4
sub4=data
print "Missing data rows count: ",sum([True for idx,row in data.iterrows() if any(row.isnull())])

I would see that there are 48 rows of missing data as shown

Missing data rows count: 48

Now how about I want to see the columns that have missing data. For that I use the isnull().sum() function as given

print sub4.isnull().sum()

This line of code will give me the column-vise missing data count as shown

country 0
breastcancerper100th 40
femaleemployrate 35
alcconsumption 26
dtype: int64

So now, how to deal with this missing data? There are some excellent papers written that have addressed this issue. For interested reader, I refer to two such examples here and here.
Dealing with Missing Values
So what do I do with a data set that has 3 continuous variables which off-course as always is dirty (brief melodrama now: hands high in air and I shout “Don’t you have any mercy on me! When will you give me that perfect data set. God laughs and tells his accountant pointing at ‘me’..”look at that earthly fool..while all fellows at his age ask for wine, women and fun he wants me to give him “clean data” which even I don’t have”). So how do I mop it clean? Do i remove the missing values? “Nah” that would be apocalyptic in data science ..hmmm..so what do I do? How about I code all the missing values as Zero. NO! Not to underestimate the Zero. So what do I do?

One solution is to impute the missing continuous variables with the mean of the neighboring values in the variable. Note: to impute the missing categorical values, one can try imputing the mode (highest occurring frequency value). Yeah..that should do the trick. So I code it as given;

sub4.fillna(sub4['breastcancerper100th'].mean(), inplace=True)
sub4.fillna(sub4['femaleemployrate'].mean(), inplace=True)
sub4.fillna(sub4['alcconsumption'].mean(), inplace=True)

So here, I have used the fillna() method of pandas library. You can see here the documentation . Now I show the output before missing value imputation as

Missing data rows count: 48
country 0
breastcancerper100th 40
femaleemployrate 35
alcconsumption 26
dtype: int64 

and the output after the missing values were imputed using the fillna() function as

country 0 breastcancerper100th 0 femaleemployrate 0 alcconsumption 0 dtype: int64

Continuing further, I now categorize the quantitative variables based on customized splits using the cut function and why I am doing this because it will help me later to view a nice elegant frequency distribution.

# categorize quantitative variable based on customized splits using the cut function
sub4['alco']=pd.qcut(sub4.alcconsumption,6,labels=["0","1-4","5-9","10-14","15-19","20-24"])
sub4['brst']=pd.qcut(sub4.breastcancerper100th,5,labels=["1-20","21-40","41-60","61-80","81-90"])
sub4['emply']=pd.qcut(sub4.femaleemployrate,4,labels=["30-39","40-59","60-79","80-90"])

Now, that I that I have split the continuous variables, I will now show there frequency distributions so as to understand my data better.


fd1=sub4['alco'].value_counts(sort=False,dropna=False)

fd2=sub4['brst'].value_counts(sort=False,dropna=False)

fd3=sub4['emply'].value_counts(sort=False,dropna=False)

I will now print the frequency distribution for alcohol consumption as given

Alcohol Consumption
0 36
1-4 35
5-9 36
10-14 35
15-19 35
20-24 36
dtype: int64

then, the frequency distribution for breast cancer per 100th women as 

Breast Cancer per 100th
1-20 43
21-40 43
41-60 65
61-80 19
81-90 43
dtype: int64 

and finally the female employee rate as 

Female Employee Rate
30-39 73
40-59 34
60-79 53
80-90 53
dtype: int64 

Now, this looks better. So if I have to summarize it the frequency distribution for alcohol consumption per liters among adults (age 15+). I will say that there are 36 women who drink no alcohol at all (and still they are breast cancer victims…hmmm ..nice find..will explore it further later). The count of women who drink between 5-9 liters and 20-24 liters of pure alcohol is similar! Then there are about 73% of women who have been employed in a certain year and roughly about 43 new breast cancer cases are reported per 100th female residents.

Stay tuned, next time I will provide a visual interpretation of these findings and more.
Thank you for reading.

Big or small-let’s save them all: Exploratory Data Analysis

Big or small-let’s save them all: Exploratory Data Analysis

In my previous post, I had discussed at length on the research question, the dataset, the variables and the various research hypothesis.
For the sake of brevity, I will restate the research question and the variables of study.

Research question: Can alcohol consumption increase the risk of breast cancer in working class women.
Variables to explore are:

  1. ‘alcconsumption’- average alcohol consumption, adult (15+) per capita consumption in litres pure alcohol
  2. ‘breastcancerper100TH’- Number of new cases of breast cancer in 100,000 female residents during the certain year
  3. ‘femaleemployrate’- Percentage of female population, age above 15, that has been employed during the given year

In this post, I present to the readers an exploratory data analysis of the gapminder dataset.

Although, for this course we are provided with the relevant dataset, however if you are not taking this course and are interested in the source of the data, then you can get it from here. In the List of indicators search box type “breast cancer, new cases per 100,000 women” to download the dataset.

I will be using python for Exploratory Data Analysis (EDA). I begin by importing the libraries pandas and numpy

# Importing the libraries
import pandas as pd
import numpy as np

I have already downloaded the dataset which is .csv (comma seperated value format) and will now load/read it in a variable called datausing pandas library as given

# Reading the data where low_memory=False increases the program efficiency
data= pd.read_csv("gapminder.csv", low_memory=False)</pre>
<pre>

To get a quick look at the number of rows and columns and the coulmn headers, you can do the following;

print (len(data)) # shows the number of rows, here 213 rows
print (len(data.columns))# shows the number of cols, here 4 columns# Print the column headers/headings
names=data.columns.values
print names

You will see the output as

213
4
213
['country' 'breastcancerper100th' 'femaleemployrate' 'alcconsumption']

Now, to see the frequency distribution of these four variables I use the value_counts() function to generate the frequency counts of the breast cancer dependence variables. Note, if you want to see the data with the missing values then choose the flag dropna=False as shown. For this dataset, majority of variable values have a frequency of 1.

print "\nAlcohol Consumption\nFrequency Distribution (in %)"
c1=data['alcconsumption'].value_counts(sort=False,dropna=False)
print c1
print "\nBreast Cancer per 100th"
c2=data['breastcancerper100th'].value_counts(sort=False)
print c2
print "\nFemale Employee Rate"
c3=data['femaleemployrate'].value_counts(sort=False)
print c3 

The output will be

Alcohol Consumption 5.25 1 9.75 1 0.50 1 9.50 1 9.60 1 

In the above output, values 5.25,9.75,0.50,5.05 are the alcohol consumption in litres and the value 0.004695 is the percentage count of the value. The flag sort=False means that values will not be sorted according to their frequencies. Similarly, I show the frequency distribution for the other two variables

Breast Cancer per 100th
23.5 2
70.5 1
31.5 1
62.5 1
19.5 6

and

Female Employee Rate
45.900002 2
55.500000 1
35.500000 1
40.500000 1
45.500000 1

I now subset the data to explore my research question in a bid to see if it requires any improvement or not. I want to see which countries are prone to greater risk of breast cancer among female employee where the average alcohol intake is 10L;

sub3=data[(data['alcconsumption']>10)&(data['breastcancerper100th']>70)&(data['femaleemployrate']>50)]
print sub3

and the result is;

 country breastcancerper100th femaleemployrate alcconsumption
9      Australia 83.2 54.599998 10.21
32     Canada 84.3 58.900002 10.20
50     Denmark 88.7 58.099998 12.02
63     Finland 84.7 53.400002 13.10
90     Ireland 74.9 51.000000 14.92
185    Switzerland 81.7 57.000000 11.41
202    United Kingdom 87.2 53.099998 13.24

Interestingly, countries with stable economies like Australia, Canada, Denmark, Finland, Ireland, Switzerland & UK top the list of high breast cancer risk among working women class. These countries are liberal to women rights. Now, this can be an interesting question that will be explored later.

How about countries with very low female employee rates- how much is there contribution to alcohol consumption and breast cancer risk? (I set the threshold for high employee rate as greater than 40% and threshold for high alcohol consumption to be greater than 20 liters and breast cancer risk at less than 50%). And the winner is, Moldova a landlocked country in Eastern Europe. Here we can see that Moldova contributes to approximately 50% of new breast cancer cases reported per 100,000th female residents with a per capita alcohol consumption of 23%. So with a low female employee rate of 43% (as compared to the threshold of 40%) this country does have a significant amount of new breast cancer cases reported because of high alcohol consumption by the relatively less number of adult female residents. ((on a side note: “Heaven’s! Moldavian working class women drink a lot :-) ))

# Creating a subset of the data
sub1=data[(data['femaleemployrate']>40) & (data['alcconsumption']>=20)& (data['breastcancerper100th']<50)]
# creating a copy of the subset. This copy will be used for subsequent analysis
sub2=sub1.copy()
print "\nContries where Female Employee Rate is greater than 40 &" \
" Alcohol Consumption is greater than 20L & new breast cancer cases reported are less than 50\n"
print sub2 

the result is

 country breastcancerper100th femaleemployrate alcconsumption
126 Moldova        49.6       43.59        23.01

The complete python code is listed on my github account

This series will be continued….

Big or small, let’s save them all

Big or small, let’s save them all

This week I’m starting off with a new MOOC in Coursera – Data Visualization and Management by Wesleyan University. The course is part of a four-course specialization that will help me catching up with Python for data analysis, combining the capabilities that I already have using other statistical software together with a great open-source programming language. This blog post is part of the assignment that I have to submit for week 1 of the course and is also the foundation of the whole course structure.

Introduction

Drinking alcoholic beverages like wine, beer, whisky and liquor among women is a commonplace sight at work related parties or just for recreation. Several research studies in the past have shown that consumption of alcoholic beverages “increases a woman’s risk of hormone-receptor-positive breast cancer”. The National Toxicology Program of the US Department of Health and Human Services lists consumption of alcoholic beverages as a known human carcinogen. In a meta-analysis of 53 epidemiological (which included a total of 58,000 women with breast cancer) showed that women who drank more than 45 grams of alcohol per day (approximately three drinks) had 1.5 times the risk of developing breast cancer as non-drinkers (a modestly increased risk) (Hamajima et al, 2002).

When we drink alcohol (chemical name ethanol), it is converted into a toxic chemical called acetaldehyde. This chemical causes the reduction of blood levels in the vitamin folic acid. Folic acid plays an important role in copying and repairing the DNA. Low levels of folic acid may render the DNA being incorrectly copied when cell division occurs. Such errors in DNA replication may lead the cells to become cancerous (8).

Literature Review

Epidemiological studies have shown that alcohol consumption in women leads to increased chances of breast cancer risk (Longnecker, 1994), (Smith-Warner et al, 1998),(Li CI et al, 2008),(Cotterchio et al, 2003),(Suzuki et al, 2005). In experimental studies conducted on human subjects the consumption of alcohol is shown to affect the plasma estrogen causing an increased risk of breast cancer in women. In a decade long study conducted on 39,876 US female health professionals aged 5 years or older by (Zhang et al, 2007) it was found that approximately 13.4% of women drank at least 10g/day of alcohol. It was found that the risk of breast cancer existed in 1.84% of women who consumed ≥10g/day of alcohol.

Research Design

For this research study, I have chosen the “Gapminder Codebook”. It combines longitudinal survey data on respondents’ social, economic, psychological and physical well-being with contextual data on the family, neighbourhood, community, school, friendships, peer groups, and romantic relationships, providing unique opportunities to study how social environments and behaviours in adolescence are linked to health and achievement outcomes in young adulthood.

After studying the Gapminder Codebook, I have decided that I am particularly interested in determining if alcohol consumption can increase the risk of breast cancer.

I therefore, add to my codebook the following two variables

a. ‘alcconsumption’ – 2008 alcohol consumption per adult (age 15+), litres Recorded and estimated average alcohol consumption, adult (15+) per capita consumption in litres pure alcohol

b. ‘breastcancerper100TH’- 2002 breast cancer new cases per 100,000 female Number of new cases of breast cancer in 100,000 female residents during the certain year.

I now need to find if there are any other variables that directly or indirectly contribute to breast cancer risk. So, in a second review of the Gapminder Codebook, I think that working women are prone to a greater risk of breast cancer. So, the second topic that I would like to explore in this study is the association between breast cancer in female employee. So the third variable will be ‘femaleemployrate’ given as;

c.  ‘femaleemployrate’- 2007 female employees age 15+ (% of population) Percentage of female population, age above 15, that has been employed during the given year.

So now that I have chosen the three variables for this study, I now derive the various hypotheses;

H0 (Null Hypothesis) =   Breast cancer is not caused by alcohol consumption

H1 (Alternative Hypothesis) = Alcohol consumption causes breast cancer

H2 (Alternative Hypothesis) = Female employee are susceptible to increased risk of breast cancer.

Note: This post will be continued…

References

  1. Longnecker MP. Alcoholic beverage consumption in relation to risk of breast cancer: meta-analysis and review. Cancer Causes Control 1994;5:73–82.
  2. Smith-Warner SA, Spiegelman D, Yaun SS, et al. Alcohol and breast cancer in women: a pooled analysis of cohort studies. JAMA 1998;279:535–40.
  3. Li CI, Malone KE, Porter PL, et al. The relationship between alcohol use and risk of breast cancer by histology and hormone receptor status among women 65–79 years of age. Cancer Epidemiol Biomarkers Prev 2003;12:1061–6.
  4. Cotterchio M, Kreiger N, Theis B, et al. Hormonal factors and the risk of breast cancer according to estrogen- and progesterone-receptor subgroup. Cancer Epidemiol Biomarkers Prev 2003;12:1053–60.
  5. Suzuki R, Ye W, Rylander-Rudqvist T, et al. Alcohol and postmenopausal breast cancer risk defined by estrogen and progesterone receptor status: a prospective cohort study. J Natl Cancer Inst 2005;97:1601–8.
  6. Zhang, Shumin M., et al. “Alcohol consumption and breast cancer risk in the Women’s Health Study.” American Journal of Epidemiology 165.6 (2007): 667-676.
  7. Hamajima N, Hirose K, Tajima K, et al. Alcohol, tobacco and breast cancer–collaborative reanalysis of individual data from 53 epidemiological studies, including 58,515 women with breast cancer and 95,067 women without the disease. British Journal of Cancer 2002;87(11):1234-1245. [PubMed Abstract]
  8. Cancer Research UK. 2014. How alcohol causes cancer. [ONLINE] Available at: http://www.cancerresearchuk.org/about-cancer/causes-of-cancer/alcohol-and-cancer/how-alcohol-causes-cancer. [Accessed 03 January 16]

2015 in review

The WordPress.com stats helper monkeys prepared a 2015 annual report for this blog.

Here’s an excerpt:

The concert hall at the Sydney Opera House holds 2,700 people. This blog was viewed about 14,000 times in 2015. If it were a concert at Sydney Opera House, it would take about 5 sold-out performances for that many people to see it.

Click here to see the complete report.

Scenarios when data preprocessing is imperative with examples in Dell Statistica 12 & RapidMiner Studio 6.5

There are usually several data preprocessing steps required before applying any machine learning algorithms to data. These are required by the nature of available data and algorithms. Below are listed few common instances where data preprocessing is required. Recall in this context, attributes are variables (columns in the data spreadsheet) and each row in this column is a data value. A target or label attribute is the dependent variable which is being predicted.

  1. Data Scaling: Some algorithms, such as k-nearest neighbors are sensitive to the scale of data. If you have one attribute whose range spans millions (of dollars, for example) while another attribute is in a few tens, then the larger scale attribute will influence the outcome. In order to eliminate or minimize such bias, we must Normalize the data. Normalization implies transforming all the attributes to a common range. This is easily achieved by dividing each attribute by its largest value, for instance. This is called Range Normalization. Different software’s have different ways to achieve this task. For example, in Dell Statistica 12 you can do this by ‘Standardisation‘ in the data tab, whereas in RapidMiner, it can be done by Normalization.
  2. Data Transformation: Some algorithms do not work on the value types of certain attributes. For example, if the target or label variable is categorical, then you cannot use regression or generalized linear models to predict it. In this case you will need to transform the label attribute into a continuous variable. Similarly in the converse case, if the label attribute is polynomial (multiple categories), you cannot use a classification algorithm such as logistic regression, which only works with binomial (yes/no, true/false) type variables. In any such situations, data type transformations are required. For Statistica see this example and for in RapidMiner its called Normalization
  3. Data set is too large or imbalanced. The first of these scenarios is quite common: the entire space of big data analytics is dedicated to address these kinds of issues. However, big data infrastructure may not always be available and you may have to analyze data in-memory. If you know that the data quality is high, you may be better off simply sampling the data to reduce the computational expense. Sampling implies we only use a portion of the data set. The question then becomes which portion to use. The simplest solution is to randomly select rows from the dataset. However, this may create bias in the testing and training samples if the random sample selected has either too few or too many examples from one of the outcome classes. To avoid this you may want to stratify the sampling. This ensures that the sample chosen has exactly the same proportion of classes as in the full dataset. A related problem is that of imbalanced datasets which is requires a complete discussion altogether. But for the sake of brevity, one of the solution for RapidMiner is given here and here
  4. Data has missing values: because of inability to measure on some attributes or errors. Some algorithms, such as neural networks or support vector machines, cannot handle missing data. In such situations, we have two options: replacing missing values by a mean, median, minimum or maximum or zero. However a better option would be to impute missing value. The way this works is to treat the attribute which has missing values as an intermediate target variable and predict the missing values using examples for which values are not missing. Clearly, for either of these approaches to work, the proportion of values missing must be “small”. As a thumb rule, attributes with more than a third of values missing must be reconsidered for inclusion in the modeling process.
  5. Despite the big data analytics capabilities that are gaining ground, from an efficiency point of view, it still makes sense to measure and weight the influence of different variables on the label attribute before building any models. The objective is to identify and rank the influence of each attribute on the label. It makes sense to remove attributes which have either a low influence on the target or high correlation to other independent variables. Very little prediction accuracy is lost by removing such attributes. In Statistica, you can use correlation matrices in the Basic Statistics tab whereas in RapidMiner, you can refer to this blog post

In most real world data analytics situations we will encounter one or more of these scenarios. Poorly prepared or unprepared data will render the predictive models unreliable. While the time spent on any or all of these, may seem high, it is totally justified: experienced data miners understand and live the rubric that 80% of data mining is really data preparation.

Why use ANOVA over t-test?

The point of conducting an experiment is to find a significant effect between the stimuli being tested. To do this various statistical tests are used, the two being discussed in this post will be the ANOVA and the t-test. In an experiment an independent variable and dependant variable are the stimuli being manipulated and the behaviour being measured. Statistical tests are carried out to confirm if the behaviour occurring is more than chance.

The t-test compares the means between two samples and is simple to conduct, but if there is more than 2 conditions in an experiment a ANOVA is required. The fact the ANOVA can test more than one treatment is a major advantage over other statistical analysis such as the t-test, it opens up many testing capabilities but it certainly doesn’t help with mathematical headaches. It is important to know that when looking at the analysis of variance an IV is called a factor, the treatment conditions or groups in an experiment are called the levels of the factor. ANOVA’s use an F-ratio as its significance statistic which is variance because it is impossible to calculate the sample means difference with more than two samples.

T-tests are easier to conduct, so why not conduct a t-test for the possible interactions in the experiment? A Type I error is the answer because the more hypothesis tests you use the more you risk making a type I error and the less power a test has. There is no disputing the t-test changed statistics with its ability to find significance with a small sample, but as previously mentioned the ANOVA allowed for testing more than 2 means. ANOVA’s are used a lot professionally when testing pharmaceuticals and therapies.

The ANOVA is an important test because it enables us to see for example how effective two different types of treatment are and how durable they are. Effectively a ANOVA can tell us how well a treatment work, how long it lasts and how budget friendly it will be an example being intensive early behavioural intervention (EIBI) for autistic children which lasts a long time with a lot hour, has amazing results but costs a lot of money. The ANOVA is able to tell us if another therapy can do the same task in shorter amount of time and therefor costing less and making the treatment more accessible. Conducting this test would also help establish concurrent validity for the therapy against EIBI. The F-ratio tells the researcher how big of a difference there is between the conditions and the effect is more than just chance. ANOVA test assumes three things:

The population sample must be normal
The observations must be independent in each sample
The population the samples are selected from have equal variance a.k.a. homogeneity of variance.
These requirements are the same for a paired and a repeated measures t-test and these measured are solved in the same way for the t-test and the ANOVA. The population sample is assumed to be normal anyway, the independent samples is achieved with the design of the experiment, if the variance is not correct then normally more data (participants) is needed in the experiment.

In conclusion it is necessary to use the ANOVA when the design of a study has more than 2 condition to compare. The t-test is simple and less daunting especially when you see a 2x4x5 factorial ANOVA is needed, but the risk of committing a type I error is not worth it. The time you spent conducting the experiment only to have it declared obsolete because the right statistical test wasn’t conducted would be a waste of time and resources, statistical tests should be used correctly for this reason.