How to compile and execute weka source code on your computer- Part 1

How to compile and execute weka source code on your computer- Part 1

Off recently, It so happened that I required the need to understand the working of Weka algorithms. The question was how to do it?

Step 1: Download the latest version of Weka from here

Step 2: install the software

Step 3: Search for the “weka src jar” file on your computer. Typical location will be at “C:\Program Files\Weka-3-7\weka-src.jar”

Step 4: Now create a folder called “Data_Analysis” in your Documents or any convenient location you like. For example C:\Users\UserName\Documents\Data_Analysis

Step 5: Copy the wek-src.jar file found in Step 3 and paste it in the Data_Analysis folder at Step 4

Step 6: Extract the contents of the jar file using WinRar or WinZip software in a temporary folder in the Documents folder like temp. So after extraction, you should have folders like lib, src, meta-inf, test and build.xml file in the temp folder.

Now, move over to NetBeans IDE

Step 7: Open Netbeans IDE and create a new java application project as following. In the New Project dialog box that appears chose

  1. Categories: General
  2. Projects: Java Application
  3. Click on next button

In the Name and location appears

  1. Project Name: weka
  2. Project Location: C:\Users\UserName\Documents\Data_Analysis
  3. Project Folder: C:\Users\UserName\Documents\ Data_Analysis \weka
  4. Put tick mark in Set as Main project, create main class
  5. Fill the text box with “weka.gui.Main”
  6. Click on Finish button

Now the important step;

Step 8: in step 6, we had extracted the contents of the weka src jar folder. So browse to the location “C:\Users\UserName\Documents\temp\src\main\java\weka” and copy the folder, “weka”.  Next, paste this folder into “C:\Users\UserName\Documents\DataAnalysis\src\main\java\weka”. You will see a warning message that there is already a folder called weka. Click on merge to continue.

Click on yes to all button

Step 9: Now switch to Net beans and click on Weka project and Source Package, we can observe that all the earlier files such as classification, gui, clustering, filters etc are dumped in to it.

Step 10: Click on Build Menu\Build Main Project, weka project will be compiled

Step 11: Click on Run Menu \Run Main Project, weka project will be executed

In the next sequel to this post we will see how to add a new algorithm in Weka in the Netbeans environment.

How to split a data frame in R with over a million observations in above 50 variables?

In a previous post dated April 6th 2015 I had written on how to split a data frame to training and test dataset. Today, I had to do it again so I was following my own post when I stumbled into the following error,

“Error in `$<-.data.frame`(`*tmp*`, "spl", value = c(TRUE, FALSE, TRUE,  : replacement has 80 rows, data has 201813”

So what does it mean? I think listening to music is good for data cleaning purpose which is what I was doing at that time. Listening to my favorite songs. Googling did not help much when it occurred to me that the data frame had 80 variables (continuous and categorical) with more than 2 million observations. Hmmm, how do I fix it? How do I get my training and test datasets?

Solution

  1. Load caTools package as >library (caTools)
  2. Set seed to 123 so that others can reproduce your results. This can be any arbitrary number. >set.seed(123)
  3. Now, I am borrowing two paragraphs from my previous post as follows “Then, you want to make sure that your Iris data set is shuffled and that you have the same ratio between species in your training and test sets. You use the sample() function to take a sample with a size that is set as the number of rows of the Iris data set which is 150.
    Also you will create a new vector variable in the Iris dataset that will have the TRUE and FALSE values basis on which you will later split the dataset into training and test. You obtain this as following;

# sample.split() is part of caTools package so you need to install it first if you have not done so by using install.packages(caTools). Assuming in this case, its already installed so we will now load it as >library(caTools)

  1. I create a new logical vector called x that will only have True and False values in it which I will later use to splitting the data frame. This can be done as follows

>x=sample.split (dataFrameName, SplitRatio= 0.7)

# SplitRatio=0.7 means split into 70% training data and 30% testing data

Cool, the major part of the problem has been solved.

  1. Train= dataFrameName[x, 1:80]

# where 1:80 means index 1 till index 80. It can also be understood as the data frame has 80 variables so that is why 1:80. Also x is the logical vector created in step 4

  1. Test= dataFrameName[!x, 1:80]

# don’t forget !x. This !symbol means to put remaining values of logical vector x in test data frame

At least this helped me.

Connect R to SQL Server 2014

For a long time, I had been traversing a long winding road for data extraction from SQL Server. Initially, I was using MS BI Studio to connect MS SQL Server to Excel to extract data from SQL Server and write it to .csv format. Which was then imported into R data frame. This process was indeed cumbersome to me. My recent involvement with R and its majestic features have indeed got me hooked to it. So, to cut the story short in this post we will see how to connect R to SQL Server 2014 and manipulate the SQL database tables in R data frames. So let’s get started.

Creating an ODBC DSN Data Source
The first time you connect R to your SQL Server instance you need to perform a few one-time set up tasks as follows;
1. Create an ODBC DSN data source
2. Install the necessary R ODBC package from CRAN
Step 1: Create DSN
First we need to setup a user DSN data source pointing at our SQL Server using ODBC. The data source will be called from R using the package “RODBC”
1. Open “Administrative Tools” and “ODBC Data Sources (32 bit)”
Admin Tools2. On the tab “User DSN” press “Add”

3. Select “SQL Server” in the provider listpic 24. Now give the data source a name and a description. Remember this name – you will need it later on in the process. Last provide the name of the server where the SQL Server is installed. Press “Next”.pic 3

a. Select the way in which you will authenticate against the SQL Server. In this case we use the default settings as shown below and click “Next”.pic 4b. You now have the possibility to select the default database for the data source. What this means is that first put a check mark on “Change the default database to” and then click on the drop down arrow besides it and select your database name that you have created in SQL Server. Press “Next” and the “Finish”.pic 5

c. On the last page remember to press the “Test Data Source” button and ensure that the connection could be established. connectedd. The User DSN is now created and active.

Install and load RODBC

Step 2:

1. Support for SQL Server is not possible using native R – therefore we have to install the RODBC package from CRAN. Open up R and in the console window type: install.packages(“RODBC”) as shown below and press the “Enter”.

install RODBC

2. The RODBC packages is now downloaded and installed on your system. Next step is to load the package into R so the functions of the package can be used. In the console window type: library(“RODBC”)load RODBC3. The RODBC package is now loaded and is ready

Connect SQL Server 2014 with R

Step 3: Now it is time to connect to the SQL Server database from R and retrieve a table

So for this example, I opened the connection like

> odbChannel=odbcConnect("YourDatabaseName") and press “Enter”establishConnNow if you want to store your SQL Server database table to a data frame in R, it’s easy. Use the sqlFetch() command like given

> dataframeName=sqlFetch(odbChannel, “Your database table name”)sqlFetch

1. To get help on RODBC package use the RShowDoc command as follows

>RShowDoc (“RODBC”, package=”RODBC”) and press “Enter” This will open up the pdf help version of RODBC package

2. To see what all tables as there in the database use sqlTables command as follows

>sqlTables (odbChannel, tableType=”TABLE”)

3. To execute a SQL query and store the result in a R data frame use the sqlQuery() function like

>dataFrameName=sqlQuery (odbChannel, “select * from tableName”)

I hope this tutorial helped you in connecting R with SQL Server 2014. I would like to hear your opinion of the same.

Data Analysis with R Series- Part 1

In this post we will see the preliminary information that can be derived from the data using R. I have used the tobacco consumption (Male and Female) data from the World Health Organization Global Health Observatory Data Repository.

Step 1: Go to the above webpage and click on the download dataset button. Under Quick downloads menu, choose CSV format. Download the csv file into your R working directory

Step 2: Open the dataset in a spreadsheet software like MS Excel and remove the heading “Prevalence of current cigarette smoking”.

Step 3: Use the read.csv function to load the dataset and save it to a data frame called WHO like

WHO= read.csv (“C:/My Documents/R project/data.csv”)

Next step is to quickly have a quick peek at the dataset so use the following commands like;

a. To see the first few rows of the data use the head() like; The output should be like

> head (WHO)
Country      Year   Male Female
1 Albania    2008   42.5    4.2
2 Armenia    2005   60.6    1.7
3 Armenia    2000   64.7     NA
4 Azerbaijan 2006   49.7     NA
5 Bangladesh 2007   60.0     NA
6 Bangladesh 2004   27.8     NA

b. Now let us look at the structure of the data frame. So use the str () like str (WHO). You will see

> str (WHO)
'data.frame':   77 obs. of  4 variables:
$ Country: Factor w/ 49 levels "Albania","Armenia",..: 1 2 2 3 4 4 5 5 6 7 ...
$ Year   : int  2008 2005 2000 2006 2007 2004 2006 2001 2008 2003 ...
$ Male   : num  42.5 60.6 64.7 49.7 60 27.8 NA NA NA NA ...
$ Female : num  4.2 1.7 NA NA NA NA 0.1 0 8.7 0.1 ...

This tells us that there are 77 observations in 4 variables namely Country, Year, Male and Female. We also see that there are many NA in this data frame. To check how many NA are there in the data frame use the is.na() like
> sum(is.na(WHO))
[1] 29

c. Okay, so we have 29 rows in the WHO data frame that have missing values. Now let’s check which variables have missing values. Since there are four variables so either you can apply the is.na() on each of the four variables like >is.na (WHO$Country) or you can use the summary() like >summary (WHO) and this will specifically tell you which variables have missing values like

> summary(WHO)
Country        Year           Male           Female
Cambodia: 3   Min.   :1999   Min.   : 5.40   Min.   : 0.000
Jordan  : 3   1st Qu.:2003   1st Qu.:14.72   1st Qu.: 0.400
Malawi  : 3   Median :2005   Median :20.95   Median : 1.200
Nepal   : 3   Mean   :2005   Mean   :27.42   Mean   : 3.445
Rwanda  : 3   3rd Qu.:2008   3rd Qu.:35.02   3rd Qu.: 5.100
Armenia : 2   Max.   :2011   Max.   :68.60   Max.   :27.700
(Other) :60                  NA's   :25      NA's   :4

So here we see that variable Male has 25 missing values out of 77 observations and variable Female has 4 missing values.
Also from this summary we can see that we have the data from year 1999 till year 2011. Besides this, we also see that the minimum tobacco consumption amongst Male is 5.40% and the maximum is 69%. Similarly, in females the maximum tobacco consumption is 28%

d. Now, lets check which country has the minimum tobacco consumption amongst Male and Female. This can be achieved by using the which.min () like

> which.min(WHO$Male)
[1] 63

Here 63 is the row number. So now to check which country has the minimum tobacco consumption amongst Male you do the following;

> WHO$Country[63]
[1] Sao Tome and Principe

Similarly, we can check the same for females like
> which.min(WHO$Female)
[1] 8
> WHO$Country[8]
[1] Benin

So we see that the country Sao Tome and Principe in Africa has the smallest percentage of Male consuming tobacco products and country Benin in West Africa has the smallest percentage of Female consuming tobacco products.

e. Now, let s check which country has the maximum percentage of tobacco consumption for male and female.  We use the which.max() for this like;

> which.max(WHO$Male)
[1] 29
> WHO$Country[29]
[1] Indonesia
> which.max(WHO$Female)
[1] 68
> WHO$Country[68]
[1] Turkey

So we now know that country Indonesia has the maximum number of Male’s consuming tobacco products and country Turkey has the maximum number of Female’s consuming tobacco products.

Thus we see that with a few basic commands taming the data gets so easy. However, what is required is a careful study of the data at hand based on which you derive interesting questions and then seek to answer them using the data.

In the next post we shall look at some more interesting functions in R that can help us derive meaning out of the data. Till then stay tuned.

Datasets (free) for practising data mining

I will use this page to consolidate datasets that are free to download. So that I do not have to waste time searching the internet for relevant datasets. These will be classified accordingly.

Education Datasets (Free)
LinkedUP – will help in collecting, sharing and have an open access to educational data sets also Datasets from DataHub
Datasets for data mining at R-bloggers website also this one.
The Programme for International Student Assessment (PISA)  and this one is a triennial international survey which aims to evaluate education systems worldwide by testing the skills and knowledge of 15-year-old students.
Quandl
Data.gov, Data360, National Center for Education Statistics, Pew Research Centre

R dataset

1. Miscellaneous dataset in CSV format

Webpages pointing to free datasets

page1, KDNuggets, UCI Machine Learning Repository, R and Data Mining, Machine Learning data set repository,

Splitting a data frame into training and testing sets in R

Step 1: Loading data in R
a. Import the iris data set from UCI Machine Learning repository as

> iris= read.csv(url("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"), header = FALSE)

Step 2: Inspecting the dataset in R
b. Inspect the first few rows of the dataset use the head() as,
> head(iris)
V1 V2 V3 V4 V5
1 5.1 3.5 1.4 0.2 Iris-setosa
2 4.9 3.0 1.4 0.2 Iris-setosa
3 4.7 3.2 1.3 0.2 Iris-setosa
4 4.6 3.1 1.5 0.2 Iris-setosa
5 5.0 3.6 1.4 0.2 Iris-setosa
6 5.4 3.9 1.7 0.4 Iris-setosa

Step 3: Getting to know your dataset
a. Inspect the data type and levels of the attributes of the dataset in R use the str() as
> str(iris)
'data.frame': 150 obs. of 5 variables:
$ V1: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ V2: num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ V3: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ V4: num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ V5: Factor w/ 3 levels "Iris-setosa",..: 1 1 1 1 1 1 1 1 1 1 ...

b. To check the distribution of the data values in a particular attribute use the table() as
# table(dataframe$attribute)
> table(iris$V5)
Iris-setosa Iris-versicolor Iris-virginica
50 50 50

c. Use the summary() to obtain a detailed overview of your dataset
> summary(iris)
This will give you the minimum value, first quantile, median, mean, third quantile and maximum value of the data set Iris for numeric data types. For the class variable, the count of factors will be returned:
V1 V2 V3 V4 V5
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 Iris-setosa :50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 Iris-versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 Iris-virginica :50
Mean :5.843 Mean :3.054 Mean :3.759 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500

Step 4: What to do next?
After you have attained a good understanding of your data, you have to think about what your data set might teach you or what you think you can learn from your data. From there on, you can think about what kind of algorithms you would be able to apply to your data set in order to get the results that you think you
can obtain. For this tutorial, the Iris data set will be used for classification, which is an example of predictive modeling.

Step 5: Divide the dataset into training and test dataset
a. To make your training and test sets, you first set a seed. This is a number of R’s random number generator.
The major advantage of setting a seed is that you can get the same sequence of random numbers whenever you supply the same seed in the random number generator.
Another major advantage is that it helps others to reproduce your results.

# To set a seed use the function set.seed()
> set.seed(123)

Then, you want to make sure that your Iris data set is shuffled and that you have the same ratio between species in your training and test sets. You use the sample() function to take a sample with a size that is set as the number of rows of the Iris data set which is 150.
Also you will create a new vector variable in the Iris dataset that will have the TRUE and FALSE values basis on which you will later split the dataset into training and test. You obtain this as following;

# sample.split() is part of caTools package so you need to install it first if you have not done so by using install.packages(caTools). Assuming in this case, its already installed so we will now load it as

>library(caTools)

> iris$spl=sample.split(iris,SplitRatio=0.7)
# By using the sample.split() you are actually creating a vector with two values TRUE and FALSE. By setting the SplitRatio to 0.7, you are splitting the original Iris dataset of 150 rows to 70% training and 30% testing data.
# iris$spl will create a new column in the Iris dataset.

Now, if you view the Iris dataset again you would notice a new column added at the end, you can use View() or head() for this
> head(iris)
V1 V2 V3 V4 V5 spl
1 5.1 3.5 1.4 0.2 Iris-setosa TRUE
2 4.9 3.0 1.4 0.2 Iris-setosa TRUE
3 4.7 3.2 1.3 0.2 Iris-setosa TRUE
4 4.6 3.1 1.5 0.2 Iris-setosa FALSE
5 5.0 3.6 1.4 0.2 Iris-setosa FALSE
6 5.4 3.9 1.7 0.4 Iris-setosa TRUE

b. Now, you can split the dataset to training and testing as given
> train=subset(iris, iris$spl==TRUE) # where spl== TRUE means to add only those rows that have value true for spl in the training dataframe
> View(train) # you will see that this dataframe has all values where iris$spl==TRUE

Similarly, to create the testing dataset,
> test=subset(iris, iris$spl==FALSE) # where spl== FALSE means to add only those rows that have value true for spl in the training dataframe
> View(test) # you will see that this dataframe has all values where iris$spl==FALSE

Another easy method to split the dataset into training and test set is as follows;

> train= iris [1:100,] # this will put the first 100 rows into the training set

> test= iris [101:150] # this will put the remaining 50 rows into the test set

Numerical data in R is examined by using the summary() whereas the categorical data is examined in R using the table().
The table output lists the categories of the nominal variable and a count of the number of values falling into that category.
To calculate the proportion of each nominal variable in R, use the prop.table() as follows
> propIris=table(iris$V5) # V5 column contains the three types of Iris flower
> prop.table(propIris) # This will generate proportion in decimals so to change it in 100s, multiply it by 100 as
> prop.table(propIris)*100

In the next post we will see how to normalize the dataset in R so that data mining is easier to get better results.

P.S. if you want to know how to split the dataset in Weka, you can follow it on the Weka Wiki here

Stay Tuned.

25 Hottest skills ranked by LinkedIn for the year 2014

The-25-Hottest-Skills-of-2014-on-LinkedIn

If your skills fit one of the categories below, there’s a good chance you either started a new job or gained the interest of a recruiter in the past year.

Follow

Get every new post delivered to your Inbox.

Join 44 other followers