Data Processing with Weka (Part II)

Today, I will discuss and elaborate on data processing in Weka 3.6 (it’s the same in version 3.7 too). This post is the second part in the series of “Data pre-processing with Weka”. If you have not seen my earlier post, you are directed to see that first.

Continuing further, assuming that you have cleaned the data at hand and its now “noise free”. The next step is to process it. Now how do we go about doing this?

Often when we have a dataset, the most common question that comes to our mind is that which attributes are the most closely related to each other such that a relationship between them could be defined? Attribute is synonymous to column heading and instance is synonymous to a row of record. Let’s take the soybean dataset that comes free with Weka to answer these questions.

Click on Edit as shown in the picture to see if the data has missing values in it or not.

Missing values

As it’s evident in the screenshot here, that this soyabean dataset is noisy. The same can be seen in this screenshot when it’s opened in Weka.

Missing values in Weka

So first step is to clean it. So either you can check my previous post on data cleaning or else the other option is to manually clean the dataset. If you want to manually clean the data for that, you will first have to save the dataset in .CSV (comma separated value) first. This is shown below.

csv format

And then open it in Microsoft excel and then manually search for outliers in the data and remove it.

How to manually delete outliers in MS Excel

In case you don’t know how to do this, here is a short tutorial on the same.

Step 1: Open the dataset in Microsoft Excel

Step 2: Ensure that the column headings row is selected. column headings

Step 3: From the Editing ribbon on the top of screen, click on the drop down besides “Sort & Filter” as shown in screenshot and click on “Filter”. filter poistion

You will see that each of your column heading has a filter to it now.

filter added

Step 4: click on the filter, check if any column has an outlier value like a ? mark or any other value that you think is incomplete, then delete it as shown in figure.delete outliers1

Note: Please ensure that you are deleting the rows and not pressing the delete button on the keyboard. Because that will only delete the values and your data will still have the blank values. Therefore to delete the rows that contain outliers, first select the rows then right click on the selected rows and from the drop down menu, click on the option “Delete rows” to delete the rows.

Step 5: Save the file

Step 6: Now open this file in Weka you will see as shown below in the figure that there are no missing values in the dataset. Save it in .ARFF format and you are done No missing values

Now, we are ready to answer the question that we asked in the beginning. To reiterate, how to find out the attributes that are related to each other and can constitute a relationship. Looking at this soybean dataset, we have 36 attributes and 562 instances. attribute&instances count

Next, to find the relationship between the attribute, in the Weka Explorer tab, click on “Select Attributes” and then under Attribute Evaluator, click on button “choose” and select the option, “FilteredAttributeEvaluator”, you will see a dialog box, click on Yes. Below that, make sure from the drop down menu you choose “No class” as shown in the figure.filter attribute

Click on Start to find attributes related to each other. If you have followed the aforementioned steps correctly then you should see an output similar to the one as shown.attribute list

Hope you learnt something meaningful and I was able to bring a smile to you, if yes do leave a comment. See you soon, until than take care.


One thought on “Data Processing with Weka (Part II)

Comments are closed.