Today, I will discuss and elaborate on data processing in Weka 3.6 (it’s the same in version 3.7 too). This post is the second part in the series of “Data pre-processing with Weka”. If you have not seen my earlier post, you are directed to see that first.
Continuing further, assuming that you have cleaned the data at hand and its now “noise free”. The next step is to process it. Now how do we go about doing this?
Often when we have a dataset, the most common question that comes to our mind is that which attributes are the most closely related to each other such that a relationship between them could be defined? Attribute is synonymous to column heading and instance is synonymous to a row of record. Let’s take the soybean dataset that comes free with Weka to answer these questions.
Click on Edit as shown in the picture to see if the data has missing values in it or not.
As it’s evident in the screenshot here, that this soyabean dataset is noisy. The same can be seen in this screenshot when it’s opened in Weka.
So first step is to clean it. So either you can check my previous post on data cleaning or else the other option is to manually clean the dataset. If you want to manually clean the data for that, you will first have to save the dataset in .CSV (comma separated value) first. This is shown below.
And then open it in Microsoft excel and then manually search for outliers in the data and remove it.
How to manually delete outliers in MS Excel
In case you don’t know how to do this, here is a short tutorial on the same.
Step 1: Open the dataset in Microsoft Excel
Step 2: Ensure that the column headings row is selected.
Step 3: From the Editing ribbon on the top of screen, click on the drop down besides “Sort & Filter” as shown in screenshot and click on “Filter”.
You will see that each of your column heading has a filter to it now.
Step 4: click on the filter, check if any column has an outlier value like a ? mark or any other value that you think is incomplete, then delete it as shown in figure.
Note: Please ensure that you are deleting the rows and not pressing the delete button on the keyboard. Because that will only delete the values and your data will still have the blank values. Therefore to delete the rows that contain outliers, first select the rows then right click on the selected rows and from the drop down menu, click on the option “Delete rows” to delete the rows.
Step 5: Save the file
Step 6: Now open this file in Weka you will see as shown below in the figure that there are no missing values in the dataset. Save it in .ARFF format and you are done
Now, we are ready to answer the question that we asked in the beginning. To reiterate, how to find out the attributes that are related to each other and can constitute a relationship. Looking at this soybean dataset, we have 36 attributes and 562 instances.
Next, to find the relationship between the attribute, in the Weka Explorer tab, click on “Select Attributes” and then under Attribute Evaluator, click on button “choose” and select the option, “FilteredAttributeEvaluator”, you will see a dialog box, click on Yes. Below that, make sure from the drop down menu you choose “No class” as shown in the figure.
Click on Start to find attributes related to each other. If you have followed the aforementioned steps correctly then you should see an output similar to the one as shown.
Hope you learnt something meaningful and I was able to bring a smile to you, if yes do leave a comment. See you soon, until than take care.