Data Pre-processing with Weka (Part-1)

Please download and install Weka 3.7.11 from this URL

Some sample datasets for you to play with are present here or in Arff format

Weka dataset needs to be in a specific format like arff or csv etc. How to convert to .arff format has been explained in my previous post on clustering with Weka.

Step 1: Data Pre Processing or Cleaning

  1. Launch Weka-> click on the tab Explorer
  2. Load a dataset. (Click on “Open File” & locate the datafile)
  3. Click on PreProcess tab & then look at your lower R.H.S. bottom window click on drop down arrow and choose “No Class”
  4. Click on “Edit” tab, a new window opens up that will show you the loaded datafile. By looking at your dataset you can also find out if there are missing values in it or not. Also please note the attribute types on the column header. It would either be ‘nominal’ or ‘numeric’.

4.1 If your data has missing values then its best to clean it first before you apply any forms of mining algorithm to     it. Please look below at Figure 1, you will see the highlighted fields are blank that means the data at hand is dirty and it first needs to be cleaned. missingValue

Figure: 1

4.2 Data Cleaning: To clean the data, you apply “Filters” to it. Generally the data will be missing with values, so the filter to apply is “ReplaceMissingWithUserConstant” (the filter choice may vary according to your need, for more information on it please consult the resources).Click on Choose button below Filters-> Unsupervised->attribute—————> ReplaceMissingWithUserConstant

Please refer below to Figure: 2 to know how to edit the filter values.

Figure: 2 

How2EditFilterValues

A good choice for replacing missing numeric values is to give it values like -1 or 0 and for string values it could be NULL. Refer to Figure 3.

Figure: 3 

FilterReplMissValwitConst

It’s worthwhile to also know how to check the total number of data values or instances in your dataset.

Refer to Figure: 4.

Figure: 4 checkTotalInstances

So as you can see in Figure 4 the number of instances is 345446. The reason why I want you to know about this is because later when we will be applying clustering to this data, your Weka software will crash because of “OutOfMemory” problem.

So this logically follows that how do we now partition or sample the dataset such that we have a smaller data content which Weka can process. So for this again we use the Filter option.

4.3 Sampling the Dataset : Click Filters-> unsupervised-> and then you can choose any of the following options below

  1. RemovePercentage – removes a given percentage from dataset
  2. RemoveRange- removes a given range of instances of a dataset
  3. RemoveWithValues
  4. Resample
  5. ReservoirSample

To know about each of these, place your mouse cursor on their name and you will see a tool-tip that will explain them.

For this dataset I’m using filter, ‘ReservoirSample’. In my experiments I have found that Weka is unable to handle values in size equal to or greater than 999999. Therefore when you are sampling your data I will suggest choose the sample size to a value less than or equal to 9999. The default value of the sample size will be 100. Change it to 9999 as shown below in Figure: 5. and then click on button Apply to apply the filter on the dataset. Once the filter has been applied, if you look at the Instances value also shown in Figure 6, you will see that the sample size is now 9999 as compared to the previous complete instances value at 345446.

Figure: 5
editResorvoirSample

Figure: 6

ReservoirSample

If you now click on the “Edit” tab on the top of the explorer screen you will see the dataset cleaned. All missing values have been replaced with your user specified constants. Please see below at Figure 7. Congratulations! Step 1 of data pre-processing or cleaning has been completed.

missingValueReplaced

Figure: 7

 It’s always a good idea to save the cleaned dataset. To do so, click on the save button as shown below in  Figure: 8.

SaveData

 Figure: 8

I hope you enjoyed reading and experimenting. Next part will be on Data processing. Do leave your comments.

Advertisements