Data Processing with Weka (Part II)

Today, I will discuss and elaborate on data processing in Weka 3.6 (it’s the same in version 3.7 too). This post is the second part in the series of “Data pre-processing with Weka”. If you have not seen my earlier post, you are directed to see that first.

Continuing further, assuming that you have cleaned the data at hand and its now “noise free”. The next step is to process it. Now how do we go about doing this?

Often when we have a dataset, the most common question that comes to our mind is that which attributes are the most closely related to each other such that a relationship between them could be defined? Attribute is synonymous to column heading and instance is synonymous to a row of record. Let’s take the soybean dataset that comes free with Weka to answer these questions.

Click on Edit as shown in the picture to see if the data has missing values in it or not.

Missing values

As it’s evident in the screenshot here, that this soyabean dataset is noisy. The same can be seen in this screenshot when it’s opened in Weka.

Missing values in Weka

So first step is to clean it. So either you can check my previous post on data cleaning or else the other option is to manually clean the dataset. If you want to manually clean the data for that, you will first have to save the dataset in .CSV (comma separated value) first. This is shown below.

csv format

And then open it in Microsoft excel and then manually search for outliers in the data and remove it.

How to manually delete outliers in MS Excel

In case you don’t know how to do this, here is a short tutorial on the same.

Step 1: Open the dataset in Microsoft Excel

Step 2: Ensure that the column headings row is selected. column headings

Step 3: From the Editing ribbon on the top of screen, click on the drop down besides “Sort & Filter” as shown in screenshot and click on “Filter”. filter poistion

You will see that each of your column heading has a filter to it now.

filter added

Step 4: click on the filter, check if any column has an outlier value like a ? mark or any other value that you think is incomplete, then delete it as shown in figure.delete outliers1

Note: Please ensure that you are deleting the rows and not pressing the delete button on the keyboard. Because that will only delete the values and your data will still have the blank values. Therefore to delete the rows that contain outliers, first select the rows then right click on the selected rows and from the drop down menu, click on the option “Delete rows” to delete the rows.

Step 5: Save the file

Step 6: Now open this file in Weka you will see as shown below in the figure that there are no missing values in the dataset. Save it in .ARFF format and you are done No missing values

Now, we are ready to answer the question that we asked in the beginning. To reiterate, how to find out the attributes that are related to each other and can constitute a relationship. Looking at this soybean dataset, we have 36 attributes and 562 instances. attribute&instances count

Next, to find the relationship between the attribute, in the Weka Explorer tab, click on “Select Attributes” and then under Attribute Evaluator, click on button “choose” and select the option, “FilteredAttributeEvaluator”, you will see a dialog box, click on Yes. Below that, make sure from the drop down menu you choose “No class” as shown in the figure.filter attribute

Click on Start to find attributes related to each other. If you have followed the aforementioned steps correctly then you should see an output similar to the one as shown.attribute list

Hope you learnt something meaningful and I was able to bring a smile to you, if yes do leave a comment. See you soon, until than take care.

Java IO Exception when using Weka CSVLoader

Well today, I was trying to load a csv file in Weka when I got the dreaded error message . I had seen this java error message “ wrong number of values. Read 28, expected 18, read Token[EOL], line 25 “on an earlier occasion too and that time having no clue on how to fix it I ended up deleting most of the columns of the dataset. Perhaps a brief background first that causes this problem message. Look, if you have very noisy data especially if you are trying to consolidate a dataset from a datasource then typically you will be coping several columns and pasting them into a .xls or CSV format file. Well, I was attempting to do something similar too.

Java Error

So when I fired up Weka and wanted to convert the .csv file to .arff format using Weka Arff viewer, I got the above error. In order to solve it all you have to do is to check for any trailing comma signs at the end of the text or any double quote sign in between or anywhere within the text or simply put any forms of punctuation marks within your datajava error causeset. The cause for this error is shown in the picture.

As you can see this text in data contains all the punctuation marks. So to resolve this java error, Remove any punctuation marks in your dataset and then try loading the data again in Weka. it will work.

Hope this helped you.


Fix: Arff file not recognised or Unable to load data in Weka

If you are a beginner to Weka then one of the most common problems that you might face is that you are unable to load your dataset into Weka because either it gives you an error such as ” Reason: premature end of file, read Token [EOF], line 1″ or “Arff file not recognised” or some other similar error.
In one of my previous post, I had provided the link to a blog that had answered this question. However, it was today when I was helping out someone else with the similar problem I realised that the most important step (Step 2 given below) was missing from the instructions on this blog. Therefore, I decided to write this post for the benefit of all.
To solve this error the solution is given below.
Step 1: Open your datafile in Excel/Access or whatever is its format
Step 2: Save the dataset file in CSV format. To save the file in .csv format do the following steSave asps,
Click on File—>Save As—> Click on the drop down menu besides ‘Save as type’ and change it to CSV. You will see a dialogue saying something like “Some features will be lost blah blah blah”… Just click on Yes.
Step 3: Now go to Weka and then click on Tools–> Arff Viewer 
Step 4: In Arff Viewer window–> Click on File–> click on Open —> go to the location where you saved the datafile in .CSV format, choose the datafile and then –> Click on OpenfileOpened
Step 5: Viola, your datafile will be open in Arff viewer. Now all you have to is to save it in Arff format by changing its extension to arff and in save as type choose arff.
Now the reason why you were getting that java error, because, the instructions on that blog which I had guided you too did not state that you first have to change the datafile format which is in excel or access or whatever format to .CSV (Comma Separated Value) format. Always remember, for Weka to open your data file, your dataset should first be converted into CSV format. Only then Weka will be able to load it in the ARFF viewer and subsequently will let you save it to .arff format
Hope you found this useful.

Data Pre-processing with Weka (Part-1)

Please download and install Weka 3.7.11 from this URL

Some sample datasets for you to play with are present here or in Arff format

Weka dataset needs to be in a specific format like arff or csv etc. How to convert to .arff format has been explained in my previous post on clustering with Weka.

Step 1: Data Pre Processing or Cleaning

  1. Launch Weka-> click on the tab Explorer
  2. Load a dataset. (Click on “Open File” & locate the datafile)
  3. Click on PreProcess tab & then look at your lower R.H.S. bottom window click on drop down arrow and choose “No Class”
  4. Click on “Edit” tab, a new window opens up that will show you the loaded datafile. By looking at your dataset you can also find out if there are missing values in it or not. Also please note the attribute types on the column header. It would either be ‘nominal’ or ‘numeric’.

4.1 If your data has missing values then its best to clean it first before you apply any forms of mining algorithm to     it. Please look below at Figure 1, you will see the highlighted fields are blank that means the data at hand is dirty and it first needs to be cleaned. missingValue

Figure: 1

4.2 Data Cleaning: To clean the data, you apply “Filters” to it. Generally the data will be missing with values, so the filter to apply is “ReplaceMissingWithUserConstant” (the filter choice may vary according to your need, for more information on it please consult the resources).Click on Choose button below Filters-> Unsupervised->attribute—————> ReplaceMissingWithUserConstant

Please refer below to Figure: 2 to know how to edit the filter values.

Figure: 2 


A good choice for replacing missing numeric values is to give it values like -1 or 0 and for string values it could be NULL. Refer to Figure 3.

Figure: 3 


It’s worthwhile to also know how to check the total number of data values or instances in your dataset.

Refer to Figure: 4.

Figure: 4 checkTotalInstances

So as you can see in Figure 4 the number of instances is 345446. The reason why I want you to know about this is because later when we will be applying clustering to this data, your Weka software will crash because of “OutOfMemory” problem.

So this logically follows that how do we now partition or sample the dataset such that we have a smaller data content which Weka can process. So for this again we use the Filter option.

4.3 Sampling the Dataset : Click Filters-> unsupervised-> and then you can choose any of the following options below

  1. RemovePercentage – removes a given percentage from dataset
  2. RemoveRange- removes a given range of instances of a dataset
  3. RemoveWithValues
  4. Resample
  5. ReservoirSample

To know about each of these, place your mouse cursor on their name and you will see a tool-tip that will explain them.

For this dataset I’m using filter, ‘ReservoirSample’. In my experiments I have found that Weka is unable to handle values in size equal to or greater than 999999. Therefore when you are sampling your data I will suggest choose the sample size to a value less than or equal to 9999. The default value of the sample size will be 100. Change it to 9999 as shown below in Figure: 5. and then click on button Apply to apply the filter on the dataset. Once the filter has been applied, if you look at the Instances value also shown in Figure 6, you will see that the sample size is now 9999 as compared to the previous complete instances value at 345446.

Figure: 5

Figure: 6


If you now click on the “Edit” tab on the top of the explorer screen you will see the dataset cleaned. All missing values have been replaced with your user specified constants. Please see below at Figure 7. Congratulations! Step 1 of data pre-processing or cleaning has been completed.


Figure: 7

 It’s always a good idea to save the cleaned dataset. To do so, click on the save button as shown below in  Figure: 8.


 Figure: 8

I hope you enjoyed reading and experimenting. Next part will be on Data processing. Do leave your comments.

Keyword’s explained for PSLC datashop

PSLC Datashop

Knowledge Component

A knowledge component is a piece of information that can be used to accomplish tasks, perhaps along with other knowledge components. Knowledge component is a generalization of everyday terms like concept, principle, fact, or skill, and cognitive science terms like schema, production rule, misconception, or facet.

Each step in a problem require the student to know something, a relevant concept or skill, to perform that step correctly. In DataShop, each step can be labelled with a hypothesized knowledge component needed

Every knowledge component is associated with one or more steps. In DataShop, one or more knowledge components can be associated with a step. This association is typically originally defined by the problem author, but researchers can provide alternative knowledge components and associations with steps, also known as a Knowledge Component Model.

Learning Curve

A learning curve is a line graph displaying opportunities across the x-axis, and a measure of student performance along the y-axis. As a learning curve visualizes student performance over time, it should reveal improvement in student performance as opportunity count (i.e., practice with a given knowledge component) increases.

Measures of student performance available in learning curves are Error Rate, Assistance Score, number of in-corrects or hints, Step Duration, Correct Step Duration, and Error Step Duration.


A step is an observable part of the solution to a problem. Because steps are observable, they are partly determined by the user interface available to the student for solving the problem.

In the example problem above, the correct steps for the first question are:

  • find the radius of the end of the can (a circle)
  • find the length of the square ABCD
  • find the area of the end of the can
  • find the area of the square ABCD
  • find the area of the left-over scrap

This whole collection of steps comprises the solution. The last step can be considered the ‘answer’, and the others as ‘intermediate’ steps.

Step List

The step list table describes all problems in a dataset. It provides a detailed listing of problem hierarchy (the unit and section divisions that contain the problem) and composition (the steps that make up a problem). Rows in the step list table are ordered alphabetically by problem hierarchy, problem name, and problem step

It is not required, however, that a student complete a problem by performing only the correct steps—the student might request a hint from the tutor, or enter an incorrect value. How do we characterize the actions of a student that is working towards performing a step correctly? These actions are referred to as transactions and attempts.


A transaction is an interaction between the student and the tutoring system. Most reports in DataShop support analysis by knowledge component model, while some currently support comparing values from two KC models simultaneously—see the predicted values on the error rate Learning Curve, for example. We plan to create new features in DataShop that support more direct knowledge component model comparison.

Clustering with Weka 3.6 part-1

1. Download and install Weka 3.6 from here
2. Follow this blog to convert your data file to ARFF format
3. Click on ‘open file’ and select the .arff file that you created in step 2. This will open the dataset in the Weka Preprocess window. Please look at the screenshot below.

Weka_screenshot14. Click on “Cluster” tab on top side of the window.

5. Under Clusterer, click on button “Choose”, from the drop down list click on “Simple K-means”

6. Click on the button “Start”. You data will be clustered using k-means algorithm.

Simple K-means

7. Finally to see the clusters created by the algorithm you chose, click on “Visualize” tab. To increase the size of the dots on graph play with the point size slider on the bottom and click on update each time so as to see the changes committed.

cluster visualization

Hope you enjoyed this post. I will soon publish another one on using the PSLC datashop. Till then, cheers.

So you want to be a researcher..huh? (Part II)

Today, I will discuss on how to find a doctoral position in a foreign or home university. The charm of studying and conducting research on foreign soil is always interesting so as to say. Because, as solitary student, one would tend to easily slip into the illusion that foreign PhD’s are instant boost to career. Let’s first clarify a few of the myths;

(P.S. This post is not for the born geniuses, if you are any of these close the browser page and move on. But, if you are like me, hard-working to the core, read on the experiences as I recount to you.)

  • ImageA foreign doctoral degree will give you a cutting edge in job market? – Yes, it can, provided you have done your homework well. With homework what I mean is that you are well experienced in terms of previous working experience. My definition of well experienced is a number equal to or greater than 8 years in the topic of your research. Folk’s believe me, this is the most important bit. If you are not suitably experienced and you our straight out of a College or University armed with a degree with a happy smile on your face and you are thinking of enrolling into a PhD, don’t do it. Unless, you are a born genius, please don’t even think about it. I have reason to it. Look, PhD is a loner’s life. As I mentioned in my previous article, only if you are in love with your topic of interest you will sail through easily. Because, when you enrol into a doctoral programme, worldwide a common protocol is followed and that is you begin your journey by conducting a literature review of all the research-paper’s published till date in your topic of interest. So, if you do not have any previous working experience then your mind will not be able to help you with the keywords to search for relevant text/research papers in various databases and then you will be jinxed
  • ImageForeign University will give me a Research Assistantship (R.A.)-  Ensure before enrolling into a PhD programme that you will be able to secure a RA position as soon as you begin your programme of study because if this isn’t clarified in the beginning and you reach and you don’t get it, again you are jinxed (believe me, I almost have to stop myself from using that F**** word because this blog is my baby and I don’t feel like tarnishing it..). Sometimes, your supervisor suggests that it will give you a RA position but for that you will have to prove your mettle. And that’s where your writing, inter-personal skills and everything else will be handy.Generally, the public universities have the option of RA to provide to a candidate but to secure that position you really have to burn the midnight oil. It will be easy for you to secure a RA position provided you already have a few publications to your name if not then you must work hard and prove to your supervisor that you deserve it.
  • ImageHow to find a PhD position in an University? You begin by a reconnaissance survey. How do you that’s the million dollar question. Answer is simple, first find your topic, then create a list of all the Universities who are working in your field of interest worldwide. Now, sit back put on your thinking hat and write a damn good proposal that introduces your topic. It could be roughly 6-8 pages long. Mine was 8 pages long. I had spent more than 3 months just writing the proposal. My proposal focused more on the literature review and what is new or novel that I’m suggesting. This novelty of idea comes only if you have invested a lot of time in conducting a scientific way of literature review.  In my next post, I will discuss on how to write a proposal including literature review. Post this, you write a formal cover letter in which you introduce yourself to the University. Next, browse to the potential University web-portal and search for relevant Professor’s/Lecturer’s who are working in your field of interest. Once you have identified them, send them your cover letter along with your intended research proposal. And then wait….wait….Do a follow-up once a month from the time of sending your email with them, just a small gentle reminder. And yes, the most important part, for heaven’s sake create a professional email address and no funky or jazzy one.. Because academicians are very strict about it. Do not be disheartened if you don’t receive a reply in the very first month for that’s common.
  • Hooray! I have nailed it. My research proposal has been accepted, what to do next?                              Congratulations! Welcome to hell. Just joking. Well, this is something you should have done earlier but its still not late. Go back to the University website and check out the research profile of your intended supervisor. What you are looking for is his research career, his publications history and how many current and former research candidates it has supervised. This will give you a brief idea on whom you are dealing with, what’s the person like. Always remember, your supervisor is merely a guide who is going to show you the way therefore a healthy and conducive relationship with that person is a mandate. That does not mean that you will engage yourself in the complicated process of boot-licking…! I believe this will be a good time to ask your potential supervisor up-front on the sources of his funding such that you can then figure out from him if you will be able to secure an RA position later on or not.

The aforementioned were a few common issue’s that most every research student faces in the beginning. This is not exhaustive, yet if you were able to find an answer to your question then do leave your comment here.


Get every new post delivered to your Inbox.