Data Pre-processing with Weka (Part-1)

Please download and install Weka 3.7.11 from this URL

Some sample datasets for you to play with are present here or in Arff format

Weka dataset needs to be in a specific format like arff or csv etc. How to convert to .arff format has been explained in my previous post on clustering with Weka.

Step 1: Data Pre Processing or Cleaning

  1. Launch Weka-> click on the tab Explorer
  2. Load a dataset. (Click on “Open File” & locate the datafile)
  3. Click on PreProcess tab & then look at your lower R.H.S. bottom window click on drop down arrow and choose “No Class”
  4. Click on “Edit” tab, a new window opens up that will show you the loaded datafile. By looking at your dataset you can also find out if there are missing values in it or not. Also please note the attribute types on the column header. It would either be ‘nominal’ or ‘numeric’.

4.1 If your data has missing values then its best to clean it first before you apply any forms of mining algorithm to     it. Please look below at Figure 1, you will see the highlighted fields are blank that means the data at hand is dirty and it first needs to be cleaned. missingValue

Figure: 1

4.2 Data Cleaning: To clean the data, you apply “Filters” to it. Generally the data will be missing with values, so the filter to apply is “ReplaceMissingWithUserConstant” (the filter choice may vary according to your need, for more information on it please consult the resources).Click on Choose button below Filters-> Unsupervised->attribute—————> ReplaceMissingWithUserConstant

Please refer below to Figure: 2 to know how to edit the filter values.

Figure: 2 


A good choice for replacing missing numeric values is to give it values like -1 or 0 and for string values it could be NULL. Refer to Figure 3.

Figure: 3 


It’s worthwhile to also know how to check the total number of data values or instances in your dataset.

Refer to Figure: 4.

Figure: 4 checkTotalInstances

So as you can see in Figure 4 the number of instances is 345446. The reason why I want you to know about this is because later when we will be applying clustering to this data, your Weka software will crash because of “OutOfMemory” problem.

So this logically follows that how do we now partition or sample the dataset such that we have a smaller data content which Weka can process. So for this again we use the Filter option.

4.3 Sampling the Dataset : Click Filters-> unsupervised-> and then you can choose any of the following options below

  1. RemovePercentage – removes a given percentage from dataset
  2. RemoveRange- removes a given range of instances of a dataset
  3. RemoveWithValues
  4. Resample
  5. ReservoirSample

To know about each of these, place your mouse cursor on their name and you will see a tool-tip that will explain them.

For this dataset I’m using filter, ‘ReservoirSample’. In my experiments I have found that Weka is unable to handle values in size equal to or greater than 999999. Therefore when you are sampling your data I will suggest choose the sample size to a value less than or equal to 9999. The default value of the sample size will be 100. Change it to 9999 as shown below in Figure: 5. and then click on button Apply to apply the filter on the dataset. Once the filter has been applied, if you look at the Instances value also shown in Figure 6, you will see that the sample size is now 9999 as compared to the previous complete instances value at 345446.

Figure: 5

Figure: 6


If you now click on the “Edit” tab on the top of the explorer screen you will see the dataset cleaned. All missing values have been replaced with your user specified constants. Please see below at Figure 7. Congratulations! Step 1 of data pre-processing or cleaning has been completed.


Figure: 7

 It’s always a good idea to save the cleaned dataset. To do so, click on the save button as shown below in  Figure: 8.


 Figure: 8

I hope you enjoyed reading and experimenting. Next part will be on Data processing. Do leave your comments.

Keyword’s explained for PSLC datashop

PSLC Datashop

Knowledge Component

A knowledge component is a piece of information that can be used to accomplish tasks, perhaps along with other knowledge components. Knowledge component is a generalization of everyday terms like concept, principle, fact, or skill, and cognitive science terms like schema, production rule, misconception, or facet.

Each step in a problem require the student to know something, a relevant concept or skill, to perform that step correctly. In DataShop, each step can be labelled with a hypothesized knowledge component needed

Every knowledge component is associated with one or more steps. In DataShop, one or more knowledge components can be associated with a step. This association is typically originally defined by the problem author, but researchers can provide alternative knowledge components and associations with steps, also known as a Knowledge Component Model.

Learning Curve

A learning curve is a line graph displaying opportunities across the x-axis, and a measure of student performance along the y-axis. As a learning curve visualizes student performance over time, it should reveal improvement in student performance as opportunity count (i.e., practice with a given knowledge component) increases.

Measures of student performance available in learning curves are Error Rate, Assistance Score, number of in-corrects or hints, Step Duration, Correct Step Duration, and Error Step Duration.


A step is an observable part of the solution to a problem. Because steps are observable, they are partly determined by the user interface available to the student for solving the problem.

In the example problem above, the correct steps for the first question are:

  • find the radius of the end of the can (a circle)
  • find the length of the square ABCD
  • find the area of the end of the can
  • find the area of the square ABCD
  • find the area of the left-over scrap

This whole collection of steps comprises the solution. The last step can be considered the ‘answer’, and the others as ‘intermediate’ steps.

Step List

The step list table describes all problems in a dataset. It provides a detailed listing of problem hierarchy (the unit and section divisions that contain the problem) and composition (the steps that make up a problem). Rows in the step list table are ordered alphabetically by problem hierarchy, problem name, and problem step

It is not required, however, that a student complete a problem by performing only the correct steps—the student might request a hint from the tutor, or enter an incorrect value. How do we characterize the actions of a student that is working towards performing a step correctly? These actions are referred to as transactions and attempts.


A transaction is an interaction between the student and the tutoring system. Most reports in DataShop support analysis by knowledge component model, while some currently support comparing values from two KC models simultaneously—see the predicted values on the error rate Learning Curve, for example. We plan to create new features in DataShop that support more direct knowledge component model comparison.

Clustering with Weka 3.6 part-1

1. Download and install Weka 3.6 from here
2. Follow this blog to convert your data file to ARFF format
3. Click on ‘open file’ and select the .arff file that you created in step 2. This will open the dataset in the Weka Preprocess window. Please look at the screenshot below.

Weka_screenshot14. Click on “Cluster” tab on top side of the window.

5. Under Clusterer, click on button “Choose”, from the drop down list click on “Simple K-means”

6. Click on the button “Start”. You data will be clustered using k-means algorithm.

Simple K-means

7. Finally to see the clusters created by the algorithm you chose, click on “Visualize” tab. To increase the size of the dots on graph play with the point size slider on the bottom and click on update each time so as to see the changes committed.

cluster visualization

Hope you enjoyed this post. I will soon publish another one on using the PSLC datashop. Till then, cheers.

So you want to be a researcher..huh? (Part II)

Today, I will discuss on how to find a doctoral position in a foreign or home university. The charm of studying and conducting research on foreign soil is always interesting so as to say. Because, as solitary student, one would tend to easily slip into the illusion that foreign PhD’s are instant boost to career. Let’s first clarify a few of the myths;

(P.S. This post is not for the born geniuses, if you are any of these close the browser page and move on. But, if you are like me, hard-working to the core, read on the experiences as I recount to you.)

  • ImageA foreign doctoral degree will give you a cutting edge in job market? – Yes, it can, provided you have done your homework well. With homework what I mean is that you are well experienced in terms of previous working experience. My definition of well experienced is a number equal to or greater than 8 years in the topic of your research. Folk’s believe me, this is the most important bit. If you are not suitably experienced and you our straight out of a College or University armed with a degree with a happy smile on your face and you are thinking of enrolling into a PhD, don’t do it. Unless, you are a born genius, please don’t even think about it. I have reason to it. Look, PhD is a loner’s life. As I mentioned in my previous article, only if you are in love with your topic of interest you will sail through easily. Because, when you enrol into a doctoral programme, worldwide a common protocol is followed and that is you begin your journey by conducting a literature review of all the research-paper’s published till date in your topic of interest. So, if you do not have any previous working experience then your mind will not be able to help you with the keywords to search for relevant text/research papers in various databases and then you will be jinxed
  • ImageForeign University will give me a Research Assistantship (R.A.)-  Ensure before enrolling into a PhD programme that you will be able to secure a RA position as soon as you begin your programme of study because if this isn’t clarified in the beginning and you reach and you don’t get it, again you are jinxed (believe me, I almost have to stop myself from using that F**** word because this blog is my baby and I don’t feel like tarnishing it..). Sometimes, your supervisor suggests that it will give you a RA position but for that you will have to prove your mettle. And that’s where your writing, inter-personal skills and everything else will be handy.Generally, the public universities have the option of RA to provide to a candidate but to secure that position you really have to burn the midnight oil. It will be easy for you to secure a RA position provided you already have a few publications to your name if not then you must work hard and prove to your supervisor that you deserve it.
  • ImageHow to find a PhD position in an University? You begin by a reconnaissance survey. How do you that’s the million dollar question. Answer is simple, first find your topic, then create a list of all the Universities who are working in your field of interest worldwide. Now, sit back put on your thinking hat and write a damn good proposal that introduces your topic. It could be roughly 6-8 pages long. Mine was 8 pages long. I had spent more than 3 months just writing the proposal. My proposal focused more on the literature review and what is new or novel that I’m suggesting. This novelty of idea comes only if you have invested a lot of time in conducting a scientific way of literature review.  In my next post, I will discuss on how to write a proposal including literature review. Post this, you write a formal cover letter in which you introduce yourself to the University. Next, browse to the potential University web-portal and search for relevant Professor’s/Lecturer’s who are working in your field of interest. Once you have identified them, send them your cover letter along with your intended research proposal. And then wait….wait….Do a follow-up once a month from the time of sending your email with them, just a small gentle reminder. And yes, the most important part, for heaven’s sake create a professional email address and no funky or jazzy one.. Because academicians are very strict about it. Do not be disheartened if you don’t receive a reply in the very first month for that’s common.
  • Hooray! I have nailed it. My research proposal has been accepted, what to do next?                              Congratulations! Welcome to hell. Just joking. Well, this is something you should have done earlier but its still not late. Go back to the University website and check out the research profile of your intended supervisor. What you are looking for is his research career, his publications history and how many current and former research candidates it has supervised. This will give you a brief idea on whom you are dealing with, what’s the person like. Always remember, your supervisor is merely a guide who is going to show you the way therefore a healthy and conducive relationship with that person is a mandate. That does not mean that you will engage yourself in the complicated process of boot-licking…! I believe this will be a good time to ask your potential supervisor up-front on the sources of his funding such that you can then figure out from him if you will be able to secure an RA position later on or not.

The aforementioned were a few common issue’s that most every research student faces in the beginning. This is not exhaustive, yet if you were able to find an answer to your question then do leave your comment here.

So you want to be a researcher..huh? (Part 1)

So you want to be a researcher? Good and Bad….

Wondering why? Then read on..

This is the first of the many parts of series of posts. RememberRemember, the road travelled by a researcher is typical long winding and lonely, therefore you got to keep your music system (read sanity) intact and playing. Read and write as much as possible. Talk, discuss and keep your ears open to anyone offering a free piece of advice or lecture. Its you who has to funnel out what is trash and what is gold.

 The topic that you choose should be carefully picked, because you will have to eat, drink and sleep with that topic for the next 3 years (if you are in bachelor’s degree) or 2 years (if in Master’s) and so forth. When choosing a topic, bear in mind two entities, (1) Is it publishable? (2) Are you in love with it ?

Lets briefly talk about Literature review. I’m sure at some point or the other of your academic career, you were asked to do a literature review. First, why do we do a literature review, simple answer because it will not only tell you what all previous works have been done in your field of study but also if you do this review scientifically then it will be gold else it will be trash. Next question arises, “How to perform a scientific literature review?”, The answer lies in few well articulated text’s like ‘Kitchenham’s review’. Read it, if you want to get published. See collecting research papers up till 10 are OK to manage but it spells doom when you have more than that. I suggest use Endnote or Mendeley. My personal favourite is EndNote. Why? Because it has a utility in it called as ‘Smart group”, which lets you cluster and files similar papers together.

Continuing further, if you don’t have access to specialised databases like IEEExplore or ACM Digital then don’t fret or worry. Google scholar is your best buddy or Microsoft research center, it will help you out. networkingMy suggestion is assuming you don’t have access to a good researched database as above and you wish to publish or make a career for yourself in the research community. In that case, engage yourself in networking with other’s, you might find something useful in Facebook, that’s why i chose to write ‘might find’, but your best bet will be LinkedIn. There, you will find various groups of your topic of interest. And please, if you are using LinkedIn, ensure that you have if not a well written profile than at least a decent one. Once you have read enough using either of the aforementioned means and you can answer to yourself the question that you now know substantial about the topic, only then you should go get yourself a subscription to any of the paid databases. Ahh.. and how could I forget, maintain a blog..its a healthy activity and will keep you happy, trust me…smiles

Well, this is all for now. Do provide your views on it and be well.

How to write an abstract?

To begin with, this post is for my use only. It might serve as a blatant no for the readers of this column which even though I wish I had in thousands are quite few….Image

Anyway, continuing further, I have embarked on a tedious journey and this blog will serve as my flag-post, a constant reminder of major issues that I might forget during this long arduous journey.

I found this nice article on a website ( and am summarising it down;

Abstract has become very important these days, because no one has the time to read your long paper honest but that’s the truth. Unless, you write a few sentences that catapult the reader to jump into its car seat and make a mad rush to the library to read your full paper. Until then, the abstract that you have written is trash. So here are some key-points, that you got to remember, while you sell your paper to the research community!

  1. Motivation: See, you get motivated after you have read something and identified a gap so that gap is your problem statement that you got to write first and post which you write its results. This section should include the importance of your work, the difficulty of the area that you’re studying and the impact that it might have if you are successful.
  2. Problem statement: what’s the problem that you are trying to solve, what is its scope? Please ensure not to use too much jargon
  3. Approach: How did you solve the problem, what important variables did you control, ignore or measure? Did you use simulation, analytic models, prototype construction, or analysis of field data for an actual product?
  4. Results: What’s the answer?

Conclusions: What are the implications of your answer? Is it going to change the world (unlikely), be a significant “win”, be a nice hack, or simply serve as a road sign indicating that this path is a waste of time (all of the previous results are useful)? 

Theo Priestley: Unlocking Big Data Silos Through Integration

Theo Priestley: Unlocking Big Data Silos Through Integration –


Get every new post delivered to your Inbox.