Error: Python version 2.7 required, which was not found in the registry- Solved

Python errorToday while installing, scipy for Python ver 2.7 I came across this error message on my Windows 7 OS 64 bit system.
Please note, I already have Python 2.7 installed.
The solution is;

1. Open notepad and copy paste the following code in it.

[HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\Python\PythonCore\2.7\Help\Main Python Documentation]
@="Python 2.7"

2. Save the file on desktop as key.reg

3. Click Start menu on desktop then click on Start search and type regedit and press enter key.

4. Registry editor window opens. Click on File menu and then click on Import. Navigate to desktop and then import the file named key.reg. You will see a message that will say something like the key was sucessfully imported.

5. try installing scipy again. It will work

Clustering With K-Means in Python

Ashish Dutt:

Very interesting read.

Originally posted on The Data Science Lab:

A very common task in data analysis is that of grouping a set of objects into subsets such that all elements within a group are more similar among them than they are to the others. The practical applications of such a procedure are many: given a medical image of a group of cells, a clustering algorithm could aid in identifying the centers of the cells; looking at the GPS data of a user’s mobile device, their more frequently visited locations within a certain radius can be revealed; for any set of unlabeled observations, clustering helps establish the existence of some sort of structure that might indicate that the data is separable.

Mathematical background

The k-means algorithm takes a dataset X of N points as input, together with a parameter K specifying how many clusters to create. The output is a set of K cluster centroids and a labeling…

View original 898 more words

Data Processing with Weka (Part II)

Today, I will discuss and elaborate on data processing in Weka 3.6 (it’s the same in version 3.7 too). This post is the second part in the series of “Data pre-processing with Weka”. If you have not seen my earlier post, you are directed to see that first.

Continuing further, assuming that you have cleaned the data at hand and its now “noise free”. The next step is to process it. Now how do we go about doing this?

Often when we have a dataset, the most common question that comes to our mind is that which attributes are the most closely related to each other such that a relationship between them could be defined? Attribute is synonymous to column heading and instance is synonymous to a row of record. Let’s take the soybean dataset that comes free with Weka to answer these questions.

Click on Edit as shown in the picture to see if the data has missing values in it or not.

Missing values

As it’s evident in the screenshot here, that this soyabean dataset is noisy. The same can be seen in this screenshot when it’s opened in Weka.

Missing values in Weka

So first step is to clean it. So either you can check my previous post on data cleaning or else the other option is to manually clean the dataset. If you want to manually clean the data for that, you will first have to save the dataset in .CSV (comma separated value) first. This is shown below.

csv format

And then open it in Microsoft excel and then manually search for outliers in the data and remove it.

How to manually delete outliers in MS Excel

In case you don’t know how to do this, here is a short tutorial on the same.

Step 1: Open the dataset in Microsoft Excel

Step 2: Ensure that the column headings row is selected. column headings

Step 3: From the Editing ribbon on the top of screen, click on the drop down besides “Sort & Filter” as shown in screenshot and click on “Filter”. filter poistion

You will see that each of your column heading has a filter to it now.

filter added

Step 4: click on the filter, check if any column has an outlier value like a ? mark or any other value that you think is incomplete, then delete it as shown in figure.delete outliers1

Note: Please ensure that you are deleting the rows and not pressing the delete button on the keyboard. Because that will only delete the values and your data will still have the blank values. Therefore to delete the rows that contain outliers, first select the rows then right click on the selected rows and from the drop down menu, click on the option “Delete rows” to delete the rows.

Step 5: Save the file

Step 6: Now open this file in Weka you will see as shown below in the figure that there are no missing values in the dataset. Save it in .ARFF format and you are done No missing values

Now, we are ready to answer the question that we asked in the beginning. To reiterate, how to find out the attributes that are related to each other and can constitute a relationship. Looking at this soybean dataset, we have 36 attributes and 562 instances. attribute&instances count

Next, to find the relationship between the attribute, in the Weka Explorer tab, click on “Select Attributes” and then under Attribute Evaluator, click on button “choose” and select the option, “FilteredAttributeEvaluator”, you will see a dialog box, click on Yes. Below that, make sure from the drop down menu you choose “No class” as shown in the figure.filter attribute

Click on Start to find attributes related to each other. If you have followed the aforementioned steps correctly then you should see an output similar to the one as shown.attribute list

Hope you learnt something meaningful and I was able to bring a smile to you, if yes do leave a comment. See you soon, until than take care.

Java IO Exception when using Weka CSVLoader

Well today, I was trying to load a csv file in Weka when I got the dreaded error message . I had seen this java error message “ wrong number of values. Read 28, expected 18, read Token[EOL], line 25 “on an earlier occasion too and that time having no clue on how to fix it I ended up deleting most of the columns of the dataset. Perhaps a brief background first that causes this problem message. Look, if you have very noisy data especially if you are trying to consolidate a dataset from a datasource then typically you will be coping several columns and pasting them into a .xls or CSV format file. Well, I was attempting to do something similar too.

Java Error

So when I fired up Weka and wanted to convert the .csv file to .arff format using Weka Arff viewer, I got the above error. In order to solve it all you have to do is to check for any trailing comma signs at the end of the text or any double quote sign in between or anywhere within the text or simply put any forms of punctuation marks within your datajava error causeset. The cause for this error is shown in the picture.

As you can see this text in data contains all the punctuation marks. So to resolve this java error, Remove any punctuation marks in your dataset and then try loading the data again in Weka. it will work.

Hope this helped you.


Fix: Arff file not recognised or Unable to load data in Weka

If you are a beginner to Weka then one of the most common problems that you might face is that you are unable to load your dataset into Weka because either it gives you an error such as ” Reason: premature end of file, read Token [EOF], line 1″ or “Arff file not recognised” or some other similar error.
In one of my previous post, I had provided the link to a blog that had answered this question. However, it was today when I was helping out someone else with the similar problem I realised that the most important step (Step 2 given below) was missing from the instructions on this blog. Therefore, I decided to write this post for the benefit of all.
To solve this error the solution is given below.
Step 1: Open your datafile in Excel/Access or whatever is its format
Step 2: Save the dataset file in CSV format. To save the file in .csv format do the following steSave asps,
Click on File—>Save As—> Click on the drop down menu besides ‘Save as type’ and change it to CSV. You will see a dialogue saying something like “Some features will be lost blah blah blah”… Just click on Yes.
Step 3: Now go to Weka and then click on Tools–> Arff Viewer 
Step 4: In Arff Viewer window–> Click on File–> click on Open —> go to the location where you saved the datafile in .CSV format, choose the datafile and then –> Click on OpenfileOpened
Step 5: Viola, your datafile will be open in Arff viewer. Now all you have to is to save it in Arff format by changing its extension to arff and in save as type choose arff.
Now the reason why you were getting that java error, because, the instructions on that blog which I had guided you too did not state that you first have to change the datafile format which is in excel or access or whatever format to .CSV (Comma Separated Value) format. Always remember, for Weka to open your data file, your dataset should first be converted into CSV format. Only then Weka will be able to load it in the ARFF viewer and subsequently will let you save it to .arff format
Hope you found this useful.

Data Pre-processing with Weka (Part-1)

Please download and install Weka 3.7.11 from this URL

Some sample datasets for you to play with are present here or in Arff format

Weka dataset needs to be in a specific format like arff or csv etc. How to convert to .arff format has been explained in my previous post on clustering with Weka.

Step 1: Data Pre Processing or Cleaning

  1. Launch Weka-> click on the tab Explorer
  2. Load a dataset. (Click on “Open File” & locate the datafile)
  3. Click on PreProcess tab & then look at your lower R.H.S. bottom window click on drop down arrow and choose “No Class”
  4. Click on “Edit” tab, a new window opens up that will show you the loaded datafile. By looking at your dataset you can also find out if there are missing values in it or not. Also please note the attribute types on the column header. It would either be ‘nominal’ or ‘numeric’.

4.1 If your data has missing values then its best to clean it first before you apply any forms of mining algorithm to     it. Please look below at Figure 1, you will see the highlighted fields are blank that means the data at hand is dirty and it first needs to be cleaned. missingValue

Figure: 1

4.2 Data Cleaning: To clean the data, you apply “Filters” to it. Generally the data will be missing with values, so the filter to apply is “ReplaceMissingWithUserConstant” (the filter choice may vary according to your need, for more information on it please consult the resources).Click on Choose button below Filters-> Unsupervised->attribute—————> ReplaceMissingWithUserConstant

Please refer below to Figure: 2 to know how to edit the filter values.

Figure: 2 


A good choice for replacing missing numeric values is to give it values like -1 or 0 and for string values it could be NULL. Refer to Figure 3.

Figure: 3 


It’s worthwhile to also know how to check the total number of data values or instances in your dataset.

Refer to Figure: 4.

Figure: 4 checkTotalInstances

So as you can see in Figure 4 the number of instances is 345446. The reason why I want you to know about this is because later when we will be applying clustering to this data, your Weka software will crash because of “OutOfMemory” problem.

So this logically follows that how do we now partition or sample the dataset such that we have a smaller data content which Weka can process. So for this again we use the Filter option.

4.3 Sampling the Dataset : Click Filters-> unsupervised-> and then you can choose any of the following options below

  1. RemovePercentage – removes a given percentage from dataset
  2. RemoveRange- removes a given range of instances of a dataset
  3. RemoveWithValues
  4. Resample
  5. ReservoirSample

To know about each of these, place your mouse cursor on their name and you will see a tool-tip that will explain them.

For this dataset I’m using filter, ‘ReservoirSample’. In my experiments I have found that Weka is unable to handle values in size equal to or greater than 999999. Therefore when you are sampling your data I will suggest choose the sample size to a value less than or equal to 9999. The default value of the sample size will be 100. Change it to 9999 as shown below in Figure: 5. and then click on button Apply to apply the filter on the dataset. Once the filter has been applied, if you look at the Instances value also shown in Figure 6, you will see that the sample size is now 9999 as compared to the previous complete instances value at 345446.

Figure: 5

Figure: 6


If you now click on the “Edit” tab on the top of the explorer screen you will see the dataset cleaned. All missing values have been replaced with your user specified constants. Please see below at Figure 7. Congratulations! Step 1 of data pre-processing or cleaning has been completed.


Figure: 7

 It’s always a good idea to save the cleaned dataset. To do so, click on the save button as shown below in  Figure: 8.


 Figure: 8

I hope you enjoyed reading and experimenting. Next part will be on Data processing. Do leave your comments.

Keyword’s explained for PSLC datashop

PSLC Datashop

Knowledge Component

A knowledge component is a piece of information that can be used to accomplish tasks, perhaps along with other knowledge components. Knowledge component is a generalization of everyday terms like concept, principle, fact, or skill, and cognitive science terms like schema, production rule, misconception, or facet.

Each step in a problem require the student to know something, a relevant concept or skill, to perform that step correctly. In DataShop, each step can be labelled with a hypothesized knowledge component needed

Every knowledge component is associated with one or more steps. In DataShop, one or more knowledge components can be associated with a step. This association is typically originally defined by the problem author, but researchers can provide alternative knowledge components and associations with steps, also known as a Knowledge Component Model.

Learning Curve

A learning curve is a line graph displaying opportunities across the x-axis, and a measure of student performance along the y-axis. As a learning curve visualizes student performance over time, it should reveal improvement in student performance as opportunity count (i.e., practice with a given knowledge component) increases.

Measures of student performance available in learning curves are Error Rate, Assistance Score, number of in-corrects or hints, Step Duration, Correct Step Duration, and Error Step Duration.


A step is an observable part of the solution to a problem. Because steps are observable, they are partly determined by the user interface available to the student for solving the problem.

In the example problem above, the correct steps for the first question are:

  • find the radius of the end of the can (a circle)
  • find the length of the square ABCD
  • find the area of the end of the can
  • find the area of the square ABCD
  • find the area of the left-over scrap

This whole collection of steps comprises the solution. The last step can be considered the ‘answer’, and the others as ‘intermediate’ steps.

Step List

The step list table describes all problems in a dataset. It provides a detailed listing of problem hierarchy (the unit and section divisions that contain the problem) and composition (the steps that make up a problem). Rows in the step list table are ordered alphabetically by problem hierarchy, problem name, and problem step

It is not required, however, that a student complete a problem by performing only the correct steps—the student might request a hint from the tutor, or enter an incorrect value. How do we characterize the actions of a student that is working towards performing a step correctly? These actions are referred to as transactions and attempts.


A transaction is an interaction between the student and the tutoring system. Most reports in DataShop support analysis by knowledge component model, while some currently support comparing values from two KC models simultaneously—see the predicted values on the error rate Learning Curve, for example. We plan to create new features in DataShop that support more direct knowledge component model comparison.


Get every new post delivered to your Inbox.