Data preprocessing with R- part II

This post is a sequel to my earlier post dated December 9th 2014

General background in R

I. R datatypes are;
a. numeric Numeric data (approximations of the real numbers, R)
b. integer Integer data (whole numbers, Z)
c. factor Categorical data (simple classifications, like gender)
d. ordered Ordinal data (ordered classifications, like educational level)
e. character Character data (strings)
f. raw Binary data

Variable types and indexing techniques
I.A. Vector:
The most basic variable in R is a vector. An R vector is a sequence of values of the same type.
Elements of a vector can be selected or replaced using the square bracket operator [ ] . The square brackets accept either a vector of names, index numbers, or a logical.

I.B. List:
A list is a generalization of a vector in that it can contain objects of different types, including other lists. There are two ways to index a list. The single bracket operator always returns a sub-list of the indexed list. That is, the resulting type is again a list. The double bracket operator ([[ ]] ) may only result in a single item, and it returns the object in the list itself. Besides indexing, the dollar operator $ can be used to retrieve a single element. To understand the above, check the results of the following statements.
AL <- list(x = c(1:5), y = c(“a”, “b”, “c”), z = capColor)
AL[[2]]
AL$y
AL[c(1, 3)]
AL[c(“x”, “y”)]
AL[[“z”]]

I.C. Data Frame

A data.frame is not much more than a list of vectors, possibly of different types, but with every vector (now columns) of the same length. Since data.frames are a type of list, indexing them with a single index returns a sub-data.frame; that is, a data.frame with less columns.
Likewise, the dollar operator returns a vector, not a sub-data.frame. Rows can be indexed using two indices in the bracket operator, separated by a comma. The first index indicates rows,
the second indicates columns. If one of the indices is left out, no selection is made (so everything is returned). It is important to realize that the result of a two-index selection is
simplified by R as much as possible. Hence, selecting a single column using a two-index results in a vector. This behaviour may be switched off using drop=FALSE as an extra parameter. Here
are some short examples demonstrating the above.

# the following command creates 10 obs in 3 variables
>d <- data.frame(x = 1:10, y = letters[1:10], z = LETTERS[1:10])

>d[1]
## x
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10

> d[, 1]
##[1] 1 2 3 4 5 6 7 8 9 10

> d[, “x”, drop = FALSE]
## x
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10

>d[c(“x”, “z”)]
## x z
1 1 A
2 2 B
3 3 C
4 4 D
5 5 E
6 6 F
7 7 G
8 8 H
9 9 I
10 10 J

>d[d$x > 3, “y”, drop = FALSE]
## y
4 d
5 e
6 f
7 g
8 h
9 i
10 j

>d[2, ]
Error: unexpected ‘>’ in “>”

I.D. Special values
R has a number of Special values that are exceptions to the
normal values of a type. These are NA, NULL, ±Inf and NaN.

i. NA Stands for not available. NA is a placeholder for a missing value.
The function is.na can be used to detect NA’s.

> NA+1
[1] NA

> sum(c(NA, 1, 2))
[1] NA

ii. NULL You may think of NULL as the empty set from mathematics. NULL is special since it has no class (its class is NULL) and has length 0 so it does not take up any space in a vector. The function is.null can be used to detect NULL variables.

iii. Inf Stands for infinity and only applies to vectors of class numeric. A vector of class integer can never be Inf.

iv. NaN Stands for not a number. This is generally the result of a calculation of which the result is unknown, but it is surely not a number. In particular operations like 0/0, Inf-Inf and Inf/Inf result in NaN. Technically, NaN is of class numeric, which may seem odd since it is used to indicate that something is not numeric. Computations involving numbers and NaN always result in NaN
The function is.nan can be used to detect NaN’s.

2. What is technically correct data in R?
In the case of R, we define technically correct data as a data set that
– is stored in a data.frame with suitable columns names, and
– each column of the data.frame is of the R type that adequately represents the value domain of the variable in the column.

Numeric data should be stored as numeric or integer, textual
data should be stored as character and categorical data should be stored as a factor or ordered vector, with the appropriate levels

Best practice. A freshly read data.frame should always be inspected with functions like head, str, and summary.

2.1 Data preprocessing with R

A: Consider the following example. Create a .csv file with the following data in it. Save it as datafile.csv
X10 X6.0
42 5.6
18 2.3*
21 <NA>

Now when you read the file using >read.csv command like
person <- read.csv(file = “/datafile.csv”, header = FALSE
, col.names = c(“age”,”height”) )
The command when executed creates a data.frame but the problem comes because of a malformed numerical value in the data like 2.3*. , causing R to interpret the whole column as a
text variable. Moreover, by default text variables are converted to factor, so we are now stuck with a height variable expressed as levels in a categorical variable:
str(person)
## ‘data.frame': 4 obs. of 2 variables:
## $ age : int 21 42 18 21
## $ height: Factor w/ 3 levels “5.7*”,”5.9″,”6.0″: 3 2 1 NA

Solution 1:
Using colClasses, we can force R to either interpret the columns in the way we want or throw an error when this is not possible.
read.csv(“files/datafile.csv”, header=FALSE, colClasses=c (‘numeric’,’numeric’))
## Error: scan() expected ‘a real’, got ‘2.3*’

Solution 2:
As an alternative, columns can be read in as character by setting stringsAsFactors=FALSE.
Example:
dat <- read.csv(file = “datafile.csv”, header = FALSE, col.names = c(“age”,”height”), stringsAsFactors=FALSE)
dat$height <- as.numeric(dat$height)
## Warning: NAs introduced by coercion

B. Cleaning a text file
Assume you have a text file with the following contents;
%% Data on the Amigo Brothers
Gratt,1861,1892
Bob,1892
1871,Emmet,1937
% Names, birth and death dates

To remove %% sign use the grepl command as
> A<- grepl(“^%”, txt)

C. Factors in R
In R, the value of categorical variables is stored in factor variables. A factor is an integer vector endowed with a table specifying what integer value corresponds to what level. The values in this translation table can be requested with the levels function.
> f <- factor(c(“a”, “b”, “a”, “a”, “c”))
> levels(f)
## [1] “a” “b” “c”

For example, suppose we read in a vector where 1 stands for male, 2 stands for female and 0 stands for
unknown. Conversion to a factor variable can be done as in the example below
## [1] “a” “b” “c”

# example:
>gender <- c(2, 1, 1, 2, 0, 1, 1)
# recoding table, stored in a simple vector
>recode <- c(male = 1, female = 2)
(gender <- factor(gender, levels = recode, labels = names(recode)))
##[1] female male male female <NA> male male
##Levels: male female
Note that we do not explicitly need to set NA as a label. Every integer value that is encountered in the first argument, but not in the levels argument will be regarded missing. Levels in a factor variable have no natural ordering.

Reference

de Jonge, E., & van der Loo, M. (2013). An introduction to data cleaning with R: Technical Report 201313, Statistics Netherlands, 2013. URL http://www. cbs. nl/nl-NL/menu/methoden/onderzoek-methoden/discussionpapers/archief/2013/default. htm.

2014 in review

The WordPress.com stats helper monkeys prepared a 2014 annual report for this blog.

Here’s an excerpt:

A San Francisco cable car holds 60 people. This blog was viewed about 1,300 times in 2014. If it were a cable car, it would take about 22 trips to carry that many people.

Click here to see the complete report.

Data Clustering- Theoretical concepts

Data Clustering- Theory

Theoretical concepts are essentially the building blocks towards a bigger picture. I’m writing this post so that I keep reverting back to it whenever in doubt. I like to learn from the bottom-up approach. Get the basics correct and then build castles. I will be updating this post time and again also as the adage goes “All work and no play makes jack a dull boy”. Therefore, this post will be followed by a practical implementation of the theoretical concepts stated here.

Data clustering sometimes also called as cluster analysis is an unsupervised process for creating a group of objects or clusters such that objects in one cluster are similar to each other within the cluster but dissimilar to others in different clusters. And most of the clustering algorithms are very sensitive to their initial assumption. Distance and similarities play an important role in clustering. In literature, it’s the similarity coefficient’s and dissimilarity measures that are used to define the degree and a quantitative measure of similarity or difference between two data points or two clusters.

Clustering process usually involves four design phases that are as follows;

  1. Phase I- Data Representation: data can be represented in various ways like image, audio, video, text, number etc. These are known as data types. They are broadly divided into two types a. Discrete and Continuous. Discrete data type the data can take a finite number of values. Discrete data type is further classified as Nominal- finite number of possible discrete values and Binary- only two possible values i.e. 0 and 1. Binary data type is of two types Symmetrical- these are nominal variables and Asymmetrical- In this type one value carries more importance than the other. Continuous- data can take infinite number of values as a continuous stream
  2. Phase II- Data Modelling- In this phase the criteria that separate the desired group structure from each other and the clusters are defined.
  3. Phase III- Data Optimisation
  4. Phase IV- Data Validation

Clustering problems are divided in to two types

  1. Hard clustering – also known as Crisp clustering. In this type a data point belongs to one and only one cluster
  2. Soft clustering – also known as Fuzzy clustering. In this type a data point may belong to two or more clusters based on some probability.

In general clustering algorithms are of two types

  1. Hierarchical clustering algorithm- is of two types a. Divisive hierarchical clustering algorithm- In this the algorithm proceeds from top to bottom. It means that it initially begins with a single large cluster that contains all the data points within the dataset and continues splitting clusters. b. Agglomerative hierarchical clustering algorithm- In this the algorithm proceeds from bottom to top. It starts with each cluster containing one data point and continues merging the clusters until one large cluster is formed.

Note: for large dataset’s hierarchical clustering algorithms are impractical because these methods take O (n2) memory space and O (n3) CPU time (Zait and Messatfa, 1997) where n is the number of data points.

Example: Assume you have a dataset that contains 200 attributes and 20 million records. Your computer has a 4GB RAM and 50GB free hard disk space. You apply a hierarchical clustering algorithm on this dataset. Cool, now go and sleep and wake up the next morning for the computer will still be churning it.

Dealing with missing values

Real life dataset typically have this problem of inequality in the values of attributes in a given dataset then its wise to follow data pre-processing steps first. (Fujikawa and Ho, 2002) have classified it into two groups as;

  1. Group a-Pre-processing methods- which replace missing values before the data mining process
  2. Group b- Embedded methods- deal with missing values during the data mining process

They have also given three cases in which missing values can occur in a dataset

  1. Missing values occur in several variables
  2. Missing values occur in a number of records
  3. Missing values occur in randomly in variables and records

Below table gives a list of methods to deal with missing values as per the cases given above

Method Group Attribute type Case Cost
Mean-and-mode method a. Numerical & Categorical 2. Low
Linear regression a. Numerical 2. Low
Standard deviation method a. Numerical 2. Low
Nearest neighbor estimator a. Numerical & Categorical 1. High
Decision tree imputation a.  Categorical 1. Middle
Auto associative neural network a. Numerical & Categorical 1. High
Casewise deletion b. Numerical & Categorical 2. Low
Lazy decision tree b. Numerical & Categorical 1. High
Dynamic path generation b. Numerical & Categorical 1. High
C4.5 b. Numerical & Categorical 1. Middle
Surrogate split b. Numerical & Categorical 1. Middle

They have proposed three cluster based algorithms to deal with missing values based on the mean-and-mode method

1) NCBMM (Natural Cluster Based Mean-and-Mode) algorithm

This is used for supervised data. It uses the class attribute to divide objects into natural clusters and uses the mean and mode of each cluster to fill in the missing values of objects in that cluster depending on the type of attribute. Since most clustering algorithms are unsupervised type, NCBMM cannot be applied directly.

2) RCBMM (attribute Rank Cluster Based mean-and-mode Method)

This is used for categorical attributes and is independent of the class attributes. RCBMM uses a distance function between two attributes. This distance function can be found by using Pearson’s co-relation co-efficient or Chi-Square statistic or optimal class prediction or Group based distance or a combination of these. It works as follows, given a missing attribute x

  1. Step1: rank all categorical attributes by their distance function from x. find the attribute with smallest distance to x. this attribute will be used for clustering.
  2. Step2: all records are divided into clusters, each of which contains records with the same value of selected attributes as derived in step1.
  3. Step3: the mode of each cluster is used to fill in the missing values.

This process is applied to each missing attribute.

  1. KMCMM (K-means Cluster Based Mean-and mode Method)

This is used for filling in the missing values for numerical attributes. It’s independent of the class attribute.

So by using any of the aforementioned methods you can have a uniformly distributed dataset. Once this is done you can apply a data standardization method like mean or median or standard deviation or range of Huber’s estimate or Tukey’s bi-weight estimate or Andrew’s wave estimate to transform the data to be dimensionless. It’s pertinent to standardize the dataset because in cases where the dissimilarity measure like the Euclidean distance is used in the dataset they are sensitive to differences in magnitude or scale of the input variable.

How to compute the distance between attributes or the distance co-efficient

Now we discuss how to compute the distance co-efficient between two attributes. There are several of these types. I will just mention them here, for more details on it the reader is suggested to refer to literature.

  1. Pearson’s co-relation co-efficient
  2. Chi-square statistic- used for nominal variables. Used to represent the joint distribution of two categorical variables.
  3. Optimal class prediction- these are of two types asymmetrical and symmetrical.
  4. Group based distance (proposed by Mantaras, 1991)- To calculate the distance between two categorical variables in a dataset the distance between their partitions can be used.

Now I discuss the scale conversion.

Data Scale Conversion

Typically in a dataset, there may be many variables of different types like ordinal, nominal, binary, categorical, numeric, interval etc. to compare variables of different types it’s required to convert them to different scales. For the moment I’m concerned with numeric variables. To convert numeric to categorical can be done by k-means, CURE, CLARANS, BIRCH, DENCLUE

Data Standardization

Preparing the data for clustering requires some sort of data transformation so as to make the data dimensionless. It’s pertinent to standardize the data because in cases where the dissimilarity measure like Euclidean distance is sensitive to differences in magnitude or scales of the input variable. Some well-known standardization methods are mean, median, standard deviation, range, Huber’s estimate, Tukey’s bi-weight estimate and Andrew’s wave estimate.

Note: Data standardization concentrates on variables while Data transformation concentrates on the whole datasets.

Data Transformation

  1. Principal Component Analysis- its main purpose is to reduce the dimensionality of a high dimension dataset consisting of a large number of interrelated variables and at the same time to retain as much as possible of the variation present in the dataset. When applied on a dataset, it results in Principal Components (PCs) which is/are new variables in the dataset that are un-correlated and ordered such that they retain most of the variation present in all of the original variables in the dataset.
  2. SVD- also known as linear projection technique. It’s widely used in data compression and visualization.
  3. Karhunen-Loeve Transformation- like PCA, this too has an optimal way to project n-dimensional points to lower dimensional points such that the error of projections (Sum of Squared Distance (SSD)) is minimum.

Similarity & Dissimilarity Measurement

The basics

  • A similarity coefficient indicates the strength of the relationship between two data points (Everitt, 1993). The more the two data points resemble one another, the larger the similarity co-efficient is.

Let x=(x1,x2,….xd) and y=(y1,y2, ….yd) in a two dimensional data point. Then the similarity co-efficient between the two data points x and y will be some function of these attribute value i.e.

S(x,y)=s(x1,x2,….xd, y1,y2, ….yd)

Similarity is usually symmetric i.e. s(x,y)=s(y,x)

  • Proximity matrix- is a matrix that contains the pairwise indices of a dataset. Usually proximity matrices are symmetrical.
  • Proximity graph- is a weighted graph where the nodes are the data points being clustered and the weighted edges represent the proximity indices between the points. A directed graph corresponds to an asymmetrical proximity while and undirected graph corresponds to a symmetrical proximity graph.

References

Everitt, B. (1993).  Cluster analysis, 3rd edition. New York, Toronto: Haslsted Press.

Fujikawa, Y. and Ho, T. (2002). Cluster based algorithms for dealing with missing values. In Cheng, M.-S.,Yu, P.S., and Liu, B., editors, Advances in Knowledge Discovery and Data Mining, Proceedings of the 6th Pacific-Asia Conference, PAKDD 2002, Taipei, Taiwan, volume 2336 of Lecture Notes in Computer Science, pages 549-554. New York; Springer.

Mantaras, R. (1991). A distance-based attribute selection measure for decision tree induction. In Machine Learning, volume 6, pages 81-92. Boston: Kluwer Academic

Zait, M. and Messatfa, H. (1997). A comparative study of clustering methods.  Future Generation Computer Systems,13(2-3):149-159

A compendium of book’s, tutorials, datasets

On this page I will be posting the links to all resources that I find are pertinent for my research work and for readers to this blog. Book’s 1. Data Mining with R learning with case studies – very interesting a practical book. My rating- 5 stars 2. An Introduction to Statistical Learning with applications in R (available for free online) by James, Witten, Hastie. Springer Publication- Is a classic book that requires no prior R programming experience required. Easy to understand language. My rating- 5 stars 3. Data Mining Practical Machine Learning Tools and techniques with Java Implementations by Witten & Frank. Maurgan Kauffman Publications- From the creators of Weka this book is perhaps the only well established treatise on the application of Weka in Data Mining context. However, a word of caution. Prior statistical concepts are a mandate to understand the algorithms presented in the book. If you are reading this book and need help, go here. My rating- 4 stars Journal’s Leading Data Mining Journals: IEEE Transactions on Pattern Analysis and Machine Learning (TPAMI)

Tutorial’s/MOOC’s
 Statistical Data Mining Tutorials- An exhaustive list of excellent tutorials on DM algorithms
Datasets (Free mostly)
 Open Discovery Space and LinkedUP – will help in collecting, sharing and have an open access to educational data sets also Datasets from DataHub
The Programme for International Student Assessment (PISA) is a triennial international survey which aims to evaluate education systems worldwide by testing the skills and knowledge of 15-year-old students.

Data preprocessing with R

In my previous post, I have detailed how to load data into R. Continuing further from there, after the data has been loaded the next step is to clean it and apply some clustering algorithm to it so as to reveal some patterns. Talking about data variable, is of two types quantitative and qualitative (also known as categorical). I am not taking into account picture or sound or any other type of data variable. Quantitative variables take on numerical values Examples include a person’s age, height, or income, the value of a house, and the price of a stock. In contrast, qualitative variables take on values in one of K different classes, or categories. Examples of qualitative variables include a person’s gender (male or female), the brand of product purchased (brand A, B, or C), whether a person defaults on a debt (yes or no), or a cancer diagnosis (Acute Myelogenous Leukemia). We tend to refer to problems with a quantitative response as regression problems, while those involving a qualitative response are often termed as classification problems. However, the distinction is not always that crisp. Least squares linear regression is used with a quantitative response, whereas logistic regression is typically used with a qualitative (two-class, or binary) response. As such it is often used as a classification method. But since it estimates class probabilities, it can be thought of as a regression method as well. Some statistical methods, such as K-nearest neighbors and boosting, can be used in the case of either quantitative or qualitative responses. We tend to select statistical learning methods on the basis of whether the response is quantitative or qualitative; i.e. we might use linear regression when quantitative and logistic regression when qualitative. However, whether the predictors are qualitative or quantitative is generally considered less important.

I chose to discuss briefly about variables because you got to understand your data that needs to be analysed. Therefore a proper understanding of your data variables can help save you a lot of tears later. Why I emphasize this is because today I wasted more than 3 hours trying to clean my data when actually the data was already cleaned. I cannot post here the type of data that I’m dealing with but I will tell the problem and the solution to it that I was facing with.

Assume you have a dataset as following;

Table Account

Account number Account name Account balance Account type
111223334 Xyz $89078 Current
234534578 Abc $2345 Savings

Off course real life dataset does not have only 2 rows but this is a mere example. Now, you are tasked with searching for patterns in it. You take into consideration all the four variables as given in table Account. You import the data into R using the command mydata <- read.csv(“C:\Users\Example\Book1.csv”)

The command succeeds
Next you view the data using the command
> View(mydata) and R will show you a tabulated format of your data.

Problem
Now, you try to plot it or apply kmeans or any other clustering algorithm to it and you will get the following error
Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1) In addition: Warning message: In kmeans(mydata, 10) : NAs introduced by coercion

Solution
So you might think, lets use the as.numeric function of R to convert all the data to numeric format using the command

> mydata=as.numeric(mydata) and you will get another error as

Error: (list) object cannot be coerced to type ‘double’

So what is the problem here when you know that you are data is in the correct format?
Problem my dear friend is the choice of your data variables that you want to compare or make clusters. What I mean is in this example, I have used Kmeans to cluster the data and kmeans understands only continuous data so if you try to cluster the variable “Account number” it will not accept it and will give you the above errors. Therefore a judicious choice of variables is imperative for analysis. So the solution is if you remove the Account number variable from the dataset you will be able to cluster the data and plot it.

Q. What if you dont want to loose the variable Account number and use it in the clustering too?

Solution: You do not have to convert the whole data frame to numeric because if you do so R will give you error Error: (list) object cannot be coerced to type ‘double’ here R is saying to you that it cannot convert the data,frame object into numeric format because its an object data type so you have to coerce it to a matrix. But before that how to check the data type of a variable in R? Use the str() as >str(yourdatavariablename) in my case its > str (mydata) and R gives me the attributes along with there data types

Answer: Use the as.matrix() function to coerce the data.frame object to a matrix as shown

> mydata=as.matrix(mydata) This command will convert your data.frame object into a matrix and now you are ready to apply any sort of clustering or plotting to your dataset.

How to apply kmeans
Note: before you apply kmeans or any other algorithm ensure that you scale or standardise your dataset. for this use the following command as mydata= scale(mydata). After this you will store the k-means result into another variable so the command is >fit=kmeans(mydatanow you are ready to plot it. Install the package cluster if you don’t have it as >install.packages(“cluster”) Now you load it as >library(cluster) to see the cluster results use the command >plotcluster(mydata,fit$cluster) here the $sign is used to refer to the cluster variable

I found a good book on learning R with statistics. The language is easy to understand and in short its a great book.

That’s all for now. I hope to see you soon. Till then eat proper, stay safe and take care.

How to read CSV file into R

If you are using R much you will likely need to read in data at some point. While R can read excel .xls and .xlsx files these file types often cause problems. Comma separated files (.csv) are much easier to work with. It’s best to save these files as csv before reading them into R. If you need to read in a csv with R the best way to do it is with the command read.csv. Here is an example of how to read CSV in R: Step 1: Save Excel file as CSV file

Step 2: On R console type the following command fileToOpen<-read.csv(file.choose(), header=TRUE) The file.choose() command of R to open the file.
Here header is true because the CSV file has column headings in it

Step 3: sucess

The above reads the file fileName.csv into a data frame that it creates called myData. header=TRUE specifies that this data includes a header row and sep=”,” specifies that the data is separated by commas (though read.csv implies the same I think it’s safer to be explicit) Also ensure that in file path you use the forward slash (“/”) instead of the usual backslash.

Using SSIS to transfer data from multiple SQL tables by executing join query that writes result to CSV

Problem: Using SSIS to transfer data from multiple tables using SQL join query into a single CSV table

So after the data was transferred to SQL Server, the next step was to analyze it using Weka. But Weka understands only Arff format file and that can be converted if the data is in CSV (Comma Separated Value) format. Now the problem is that SQL Server does not provide any explicit means to export data to CSV format. The most you can have it is in Excel spreadsheet format. So the solution is to use SSIS (SQL Server Integration Services) or BIDS. One can use the Import/Export Data feature of SSMS but it works well for small dataset size. I’m dealing here with a hundred thousand million rows of data and I found Import/Export if used to simply hang the system. Therefore, my suggestion is to use SSIS for such voluminous data transfer quickly and efficiently. Today, I will show how to use SQL Query that joins two tables and then writes the data to a CSV file dynamically.

Solution

  1. Launch SSIS and create an Integration Services Project
  2. Launch MS Excel and create a new blank worksheet. Save it as CSV file. Provide the header column names to this file. Save and exit (On saving the file in CSV format you will be prompted that something blah blah will be lost.. just ignore it and click on save)
  3. Drag a Data Flow Task and drop it in the control flow area.data flow task
  4. Right click in the connection managers pane on the bottom and create an OLE DB Source connection and a Flat File Connectionconnections
  5. Point the OLE DB Source connection to your database in SQL Server and configure it accordingly configure ole db source
  6. Point the Flat File connection manager to the CSV file created above in Step 2. Make sure you place a check mark on the box at the bottom that says “column names in the first data row” configure flat file destination
  7. Drag and drop an OLE DB Source object and a Flat File Destination object into the data flow task area. data flow task1
    1. Configure the OLE DB Source as given below
    2. In Connection manager choose the OLE DB connection that we created in Step 5
    3. In Data access mode, click on the drop down and choose SQL Command
    4. In SQL Command Text box type your SQL query. (To be on the safe side, please ensure that you have parsed and executed this query in SQL Server. This is to validate that your SQL Query is correct)
    5. Optionally, you may click on Parse Query button and this will tell you if your SQL query is syntactically correct or not.
    6. Next, click on Columns and ensure that you see only those columns that you have mentioned in your SQL query. An interesting feature, I will share with you here. Assume your database table has six columns and Click on Ok button now. OLE DB Configuration is completed. If all went Ok then the red x on it would vanish. ole db source editor configuration
    1. Now double click on Flat File Destination and configure it as follows:
    2. First ensure that have configured the Flat File Connection manager as stated above in Step 6. If not then configure it first.
    3. Assuming that the Flat File Connection manager has been configured so now ensure that from the drop down under flat file connection manager, the connection manager is listed.
    4. Next click on Mappings and ensure that all the requisite columns are mapped properly. If all is correct then finally click on OK button flat file dest editor-1
  8. Save your package and then run it.Success
  9. Viola, package execution is successful and you shall have your required data in CSV format which you can now easily convert it into Arff format for Weka to parse.
Follow

Get every new post delivered to your Inbox.

Join 41 other followers