Data Clustering- Theory

Theoretical concepts are essentially the building blocks towards a bigger picture. I’m writing this post so that I keep reverting back to it whenever in doubt. I like to learn from the bottom-up approach. Get the basics correct and then build castles. I will be updating this post time and again also as the adage goes “*All work and no play makes jack a dull boy”*. Therefore, this post will be followed by a practical implementation of the theoretical concepts stated here.

Data clustering sometimes also called as cluster analysis is an unsupervised process for creating a group of objects or clusters such that objects in one cluster are similar to each other within the cluster but dissimilar to others in different clusters. And most of the clustering algorithms are very sensitive to their initial assumption. Distance and similarities play an important role in clustering. In literature, it’s the similarity coefficient’s and dissimilarity measures that are used to define the degree and a quantitative measure of similarity or difference between two data points or two clusters.

Clustering process usually involves four design phases that are as follows;

- Phase I- Data Representation: data can be represented in various ways like image, audio, video, text, number etc. These are known as data types. They are broadly divided into two types a.
*Discrete*and*Continuous. Discrete data type*the data can take a finite number of values.*Discrete data type*is further classified as*Nominal*– finite number of possible discrete values and*Binary*– only two possible values i.e. 0 and 1.*Binary data type*is of two types*Symmetrical*– these are nominal variables and*Asymmetrical*– In this type one value carries more importance than the other. Continuous- data can take infinite number of values as a continuous stream - Phase II- Data Modelling– In this phase the criteria that separate the desired group structure from each other and the clusters are defined.
- Phase III- Data Optimisation
- Phase IV- Data Validation

Clustering problems are divided in to two types

- Hard clustering – also known as Crisp clustering. In this type a data point belongs to one and only one cluster
- Soft clustering – also known as Fuzzy clustering. In this type a data point may belong to two or more clusters based on some probability.

In general clustering algorithms are of two types

- Hierarchical clustering algorithm- is of two types
*a.**Divisive hierarchical clustering algorithm*– In this the algorithm proceeds from top to bottom. It means that it initially begins with a single large cluster that contains all the data points within the dataset and continues splitting clusters.*b. Agglomerative hierarchical clustering algorithm*– In this the algorithm proceeds from bottom to top. It starts with each cluster containing one data point and continues merging the clusters until one large cluster is formed.

*Note:** for large dataset’s hierarchical clustering algorithms are impractical because these methods take O (n ^{2}) memory space and O (n^{3}) CPU time (Zait and Messatfa, 1997) where n is the number of data points. *

*Example: *Assume you have a dataset that contains 200 attributes and 20 million records. Your computer has a 4GB RAM and 50GB free hard disk space. You apply a hierarchical clustering algorithm on this dataset. Cool, now go and sleep and wake up the next morning for the computer will still be churning it.

Dealing with missing values

Real life dataset typically have this problem of inequality in the values of attributes in a given dataset then its wise to follow data pre-processing steps first. (Fujikawa and Ho, 2002) have classified it into two groups as;

- Group a-Pre-processing methods- which replace missing values before the data mining process
- Group b- Embedded methods- deal with missing values during the data mining process

They have also given three cases in which missing values can occur in a dataset

- Missing values occur in several variables
- Missing values occur in a number of records
- Missing values occur in randomly in variables and records

Below table gives a list of methods to deal with missing values as per the cases given above

Method | Group | Attribute type | Case | Cost |

Mean-and-mode method | a. | Numerical & Categorical | 2. | Low |

Linear regression | a. | Numerical | 2. | Low |

Standard deviation method | a. | Numerical | 2. | Low |

Nearest neighbor estimator | a. | Numerical & Categorical | 1. | High |

Decision tree imputation | a. | Categorical | 1. | Middle |

Auto associative neural network | a. | Numerical & Categorical | 1. | High |

Casewise deletion | b. | Numerical & Categorical | 2. | Low |

Lazy decision tree | b. | Numerical & Categorical | 1. | High |

Dynamic path generation | b. | Numerical & Categorical | 1. | High |

C4.5 | b. | Numerical & Categorical | 1. | Middle |

Surrogate split | b. | Numerical & Categorical | 1. | Middle |

They have proposed three cluster based algorithms to deal with missing values based on the mean-and-mode method

1) **NCBMM (Natural Cluster Based Mean-and-Mode) algorithm**

This is used for supervised data. It uses the class attribute to divide objects into natural clusters and uses the mean and mode of each cluster to fill in the missing values of objects in that cluster depending on the type of attribute. Since most clustering algorithms are unsupervised type, NCBMM cannot be applied directly.

2) **RCBMM (attribute Rank Cluster Based mean-and-mode Method)**

This is used for categorical attributes and is independent of the class attributes. RCBMM uses a distance function between two attributes. This distance function can be found by using Pearson’s co-relation co-efficient or Chi-Square statistic or optimal class prediction or Group based distance or a combination of these. It works as follows, given a missing attribute x

- Step1: rank all categorical attributes by their distance function from x. find the attribute with smallest distance to x. this attribute will be used for clustering.
- Step2: all records are divided into clusters, each of which contains records with the same value of selected attributes as derived in step1.
- Step3: the mode of each cluster is used to fill in the missing values.

This process is applied to each missing attribute.

**KMCMM (K-means Cluster Based Mean-and mode Method)**

This is used for filling in the missing values for numerical attributes. It’s independent of the class attribute.

So by using any of the aforementioned methods you can have a uniformly distributed dataset. Once this is done you can apply a data standardization method like mean or median or standard deviation or range of Huber’s estimate or Tukey’s bi-weight estimate or Andrew’s wave estimate to transform the data to be dimensionless. It’s pertinent to standardize the dataset because in cases where the dissimilarity measure like the Euclidean distance is used in the dataset they are sensitive to differences in magnitude or scale of the input variable.

**How to compute the distance between attributes or the distance co-efficient**

Now we discuss how to compute the distance co-efficient between two attributes. There are several of these types. I will just mention them here, for more details on it the reader is suggested to refer to literature.

- Pearson’s co-relation co-efficient
- Chi-square statistic- used for nominal variables. Used to represent the joint distribution of two categorical variables.
- Optimal class prediction- these are of two types asymmetrical and symmetrical.
- Group based distance (proposed by Mantaras, 1991)- To calculate the distance between two categorical variables in a dataset the distance between their partitions can be used.

Now I discuss the scale conversion.

**Data Scale Conversion**

Typically in a dataset, there may be many variables of different types like ordinal, nominal, binary, categorical, numeric, interval etc. to compare variables of different types it’s required to convert them to different scales. For the moment I’m concerned with numeric variables. To convert numeric to categorical can be done by k-means, CURE, CLARANS, BIRCH, DENCLUE

**Data Standardization**

Preparing the data for clustering requires some sort of data transformation so as to make the data dimensionless. It’s pertinent to standardize the data because in cases where the dissimilarity measure like Euclidean distance is sensitive to differences in magnitude or scales of the input variable. Some well-known standardization methods are mean, median, standard deviation, range, Huber’s estimate, Tukey’s bi-weight estimate and Andrew’s wave estimate.

*Note:**Data standardization concentrates on variables while Data transformation concentrates on the whole datasets.*

**Data Transformation**

- Principal Component Analysis- its main purpose is to reduce the dimensionality of a high dimension dataset consisting of a large number of interrelated variables and at the same time to retain as much as possible of the variation present in the dataset. When applied on a dataset, it results in Principal Components (PCs) which is/are new variables in the dataset that are un-correlated and ordered such that they retain most of the variation present in all of the original variables in the dataset.
- SVD– also known as linear projection technique. It’s widely used in data compression and visualization.
- Karhunen-Loeve Transformation– like PCA, this too has an optimal way to project n-dimensional points to lower dimensional points such that the error of projections (Sum of Squared Distance (SSD)) is minimum.

**Similarity & Dissimilarity Measurement**

The basics

- A similarity coefficient indicates the strength of the relationship between two data points (Everitt, 1993). The more the two data points resemble one another, the larger the similarity co-efficient is.

Let x=(x_{1},x_{2},….x_{d}) and y=(y_{1},y_{2}, ….y_{d}) in a two dimensional data point. Then the similarity co-efficient between the two data points x and y will be some function of these attribute value i.e.

S(x,y)=s(x_{1},x_{2},….x_{d}, y_{1},y_{2}, ….y_{d})

Similarity is usually symmetric i.e. s(x,y)=s(y,x)

- Proximity matrix- is a matrix that contains the pairwise indices of a dataset. Usually proximity matrices are symmetrical.
- Proximity graph- is a weighted graph where the nodes are the data points being clustered and the weighted edges represent the proximity indices between the points. A directed graph corresponds to an asymmetrical proximity while and undirected graph corresponds to a symmetrical proximity graph.

**References**

Everitt, B. (1993). * Cluster analysis, *3^{rd} edition. New York, Toronto: Haslsted Press.

Fujikawa, Y. and Ho, T. (2002). Cluster based algorithms for dealing with missing values. In Cheng, M.-S.,Yu, P.S., and Liu, B., editors, *Advances in Knowledge Discovery and Data Mining, Proceedings of the 6 ^{th} Pacific-Asia Conference, *PAKDD 2002, Taipei, Taiwan, volume 2336 of

*Lecture Notes in Computer Science,*pages 549-554. New York; Springer.

Mantaras, R. (1991). A distance-based attribute selection measure for decision tree induction. In *Machine Learning, *volume 6, pages 81-92. Boston: Kluwer Academic

Zait, M. and Messatfa, H. (1997). A comparative study of clustering methods. * Future Generation Computer Systems,*13(2-3):149-159