The title of this week’s essay is actually derived from the infamous speech (“Give me blood and I promise you freedom!”) by the Indian nationalist Subhash Chandra Bose’s speech delivered in Burma on July 4th 1944.
An essay makes more sense if its title can relate to its contents. Thus, after considerable debate on how to aptly title it, I finally settled on this inspirational speech. And this is exactly what k-means does. You feed it with numerical data and it will give you clusters. In the remainder of this article, I will discuss on what is k-means algorithm, when should it be used, what are its drawbacks and what results it can give us.
K-Means Clustering is one of the many techniques that constitute “unsupervised learning”. The “supervised learning” simply put, is when you are given some prior training (“supervision”) on how to proceed. While, often no such prior knowledge (training data) exists and we are just given a dataset and told – “go extract some insights”. This mode, where we have to find patterns in data without guidance of prior knowledge, is called “unsupervised learning”. There is no “supervision” in the form of examples of good classification. So we have to dive in and dig out some nugget(s) of wisdom, no matter what, to get started.
All right, I get it! Tell me, how it works?
As the name implies it is so called because it operates by computing the “mean” of some attributes. That “mean” then becomes the center of one cluster. There are a small number, K, of such clusters. That is, the technique consists of computing “K number of means” leading to “clustering” of the data, around these “K means”.
Cool, so when can I use it? I mean in which situation’s it can be applied?
To answer the question, K-means clustering is applicable during the initial exploration of the data when we do not have any training samples. For example, it can be used in clustering the purchasing patterns of customer and then we can drill down into the clusters by building predictive models inside each cluster e.g. using multivariate linear regression. A good discussion on the drawbacks of k-means is given on this stack exchange post
So, how do I choose this K?
If we have some idea of what we are looking for or how many clsuters we expect or want, then we set K to be that number before we start the process. If we don’t know how many groups are there in the data then our exploration will take a little longer and involve some trial and error as we try say K=3,4,5 until we see that the clusters are making some sense to us in our domain.
The K-Means algorithm is iterative. It starts by choosing K points at random from the data and uses these as “cluster centers” just to get started. Then at each step it decides which cluster to assign a point to based on which cluster center is closest.
Once that is done and we have a new arrangement of points than the “center” or “mean” of the cluster” is computed again because it will have most probably shifted. When does it not shift? When we have a stable clustering. That is when we have iterated till we get no benefit from iterating further then that is our result.
Choosing the optimum K in the k-means clustering can either be done manually by inspecting the visual representation of data but this method is infeasible for large dataset with many features. For a large dataset, automated methods like the “elbow method”, “Information Criterion Approach (includes Akakie Information Criterion & Bayesian Information Criterion, Deviance Information Criterion”, “Information Theoretic Approach”, “Silhouette Scores” and the “Cross validation” approach is used. A good treatment on this subject is presented by Tibshirani et al in there gap-statistic paper. See this quora post for some other good answers on the same.
I will use the gapminder dataset. In the previous article’s, I have described at length about this dataset, so any interested reader is suggested to see this link.
If you have been following along, you will note that I’m interested in determining the relationship between alcohol consumption and breast cancer in working women.
The potential clustering variables
The data has a total of 157 rows in 9 columns. The data was then randomly split into a training set that included 70% of the observations (N=117) and a test set that included 30% of the observations (N=40). A random forest analysis was preformed on the training set (N=117) to evaluate a series of explanatory variables. The following explanatory variables were evaluated in predicting a categorical breast cancer per person:
- Alcohol Consumption – 2008 recorded and estimated average alcohol consumption, adult (15+) per capita as collected by the World Heath Organization
- CO2 Emissions – Total amount of CO2 emission in metric tons from 1751 to 2006 as collected by CDIAC
- Female Employment Rate – Percentage of female population, age above 15, that has been employed during 2007 as collected by the International Labour Organization
- Internet Use Rate – 2010 Internet users per 100 people as collected by the World Bank
- Life Expectancy – 2011 life expectancy at birth (in years) as collected by various sources
- Polity Score – 2009 Democracy score as collected by the Polity IV Project
- Employment Rate – Percentage of total population, age above 15, that has been employed during 2009 as collected by the International Labour Organization
- Urbanization Rate – 2008 Urban population (% total population) as collected by the World Bank
- Income per person – is the 2010 Gross Domestic Product per capita in constant 2000 US$.
The following table 1, summarises the variable importance in classifying countries into levels of breast cancer per 100th person:
Table 1: Variable Importance
|Income per person||15%|
|Female Employ Rate||13%|
|Internet Use Rate||12%|
The explanatory variables with a relative importance scores greater than or equal to 10% (Income per person, life expectancy, alcohol consumption, female employ rate, internet use rate, urbanization rate and employ rate) were selected as clustering variables.
All clustering variables were standardized to have a mean of 0 and a standard deviation of 1. The data was randomly split into a training set that included 70% of the observations (N=109) and a test set that included 30% of the observations (N=48). A series of k-means cluster analyses were conducted on the training data specifying k=1-9 clusters, using Euclidean distance. The mean Euclidean distance was plotted for each of the nine cluster solutions in an elbow curve to provide guidance for selecting the number of clusters. The following figure 1 illustrates the elbow curve.
Figure 1: Selecting K with the Elbow Curve Method
The possible levels of K from figure 1 include 2,3,4,6,7,8 (determined by the breaks in the blue line). Each level was evaluated to see if there was a significant difference between levels breast cancer per 100th person. Breaking the data into 2 clusters proved to be a good choice however do note that 7 was not a bad choice. My preference towards less complication than was needed, lead to selecting k=2.
Cluster number zero has a much higher alcohol consumption per person leading to increased cases of breast cancer than cluster one as shown in figure 2.
Figure 2: Means for feature breastcancerper100th by cluster
In figure 3, I show the clusters (zero in blue and one in green) for all the clustering variables.
Figure 3: Clustering variable Scatterplots
This examination was done as an exercise in doing K-means clustering. The data in the GapMinder dataset does not lend itself to clustering. The scatter plots above illustrate the lack of distinct clusters. Caution should be exercised in using/interpreting these results.
As always the source code can be found in this GitHub repository
Featured Image source: Modern Medical Dictionary. 2014. Index of /wp-content/uploads/2014/03. [ONLINE] Available at:http://modernmedicaldictionary.com/wp-content/uploads/2014/03/blind_men_elephant.jpg. [Accessed 20 April 2016].