Clustering Algorithms: Hierarchical Clustering¶

In [1]:

from sklearn import datasets
from sklearn.metrics import adjusted_mutual_info_score
from scipy.cluster.hierarchy import linkage, dendrogram, fcluster
from pylab import *
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib notebook

We are going to use the functions that scipy provide for hierarchical agglomerative clustering. We will also continue working with the iris dataset.

In [2]:

iris = datasets.load_iris()
plt.figure(figsize=(8,8))
plt.scatter(iris['data'][:, 2], iris['data'][:, 1], c=iris['target'],s=100);

From the plot we can see that the classes from the labels do not form well separated clusters, so it is going to be difficult for hierarchical clustering to discover these three clusters.

First we apply single linkage clustering to the iris dataset.

Take note of the running time, so we can compare it with the running time of the other algorithms.

In [3]:

%time clust = linkage(iris['data'], method='single')
plt.figure(figsize=(15,15))
dendrogram(clust, distance_sort=True, orientation='right');

CPU times: user 5.79 ms, sys: 6.9 ms, total: 12.7 ms
Wall time: 6.34 ms

There is only evidence of two distinctive partitions in the dataset, some inversions also appear on the dendrogram. If we cut the dendrogram so we have tree clusters we obtain the following

In [4]:

plt.figure(figsize=(8,8))
clabels = fcluster(clust, 3, criterion='maxclust')

plt.scatter(iris['data'][:, 2], iris['data'][:, 1], c=clabels ,s=100);

We can compare the true labels with the ones obtained using this clustering algorithm using for example the mutual information score

In [5]:

print("AMI= ", adjusted_mutual_info_score(iris['target'], clabels))

AMI=  0.5820928222202184

Not a very good result

Lets apply the complete link criteria to the data.

In [6]:

%time clust = linkage(iris['data'], method='complete')
plt.figure(figsize=(15,15))
dendrogram(clust, distance_sort=True, orientation='right');

CPU times: user 1.72 ms, sys: 298 µs, total: 2.02 ms
Wall time: 1.62 ms

Also two aparent clusters, but if we cut the dendrogram to three clusters we obtain something a little better.

In [7]:

plt.figure(figsize=(10,10))
clabels = fcluster(clust, 3, criterion='maxclust')
plt.scatter(iris['data'][:, 2], iris['data'][:, 1], c=clabels,s=100);

If we compute the mutual information score

In [8]:

print (adjusted_mutual_info_score(iris['target'], clabels))

0.6963483696671463

Now we use the average link criteria to the data.

In [9]:

%time clust = linkage(iris['data'], method='average')
plt.figure(figsize=(15,15))
dendrogram(clust, distance_sort=True, orientation='right');

CPU times: user 442 µs, sys: 1.76 ms, total: 2.2 ms
Wall time: 746 µs

We cut again to obtain three classes

In [10]:

plt.figure(figsize=(10,10))
clabels = fcluster(clust, 3, criterion='maxclust')
plt.scatter(iris['data'][:, 2], iris['data'][:, 1], c=clabels,s=100);

In this case the mutual information scores higher for this criteria.

In [11]:

print (adjusted_mutual_info_score(iris['target'], clabels))

0.7934250515435666

Now we apply the Ward criterion (uses the variances of the clusters)

In [12]:

%time clust = linkage(iris['data'], method='ward')
plt.figure(figsize=(15,15))
dendrogram(clust, distance_sort=True, orientation='right');

CPU times: user 1.87 ms, sys: 542 µs, total: 2.41 ms
Wall time: 2.01 ms

In [13]:

plt.figure(figsize=(10,10))
clabels = fcluster(clust, 3, criterion='maxclust')
plt.scatter(iris['data'][:, 2], iris['data'][:, 1], c=clabels,s=100);

This criteria scores a little lower than the previous one. As we do not usually have the labels with a real unsupervise dataset we will have to use other quality criteria to decide the method to use for clustering the data.

In [14]:

print (adjusted_mutual_info_score(iris['target'], clabels))

0.7578034225092115