from sklearn import datasets
from sklearn.metrics import adjusted_mutual_info_score
from scipy.cluster.hierarchy import linkage, dendrogram, fcluster
from pylab import *
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib notebook
We are going to use the functions that scipy provide for hierarchical agglomerative clustering. We will also continue working with the iris dataset.
iris = datasets.load_iris()
plt.figure(figsize=(8,8))
plt.scatter(iris['data'][:, 2], iris['data'][:, 1], c=iris['target'],s=100);
From the plot we can see that the classes from the labels do not form well separated clusters, so it is going to be difficult for hierarchical clustering to discover these three clusters.
First we apply single linkage clustering to the iris dataset.
Take note of the running time, so we can compare it with the running time of the other algorithms.
%time clust = linkage(iris['data'], method='single')
plt.figure(figsize=(15,15))
dendrogram(clust, distance_sort=True, orientation='right');
CPU times: user 5.79 ms, sys: 6.9 ms, total: 12.7 ms Wall time: 6.34 ms
There is only evidence of two distinctive partitions in the dataset, some inversions also appear on the dendrogram. If we cut the dendrogram so we have tree clusters we obtain the following
plt.figure(figsize=(8,8))
clabels = fcluster(clust, 3, criterion='maxclust')
plt.scatter(iris['data'][:, 2], iris['data'][:, 1], c=clabels ,s=100);
We can compare the true labels with the ones obtained using this clustering algorithm using for example the mutual information score
print("AMI= ", adjusted_mutual_info_score(iris['target'], clabels))
AMI= 0.5820928222202184
Not a very good result
Lets apply the complete link criteria to the data.
%time clust = linkage(iris['data'], method='complete')
plt.figure(figsize=(15,15))
dendrogram(clust, distance_sort=True, orientation='right');
CPU times: user 1.72 ms, sys: 298 µs, total: 2.02 ms Wall time: 1.62 ms
Also two aparent clusters, but if we cut the dendrogram to three clusters we obtain something a little better.
plt.figure(figsize=(10,10))
clabels = fcluster(clust, 3, criterion='maxclust')
plt.scatter(iris['data'][:, 2], iris['data'][:, 1], c=clabels,s=100);
If we compute the mutual information score
print (adjusted_mutual_info_score(iris['target'], clabels))
0.6963483696671463
Now we use the average link criteria to the data.
%time clust = linkage(iris['data'], method='average')
plt.figure(figsize=(15,15))
dendrogram(clust, distance_sort=True, orientation='right');
CPU times: user 442 µs, sys: 1.76 ms, total: 2.2 ms Wall time: 746 µs
We cut again to obtain three classes
plt.figure(figsize=(10,10))
clabels = fcluster(clust, 3, criterion='maxclust')
plt.scatter(iris['data'][:, 2], iris['data'][:, 1], c=clabels,s=100);
In this case the mutual information scores higher for this criteria.
print (adjusted_mutual_info_score(iris['target'], clabels))
0.7934250515435666
Now we apply the Ward criterion (uses the variances of the clusters)
%time clust = linkage(iris['data'], method='ward')
plt.figure(figsize=(15,15))
dendrogram(clust, distance_sort=True, orientation='right');
CPU times: user 1.87 ms, sys: 542 µs, total: 2.41 ms Wall time: 2.01 ms
plt.figure(figsize=(10,10))
clabels = fcluster(clust, 3, criterion='maxclust')
plt.scatter(iris['data'][:, 2], iris['data'][:, 1], c=clabels,s=100);
This criteria scores a little lower than the previous one. As we do not usually have the labels with a real unsupervise dataset we will have to use other quality criteria to decide the method to use for clustering the data.
print (adjusted_mutual_info_score(iris['target'], clabels))
0.7578034225092115