Density based clustering does not assumes an specific shape for the data, let's see wen we use data that does not fit into spherical clusters, for example an image
from sklearn.datasets import load_sample_image
from sklearn.cluster import DBSCAN, KMeans
from pylab import *
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
We will transform the image into a dataset including the coordinates of the pixels and their RGB values
colors = 'rgbymc'
# you can use 'flower' or 'china' images
flower = load_sample_image('flower.jpg')
mdata = np.zeros((int(flower.shape[0]*flower.shape[1]/4.0), flower.shape[2]+2))
cc=0
for i in range(0,flower.shape[0]-1,2):
for j in range(0,flower.shape[1]-1,2):
mdata[cc][0] = i/2
mdata[cc][1] = j/2
for k in range(flower.shape[2]):
mdata[cc][2+k] = flower[i, j, k]
cc += 1
plt.figure(figsize=(10,10))
plt.scatter(mdata[:, 0], mdata[:, 1], c=mdata[:, 2:]/255.0, s=2, marker='+')
plt.show()
This is what happens using K-means looking for 3 classes using the coordinates and the color
km = KMeans(n_clusters=3, n_jobs=-1)
labels = km.fit_predict(mdata)
plt.figure(figsize=(10,10))
plt.scatter(mdata[:, 0], mdata[:, 1], c=np.array(labels)/len(np.unique(labels)), s=2, marker='+')
plt.show()
You can check what happens changing the number of clusters
Now we will cluster only the colors (we are applying vector quantization to the colors palette). You can play with the number of clusters to see how the cluster colors get closer to the original colors.
km = KMeans(n_clusters=3, n_jobs=-1)
labels = km.fit_predict(mdata[:,2:])
ccent = km.cluster_centers_/255.0
lcols = [ccent[l,:] for l in labels]
plt.figure(figsize=(10,10))
plt.scatter(mdata[:, 0], mdata[:, 1], c=lcols, s=2, marker='+')
plt.show()
Using or not the coordinates obviously has an effect on the result
For DBSCAN we will also use coordinates and colors, now to select the parameters is more complex and different parameters will yileld very different results.
dbs = DBSCAN(eps=25, min_samples=100)
labels = dbs.fit_predict(mdata)
unq = len(np.unique(labels))
print(unq-1)
ecolors = np.array(labels)
ecolors[ecolors == -1] += unq+25
plt.figure(figsize=(10,10))
plt.scatter(mdata[:, 0], mdata[:, 1], c=ecolors/unq+25, s=2, marker='+')
plt.show()
3