Notebook

Scalable Clustering¶

Most of the clustering algorithms that we have seen do not scale to large number of examples or even clusters

In [1]:

from sklearn.cluster import Birch, MiniBatchKMeans, KMeans
from pylab import *
import matplotlib.pyplot as plt
import numpy as np
from numpy import loadtxt
import time
from sklearn.metrics import adjusted_mutual_info_score
import warnings
warnings.filterwarnings("ignore")

%matplotlib inline

citypath = '/home/bejar/Data/City/' # Adjust the path to your local dataset to run
data = 'LONcrime.csv'
citypos = loadtxt(citypath+data, delimiter=',')

print(citypos.shape)

(85345, 2)

Let's see what happens when we increase the number of clusters

In [3]:

ltimes = []
for nc in range(10, 101, 10):
    km = KMeans(n_clusters=nc, n_init=1)
    itime = time.perf_counter()
    kmlabels = km.fit_predict(citypos)
    etime = time.perf_counter()
    ltimes.append(etime-itime)
    
fig = plt.figure(figsize=(15,5))
ax = fig.add_subplot(111)
plt.plot(range(10, 101, 10), ltimes)

Out[3]:

[<matplotlib.lines.Line2D at 0x7fc8e79c5c50>]

Minibatch K-means is an incremental version of K-means that uses random batches of examples to reduce the amount of memory and time. It is designed for datasets that can not be stored in memory, the size of the batch could improve the results if the clusters are difficult to find, if not small sizes can obtain results good enough in practice.

In [4]:

ltimes = []
ltimeskm = []
agree = []
for nc in range(10, 101, 10):
    mbkm = MiniBatchKMeans(n_clusters=nc, batch_size=100, n_init=1, max_iter=5000)
    itime = time.perf_counter()
    mbkmlabels = mbkm.fit_predict(citypos)
    etime = time.perf_counter()
    ltimes.append(etime-itime)
    km = KMeans(n_clusters=nc, n_init=1)
    itime = time.perf_counter()
    kmlabels = km.fit_predict(citypos)
    etime = time.perf_counter()
    ltimeskm.append(etime-itime)
    agree.append(adjusted_mutual_info_score(mbkmlabels, kmlabels))
    
fig = plt.figure(figsize=(15,5))
ax = fig.add_subplot(121)
plt.plot(range(10, 101, 10), ltimeskm, color='b')
plt.plot(range(10, 101, 10), ltimes, color='r')
ax = fig.add_subplot(122)
plt.plot(range(10, 101, 10), agree)

Out[4]:

[<matplotlib.lines.Line2D at 0x7fc8e8594ef0>]

BIRCH is an algorithm that compresses the data to certain granularity using the Leader algorithm to reduce the data size and uses an indexing datastructure for fast assignment, the adjustment of its parameters is key for good results. Also assumes that the whole dataset does not fits in memory but the compressed data is a good approximation of the data densities.

In [5]:

ltimes = []
ltimeskm = []
agree = []
for nc in range(10, 101, 10):
    birch = Birch(threshold=0.02, n_clusters=nc, branching_factor=100)
    itime = time.perf_counter()
    birchlabels = mbkm.fit_predict(citypos)
    etime = time.perf_counter()
    ltimes.append(etime-itime)
    km = KMeans(n_clusters=nc, n_init=1)
    itime = time.perf_counter()
    kmlabels = km.fit_predict(citypos)
    etime = time.perf_counter()
    ltimeskm.append(etime-itime)
    agree.append(adjusted_mutual_info_score(birchlabels, kmlabels))
    
fig = plt.figure(figsize=(15,5))
ax = fig.add_subplot(121)
plt.plot(range(10, 101, 10), ltimeskm, color='b')
plt.plot(range(10, 101, 10), ltimes, color='r')
ax = fig.add_subplot(122)
plt.plot(range(10, 101, 10), agree)

Out[5]:

[<matplotlib.lines.Line2D at 0x7fc8e84e44e0>]

The implementation of scikit-learn of BIRCH does not always do a good job because the final number of clusters is obtained using agglomerative hierarchical clustering

Now let's see what happens if we increase the size of the dataset. For this we are going to generate randomly spherical blobs of with increasing numbers of examples.

The three algorithms are computed for the data and the time used and the agreement with the original labels is collected. The size of the batches of minibatch K-means and Birch branching factor are set to a 5% of the size of the dataset, this is just a random guess and these parameter have to be carefully set.

If you run this cell again it takes some time to complete. You have also this code in the /Code folder so yu can run it outside the notebooks and explore different options.

In [6]:

from amltlearn.datasets import make_blobs


ltimes = []
agree = []
step = 10000
limit = 100001
ncenters = 50
for ns in range(step, limit, step):
    blobs, labels = make_blobs(n_samples=ns, n_features=20, centers=ncenters, cluster_std=1, center_box=(-3,3))
    # KMeans
    km = KMeans(n_clusters=ncenters, n_init=1)
    itime = time.perf_counter()
    kmlabels = km.fit_predict(blobs)
    etime = time.perf_counter()
    dtime1 = etime-itime
    agg1 = adjusted_mutual_info_score(labels, kmlabels)

    # Minibatch Kmeans
    itime = time.perf_counter()
    mbkm = MiniBatchKMeans(n_clusters=ncenters, batch_size=int(ns/20), n_init=1)
    mbkmlabels = mbkm.fit_predict(blobs)
    etime = time.perf_counter()
    dtime2 = etime-itime
    agg2 = adjusted_mutual_info_score(labels, mbkmlabels)

    # Birch
    itime = time.perf_counter()
    birch = Birch(threshold=5, n_clusters=ncenters, branching_factor=int(ns/20))
    birchlabels = birch.fit_predict(blobs)
    etime = time.perf_counter()
    dtime3 = etime-itime
    agg3 = adjusted_mutual_info_score(labels, birchlabels)

    ltimes.append((dtime1, dtime2, dtime3))
    agree.append((agg1, agg2, agg3))

fig = plt.figure(figsize=(15,5))
ax = fig.add_subplot(121)
plt.plot(range(step, limit, step), [x for x,_,_ in ltimes], color='r')
plt.plot(range(step, limit, step), [x for _, x,_ in ltimes], color='g')
plt.plot(range(step, limit, step), [x for _, _, x in ltimes], color='b')

ax = fig.add_subplot(122)
plt.plot(range(step, limit, step), [x for x,_,_ in agree], color='r')
plt.plot(range(step, limit, step), [x for _, x,_ in agree], color='g')
plt.plot(range(step, limit, step), [x for _, _, x in agree], color='b')

plt.show()

As you can see K-means time grows faster and Birch and K-means scale in different ways, be aware that changing their parameters also change the time cost. Looking to the agreement to the original partitions you can see that Birch does in general a better job.

You can play with the size of the datasets and specially with the number of clusters, you will see that with a small number of clusters if data fits in memory in general K-means is really fast.

In [ ]: