We'll start by loading some image data from the MNIST8M dataset. Since we have more resources now, we'll load 100k images instead of the 2-4k we've been working with in other labs. First some important imports:
import numpy as np
import h5py # for binary IO
import time
import timeit
import math
Then lets load a block of images. While we're doing that, we'll time the process (Note: Python has some other timing routines, but they often dont measure something other than wall clock time). Using time() gives us a direct measure of running time.
t = time.time()
train = np.loadtxt("/data/MNIST8M/parts/alls00.fmat.txt")
dt1 = time.time() - t
dt1
Which is a long time. Once we start dealing with large datasets, file formats become important. Lets compare with the time it takes to load the same data in binary format. We'll use HDF5, which is a standard binary file format for scientific data. Its accessible from Python (h5py package) and BIDMach and Matlab.
t = time.time()
f = h5py.File("/data/MNIST8M/parts/alls00.mat")
train = f['mat'][:]
dt2 = time.time() - t
dt2
Much better. On top of that the binary file (which is automatically compressed) is 30 MB vs 300 MB for the text version of the file. We could have compressed the text file as well, but that would make reading it even slower.
With larger datasets, you should use binary data representations whenever possible. Even if the data is text, you can represent it with integer word ids and a dictionary. The dictionary is typically a fraction of the size of original text, and processing integer data is orders of magnitude faster.
Now lets train a kMeans model. Check the help page for this algorithm here for explanation of what these options mean.
from sklearn.cluster import KMeans
dim = 100
kmodel = KMeans(n_clusters=dim, init='random', n_init=1, max_iter=10)
Lets time model calculation
t = time.time()
kmodel.fit(train)
dt = time.time() - t
dt
Now lets load another chunk of the data for testing:
f = h5py.File("/data/MNIST8M/parts/alls01.mat")
test = f['mat'][:]
To get the score of the clustering, we call the score function. That returns the total sum of squares of distance from test points to their cluster centroids. This number depends on the number of test points. To derive a more meaningful number, we divide by the number of test points.
sc = kmodel.score(test)/test.shape[0];
print "Time=", dt, " score=", sc
How does cluster quality and running time vary with number of dimesions? Lets try...
dim = 300
kmodel300 = KMeans(n_clusters=dim, init='random', n_init=1, max_iter=10)
t = time.time()
kmodel300.fit(train)
dt = time.time() - t
sc = kmodel300.score(test)/test.shape[0];
print "Time=", dt, " score=", sc
This cell takes about 4 minutes.
The score improved, but the time went up 10x. That's not good. The complexity seems to be growing quadraticly with the dimension. It should be linear. This is probably related to Scikit's automatic distance precomputation. Lets try:
dim = 1000
kmodel = KMeans(n_clusters=dim, init='random', n_init=1, max_iter=10)
t = time.time()
kmodel.fit(train)
dt = time.time() - t
sc = kmodel.score(test)/test.shape[0];
print "Time=", dt, " score=", sc
WARNING: Slow cell, you don't need to wait.
Which should eventually give you something like: Time= 807.521463871 score= -1623958.78602
This is good, and is back on track for linear complexity growth with dim.
Switch to the bidmach notebook now. You dont have to shut this one down. You can start another terminal to your instance and do
bidmach notebook. But if you leave this one running make sure you check the socket number that bidmach notebook starts on. It will be something other than 8888. Make sure you point another browser window at
http://localhost:socketnum
Lets load the model we just saved from BIDMach
f = h5py.File("/data/MNIST8M/model.mat")
modelmat = f['model'][:]
modelmat.shape
kmodel300.cluster_centers_.shape
kmodelmat = kmodel300.cluster_centers_
kmodel300.score(test)
kmodel.cluster_centers_ = np.transpose(modelmat)
kmodel.score(test)
This difference is not significant. If you run KMeans again, you'll find it varies its score due to different random initializations.
You can save and shutdown this notebook now.