Lets try running KMeans using BIDMach instead. That will allow us to go quite a bit further. First, we need to import some classes and test hardware. These lines are standard and will normally be at the start of any BIDMach code. When run from the command line, they are executed by a startup script
import BIDMat.{CMat,CSMat,DMat,Dict,FMat,FND,GMat,GDMat,GIMat,GLMat,GSMat,GSDMat,HMat,IDict,Image,IMat,LMat,Mat,SMat,SBMat,SDMat}
import BIDMat.MatFunctions._
import BIDMat.SciFunctions._
import BIDMat.Solvers._
import BIDMat.Plotting._
import BIDMach.Learner
import BIDMach.models.{FM,GLM,KMeans,KMeansw,LDA,LDAgibbs,Model,NMF,SFA,RandomForest}
import BIDMach.networks.{DNN}
import BIDMach.datasources.{DataSource,MatDS,FilesDS,SFilesDS}
import BIDMach.mixins.{CosineSim,Perplexity,Top,L1Regularizer,L2Regularizer}
import BIDMach.updaters.{ADAGrad,Batch,BatchNorm,IncMult,IncNorm,Telescoping}
import BIDMach.causal.{IPTW}
Mat.checkMKL
Mat.checkCUDA
We'll load BIDMach files which are in binary format as well. First lets grab HDF5 files for train and test:
val train:FMat = load("/data/MNIST8M/parts/alls00.mat","mat")
val test:FMat = load("/data/MNIST8M/parts/alls01.mat","mat")
BIDMach is greared toward training and tuning larger models than Scikit learn. For that reason, its interface is rather different. First we define a learner class with some options:
val (kmlearner, opts) = KMeans.learner(train)
A learner is acturally a mixture of classes that comprise a model, a datasource, an optimization strategy, and possibly other "mixins" like regularizers. The options object contains all the options for these various pieces, and is self-describing.
opts.what
Some of these are obvious, some are obscure. You dont normally need to touch the latter kind. Having an options object allows you to change one or two options and re-run the learner without entering a long command line each time.
opts.dim = 300 // Dimension = number of centroids
Now lets train the model
kmlearner.train
Much faster than Scikit for this model anyway. Lets check model quality. Since BIDMach reports a different loss (means-squared value) than Scikit, the results are not directly comparable. To compare models across systems its best to use the same system to evaluate both models.
We can do that by saving BIDMach's model and passing it over to Scikit:
saveAs("/data/MNIST8M/model.mat",FMat(kmlearner.modelmat),"model")
Now go back to your Python notebook and continue from where you stopped before.
So far we've been limited to datasets that can fit in memory. But most large datasets wont. Luckily, many state-of-the-art machine learning algorithms process data in minibatches that can be streamed off the disk. The MNIST files we processed so far comprise 100,000 images. The full dataset has 8 million images, and has 81 files of 100k each.
With BIDMach you can process arbitrary-sized datasets by streaming them from disk. Lets set up a learner to do this.
val (kmlearner, opts) = KMeans.learner("/data/MNIST8M/parts/alls%02d.fmat.lz4", 300)
The first file argument is a pattern that enumerates the actual files in the group when you pass it an integer. The second argument is the dimension.
opts.nend = 10 // Limit of how many files to process
opts.npasses = 10 // Number of passes over the data
kmlearner.train
So in around 1/5 the time, we were able to process 10x more data, compared to SKLearn. But we havent reached a good operating point for the GPU yet. Lets try a higher dimensional model:
opts.dim = 1000
kmlearner.train
We just tripled model dimension, but running time only went up by 50%. Notice that the gflops, the algorithm's compute throughput, doubled. Graphics processors generally run faster when there is more paralellism in the problem. Here the greater the model dimension the more work that can be done concurrently.
But how good is this model? We can check by using the centroids for k-NN prediction.
The dataset was already designed for this. If you look careful at the matrix "train" it has 794 columns, only 784 of which are image features. The first ten columns contain scaled, one-hot encoding of the label of the point. Look at the first 10 columns in the cell below:
TODO: look at the first 10 columns. Why do you think they were encoded this way? Hint: We want images with the same label to be in the same cluster so that centroids have an unambiguous label.
Once we have the model matrix, we have to clean it a little since it might have some empty centers. We can do
val mm0 = FMat(kmlearner.modelmats(0)) // grab the main model matrix as FMat (CPU float matrix)
val mm1 = FMat(kmlearner.modelmats(1)) // this is an auxiliary model matrix
val igood = find(sum(mm0, 2) > 0) // find non-empty centroids
kmlearner.modelmats(0) = mm0(igood,?) // filter the models rows.
kmlearner.modelmats(1) = mm1(igood,?)
and lets load some test data that doesnt overlap the training set
val test = loadFMat("/data/MNIST8M/parts/alls70.fmat.lz4")
This data actually includes the labels as well as image data. Lets split them up:
val (dmy, labels) = maxi2(test) // Get the indices of max values, which are the one-hot elements in rows 0..9
test(0->10,?) = 0 // Clear those rows so that the data are unlabeled
labels(0->100) // Show the labels
We make an array to hold the models predictions on the test data:
val pcats = izeros(1, test.ncols)
val (kmp, popts) = KMeans.predictor(kmlearner.model, test, pcats)
kmp.predict
pcats(0->100)
The above are the cluster indices for each test point. We still need a mapping from cluster center number to label. That's already encoded in the model matrix mm0. Here it is: The maxi2 function finds the max value along an axis (1 for columns, 2 for rows), and returns the max value in the first output. The second output is the index of the max value.
val (dmy,ibest0) = maxi2(kmlearner.modelmats(0),2)
val ibest = ibest0.t
Lets compare the prediction with actual labels. Pcats are the cluster numbers, ibest maps from a cluster number to a label:
mean(float(labels == ibest(pcats)))
Let's try again with a larger number of clusters
TODO: enter the requested code below
// Remove models empty clusters
// compute the cluster --> label mapping again
// Make a predictor
// Evaluate
So in less than two minutes (half the time of SKLearn for 300 dims), we ran a model with 10x more dimensions on 10x the amount of data. i.e. its a roughly 200x performance improvement. Model memory is an issue (recent GPUs have only about 12GB of memory), but dataset size isnt since minibatch design allows us to stream data from local disk (or HDFS). For models that fit in memory its difficult to beat single-GPU performance, even with large compute clusters. Some recent benchmarks are here
GPU performance is also improving quite fast. Although the above numbers are good, the most recent GPUs (NVIDIA's Titan-X) are about 4x as fast for the calculation above, and nearly 1000x faster than Scikit-Learn on this dataset.