Before you Start This lab involves some computing on modest datasets. You should set your VM to use at least 1.5 GB of RAM. If that's not possible, you'll have to shrink the datasets (reduce their number of columns) to be able to run it.
Once you've checked memory size and restarted if needed, grab the lab from the icon at the top right.
For this lab, we'll use a dataset of hand-written digit images called MNIST. You can download the dataset from here. Put it in a lab6 directory, and do:
tar xvzf MNIST.tar.gz
which will produce an MNIST subdirectory with several files. The data comprises a train and test set, and train and test labels (ictrain and ictest).
We'll be using numpy for this part of the lab so lets import it:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
We can read matrices in text format with numpy's loadtxt command. Text file input is slow, so this may take a few seconds.
train0=np.loadtxt("MNIST/train.fmat.txt")
Lets look at the shape of this matrix:
train0.shape
The columns of the matrix are 28 * 28 images. Since we had to read the data from a 2d array, the structure is different. Lets look at one of the images. We'll also convert it to a matrix, so that more mathematical operators are defined on it. If you have trouble running the lab, reduce the dataset size by decreasing the column bound from 4000 below.
train = np.asmatrix(train0[:,0:4000])
and we'll save the number of training examples for later:
ntrain = train.shape[1];ntrain
We can take any of the columns of the array, and reshape it to a 28 x 28 matrix:
img = train[:,0].reshape([28,28])
To look at the image, we can use Matplotlib's imshow function. We give it the image, plus a color map which maps the data values 0..255 to grayscale colors:
plt.imshow(img,cmap='gray')
Try some different images:
img2 = train[:,7].reshape([28,28])
plt.imshow(img2,cmap='gray')
TODO: Now create a 10x10 array of the first 100 images and show it below. Its probably easiest to make a 280 x 280 array, and fill it with the individual images. Be sure to scale your displayed image so that it is big enough to see individual images. i.e. so that it fills most of the browser window.
Next we need to compute the pairwise distances between our test and training set. Check the test set dims:
test0=np.loadtxt("MNIST/test.fmat.txt")
test=test0[:,0:2000]
ntest=test.shape[1];ntest
The squared euclidean distance between two vector x and y is the squared length of their difference. We can write it as
(x-y) dot (x-y)
When working with matrices, we can also use a matrix formula, which is
transpose(x-y) * (x-y)
Well use the shorthand ".T" for the transpose, and then we can write:
(x-y) dot (x-y) = x dot x - 2 x.T * y + y dot y
This formula works for x and y which are single columns. We chose this combination of dot and transpose product so we can generalize it to matrices of points.
We would like to efficiently compute a ntest x ntrain matrix of all pairwise distances using this formula. The first term should be the x dot products, which we can compute like this:
xdots = np.sum(np.multiply(test,test),0).reshape(ntest,1) * np.ones((1,ntrain))
Note that we had to multiply by a row-vector of ones to reproduce the x-product for each y. The formula for y dot products is similar:
ydots = np.ones((ntest, 1)) * np.sum(np.multiply(train,train),0).reshape(1,ntrain)
Finally, we can complete the distance formula by adding the cross-terms:
dists = xdots + ydots - 2 * np.transpose(test) * train
The raw distances will be larger for larger input instances, which is not what we want. We can improve the distance estimate by normalizing by the lengths of test and train instances:
TODO: produce a normalized distance matrix ndists by dividing by the eucliden length of the test and train points. Hint: you already have the squared euclidean lengths of all the test and train points.
ndists =
Now the i^th row of the distance matrix is the vector of distances from the i^th test point to the training points. To get the closest neighbors, we can sort this array along its rows (axis = 1). The argsort function returns the index of the closest points, which we can use to lookup their label.
ibest = np.argsort(ndists,axis=1); ibest
To look up the distances along a given row, we can use the ibest array to provide the column (training point) indices.
irow = 0
ndists[irow,ibest[irow,:]]
And then to get their labels we should load the label data matrix:
labels = np.loadtxt("MNIST/ictrain.imat.txt")
labels[ibest[irow,:]]
TODO: What label is the best match for test point 0?
Now we'll restrict ourselves to the k-nearest neighbors. We currently have all the neighbors as columns of the dists and ibest matrices. Lets produce the labels of only the k-nearest, which is simply:
k = 3
knlabels = labels[ibest]; knlabels
to get the k-nearest distances, we can sort the distances directly:
ndists[0,ibest[0,:]]
Now, lets find the majority vote for each label. First we need to count the votes in a matrix whose columns are the possible digits. You'll need to do this with loops, since Python lacks a general accumulate function.
votes = np.zeros((ntest,10))
TODO: write code to fill out the votes array with the predicted labels using the k nearest neighbors
votes
Then we can find the position of the maximum of the votes, which will also be the digit label:
bestlabels = np.argmax(votes, axis=1); bestlabels
How well did we do? Lets load the test labels and see:
testlabels0 = np.loadtxt("MNIST/ictest.imat.txt")
testlabels = testlabels0[0:ntest]
We can directly compare the bestlabels (predicted) and the testlabels with == to compute matches. Matching labels will produce a 1 or 0 otherwise. To find the fraction of matches we simply take the mean of the 0-1 values.
accuracy = np.mean(bestlabels == testlabels); accuracy
Now lets try some other values of k. To speed things up, we'll skip the expensive sort steps, which we've already done anyway.
k = 3
votes = np.zeros((ntest,10))
# your code for vote counting here
bestlabels = np.argmax(votes, axis=1)
np.mean(bestlabels == testlabels)
TODO: What was the best value of k <= 10? What is the corresponding accuracy?
Now lets try weighting by inverse distance. Modify the last compute cell to use the inverse distance (you probably want to use the kndists matrix) as the vote.
k = 3
votes = np.zeros((ntest,10))
# your code for vote counting here
bestlabels = np.argmax(votes, axis=1)
np.mean(bestlabels == testlabels)
TODO: Now what is the next value of k <= 10? What is the corresponding accuracy?
Read the documentation for Naive Bayes in Scikit-learn here http://scikit-learn.org/stable/modules/naive_bayes.html This section uses Scikit-learn, which is installed on your VM. Unfortunately, this module depends on nose tests, which arent installed. Go to a shell prompt and type:
sudo pip install nose
We want to classify our image data, and so we have to choose a feature distribution. Although the data for MNIST are real 0..255 brightness values, most values are either 0's or 255's so we'll threshold them to produce binary features. Then we can use the Bernoulli Naive Bayes classifier.
We also need to transpose the data: Sckikit-learn expects data points to be rows rather than columns.
btrain = np.transpose(train > 128)
btest = np.transpose(test > 128)
Now import the BernoulliNB classifier:
from sklearn.naive_bayes import BernoulliNB
nb = BernoulliNB()
Which creates a new Bernoulli Naive Bayes classifier. Now train the classifier on your binarized training data.
model = nb.fit(btrain, labels)
Use the model to predict the labels of the binarized test data:
nbpreds = model.predict(btest)
And evaluate the predictions compared to the actual labels
np.mean(nbpreds == testlabels)
TODO: How did the accuracy of Naive Bayes compare with the kNN classifier on this data?
TODO: How well do you think this dataset fits the assumptions of the kNN and Naive Bayes models respectively?
Lets dig a bit deeper into the results from both classifiers. One useful tool is a confusion matrix: this is a matrix index by the true labels and predicted labels, and holds the normalized counts of values in each element. We can form the matrix as follows:
We start with a matrix C such that C[i,j] is a count of the points with predicted label i and true label j. We then normalize C so that its columns add up to 1. Then the j^th column corresponds to true label j, and is a vector of the probabilities that a true label of j is classified as one of the other labels.
def confCount(a0, b0):
a = a0.astype('int64')
b = b0.astype('int64')
counts = np.zeros((10,10))
for i in range(0,a.size):
counts[b[i],a[i]] += 1
return counts
In the cell below, we compute the confusion matrix for the Naive Bayes classifier.
ccounts = confCount(testlabels, nbpreds)
conf = ccounts / sum(ccounts).reshape(10,1)
conf
TODO: Now in the cell below create a grayscale visualization of the confusion matrix. i.e. a 10 x 10 image whose elements are in the range 0..255 and are 255 times the values of the confusion matrix.
TODO: Using the vizualization or the matrix itself, answer these questions:
Q1: which digit is misclassified most often? How did you derive your answer from the confusion matrix?
Q2: How many mislabeled pairs are there (a pair of correct digit, incorrect label) which have probability > 0.05 ?
Enter your lab responses here