from sklearn.datasets import make_circles
X, y = make_circles(noise=.1, factor=.5)
print "X.shape:", X.shape
print "unique labels: ", np.unique(y)
Now let's plot the data again.
plt.prism() # this sets a nice color map
plt.scatter(X[:, 0], X[:, 1], c=y)
Take the first 50 examples for training and the rest for testing.
X_train = X[:50]
y_train = y[:50]
X_test = X[50:]
y_test = y[50:]
Import logistic regression and fit the model.
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
Evaluate the logistic regression as we did before by ploting the decision surface and predictions on the test data. We plot the training data as circles, colored with their true labels. The test data is colored with their prediction and plotted as triangles.
plt.prism()
from utility import plot_decision_boundary
y_pred_test = logreg.predict(X_test)
plt.scatter(X_test[:, 0], X_test[:, 1], c=y_pred_test, marker='^')
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train)
plot_decision_boundary(logreg, X)
plt.xlim(-1.5, 1.5)
plt.ylim(-1.5, 1.5)
print "Accuracy of logistic regression on test set:", logreg.score(X_test, y_test)
That doesn't look as good as before. Notice that all test points on the right of the line are red and all on the left are green. (as expected)
Now let us look at how K Nearest Neighbors works here. Let us import the classifier and create an object.
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5) # we specify that this knn should always use 5 neighbors
knn.fit(X_train, y_train)
y_pred_test = knn.predict(X_test)
plt.prism() # gives us a nice color map
plt.xlim(-1.5, 1.5)
plt.ylim(-1.5, 1.5)
plt.scatter(X_test[:, 0], X_test[:, 1], c=y_pred_test, marker='^')
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train)
print "Accuracy of KNN test set:", knn.score(X_test, y_test)
This looks much better, which is also reflected by the test set score. You can try to change $k$ to make the prediction better.
We'll now have a look at MNIST again.
from sklearn.datasets import fetch_mldata
from sklearn.utils import shuffle
mnist = fetch_mldata("MNIST original")
X_digits, y_digits = mnist.data, mnist.target
X_digits, y_digits = shuffle(X_digits, y_digits)
This time we use all classes, but only a small training set (because KNN usually takes a while). To do model selection, we also create a valdation set to adjust $k$.
X_digits_train = X_digits[:1000]
y_digits_train = y_digits[:1000]
X_digits_valid = X_digits[1000:2000]
y_digits_valid = y_digits[1000:2000]
X_digits_test = X_digits[2000:3000]
y_digits_test = y_digits[2000:3000]
Now let us fit the model. That actually just remembers the dataset. Then we will evaluate on the validation set.
This time we choose $k=20$. You can find a good value later.
knn_digits = KNeighborsClassifier(n_neighbors=20)
knn_digits.fit(X_digits_train, y_digits_train)
print "KNN validation accuracy on MNIST digits: ", knn_digits.score(X_digits_valid, y_digits_valid)
After you found a good value of $k$, you can evaluate again on the test set (only do this once to have a meaningful result!)
print "KNN test accuracy on MNIST digits: ", knn_digits.score(X_digits_test, y_digits_test)
To get a better understanding of the classifier, let us take a closer look at some mistake that are done with $k=3$.
knn_digits = KNeighborsClassifier(n_neighbors=3)
knn_digits.fit(X_digits_train, y_digits_train)
y_digits_valid_pred = knn_digits.predict(X_digits_valid)
Get the neighbors of the validation data from the training data.
neighbors = knn_digits._tree.query(X_digits_valid, k=3)[1]
Not let's look at them. Let's start with an image where it worked. First plot the validation image itself, then three neighbors.
plt.rc("image", cmap="binary") # this sets a black on white colormap
# plot X_digits_valid[0]
plt.subplot(1, 4, 1)
plt.imshow(X_digits_valid[0].reshape(28, 28))
plt.title("Query")
# plot three nearest neighbors from the training set
for i in [0, 1, 2]:
plt.subplot(1, 4, 2 + i)
plt.title("%dth neighbor" % i)
plt.imshow(X_digits_train[neighbors[0, i]].reshape(28, 28))
Find out where we went wrong on the validation set, so we can have a look.
wrong = np.where(y_digits_valid_pred != y_digits_valid)[0] # the != part gives a mask, the "where" gives us the indices
print "Wrong prediction on the following images: ", wrong
Now take one of these and visualize the 3 closest neighbors. That will hopefully give us some insights into why there was an error.
index = wrong[0]
plt.rc("image", cmap="binary") # this sets a black on white colormap
# plot X_digits_valid[index]
plt.subplot(1, 4, 1)
plt.imshow(X_digits_valid[index].reshape(28, 28))
plt.title("Query")
# plot three nearest neighbors from the training set
for i in [0, 1, 2]:
plt.subplot(1, 4, 2 + i)
plt.title("%dth neighbor" % i)
plt.imshow(X_digits_train[neighbors[index, i]].reshape(28, 28))