This exercise will walk you through the process of using machine learning for facial recognition.
from __future__ import print_function, division
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# use seaborn for better matplotlib styles
import seaborn; seaborn.set(style='white')
The data we'll use is a number of snapshots of the faces of world leaders. We'll fetch the data as follows:
from sklearn.datasets import fetch_lfw_people
faces = fetch_lfw_people(min_faces_per_person=70, resize=0.4)
plt.imshow
to plot several of the images. How many pixels are in each image?sklearn.metrics.train_test_split
to split the data into a training set and a test set.faces.keys()
dict_keys(['DESCR', 'target', 'images', 'target_names', 'data'])
n_samples, n_features = faces.data.shape
print(n_samples, n_features)
1288 1850
print(faces.target_names)
['Ariel Sharon' 'Colin Powell' 'Donald Rumsfeld' 'George W Bush' 'Gerhard Schroeder' 'Hugo Chavez' 'Tony Blair']
fig, axes = plt.subplots(4, 8, figsize=(12, 9))
for i, ax in enumerate(axes.flat):
ax.imshow(faces.images[i], cmap='binary_r')
ax.set_title(faces.target_names[faces.target[i]], fontsize=10)
ax.set_xticks([]); ax.set_yticks([])
Lets use some dimensionality reduction routines to try and understand the data. Just a warning: you'll probably find that, unlike in the case of the handwritten digits, the projections will be a bit too jumbled to gain much insight. Still, it's always a useful step in understanding your data!
X = faces.data
y = faces.target
from sklearn.decomposition import PCA
from sklearn.manifold import Isomap
X_pca = PCA(n_components=2).fit_transform(X)
X_iso = Isomap(n_components=2).fit_transform(X)
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=faces.target,
cmap='Blues')
plt.title('PCA projection');
plt.scatter(X_iso[:, 0], X_iso[:, 1], c=faces.target,
cmap='Blues')
plt.title('Isomap projection');
It's not obvious from these projections that the data can be well-separated; on the other hand, we've reduced our 1200 dimensional data to two!
Here we'll perform a classification task on our data. Given a training set, we want to build a classifier that will accurately predict the test set
sklearn.cross_validation.train_test_split
)sklearn.svm.SVC
) to classify the data. Import this and instantiate the estimator.sklearn.metrics.accuracy_score
to see how well you're doing.C
parameter of SVC
. Look at the SVC
doc string and try some choices for the kernel
, for C
and for gamma
. What's the best accuracy you can find?sklearn.metrics.classification_report
and sklearn.metrics.confusion_matrix
, and plot some of the images with the true and predicted label. How well does it do?# split the data
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
print(X_train.shape, X_test.shape)
(966, 1850) (322, 1850)
# instantiate the estimator
from sklearn.svm import SVC
clf = SVC()
# Do a fit and check accuracy
from sklearn.metrics import accuracy_score
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy_score(y_test, y_pred)
0.40993788819875776
# Note that we can also do this:
clf.score(X_test, y_test)
0.40993788819875776
# Try out various hyper parameters
for kernel in ['linear', 'rbf', 'poly']:
clf = SVC(kernel=kernel).fit(X_train, y_train)
score = clf.score(X_test, y_test)
print("{0}: accuracy = {1}".format(kernel, score))
linear: accuracy = 0.8260869565217391 rbf: accuracy = 0.40993788819875776 poly: accuracy = 0.8012422360248447
It looks like the linear kernel gives the best results.
best_clf = SVC(kernel='linear').fit(X_train, y_train)
y_pred = best_clf.predict(X_test)
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred, target_names=faces.target_names))
precision recall f1-score support Ariel Sharon 0.76 0.79 0.77 28 Colin Powell 0.79 0.84 0.82 63 Donald Rumsfeld 0.65 0.71 0.68 24 George W Bush 0.91 0.86 0.88 132 Gerhard Schroeder 0.76 0.80 0.78 20 Hugo Chavez 0.90 0.82 0.86 22 Tony Blair 0.77 0.82 0.79 33 avg / total 0.83 0.83 0.83 322
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred)
array([[ 22, 4, 0, 2, 0, 0, 0], [ 3, 53, 1, 2, 0, 1, 3], [ 2, 2, 17, 1, 1, 0, 1], [ 0, 7, 8, 113, 0, 1, 3], [ 1, 0, 0, 2, 16, 0, 1], [ 0, 1, 0, 1, 2, 18, 0], [ 1, 0, 0, 3, 2, 0, 27]])
shape = faces.images.shape[-2:]
last_names = [label.split()[-1] for label in faces.target_names]
titles = ["True: {0}\nPred: {1}".format(last_names[i_test],
last_names[i_pred])
for (i_test, i_pred) in zip(y_test, y_pred)]
fig, axes = plt.subplots(4, 8, figsize=(12, 9),
subplot_kw=dict(xticks=[], yticks=[]))
for i, ax in enumerate(axes.flat):
ax.imshow(X_test[i].reshape(shape), cmap='binary_r')
ax.set_title(titles[i], fontsize=10)
It still amazes me that with such a simple algorithm, we can get ~80% prediction accuracy on data like this!