Face Recognition on a subset of the Labeled Faces in the Wild dataset

In [1]:
%pylab inline

import pylab as pl
import numpy as np

Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.kernel.zmq.pylab.backend_inline].
For more information, type 'help(pylab)'.

The LFW dataset

Let's start with downloading the data using a scikit-learn utility function.

In [2]:
from sklearn.datasets import fetch_lfw_people

lfw_people = fetch_lfw_people(min_faces_per_person=70, resize=0.4)

Let's introspect the images arrays to find the shapes (for plotting with matplotlib)

In [3]:
X = lfw_people.data
y = lfw_people.target
names = lfw_people.target_names

n_samples, n_features = X.shape
_, h, w = lfw_people.images.shape
n_classes = len(names)

print("n_samples: {}".format(n_samples))
print("n_features: {}".format(n_features))
print("n_classes: {}".format(n_classes))
n_samples: 1288
n_features: 1850
n_classes: 7

In [4]:
def plot_gallery(images, titles, h, w, n_row=3, n_col=6):
    """Helper function to plot a gallery of portraits"""
    pl.figure(figsize=(1.7 * n_col, 2.3 * n_row))
    pl.subplots_adjust(bottom=0, left=.01, right=.99, top=.90, hspace=.35)
    for i in range(n_row * n_col):
        pl.subplot(n_row, n_col, i + 1)
        pl.imshow(images[i].reshape((h, w)), cmap=pl.cm.gray)
        pl.title(titles[i], size=12)
        pl.xticks(())
        pl.yticks(())

plot_gallery(X, names[y], h, w)

Let's have a look at the repartition among target classes:

In [5]:
pl.figure(figsize=(14, 3))

y_unique = np.unique(y)
counts = [(y == i).sum() for i in y_unique]

pl.xticks(y_unique,  names[y_unique])
locs, labels = pl.xticks()
pl.setp(labels, rotation=45, size=20)
_ = pl.bar(y_unique, counts)

Splitting the dataset for model development and then evaluation

Let's split the data in a development set and final evaluation set.

In [6]:
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

Unsupervised Feature Extraction with Randomized PCA

To train a model we will first reduce the dimensionality of the original picture to a 150 PCA space: unsupervised feature extraction.

In [7]:
from sklearn.decomposition import RandomizedPCA

n_components = 150

print "Extracting the top %d eigenfaces from %d faces" % (
    n_components, X_train.shape[0])

pca = RandomizedPCA(n_components=n_components, whiten=True)

%time pca.fit(X_train)

eigenfaces = pca.components_.reshape((n_components, h, w))
Extracting the top 150 eigenfaces from 966 faces
CPU times: user 559 ms, sys: 69.4 ms, total: 629 ms
Wall time: 449 ms

Let's plot the gallery of the most significant eigenfaces:

In [8]:
eigenface_titles = ["eigenface %d" % i for i in range(eigenfaces.shape[0])]
plot_gallery(eigenfaces, eigenface_titles, h, w)

Projecting the input data on the eigenfaces orthonormal basis:

In [9]:
X_train_pca = pca.transform(X_train)

Training a Gaussian Kernel Support Vector Machine

Let's now train a Kernel Support Vector Machine on the projected data. We perform an automated parameter search to find good values for Gamma and C:

In [10]:
from sklearn.svm import SVC

svm = SVC(kernel='rbf', class_weight='auto')

svm
Out[10]:
SVC(C=1.0, cache_size=200, class_weight='auto', coef0=0.0, degree=3,
  gamma=0.0, kernel='rbf', max_iter=-1, probability=False, shrinking=True,
  tol=0.001, verbose=False)

Unfortunately an SVM is very sensitive to the parameters C and gamma and it's very unlikely that the default parameters will yield a good predictive accurracy:

In [11]:
from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.cross_validation import cross_val_score

cv = StratifiedShuffleSplit(y_train, test_size=0.20, n_iter=3)

%time svm_cv_scores = cross_val_score(svm, X_train_pca, y_train, scoring='f1', n_jobs=2)
svm_cv_scores
CPU times: user 15.6 ms, sys: 21.8 ms, total: 37.4 ms
Wall time: 531 ms

Out[11]:
array([ 0.73740893,  0.75845638,  0.74661801])
In [12]:
svm_cv_scores.mean(), svm_cv_scores.std()
Out[12]:
(0.74749443841822638, 0.0086149047467813464)

Fortunately we can automate the search for the best combination of parameters:

In [13]:
from sklearn.grid_search import GridSearchCV

param_grid = {
    'C': [1e3, 5e3, 1e4, 5e4, 1e5],
    'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1],
}

clf = GridSearchCV(svm, param_grid, scoring='f1', cv=cv, n_jobs=2)
In [14]:
%time clf = clf.fit(X_train_pca, y_train)

print("Best estimator found by randomized hyper parameter search:")
print(clf.best_params_)
print("Best parameters validation score: {:.3f}".format(clf.best_score_))
CPU times: user 560 ms, sys: 187 ms, total: 747 ms
Wall time: 12.3 s
Best estimator found by randomized hyper parameter search:
{'C': 5000.0, 'gamma': 0.001}
Best parameters validation score: 0.809

Final evaluation of the best model on the held-out test set

Let's start with a qualitative inspection of the some of the predictions:

In [15]:
X_test_pca = pca.transform(X_test)
y_pred = clf.predict(X_test_pca)
In [16]:
def title(y_pred, y_test, target_names, i):
    pred_name = target_names[y_pred[i]].rsplit(' ', 1)[-1]
    true_name = target_names[y_test[i]].rsplit(' ', 1)[-1]
    return 'predicted: %s\ntrue:      %s' % (pred_name, true_name)

prediction_titles = [title(y_pred, y_test, names, i)
                     for i in range(y_pred.shape[0])]

plot_gallery(X_test, prediction_titles, h, w)
In [17]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred, target_names=names))
                   precision    recall  f1-score   support

     Ariel Sharon       0.81      0.76      0.79        17
     Colin Powell       0.89      0.84      0.86        61
  Donald Rumsfeld       0.85      0.74      0.79        31
    George W Bush       0.90      0.96      0.93       134
Gerhard Schroeder       0.76      0.84      0.80        19
      Hugo Chavez       0.89      0.89      0.89        19
       Tony Blair       0.84      0.78      0.81        41

      avg / total       0.87      0.87      0.87       322


In [18]:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred, labels=range(n_classes))
print(cm)
[[ 13   2   1   1   0   0   0]
 [  2  51   0   4   1   0   3]
 [  0   1  23   5   1   1   0]
 [  1   1   2 129   0   0   1]
 [  0   0   1   0  16   1   1]
 [  0   0   0   0   1  17   1]
 [  0   2   0   5   2   0  32]]

In [19]:
pl.gray()
_ = pl.imshow(cm, interpolation='nearest')
In []:
 
Back to top