Machine Learning Breakout: Facial Recognition¶

This exercise will walk you through the process of using machine learning for facial recognition.

In [1]:

from __future__ import print_function, division

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# use seaborn for better matplotlib styles
import seaborn; seaborn.set(style='white')

1. Fetch & explore the data¶

The data we'll use is a number of snapshots of the faces of world leaders. We'll fetch the data as follows:

In [2]:

from sklearn.datasets import fetch_lfw_people
faces = fetch_lfw_people(min_faces_per_person=70, resize=0.4)

Explore this data, which is layed out very similarly to the digits data we saw earlier. How many samples are there? How many features? How many classes, or targets?
Use subplots and plt.imshow to plot several of the images. How many pixels are in each image?
Use sklearn.metrics.train_test_split to split the data into a training set and a test set.

In [3]:

faces.keys()

Out[3]:

dict_keys(['DESCR', 'target', 'images', 'target_names', 'data'])

In [4]:

n_samples, n_features = faces.data.shape
print(n_samples, n_features)

1288 1850

In [5]:

print(faces.target_names)

['Ariel Sharon' 'Colin Powell' 'Donald Rumsfeld' 'George W Bush'
 'Gerhard Schroeder' 'Hugo Chavez' 'Tony Blair']

In [6]:

fig, axes = plt.subplots(4, 8, figsize=(12, 9))

for i, ax in enumerate(axes.flat):
    ax.imshow(faces.images[i], cmap='binary_r')
    ax.set_title(faces.target_names[faces.target[i]], fontsize=10)
    ax.set_xticks([]); ax.set_yticks([])

2. Projecting the Data¶

Lets use some dimensionality reduction routines to try and understand the data. Just a warning: you'll probably find that, unlike in the case of the handwritten digits, the projections will be a bit too jumbled to gain much insight. Still, it's always a useful step in understanding your data!

Project the data to two-dimensions with Principal Component Analysis, and scatter-plot the results
Project the data to two dimensinos with Isomap and scatter-plot the results

In [7]:

X = faces.data
y = faces.target

In [8]:

from sklearn.decomposition import PCA
from sklearn.manifold import Isomap

X_pca = PCA(n_components=2).fit_transform(X)
X_iso = Isomap(n_components=2).fit_transform(X)

In [9]:

plt.scatter(X_pca[:, 0], X_pca[:, 1], c=faces.target,
            cmap='Blues')
plt.title('PCA projection');

In [10]:

plt.scatter(X_iso[:, 0], X_iso[:, 1], c=faces.target,
            cmap='Blues')
plt.title('Isomap projection');

It's not obvious from these projections that the data can be well-separated; on the other hand, we've reduced our 1200 dimensional data to two!

3: Classification of unknown images¶

Here we'll perform a classification task on our data. Given a training set, we want to build a classifier that will accurately predict the test set

Start by splitting your data into a train and test set (you can use sklearn.cross_validation.train_test_split)
We'll use a support vector classifier (sklearn.svm.SVC) to classify the data. Import this and instantiate the estimator.
Perform an initial fit on the data, predict the test labels, and use sklearn.metrics.accuracy_score to see how well you're doing.
The estimator can be tuned to make the fit better. we'll do this by adjusting the C parameter of SVC. Look at the SVC doc string and try some choices for the kernel, for C and for gamma. What's the best accuracy you can find?
For this best estimator, print the sklearn.metrics.classification_report and sklearn.metrics.confusion_matrix, and plot some of the images with the true and predicted label. How well does it do?

In [11]:

# split the data
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
print(X_train.shape, X_test.shape)

(966, 1850) (322, 1850)

In [12]:

# instantiate the estimator
from sklearn.svm import SVC
clf = SVC()

In [13]:

# Do a fit and check accuracy
from sklearn.metrics import accuracy_score

clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

accuracy_score(y_test, y_pred)

Out[13]:

0.40993788819875776

In [14]:

# Note that we can also do this:
clf.score(X_test, y_test)

Out[14]:

0.40993788819875776

In [15]:

# Try out various hyper parameters
for kernel in ['linear', 'rbf', 'poly']:
    clf = SVC(kernel=kernel).fit(X_train, y_train)
    score = clf.score(X_test, y_test)
    print("{0}: accuracy = {1}".format(kernel, score))

linear: accuracy = 0.8260869565217391
rbf: accuracy = 0.40993788819875776
poly: accuracy = 0.8012422360248447

It looks like the linear kernel gives the best results.

In [16]:

best_clf = SVC(kernel='linear').fit(X_train, y_train)
y_pred = best_clf.predict(X_test)

In [17]:

from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred, target_names=faces.target_names))

                   precision    recall  f1-score   support

     Ariel Sharon       0.76      0.79      0.77        28
     Colin Powell       0.79      0.84      0.82        63
  Donald Rumsfeld       0.65      0.71      0.68        24
    George W Bush       0.91      0.86      0.88       132
Gerhard Schroeder       0.76      0.80      0.78        20
      Hugo Chavez       0.90      0.82      0.86        22
       Tony Blair       0.77      0.82      0.79        33

      avg / total       0.83      0.83      0.83       322

In [18]:

from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred)

Out[18]:

array([[ 22,   4,   0,   2,   0,   0,   0],
       [  3,  53,   1,   2,   0,   1,   3],
       [  2,   2,  17,   1,   1,   0,   1],
       [  0,   7,   8, 113,   0,   1,   3],
       [  1,   0,   0,   2,  16,   0,   1],
       [  0,   1,   0,   1,   2,  18,   0],
       [  1,   0,   0,   3,   2,   0,  27]])

In [19]:

shape = faces.images.shape[-2:]
last_names = [label.split()[-1] for label in faces.target_names]

titles = ["True: {0}\nPred: {1}".format(last_names[i_test],
                                        last_names[i_pred])
          for (i_test, i_pred) in zip(y_test, y_pred)]
    
fig, axes = plt.subplots(4, 8, figsize=(12, 9),
                         subplot_kw=dict(xticks=[], yticks=[]))

for i, ax in enumerate(axes.flat):
    ax.imshow(X_test[i].reshape(shape), cmap='binary_r')
    ax.set_title(titles[i], fontsize=10)

It still amazes me that with such a simple algorithm, we can get ~80% prediction accuracy on data like this!