Notebook

Breakout: Model Validation¶

Here we'll practice the process of model validation, and evaluating how we can improve our model. We'll return to the Labeled Faces in the Wild dataset that we saw previously, and use the cross-validation techniques we covered to find the best possible model.

In [1]:

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# use seaborn plotting defaults
# If this causes an error, you can comment it out.
import seaborn as sns
sns.set()

In [2]:

from sklearn.datasets import fetch_lfw_people
faces = fetch_lfw_people(min_faces_per_person=70, resize=0.4)

X, y = faces.data, faces.target

1. Validation with Random Forests¶

Use a RandomForestClassifier with the default parameters, and use 10-fold cross-validation to determine the optimal accuracy.
Construct validation curves for the random forest classifier on this data, exploring the effect of max_depth on the result.
What is the best value for max_depth (approximately)? What is the best score for this estimator?
Construct a learning curve for the Random Forest Classifier using this value for max_depth.
Given the validation and learning curves, how do you think you could improve this classifier – should you seek a better model/more features, or should you seek more data?

Simple Cross-validation¶

In [3]:

from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_score

scores = cross_val_score(RandomForestClassifier(), X, y, cv=10)
print("score = {0:.2f} +- {1:.2f}".format(scores.mean(), scores.std()))

score = 0.58 +- 0.04

Validation Curves for Random Forests¶

In [4]:

# Use our plot_with_err utility routine from the lecture
def plot_with_err(x, data, color, **kwargs):
    mu, std = data.mean(1), data.std(1)
    plt.plot(x, mu, '-', c=color, **kwargs)
    plt.fill_between(x, mu - std, mu + std, edgecolor='none', facecolor=color, alpha=0.2)

In [5]:

from sklearn.learning_curve import validation_curve

max_depths = np.arange(1, 20, 2)
val_train, val_test = validation_curve(RandomForestClassifier(),
                                       X, y,
                                       'max_depth', max_depths, cv=5)

In [6]:

plot_with_err(max_depths, val_train, 'red', label='train')
plot_with_err(max_depths, val_test, 'blue', label='test')
plt.xlabel('max_depth')
plt.ylabel('score')
plt.legend(loc='best');

Apparently our model is strongly under-fitting the data (i.e. the train/test score are separated by a large margin!)

Learning curves for Random Forests¶

In [7]:

from sklearn.learning_curve import learning_curve
clf = RandomForestClassifier(max_depth=9)
train_sizes = np.linspace(0.05, 1, 20)
N_train, val_train, val_test = learning_curve(clf, X, y, train_sizes)

In [8]:

plot_with_err(N_train, val_train, 'r', label='training scores')
plot_with_err(N_train, val_test, 'b', label='validation scores')
plt.xlabel('Training Set Size'); plt.ylabel('rms error')
plt.legend();

It's clear here that the weakness of our model is not the model complexity, but the amount of training data available. Our model will not be data-saturated until the two lines above meet – and it will take a whole lot more data to bring the lines together!

2. Validation with Support Vector Machines¶

The Support Vector Classifier is often a much more powerful model than random forests, especially for smaller datasets.

Here we'll repeat the above exercise, but use sklearn.svm.SVC instead. The support vector classifier that we'll use below does not scale well with data dimension. For this reason, we'll start by doing a dimensionality reduction of the data.

Use the SVC with the default parameters, and use 3-fold cross-validation to determine the optimal accuracy.

You'll notice that this computation takes a relatively long time in comparison to the Random Forest Classifier. This is because the data has a very high dimension, and SVC does not scale well with data dimension. In order to make the remaining tasks computationally viable, we'll reduce the dimension of the data

In [9]:

from sklearn.svm import SVC

scores = cross_val_score(SVC(), X, y, cv=3)
print("score = {0:.2f} +- {1:.2f}".format(scores.mean(), scores.std()))

score = 0.41 +- 0.00

Use the PCA estimator to project the data down to 100 dimensions
re-compute the SVC cross-validation on this result. Is the score similar?

In [10]:

from sklearn.decomposition import PCA
X_proj = PCA(100).fit_transform(X)

In [11]:

scores = cross_val_score(SVC(), X_proj, y, cv=3)
print("score = {0:.2f} +- {1:.2f}".format(scores.mean(), scores.std()))

score = 0.41 +- 0.00

Now we'll carry-on with the learning/validation curves using this projected data.

Construct validation curves for SVC on this (projected) data, using a linear kernel (kernel='linear') and exploring the effect of C on the result. Note that the effect of C only changes on very large scales: you should try logarithmically-spaced values between, say $10^{-10}$ and $10^-1$.
What is the optimal value for C? What is the best score for this estimator? What is the score for this value if you use the entire dataset? Is this much different than for the projected data?
Construct a learning curve for the Support Vector Machine using this value for max_depth.
Given the validation and learning curves, how do you think you could improve this classifier – should you seek a better model/more features, or should you seek more data?
Overall, how does this compare to the Random Forest Classifier?

In [12]:

C = 10 ** np.linspace(-9, -4, 20)
val_train, val_test = validation_curve(SVC(kernel='linear'), X_proj, y,
                                       'C', C, cv=5)

In [13]:

plt.axes(xscale='log')

plot_with_err(C, val_train, 'red', label='train')
plot_with_err(C, val_test, 'blue', label='test')
plt.xlabel('C')
plt.ylabel('score')
plt.legend(loc='best');

In [14]:

# Compute the score on the full dataset using the optimal value of C:
scores = cross_val_score(SVC(kernel='linear', C=1E-6), X_proj, y, cv=3)
print("projected score = {0:.2f} +- {1:.2f}".format(scores.mean(), scores.std()))

scores = cross_val_score(SVC(kernel='linear', C=1E-6), X, y, cv=3)
print("full score = {0:.2f} +- {1:.2f}".format(scores.mean(), scores.std()))

projected score = 0.82 +- 0.01
full score = 0.83 +- 0.01

These scores are statistically equivalent: it seems that our projection is not losing significant information relevant to the classification!

In [15]:

# Compute the learning curve for this value of C
train_sizes = np.linspace(0.05, 1, 20)
N_train, val_train, val_test = learning_curve(SVC(kernel='linear', C=1E-6),
                                              X_proj, y, train_sizes)

In [16]:

plot_with_err(N_train, val_train, 'r', label='training scores')
plot_with_err(N_train, val_test, 'b', label='validation scores')
plt.xlabel('Training Set Size'); plt.ylabel('rms error')
plt.legend();

Comparing this to the random forests, we see that the SVC classifier underfits the data much less! Still, though, it's clear that adding more data would still benefit this model. We've not yet gotten to the point where the model is saturated and the training/testing score are equal.

Summary¶

For this particular dataset, we find the following:

RandomForestClassifier under-fits the data to a large extent ($\Delta$score $\approx$ 0.5).
SVC also under-fits the data, but not by as much ($\Delta$score $\approx$ 0.1)
From the learning curves, it's clear that both models could benefit from more training data

One important piece of this is that Random Forests have much better scaling with data size than do support vector machines. If you double the size of the data, a random forest will take about twice as long to train, but support vector machines will take about four times as long. So for small datasets like this, SVC will often be a better choice, while for large datasets, RandomForestClassifier becomes better.