Here we'll practice the process of model validation, and evaluating how we can improve our model. We'll return to the Labeled Faces in the Wild dataset that we saw previously, and use the cross-validation techniques we covered to find the best possible model.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
# use seaborn plotting defaults
# If this causes an error, you can comment it out.
import seaborn as sns
sns.set()
from sklearn.datasets import fetch_lfw_people
faces = fetch_lfw_people(min_faces_per_person=70, resize=0.4)
X, y = faces.data, faces.target
RandomForestClassifier
with the default parameters, and use 10-fold cross-validation to determine the optimal accuracy.max_depth
on the result.max_depth
(approximately)? What is the best score for this estimator?max_depth
.from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_score
scores = cross_val_score(RandomForestClassifier(), X, y, cv=10)
print("score = {0:.2f} +- {1:.2f}".format(scores.mean(), scores.std()))
score = 0.58 +- 0.04
# Use our plot_with_err utility routine from the lecture
def plot_with_err(x, data, color, **kwargs):
mu, std = data.mean(1), data.std(1)
plt.plot(x, mu, '-', c=color, **kwargs)
plt.fill_between(x, mu - std, mu + std, edgecolor='none', facecolor=color, alpha=0.2)
from sklearn.learning_curve import validation_curve
max_depths = np.arange(1, 20, 2)
val_train, val_test = validation_curve(RandomForestClassifier(),
X, y,
'max_depth', max_depths, cv=5)
plot_with_err(max_depths, val_train, 'red', label='train')
plot_with_err(max_depths, val_test, 'blue', label='test')
plt.xlabel('max_depth')
plt.ylabel('score')
plt.legend(loc='best');
Apparently our model is strongly under-fitting the data (i.e. the train/test score are separated by a large margin!)
from sklearn.learning_curve import learning_curve
clf = RandomForestClassifier(max_depth=9)
train_sizes = np.linspace(0.05, 1, 20)
N_train, val_train, val_test = learning_curve(clf, X, y, train_sizes)
plot_with_err(N_train, val_train, 'r', label='training scores')
plot_with_err(N_train, val_test, 'b', label='validation scores')
plt.xlabel('Training Set Size'); plt.ylabel('rms error')
plt.legend();
It's clear here that the weakness of our model is not the model complexity, but the amount of training data available. Our model will not be data-saturated until the two lines above meet – and it will take a whole lot more data to bring the lines together!
The Support Vector Classifier is often a much more powerful model than random forests, especially for smaller datasets.
Here we'll repeat the above exercise, but use sklearn.svm.SVC
instead.
The support vector classifier that we'll use below does not scale well with data dimension. For this reason, we'll start by doing a dimensionality reduction of the data.
SVC
with the default parameters, and use 3-fold cross-validation to determine the optimal accuracy.You'll notice that this computation takes a relatively long time in comparison to the Random Forest Classifier.
This is because the data has a very high dimension, and SVC
does not scale well with data dimension.
In order to make the remaining tasks computationally viable, we'll reduce the dimension of the data
from sklearn.svm import SVC
scores = cross_val_score(SVC(), X, y, cv=3)
print("score = {0:.2f} +- {1:.2f}".format(scores.mean(), scores.std()))
score = 0.41 +- 0.00
PCA
estimator to project the data down to 100 dimensionsSVC
cross-validation on this result. Is the score similar?from sklearn.decomposition import PCA
X_proj = PCA(100).fit_transform(X)
scores = cross_val_score(SVC(), X_proj, y, cv=3)
print("score = {0:.2f} +- {1:.2f}".format(scores.mean(), scores.std()))
score = 0.41 +- 0.00
Now we'll carry-on with the learning/validation curves using this projected data.
SVC
on this (projected) data, using a linear kernel (kernel='linear'
) and exploring the effect of C
on the result. Note that the effect of C
only changes on very large scales: you should try logarithmically-spaced values between, say $10^{-10}$ and $10^-1$.C
? What is the best score for this estimator? What is the score for this value if you use the entire dataset? Is this much different than for the projected data?max_depth
.C = 10 ** np.linspace(-9, -4, 20)
val_train, val_test = validation_curve(SVC(kernel='linear'), X_proj, y,
'C', C, cv=5)
plt.axes(xscale='log')
plot_with_err(C, val_train, 'red', label='train')
plot_with_err(C, val_test, 'blue', label='test')
plt.xlabel('C')
plt.ylabel('score')
plt.legend(loc='best');
# Compute the score on the full dataset using the optimal value of C:
scores = cross_val_score(SVC(kernel='linear', C=1E-6), X_proj, y, cv=3)
print("projected score = {0:.2f} +- {1:.2f}".format(scores.mean(), scores.std()))
scores = cross_val_score(SVC(kernel='linear', C=1E-6), X, y, cv=3)
print("full score = {0:.2f} +- {1:.2f}".format(scores.mean(), scores.std()))
projected score = 0.82 +- 0.01 full score = 0.83 +- 0.01
These scores are statistically equivalent: it seems that our projection is not losing significant information relevant to the classification!
# Compute the learning curve for this value of C
train_sizes = np.linspace(0.05, 1, 20)
N_train, val_train, val_test = learning_curve(SVC(kernel='linear', C=1E-6),
X_proj, y, train_sizes)
plot_with_err(N_train, val_train, 'r', label='training scores')
plot_with_err(N_train, val_test, 'b', label='validation scores')
plt.xlabel('Training Set Size'); plt.ylabel('rms error')
plt.legend();
Comparing this to the random forests, we see that the SVC classifier underfits the data much less! Still, though, it's clear that adding more data would still benefit this model. We've not yet gotten to the point where the model is saturated and the training/testing score are equal.
For this particular dataset, we find the following:
RandomForestClassifier
under-fits the data to a large extent ($\Delta$score $\approx$ 0.5).SVC
also under-fits the data, but not by as much ($\Delta$score $\approx$ 0.1)One important piece of this is that Random Forests have much better scaling with data size than do support vector machines. If you double the size of the data, a random forest will take about twice as long to train, but support vector machines will take about four times as long. So for small datasets like this, SVC
will often be a better choice, while for large datasets, RandomForestClassifier
becomes better.