Here we'll practice the process of model validation, and evaluating how we can improve our model. We'll return to the Labeled Faces in the Wild dataset that we saw previously, and use the cross-validation techniques we covered to find the best possible model.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
# use seaborn plotting defaults
# If this causes an error, you can comment it out.
import seaborn as sns
sns.set()
from sklearn.datasets import fetch_lfw_people
faces = fetch_lfw_people(min_faces_per_person=70, resize=0.4)
X, y = faces.data, faces.target
RandomForestClassifier
with the default parameters, and use 10-fold cross-validation to determine the optimal accuracy.max_depth
on the result.max_depth
(approximately)? What is the best score for this estimator?max_depth
.The Support Vector Classifier is often a much more powerful model than random forests, especially for smaller datasets.
Here we'll repeat the above exercise, but use sklearn.svm.SVC
instead.
The support vector classifier that we'll use below does not scale well with data dimension. For this reason, we'll start by doing a dimensionality reduction of the data.
SVC
with the default parameters, and use 3-fold cross-validation to determine the optimal accuracy.You'll notice that this computation takes a relatively long time in comparison to the Random Forest Classifier.
This is because the data has a very high dimension, and SVC
does not scale well with data dimension.
In order to make the remaining tasks computationally viable, we'll reduce the dimension of the data
PCA
estimator to project the data down to 100 dimensionsSVC
cross-validation on this result. Is the score similar?Now we'll carry-on with the learning/validation curves using this projected data.
SVC
on this (projected) data, using a linear kernel (kernel='linear'
) and exploring the effect of C
on the result. Note that the effect of C
only changes on very large scales: you should try logarithmically-spaced values between, say $10^{-10}$ and $10^-1$.C
? What is the best score for this estimator? What is the score for this value if you use the entire dataset? Is this much different than for the projected data?max_depth
.