In this breakout, we'll be using Principal Component Analysis to explore how it interacts with the faces dataset that we saw earlier.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
# use seaborn plotting defaults
import seaborn as sns; sns.set()
We'll use this code to load the data:
from sklearn.datasets import fetch_lfw_people
faces = fetch_lfw_people(min_faces_per_person=70, resize=0.4)
X, y = faces.data, faces.target
from sklearn.decomposition import PCA
pca = PCA().fit(X)
pca
PCA(copy=True, n_components=None, whiten=False)
pca.n_components_
1850
plt.axes(xscale='log')
plt.plot(pca.explained_variance_ratio_.cumsum())
plt.xlabel('number of components')
plt.ylabel('cubulative variance ratio');
We see that given about 100 components, we'll retain 90% of the variance.
Note that we could also have determined this automatically, using the following:
pca = PCA(n_components=0.90)
pca.fit(X)
pca.n_components_
105
The mean of the data (found in the mean_
attribute) and each component of the data (found in the rows of the components_
attribute) can be reshaped and interpreted as an image.
plt.imshow
components_
matrixYou'll have to play around with the colormap and grid settings to make this look OK
imshape = faces.images.shape[-2:]
plt.axes(xticks=[], yticks=[])
plt.imshow(pca.mean_.reshape(imshape), cmap='binary_r');
fig, ax = plt.subplots(2, 5, figsize=(14, 6),
subplot_kw=dict(xticks=[], yticks=[]))
for i in range(10):
ax.flat[i].imshow(pca.components_[i].reshape(imshape),
cmap='binary_r')
We see that the main components measure things like how off-center the face is, how much shadow there is, how deep the eye sockets are, etc.
For several faces, plot the true image plus the reconstruction (computed using inverse_transform
) for several different values of n_components
. (you might even use IPython's interactive functions to make this exploration easier).
Does the 90% variance choice seem to correspond to a good visual representation of each picture?
Note: As you experiment with this, you may want to use RandomizedPCA
rather than PCA
for this task. RandomizedPCA
is an approximate method with the same interface as PCA
, but operates much more quickly.
pca = PCA().fit(X)
def plot_face(i=279):
fig, ax = plt.subplots(1, 6, figsize=(14, 3),
subplot_kw=dict(xticks=[], yticks=[]))
ax[0].imshow(X[i].reshape(imshape), cmap='binary_r');
for j, ncomp in enumerate([10, 20, 40, 80, 100]):
approx = pca.mean_ + np.dot(pca.transform(X[i:i + 1])[:, :ncomp],
pca.components_[:ncomp])
ax[j + 1].imshow(approx.reshape(imshape), cmap='binary_r')
ax[j + 1].set_title('{0} components'.format(ncomp))
plot_face(700)
from IPython.html.widgets import interact
interact(plot_face, i=(0, X.shape[0] - 1));