This is one of the 100 recipes of the IPython Cookbook, the definitive guide to high-performance scientific computing and data science in Python.

8.7. Reducing the dimensionality of a data with a Principal Component Analysis¶

We import NumPy, matplotlib, and scikit-learn.

In [ ]:

import numpy as np
import sklearn
import sklearn.decomposition as dec
import sklearn.datasets as ds
import matplotlib.pyplot as plt
%matplotlib inline

The Iris flower dataset is available in the datasets module of scikit-learn.

In [ ]:

iris = ds.load_iris()
X = iris.data
y = iris.target
print(X.shape)

Each row contains four parameters related to the morphology of the flower. Let's display the first two components in two dimensions. The color reflects the iris variety of the flower (the label, between 0 and 2).

In [ ]:

plt.figure(figsize=(6,3));
plt.scatter(X[:,0], X[:,1], c=y,
            s=30, cmap=plt.cm.rainbow);

We now apply PCA on the dataset to get the transformed matrix. This operation can be done in a single line with scikit-learn: we instantiate a PCA model, and call the fit_transform method. This function computes the principal components first, and projects the data then.

In [ ]:

X_bis = dec.PCA().fit_transform(X)

We now display the same dataset, but in a new coordinate system (or equivalently, a linearly transformed version of the initial dataset).

In [ ]:

plt.figure(figsize=(6,3));
plt.scatter(X_bis[:,0], X_bis[:,1], c=y,
            s=30, cmap=plt.cm.rainbow);

Points belonging to the same classes are now grouped together, even though the PCA estimator dit not use the labels. The PCA was able to find a projection maximizing the variance, which corresponds here to a projection where the classes are well separated.

The scikit.decomposition module contains several variants of the classic PCA estimator: ProbabilisticPCA, SparsePCA, RandomizedPCA, KernelPCA... As an example, let's take a look at KernelPCA, a non-linear version of PCA.

In [ ]:

X_ter = dec.KernelPCA(kernel='rbf').fit_transform(X)
plt.figure(figsize=(6,3));
plt.scatter(X_ter[:,0], X_ter[:,1], c=y, s=30, cmap=plt.cm.rainbow);

You'll find all the explanations, figures, references, and much more in the book (to be released later this summer).

IPython Cookbook, by Cyrille Rossant, Packt Publishing, 2014 (500 pages).