This is one of the 100 recipes of the IPython Cookbook, the definitive guide to high-performance scientific computing and data science in Python.
import numpy as np
import sklearn
import sklearn.decomposition as dec
import sklearn.datasets as ds
import matplotlib.pyplot as plt
%matplotlib inline
iris = ds.load_iris()
X = iris.data
y = iris.target
print(X.shape)
plt.figure(figsize=(6,3));
plt.scatter(X[:,0], X[:,1], c=y,
s=30, cmap=plt.cm.rainbow);
PCA
model, and call the fit_transform
method. This function computes the principal components first, and projects the data then.X_bis = dec.PCA().fit_transform(X)
plt.figure(figsize=(6,3));
plt.scatter(X_bis[:,0], X_bis[:,1], c=y,
s=30, cmap=plt.cm.rainbow);
Points belonging to the same classes are now grouped together, even though the PCA
estimator dit not use the labels. The PCA was able to find a projection maximizing the variance, which corresponds here to a projection where the classes are well separated.
scikit.decomposition
module contains several variants of the classic PCA
estimator: ProbabilisticPCA
, SparsePCA
, RandomizedPCA
, KernelPCA
... As an example, let's take a look at KernelPCA
, a non-linear version of PCA.X_ter = dec.KernelPCA(kernel='rbf').fit_transform(X)
plt.figure(figsize=(6,3));
plt.scatter(X_ter[:,0], X_ter[:,1], c=y, s=30, cmap=plt.cm.rainbow);
You'll find all the explanations, figures, references, and much more in the book (to be released later this summer).
IPython Cookbook, by Cyrille Rossant, Packt Publishing, 2014 (500 pages).