In [1]:

```
import warnings
warnings.filterwarnings('ignore')
```

In [2]:

```
%load_ext watermark
```

In [3]:

```
%watermark -v -d -a 'Sebastian Raschka' -p scikit-learn,plotly,numpy,pandas
```

*In other words, PCA projects the entire dataset onto a different feature (sub)space, and LDA tries to determine a suitable feature (sub)space in order to distinguish between patterns that belong to different classes.*

Often, the desired goal is to reduce the dimensions of a $d$-dimensional dataset by projecting it onto a $(k)$-dimensional subspace (where $k\;<\;d$) in order to increase the computational efficiency while retaining most of the information. An important question is "what is the size of $k$ that represents the data 'well'?"

Later, we will compute eigenvectors (the principal components) of a dataset and collect them in a projection matrix. Each of those eigenvectors is associated with an eigenvalue which can be interpreted as the "length" or "magnitude" of the corresponding eigenvector. If some eigenvalues have a significantly larger magnitude than others that the reduction of the dataset via PCA onto a smaller dimensional subspace by dropping the "less informative" eigenpairs is reasonable.

- Standardize the data.
- Obtain the Eigenvectors and Eigenvalues from the covariance matrix or correlation matrix, or perform Singular Vector Decomposition.
- Sort eigenvalues in descending order and choose the $k$ eigenvectors that correspond to the $k$ largest eigenvalues where $k$ is the number of dimensions of the new feature subspace ($k \le d$)/.
- Construct the projection matrix $\mathbf{W}$ from the selected $k$ eigenvectors.
- Transform the original dataset $\mathbf{X}$ via $\mathbf{W}$ to obtain a $k$-dimensional feature subspace $\mathbf{Y}$.

For the following tutorial, we will be working with the famous "Iris" dataset that has been deposited on the UCI machine learning repository

(https://archive.ics.uci.edu/ml/datasets/Iris).

The iris dataset contains measurements for 150 iris flowers from three different species.

The three classes in the Iris dataset are:

- Iris-setosa (n=50)
- Iris-versicolor (n=50)
- Iris-virginica (n=50)

And the four features of in Iris dataset are:

- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm

In [16]:

```
import pandas as pd
df = pd.read_csv(
filepath_or_buffer='https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data',
header=None,
sep=',')
df.columns=['sepal_len', 'sepal_wid', 'petal_len', 'petal_wid', 'class']
df.dropna(how="all", inplace=True) # drops the empty line at file-end
df.tail()
```

Out[16]:

In [17]:

```
# split data table into data X and class labels y
X = df.ix[:,0:4].values
y = df.ix[:,4].values
```

Our iris dataset is now stored in form of a $150 \times 4$ matrix where the columns are the different features, and every row represents a separate flower sample. Each sample row $\mathbf{x}$ can be pictured as a 4-dimensional vector

$\mathbf{x^T} = \begin{pmatrix} x_1 \ x_2 \ x_3 \ x_4 \end{pmatrix} = \begin{pmatrix} \text{sepal length} \ \text{sepal width} \\text{petal length} \ \text{petal width} \end{pmatrix}$

In [6]:

```
import plotly.plotly as py
from plotly.graph_objs import *
import plotly.tools as tls
```

In [ ]:

```
# plotting histograms
traces = []
legend = {0:False, 1:False, 2:False, 3:True}
colors = {'Iris-setosa': 'rgb(31, 119, 180)',
'Iris-versicolor': 'rgb(255, 127, 14)',
'Iris-virginica': 'rgb(44, 160, 44)'}
for col in range(4):
for key in colors:
traces.append(Histogram(x=X[y==key, col],
opacity=0.75,
xaxis='x%s' %(col+1),
marker=Marker(color=colors[key]),
name=key,
showlegend=legend[col]))
data = Data(traces)
layout = Layout(barmode='overlay',
xaxis=XAxis(domain=[0, 0.25], title='sepal length (cm)'),
xaxis2=XAxis(domain=[0.3, 0.5], title='sepal width (cm)'),
xaxis3=XAxis(domain=[0.55, 0.75], title='petal length (cm)'),
xaxis4=XAxis(domain=[0.8, 1], title='petal width (cm)'),
yaxis=YAxis(title='count'),
title='Distribution of the different Iris flower features')
fig = Figure(data=data, layout=layout)
py.image.save_as({'data': data}, 'iris_histos.png')
py.iplot(fig)
```

In [19]:

```
from sklearn.preprocessing import StandardScaler
X_std = StandardScaler().fit_transform(X)
```

The classic approach to PCA is to perform the eigendecomposition on the covariance matrix $\Sigma$, which is a $d \times d$ matrix where each element represents the covariance between two features. The covariance between two features is calculated as follows:

$\sigma_{jk} = \frac{1}{n-1}\sum_{i=1}^{N}\left( x_{ij}-\bar{x}_j \right) \left( x_{ik}-\bar{x}_k \right).$

We can summarize the calculation of the covariance matrix via the following matrix equation:

$\Sigma = \frac{1}{n-1} \left( (\mathbf{X} - \mathbf{\bar{x}})^T\;(\mathbf{X} - \mathbf{\bar{x}}) \right)$

where $\mathbf{\bar{x}}$ is the mean vector
$\mathbf{\bar{x}} = \sum\limits_{k=1}^n x_{i}.$

The mean vector is a $d$-dimensional vector where each value in this vector represents the sample mean of a feature column in the dataset.

In [9]:

```
import numpy as np
mean_vec = np.mean(X_std, axis=0)
cov_mat = (X_std - mean_vec).T.dot((X_std - mean_vec)) / (X_std.shape[0]-1)
print('Covariance matrix \n%s' %cov_mat)
```

`cov`

function:

In [10]:

```
print('NumPy covariance matrix: \n%s' %np.cov(X_std.T))
```

Next, we perform an eigendecomposition on the covariance matrix:

In [11]:

```
cov_mat = np.cov(X_std.T)
eig_vals, eig_vecs = np.linalg.eig(cov_mat)
print('Eigenvectors \n%s' %eig_vecs)
print('\nEigenvalues \n%s' %eig_vals)
```

Eigendecomposition of the standardized data based on the correlation matrix:

In [12]:

```
cor_mat1 = np.corrcoef(X_std.T)
eig_vals, eig_vecs = np.linalg.eig(cor_mat1)
print('Eigenvectors \n%s' %eig_vecs)
print('\nEigenvalues \n%s' %eig_vals)
```

Eigendecomposition of the raw data based on the correlation matrix:

In [13]:

```
cor_mat2 = np.corrcoef(X.T)
eig_vals, eig_vecs = np.linalg.eig(cor_mat2)
print('Eigenvectors \n%s' %eig_vecs)
print('\nEigenvalues \n%s' %eig_vals)
```

We can clearly see that all three approaches yield the same eigenvectors and eigenvalue pairs:

- Eigendecomposition of the covariance matrix after standardizing the data.
- Eigendecomposition of the correlation matrix.
- Eigendecomposition of the correlation matrix after standardizing the data.

In [20]:

```
u,s,v = np.linalg.svd(X_std.T)
u
```

Out[20]:

In [28]:

```
for ev in eig_vecs:
np.testing.assert_array_almost_equal(1.0, np.linalg.norm(ev))
print('Everything ok!')
```

In order to do so, the common approach is to rank the eigenvalues from highest to lowest in order choose the top $k$ eigenvectors.

In [29]:

```
# Make a list of (eigenvalue, eigenvector) tuples
eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]
# Sort the (eigenvalue, eigenvector) tuples from high to low
eig_pairs.sort()
eig_pairs.reverse()
# Visually confirm that the list is correctly sorted by decreasing eigenvalues
print('Eigenvalues in descending order:')
for i in eig_pairs:
print(i[0])
```

In [30]:

```
tot = sum(eig_vals)
var_exp = [(i / tot)*100 for i in sorted(eig_vals, reverse=True)]
cum_var_exp = np.cumsum(var_exp)
trace1 = Bar(
x=['PC %s' %i for i in range(1,5)],
y=var_exp,
showlegend=False)
trace2 = Scatter(
x=['PC %s' %i for i in range(1,5)],
y=cum_var_exp,
name='cumulative explained variance')
data = Data([trace1, trace2])
layout=Layout(
yaxis=YAxis(title='Explained variance in percent'),
title='Explained variance by different principal components')
fig = Figure(data=data, layout=layout)
py.iplot(fig)
```

Out[30]:

It's about time to get to the really interesting part: The construction of the projection matrix that will be used to transform the Iris data onto the new feature subspace. Although, the name "projection matrix" has a nice ring to it, it is basically just a matrix of our concatenated top *k* eigenvectors.

Here, we are reducing the 4-dimensional feature space to a 2-dimensional feature subspace, by choosing the "top 2" eigenvectors with the highest eigenvalues to construct our $d \times k$-dimensional eigenvector matrix $\mathbf{W}$.

In [33]:

```
matrix_w = np.hstack((eig_pairs[0][1].reshape(4,1),
eig_pairs[1][1].reshape(4,1)))
print('Matrix W:\n', matrix_w)
```

$\mathbf{Y} = \mathbf{X} \times \mathbf{W}$, where $\mathbf{Y}$ is a $150\times 2$ matrix of our transformed samples.

In [34]:

```
Y = X_std.dot(matrix_w)
```

In [36]:

```
traces = []
for name in ('Iris-setosa', 'Iris-versicolor', 'Iris-virginica'):
trace = Scatter(
x=Y[y==name,0],
y=Y[y==name,1],
mode='markers',
name=name,
marker=Marker(
size=12,
line=Line(
color='rgba(217, 217, 217, 0.14)',
width=0.5),
opacity=0.8))
traces.append(trace)
data = Data(traces)
layout = Layout(showlegend=True,
scene=Scene(xaxis=XAxis(title='PC1'),
yaxis=YAxis(title='PC2'),))
fig = Figure(data=data, layout=layout)
py.iplot(fig)
```

Out[36]:

In [37]:

```
from sklearn.decomposition import PCA as sklearnPCA
sklearn_pca = sklearnPCA(n_components=2)
Y_sklearn = sklearn_pca.fit_transform(X_std)
```

In [38]:

```
traces = []
for name in ('Iris-setosa', 'Iris-versicolor', 'Iris-virginica'):
trace = Scatter(
x=Y_sklearn[y==name,0],
y=Y_sklearn[y==name,1],
mode='markers',
name=name,
marker=Marker(
size=12,
line=Line(
color='rgba(217, 217, 217, 0.14)',
width=0.5),
opacity=0.8))
traces.append(trace)
data = Data(traces)
layout = Layout(xaxis=XAxis(title='PC1', showline=False),
yaxis=YAxis(title='PC2', showline=False))
fig = Figure(data=data, layout=layout)
py.iplot(fig)
```

Out[38]: