Setting up the environment

In [ ]:

```
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.optimize as opt
import pickle
%matplotlib inline
```

Write a function that generates data from two Gaussians with unit variance, centered at $\mathbf{1}$ and $-\mathbf{1}$ respectively. $\mathbf{1}$ is the vector of all ones.

Use the function to generate 100 samples from each Gaussian, with a 5 dimensional feature space.

In [ ]:

```
# Solution goes here
```

The **Singular Values** of a square matrix $A$ is defined as the square root of the eigenvalues of $A^T A$. Given a matrix $X$, the singular value decomposition (SVD) is given by
$$
X = U S V^T
$$
where $U$ and $V$ are orthogonal matrices containing the left and right singular vectors respectively. And $S$ is a matrix with the singular values along the diagonal.

Recall that PCA considers the covariance matrix of a data matrix $X$. Using the definition of SVD above, derive expressions for:

- the eigenvectors
- the projection of $X$ onto the $k$ largest eigenvalues

Implement the principal component analysis method, using `numpy.linalg.svd`

. Your function should take the data matrix and return two matrices:

- The projection of the data onto the principal components
- The actual components (eigenvectors) themselves.

*Hint: do not forget to center the data by removing the mean*

In [ ]:

```
# Solution goes here
```

Obtain the projection of the toy data above to its first two principal components. Plot the results. You should be able to see that the first principal component already gives you the axis of discrimination.

In [ ]:

```
# Solution goes here
```

*You have seen this dataset in earlier tutorials*

We will predict the incidence of diabetes based on various measurements (see description). Instead of directly using the raw data, we use a normalised version, where the label to be predicted (the incidence of diabetes) is in the first column. Download the data from mldata.org.

Read in the data using pandas.

In [ ]:

```
names = ['diabetes', 'num preg', 'plasma', 'bp', 'skin fold', 'insulin', 'bmi', 'pedigree', 'age']
data = pd.read_csv('diabetes_scale.csv', header=None, names=names)
data.diabetes.replace(-1, 0, inplace=True) # replace -1 with 0 because we need labels to be in {0, 1}
data.head()
```

Find the first two principal components of the features in the classification data set. Plot the scatter plot showing the examples projected onto the first two principal components. Use the labels to produce different symbols for each class. Discuss whether the first two principal components discriminate well.

In [ ]:

```
# Solution goes here
```

Plot the scatter plot of the first two principal components of the classification dataset respectively before and after the normalisations in Tutorial 2 and 5.

In [ ]:

```
# Solution goes here
```

Write a file containing the features projected onto the first 4 principal components, using the `to_csv`

command of `pandas`

.

In [ ]:

```
# Solution goes here
```

In [ ]:

```
# Data should look something like the below
data = pd.read_csv('prin_feat4.csv')
data.head()
```

Use the first four principal components with your logistic regression code from Tutorial 3, and compare the results. For simplicity, compare the training error.

In [ ]:

```
# Solution goes here
```

In [ ]:

```
# Solution goes here
```

Use `numpy.random.randn`

to generate 20 random features and add them to the diabetes dataset. How does logistic regression perform on all features? Can PCA be used to identify the signal?

In [ ]:

```
# Solution goes here
```

The aim of this section of the tutorial is to see that in some cases, the principal components can be human interpretable.

The images below are of Colin Powell, resized to a smaller image, from LFW. Download the images from the course website.

In [ ]:

```
# Visualising images
def plot_gallery(images, titles, h, w, n_row=2, n_col=6):
"""Helper function to plot a gallery of portraits"""
plt.figure(figsize=(1.8 * n_col, 2.4 * n_row))
plt.subplots_adjust(bottom=0, left=.01, right=.99, top=.90, hspace=.35)
for i in range(n_row * n_col):
plt.subplot(n_row, n_col, i + 1)
plt.imshow(images[i].reshape((h, w)), cmap=plt.cm.gray)
plt.title(titles[i], size=12)
plt.xticks(())
plt.yticks(())
```

In [ ]:

```
lfw_colin = pickle.load(open('lfw_colin.pkl', 'rb'))
# introspect the images array to find the shapes (for plotting)
n_samples, h, w = lfw_colin['images'].shape
plot_gallery(lfw_colin['images'], range(n_samples), h, w)
```

Use the `pca`

function you wrote above to find the first 15 principal components. Visualise them. Discuss what the components potentially capture, for example lighting from the right.

*Hint: Images need to be converted into a vector for PCA, and the results need to be converted back*

In [ ]:

```
# Solution goes here
```