Dimensionality reduction

COMP4670/8600 - Introduction to Statistical Machine Learning - Tutorial 6

Setting up the environment

In [ ]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.optimize as opt
import pickle

%matplotlib inline

Toy dataset for debugging

Write a function that generates data from two Gaussians with unit variance, centered at $\mathbf{1}$ and $-\mathbf{1}$ respectively. $\mathbf{1}$ is the vector of all ones.

Use the function to generate 100 samples from each Gaussian, with a 5 dimensional feature space.

In [ ]:
# Solution goes here

Principal component analysis (PCA)

The Singular Values of a square matrix $A$ is defined as the square root of the eigenvalues of $A^T A$. Given a matrix $X$, the singular value decomposition (SVD) is given by $$ X = U S V^T $$ where $U$ and $V$ are orthogonal matrices containing the left and right singular vectors respectively. And $S$ is a matrix with the singular values along the diagonal.

Recall that PCA considers the covariance matrix of a data matrix $X$. Using the definition of SVD above, derive expressions for:

  1. the eigenvectors
  2. the projection of $X$ onto the $k$ largest eigenvalues

Solution description

Implement PCA

Implement the principal component analysis method, using numpy.linalg.svd. Your function should take the data matrix and return two matrices:

  1. The projection of the data onto the principal components
  2. The actual components (eigenvectors) themselves.

Hint: do not forget to center the data by removing the mean

In [ ]:
# Solution goes here

Obtain the projection of the toy data above to its first two principal components. Plot the results. You should be able to see that the first principal component already gives you the axis of discrimination.

In [ ]:
# Solution goes here

The classification data set

You have seen this dataset in earlier tutorials

We will predict the incidence of diabetes based on various measurements (see description). Instead of directly using the raw data, we use a normalised version, where the label to be predicted (the incidence of diabetes) is in the first column. Download the data from mldata.org.

Read in the data using pandas.

In [ ]:
names = ['diabetes', 'num preg', 'plasma', 'bp', 'skin fold', 'insulin', 'bmi', 'pedigree', 'age']
data = pd.read_csv('diabetes_scale.csv', header=None, names=names)
data.diabetes.replace(-1, 0, inplace=True) # replace -1 with 0 because we need labels to be in {0, 1}
data.head()

Find the first two principal components of the features in the classification data set. Plot the scatter plot showing the examples projected onto the first two principal components. Use the labels to produce different symbols for each class. Discuss whether the first two principal components discriminate well.

In [ ]:
# Solution goes here

(optional) Effect of normalisation on principal components

Plot the scatter plot of the first two principal components of the classification dataset respectively before and after the normalisations in Tutorial 2 and 5.

In [ ]:
# Solution goes here

Using principal components as features for classification

Write a file containing the features projected onto the first 4 principal components, using the to_csv command of pandas.

In [ ]:
# Solution goes here
In [ ]:
# Data should look something like the below
data = pd.read_csv('prin_feat4.csv')
data.head()

Use the first four principal components with your logistic regression code from Tutorial 3, and compare the results. For simplicity, compare the training error.

In [ ]:
# Solution goes here
In [ ]:
# Solution goes here

(optional) Explore noisy features

Use numpy.random.randn to generate 20 random features and add them to the diabetes dataset. How does logistic regression perform on all features? Can PCA be used to identify the signal?

In [ ]:
# Solution goes here

Eigenfaces

The aim of this section of the tutorial is to see that in some cases, the principal components can be human interpretable.

The images below are of Colin Powell, resized to a smaller image, from LFW. Download the images from the course website.

In [ ]:
# Visualising images
def plot_gallery(images, titles, h, w, n_row=2, n_col=6):
    """Helper function to plot a gallery of portraits"""
    plt.figure(figsize=(1.8 * n_col, 2.4 * n_row))
    plt.subplots_adjust(bottom=0, left=.01, right=.99, top=.90, hspace=.35)
    for i in range(n_row * n_col):
        plt.subplot(n_row, n_col, i + 1)
        plt.imshow(images[i].reshape((h, w)), cmap=plt.cm.gray)
        plt.title(titles[i], size=12)
        plt.xticks(())
        plt.yticks(())
In [ ]:
lfw_colin = pickle.load(open('lfw_colin.pkl', 'rb'))

# introspect the images array to find the shapes (for plotting)
n_samples, h, w = lfw_colin['images'].shape
plot_gallery(lfw_colin['images'], range(n_samples), h, w)

Use the pca function you wrote above to find the first 15 principal components. Visualise them. Discuss what the components potentially capture, for example lighting from the right.

Hint: Images need to be converted into a vector for PCA, and the results need to be converted back

In [ ]:
# Solution goes here