This notebook covers a Python-based solution for the seventh programming exercise of the machine learning class on Coursera. Please refer to the exercise text for detailed descriptions and equations.
In this exercise we'll implement K-means clustering and use it to compress an image. We'll start with a simple 2D data set to see how K-means works, then we'll apply it to image compression. We'll also experiment with principal component analysis and see how it can be used to find a low-dimensional representation of images of faces.
To start out we're going to implement and apply K-means to a simple 2-dimensional data set to gain some intuition about how it works. K-means is an iterative, unsupervised clustering algorithm that groups similar instances together into clusters. The algorithm starts by guessing the initial centroids for each cluster, and then repeatedly assigns instances to the nearest cluster and re-computes the centroid of that cluster. The first piece that we're going to implement is a function that finds the closest centroid for each instance in the data.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from scipy.io import loadmat
%matplotlib inline
def find_closest_centroids(X, centroids):
m = X.shape[0]
k = centroids.shape[0]
idx = np.zeros(m)
for i in range(m):
min_dist = 1000000
for j in range(k):
dist = np.sum((X[i,:] - centroids[j,:]) ** 2)
if dist < min_dist:
min_dist = dist
idx[i] = j
return idx
Let's test the function to make sure it's working as expected. We'll use the test case provided in the exercise.
data = loadmat('data/ex7data2.mat')
X = data['X']
initial_centroids = initial_centroids = np.array([[3, 3], [6, 2], [8, 5]])
idx = find_closest_centroids(X, initial_centroids)
idx[0:3]
array([ 0., 2., 1.])
The output matches the expected values in the text (remember our arrays are zero-indexed instead of one-indexed so the values are one lower than in the exercise). Next we need a function to compute the centroid of a cluster. The centroid is simply the mean of all of the examples currently assigned to the cluster.
def compute_centroids(X, idx, k):
m, n = X.shape
centroids = np.zeros((k, n))
for i in range(k):
indices = np.where(idx == i)
centroids[i,:] = (np.sum(X[indices,:], axis=1) / len(indices[0])).ravel()
return centroids
compute_centroids(X, idx, 3)
array([[ 2.42830111, 3.15792418], [ 5.81350331, 2.63365645], [ 7.11938687, 3.6166844 ]])
This output also matches the expected values from the exercise. So far so good. The next part involves actually running the algorithm for some number of iterations and visualizing the result. This step was implmented for us in the exercise, but since it's not that complicated I'll build it here from scratch. In order to run the algorithm we just need to alternate between assigning examples to the nearest cluster and re-computing the cluster centroids.
def run_k_means(X, initial_centroids, max_iters):
m, n = X.shape
k = initial_centroids.shape[0]
idx = np.zeros(m)
centroids = initial_centroids
for i in range(max_iters):
idx = find_closest_centroids(X, centroids)
centroids = compute_centroids(X, idx, k)
return idx, centroids
idx, centroids = run_k_means(X, initial_centroids, 10)
cluster1 = X[np.where(idx == 0)[0],:]
cluster2 = X[np.where(idx == 1)[0],:]
cluster3 = X[np.where(idx == 2)[0],:]
fig, ax = plt.subplots(figsize=(12,8))
ax.scatter(cluster1[:,0], cluster1[:,1], s=30, color='r', label='Cluster 1')
ax.scatter(cluster2[:,0], cluster2[:,1], s=30, color='g', label='Cluster 2')
ax.scatter(cluster3[:,0], cluster3[:,1], s=30, color='b', label='Cluster 3')
ax.legend()
<matplotlib.legend.Legend at 0x18b0abe0>
One step we skipped over is a process for initializing the centroids. This can affect the convergence of the algorithm. We're tasked with creating a function that selects random examples and uses them as the initial centroids.
def init_centroids(X, k):
m, n = X.shape
centroids = np.zeros((k, n))
idx = np.random.randint(0, m, k)
for i in range(k):
centroids[i,:] = X[idx[i],:]
return centroids
init_centroids(X, 3)
array([[ 1.15354031, 4.67866717], [ 6.27376271, 2.24256036], [ 2.20960296, 4.91469264]])
Our next task is to apply K-means to image compression. The intuition here is that we can use clustering to find a small number of colors that are most representative of the image, and map the original 24-bit colors to a lower-dimensional color space using the cluster assignments. Here's the image we're going to compress.
from IPython.display import Image
Image(filename='data/bird_small.png')
The raw pixel data has been pre-loaded for us so let's pull it in.
image_data = loadmat('data/bird_small.mat')
image_data
{'A': array([[[219, 180, 103], [230, 185, 116], [226, 186, 110], ..., [ 14, 15, 13], [ 13, 15, 12], [ 12, 14, 12]], [[230, 193, 119], [224, 192, 120], [226, 192, 124], ..., [ 16, 16, 13], [ 14, 15, 10], [ 11, 14, 9]], [[228, 191, 123], [228, 191, 121], [220, 185, 118], ..., [ 14, 16, 13], [ 13, 13, 11], [ 11, 15, 10]], ..., [[ 15, 18, 16], [ 18, 21, 18], [ 18, 19, 16], ..., [ 81, 45, 45], [ 70, 43, 35], [ 72, 51, 43]], [[ 16, 17, 17], [ 17, 18, 19], [ 20, 19, 20], ..., [ 80, 38, 40], [ 68, 39, 40], [ 59, 43, 42]], [[ 15, 19, 19], [ 20, 20, 18], [ 18, 19, 17], ..., [ 65, 43, 39], [ 58, 37, 38], [ 52, 39, 34]]], dtype=uint8), '__globals__': [], '__header__': 'MATLAB 5.0 MAT-file, Platform: GLNXA64, Created on: Tue Jun 5 04:06:24 2012', '__version__': '1.0'}
A = image_data['A']
A.shape
(128L, 128L, 3L)
Now we need to apply some pre-processing to the data and feed it into the K-means algorithm.
# normalize value ranges
A = A / 255.
# reshape the array
X = np.reshape(A, (A.shape[0] * A.shape[1], A.shape[2]))
X.shape
(16384L, 3L)
# randomly initialize the centroids
initial_centroids = init_centroids(X, 16)
# run the algorithm
idx, centroids = run_k_means(X, initial_centroids, 10)
# get the closest centroids one last time
idx = find_closest_centroids(X, centroids)
# map each pixel to the centroid value
X_recovered = centroids[idx.astype(int),:]
X_recovered.shape
(16384L, 3L)
# reshape to the original dimensions
X_recovered = np.reshape(X_recovered, (A.shape[0], A.shape[1], A.shape[2]))
X_recovered.shape
(128L, 128L, 3L)
plt.imshow(X_recovered)
<matplotlib.image.AxesImage at 0x1abed198>
Cool! You can see that we created some artifacts in the compression but the main features of the image are still there. That's it for K-means. We'll now move on to principal component analysis.
PCA is a linear transformation that finds the "principal components", or directions of greatest variance, in a data set. It can be used for dimension reduction among other things. In this exercise we're first tasked with implementing PCA and applying it to a simple 2-dimensional data set to see how it works. Let's start off by loading and visualizing the data set.
data = loadmat('data/ex7data1.mat')
data
{'X': array([[ 3.38156267, 3.38911268], [ 4.52787538, 5.8541781 ], [ 2.65568187, 4.41199472], [ 2.76523467, 3.71541365], [ 2.84656011, 4.17550645], [ 3.89067196, 6.48838087], [ 3.47580524, 3.63284876], [ 5.91129845, 6.68076853], [ 3.92889397, 5.09844661], [ 4.56183537, 5.62329929], [ 4.57407171, 5.39765069], [ 4.37173356, 5.46116549], [ 4.19169388, 4.95469359], [ 5.24408518, 4.66148767], [ 2.8358402 , 3.76801716], [ 5.63526969, 6.31211438], [ 4.68632968, 5.6652411 ], [ 2.85051337, 4.62645627], [ 5.1101573 , 7.36319662], [ 5.18256377, 4.64650909], [ 5.70732809, 6.68103995], [ 3.57968458, 4.80278074], [ 5.63937773, 6.12043594], [ 4.26346851, 4.68942896], [ 2.53651693, 3.88449078], [ 3.22382902, 4.94255585], [ 4.92948801, 5.95501971], [ 5.79295774, 5.10839305], [ 2.81684824, 4.81895769], [ 3.88882414, 5.10036564], [ 3.34323419, 5.89301345], [ 5.87973414, 5.52141664], [ 3.10391912, 3.85710242], [ 5.33150572, 4.68074235], [ 3.37542687, 4.56537852], [ 4.77667888, 6.25435039], [ 2.6757463 , 3.73096988], [ 5.50027665, 5.67948113], [ 1.79709714, 3.24753885], [ 4.3225147 , 5.11110472], [ 4.42100445, 6.02563978], [ 3.17929886, 4.43686032], [ 3.03354125, 3.97879278], [ 4.6093482 , 5.879792 ], [ 2.96378859, 3.30024835], [ 3.97176248, 5.40773735], [ 1.18023321, 2.87869409], [ 1.91895045, 5.07107848], [ 3.95524687, 4.5053271 ], [ 5.11795499, 6.08507386]]), '__globals__': [], '__header__': 'MATLAB 5.0 MAT-file, Platform: PCWIN64, Created on: Mon Nov 14 22:41:44 2011', '__version__': '1.0'}
X = data['X']
fig, ax = plt.subplots(figsize=(12,8))
ax.scatter(X[:, 0], X[:, 1])
<matplotlib.collections.PathCollection at 0x18e19dd8>
The algorithm for PCA is fairly simple. After ensuring that the data is normalized, the output is simply the singular value decomposition of the covariance matrix of the original data.
def pca(X):
# normalize the features
X = (X - X.mean()) / X.std()
# compute the covariance matrix
X = np.matrix(X)
cov = (X.T * X) / X.shape[0]
# perform SVD
U, S, V = np.linalg.svd(cov)
return U, S, V
U, S, V = pca(X)
U, S, V
(matrix([[-0.79241747, -0.60997914], [-0.60997914, 0.79241747]]), array([ 1.43584536, 0.56415464]), matrix([[-0.79241747, -0.60997914], [-0.60997914, 0.79241747]]))
Now that we have the principal components (matrix U), we can use these to project the original data into a lower-dimensional space. For this task we'll implement a function that computes the projection and selects only the top K components, effectively reducing the number of dimensions.
def project_data(X, U, k):
U_reduced = U[:,:k]
return np.dot(X, U_reduced)
Z = project_data(X, U, 1)
Z
matrix([[-4.74689738], [-7.15889408], [-4.79563345], [-4.45754509], [-4.80263579], [-7.04081342], [-4.97025076], [-8.75934561], [-6.2232703 ], [-7.04497331], [-6.91702866], [-6.79543508], [-6.3438312 ], [-6.99891495], [-4.54558119], [-8.31574426], [-7.16920841], [-5.08083842], [-8.54077427], [-6.94102769], [-8.5978815 ], [-5.76620067], [-8.2020797 ], [-6.23890078], [-4.37943868], [-5.56947441], [-7.53865023], [-7.70645413], [-5.17158343], [-6.19268884], [-6.24385246], [-8.02715303], [-4.81235176], [-7.07993347], [-5.45953289], [-7.60014707], [-4.39612191], [-7.82288033], [-3.40498213], [-6.54290343], [-7.17879573], [-5.22572421], [-4.83081168], [-7.23907851], [-4.36164051], [-6.44590096], [-2.69118076], [-4.61386195], [-5.88236227], [-7.76732508]])
We can also attempt to recover the original data by reversing the steps we took to project it.
def recover_data(Z, U, k):
U_reduced = U[:,:k]
return np.dot(Z, U_reduced.T)
X_recovered = recover_data(Z, U, 1)
X_recovered
matrix([[ 3.76152442, 2.89550838], [ 5.67283275, 4.36677606], [ 3.80014373, 2.92523637], [ 3.53223661, 2.71900952], [ 3.80569251, 2.92950765], [ 5.57926356, 4.29474931], [ 3.93851354, 3.03174929], [ 6.94105849, 5.3430181 ], [ 4.93142811, 3.79606507], [ 5.58255993, 4.29728676], [ 5.48117436, 4.21924319], [ 5.38482148, 4.14507365], [ 5.02696267, 3.8696047 ], [ 5.54606249, 4.26919213], [ 3.60199795, 2.77270971], [ 6.58954104, 5.07243054], [ 5.681006 , 4.37306758], [ 4.02614513, 3.09920545], [ 6.76785875, 5.20969415], [ 5.50019161, 4.2338821 ], [ 6.81311151, 5.24452836], [ 4.56923815, 3.51726213], [ 6.49947125, 5.00309752], [ 4.94381398, 3.80559934], [ 3.47034372, 2.67136624], [ 4.41334883, 3.39726321], [ 5.97375815, 4.59841938], [ 6.10672889, 4.70077626], [ 4.09805306, 3.15455801], [ 4.90719483, 3.77741101], [ 4.94773778, 3.80861976], [ 6.36085631, 4.8963959 ], [ 3.81339161, 2.93543419], [ 5.61026298, 4.31861173], [ 4.32622924, 3.33020118], [ 6.02248932, 4.63593118], [ 3.48356381, 2.68154267], [ 6.19898705, 4.77179382], [ 2.69816733, 2.07696807], [ 5.18471099, 3.99103461], [ 5.68860316, 4.37891565], [ 4.14095516, 3.18758276], [ 3.82801958, 2.94669436], [ 5.73637229, 4.41568689], [ 3.45624014, 2.66050973], [ 5.10784454, 3.93186513], [ 2.13253865, 1.64156413], [ 3.65610482, 2.81435955], [ 4.66128664, 3.58811828], [ 6.1549641 , 4.73790627]])
fig, ax = plt.subplots(figsize=(12,8))
ax.scatter(X_recovered[:, 0], X_recovered[:, 1])
<matplotlib.collections.PathCollection at 0x1af80320>
Notice that the projection axis for the first principal component was basically a diagonal line through the data set. When we reduced the data to one dimension, we lost the variations around that diagonal line, so in our reproduction everything falls along that diagonal.
Our last task in this exercise is to apply PCA to images of faces. By using the same dimension reduction techniques we can capture the "essence" of the images using much less data than the original images.
faces = loadmat('data/ex7faces.mat')
X = faces['X']
X.shape
(5000L, 1024L)
The exercise code includes a function that will render the first 100 faces in the data set in a grid. Rather than try to re-produce that here, you can look in the exercise text for an example of what they look like. We can at least render one image fairly easily though.
face = np.reshape(X[3,:], (32, 32))
plt.imshow(face)
<matplotlib.image.AxesImage at 0x1b208cf8>
Yikes, that looks awful. These are only 32 x 32 grayscale images though (it's also rendering sideways, but we can ignore that for now). Anyway's let's proceed. Our next step is to run PCA on the faces data set and take the top 100 principal components.
U, S, V = pca(X)
Z = project_data(X, U, 100)
Now we can attempt to recover the original structure and render it again.
X_recovered = recover_data(Z, U, 100)
face = np.reshape(X_recovered[3,:], (32, 32))
plt.imshow(face)
<matplotlib.image.AxesImage at 0x1b5b8f60>
Observe that we lost some detail, though not as much as you might expect for a 10x reduction in the number of dimensions.
That concludes exercise 7. In the final exercise we'll implement algorithms for anomaly detection and build a recommendation system using collaborative filtering.