Guillermo Moncecchi
This notebook contains just some code notes on Principal Component Analysis, based on a very simple example. If you want the original, much better explanation, refer to this page from Sebastian Raschka.
import numpy as np
%matplotlib inline
from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from mpl_toolkits.mplot3d import proj3d
np.set_printoptions(suppress=True,precision=3)
from matplotlib.patches import FancyArrowPatch
Let us define a very simple dataset, with 6 people, identified by their heigth, weigth, and the length of their middle finger (?). Each column is an instance, with 3 attributes (we use this format to directly call the np.cov() function
X_men=np.array([[1.97,110,5],[1.80,70,4.8],[1.70,90,4.9]]).transpose()
X_women=np.array([[1.65,52,4.7],[1.75,65,4.8],[1.67,58,4.6]]).transpose()
X = np.hstack((X_men,X_women))
print (X)
[[ 1.97 1.8 1.7 1.65 1.75 1.67] [ 110. 70. 90. 52. 65. 58. ] [ 5. 4.8 4.9 4.7 4.8 4.6 ]]
mean=np.mean(X,axis=1)
std=np.std(X,axis=1)
var=np.var(X,axis=1)
print ("Means:",mean)
print ("Standard deviations:",std)
print ("Variances:",var)
Means: [ 1.757 74.167 4.8 ] Standard deviations: [ 0.108 19.953 0.129] Variances: [ 0.012 398.139 0.017]
Rescale values
def scale_linear_byrow(X):
mins = np.min(X, axis=1).reshape(3,1)
maxs = np.max(X, axis=1).reshape(3,1)
rng = maxs - mins
return ((X - mins)/rng)
X_r=scale_linear_byrow(X)
print(X_r)
[[ 1. 0.469 0.156 0. 0.313 0.063] [ 1. 0.31 0.655 0. 0.224 0.103] [ 1. 0.5 0.75 0.25 0.5 0. ]]
fig = plt.figure(figsize=(8,8))
ax = fig.add_subplot(111, projection='3d')
plt.rcParams['legend.fontsize'] = 10
ax.plot(X_r[0,0:3], X_r[1,0:3], X_r[2,0:3], 'o', markersize=8, color='red', alpha=0.5, label='class1')
ax.plot(X_r[0,3:6], X_r[1,3:6], X_r[2,3:6], 'o', markersize=8, color='blue', alpha=0.5, label='class1')
plt.show()
Calculate mean, standard deviation and variance for every attribute, using the axis property of each of the methods
mean=np.mean(X_r,axis=1)
std=np.std(X_r,axis=1)
var=np.var(X_r,axis=1)
print ("Means:",mean)
print ("Standard deviations:",std)
print ("Variances:",var)
#mean_vector = np.array([[mean_x],[mean_y],[mean_z]])
Means: [ 0.333 0.382 0.5 ] Standard deviations: [ 0.336 0.344 0.323] Variances: [ 0.113 0.118 0.104]
Now, obtain the covariance matrix for our sample
cvm=np.cov(X_r)
print (cvm)
[[ 0.136 0.112 0.102] [ 0.112 0.142 0.122] [ 0.102 0.122 0.125]]
Calculate eigenvalues and eigenvectors from the covariance matrix. Since it is symmetric, we know they are real numbers
eig_val_cov, eig_vec_cov = np.linalg.eig(cvm)
for i in range(len(eig_val_cov)):
eigvec_cov = eig_vec_cov[:,i].reshape(1,3).T
print('Eigenvalue {} from covariance matrix: {}'.format(i+1, eig_val_cov[i]))
print('Eigenvector:')
print(eigvec_cov)
Eigenvalue 1 from covariance matrix: 0.3587407970726606 Eigenvector: [[ 0.561] [ 0.608] [ 0.562]] Eigenvalue 2 from covariance matrix: 0.033429425967488116 Eigenvector: [[ 0.825] [-0.356] [-0.439]] Eigenvalue 3 from covariance matrix: 0.010660453485812363 Eigenvector: [[ 0.066] [-0.71 ] [ 0.701]]
Verify that the eigenvalue equation holds:
for i in range(len(eig_val_cov)):
eigv = eig_vec_cov[:,i].reshape(1,3).T
np.testing.assert_array_almost_equal(cvm.dot(eigv), eig_val_cov[i] * eigv,
decimal=6, err_msg='', verbose=True)
# Sort eigenvectors given their eigenvalue
# Make a list of (eigenvalue, eigenvector) tuples
eig_pairs = [(np.abs(eig_val_cov[i]), eig_vec_cov[:,i]) for i in range(len(eig_val_cov))]
# Sort the (eigenvalue, eigenvector) tuples from high to low
eig_pairs.sort()
eig_pairs.reverse()
# Visually confirm that the list is correctly sorted by decreasing eigenvalues
for i in eig_pairs:
print(i)
# print(i[0])
(0.3587407970726606, array([ 0.561, 0.608, 0.562])) (0.033429425967488116, array([ 0.825, -0.356, -0.439])) (0.010660453485812363, array([ 0.066, -0.71 , 0.701]))
# Just to principal components
matrix_w = np.hstack((eig_pairs[0][1].reshape(3,1), eig_pairs[1][1].reshape(3,1)))
print('Matrix W:\n', matrix_w)
Matrix W: [[ 0.561 0.825] [ 0.608 -0.356] [ 0.562 -0.439]]
Transform instances to the new subspace...
# Transform instance to the new subspace
transformed = np.dot(X_r.T,matrix_w).transpose()
print (transformed)
[[ 1.731 0.733 0.907 0.141 0.593 0.098] [ 0.03 0.057 -0.434 -0.11 -0.041 0.015]]
plt.plot(transformed[0,0:3], transformed[1,0:3], 'o', markersize=7, color='blue', alpha=0.5, label='men')
plt.plot(transformed[0,3:6], transformed[1,3:6], '^', markersize=7, color='red', alpha=0.5, label='women')
#plt.xlim([-4,4])
#plt.ylim([-4,4])
plt.xlabel('x_values')
plt.ylabel('y_values')
#plt.legend()
plt.title('Transformed samples with class labels')
plt.show()
# Just one dimension
matrix_w2 = np.hstack((eig_pairs[0][1].reshape(3,1),))
print('Matrix W:\n', matrix_w2)
Matrix W: [[ 0.561] [ 0.608] [ 0.562]]
transformed2 = np.dot(X_r.T,matrix_w2).transpose()
print (transformed2)
[[ 1.731 0.733 0.907 0.141 0.593 0.098]]
plt.plot(transformed2[0,0:3],[0,0,0], 'o', markersize=7, color='blue', alpha=0.5, label='men')
plt.plot(transformed2[0,3:6],[0,0,0], '^', markersize=7, color='red', alpha=0.5, label='women')
#plt.xlim([-4,4])
#plt.ylim([-4,4])
plt.xlabel('x_values')
plt.ylabel('y_values')
plt.legend()
plt.title('Transformed samples with class labels')
plt.show()