Sometimes, it is more appropriate to think in terms of continuous underlying causes (factors) that control the observed data.
include("scripts/pca_demo_helpers.jl")
X = readDataSet("datasets/virus3.dat")
$\Rightarrow$ FA, pPCA and PCA differ only by their model for the noise variance $\Psi$ (namely, diagonal, isotropic and 'zeros').
$\Rightarrow$ PCA is very widely applied to image and signal processing tasks!
$\Rightarrow$ FA has strong history in 'social sciences'
Let's perform pPCA on the example (Tobamovirus) data set using EM. We'll find the two principal components ($M=2$), and then visualize the data in a 2-D plot. The implementation is quite straightforward, have a look at the source file if you're interested in the details.
using LinearAlgebra
include("scripts/pca_demo_helpers.jl")
X = readDataSet("datasets/virus3.dat")
(θ, Z) = pPCA(convert(Matrix,X'), 2)# uses EM, implemented in scripts/pca_demo_helpers.jl. Feel free to try more/less dimensions.
using PyPlot
plot(Z[1,:], Z[2,:], "w")
for n=1:size(Z,2)
PyPlot.text(Z[1,n], Z[2,n], string(n), fontsize=10) # put a label on the position of the data point
end
title("Projection of Tobamovirus data set on two dimensions (numbers correspond to data points)", fontsize=10);
Note that the solution is not unique, but the clusters should be more or less persistent.
Now let's randomly remove 20% of the data:
X_corrupt = convert(Matrix{Float64}, X)# convert to floating point matrix so we can use NaN to indicate missing values
indices = findall(rand(Float64,size(X)) .< 0.2)
X_corrupt[indices] .= NaN
println(X_corrupt)
(θ, Z) = pPCA(convert(Matrix,X_corrupt'), 2) # Perform pPCA on the corrupted data set
plot(Z[1,:], Z[2,:], "w")
for n=1:size(Z,2)
PyPlot.text(Z[1,n], Z[2,n], string(n), fontsize=10) # put a label on the position of the data point
end
title("Projection of CORRUPTED Tobamovirus data set on two dimensions", fontsize=10);
As you can see, pPCA is quite robust in the face of missing data.
open("../../styles/aipstyle.html") do f
display("text/html", read(f, String))
end