underlying causes (factors) that control the observed data.
Tobamovirus data set, see section 4.1 in Tipping and Bishop (1999) (Originally from Ripley (1996)).
include("scripts/pca_demo_helpers.jl")
X = readDataSet("datasets/virus3.dat")
38×18 Array{Int64,2}: 17 13 14 16 4 9 14 1 13 0 11 13 5 7 1 4 11 5 12 11 9 12 6 5 12 1 9 1 7 12 5 6 0 4 8 2 18 16 16 16 8 6 14 1 14 0 9 12 4 8 0 2 11 3 18 16 15 19 8 6 11 1 15 1 7 13 5 8 0 2 9 3 17 13 13 22 8 4 18 1 10 3 8 11 7 6 1 2 10 2 16 13 16 21 9 3 17 1 10 4 7 12 7 5 1 2 11 3 22 19 10 16 10 4 18 1 12 2 8 11 6 8 0 1 8 2 20 10 24 10 6 9 21 0 7 0 7 18 4 9 1 4 8 2 20 21 12 15 9 7 11 1 9 3 8 14 6 7 0 1 10 3 20 21 12 15 9 7 11 1 9 3 9 14 5 7 0 1 10 3 18 11 24 10 9 6 19 0 12 0 7 14 4 11 0 4 9 1 20 12 23 10 8 5 20 0 13 0 6 13 4 11 0 4 10 1 18 19 18 16 8 4 12 0 12 0 10 15 8 6 1 1 12 1 ⋮ ⋮ ⋮ ⋮ 17 12 22 10 8 5 18 0 14 0 5 13 4 10 0 3 9 1 17 16 16 16 8 6 15 1 14 0 9 12 4 8 0 2 11 3 19 17 15 17 7 6 15 1 14 0 8 12 4 8 0 2 10 3 18 16 16 19 8 6 11 1 15 1 7 13 5 8 0 2 9 3 18 17 15 17 8 6 15 1 14 0 8 12 4 8 0 3 9 3 15 12 14 23 8 3 17 1 9 4 7 15 6 6 1 2 11 2 13 11 14 22 7 3 17 1 10 4 8 13 6 6 1 3 11 2 16 11 15 23 10 4 18 1 10 3 7 12 6 5 1 2 9 3 14 11 14 25 11 3 19 2 10 2 7 12 6 5 1 2 9 3 11 11 15 24 10 5 18 1 11 1 7 14 5 7 2 3 11 2 15 9 12 21 8 4 21 1 10 3 7 15 7 6 1 3 10 3 15 11 15 22 7 3 19 1 8 3 4 14 6 5 1 2 10 2
or equivalently $$\begin{align*} p(x|z) &= \mathcal{N}(x|\,W z + \mu,\Psi) \tag{likelihood}\\ p(z) &= \mathcal{N}(z|\,0,I) \tag{prior} \end{align*}$$
$p(z|x)$ distributions are also Gaussian.
since the mean evaluates to $$\begin{align*} \mathrm{E}[x] &= \mathrm{E}[W z + \mu+ \epsilon] \\ &= W \mathrm{E}[z] + \mu + \mathrm{E}[\epsilon] \\ &= \mu \end{align*}$$ and the covariance matrix is $$\begin{align*} \mathrm{cov}[x] &= \mathrm{E}[({x}-{\mu})({x}-{\mu})^T] \\ &= \mathrm{E}[(W z +\epsilon)(W z +\epsilon)^T] \\ &= W \mathrm{E}[z z^T] W^T + \mathrm{E}[\epsilon \epsilon^T] \\ &= W W^T + \Psi \end{align*}$$
long skinny matrices plus a diagonal matrix.
i.e., the noise model is discarded altogether and the columns of $W$ are orthonormal.
$\Rightarrow$ FA, pPCA and PCA differ only by their model for the noise variance $\Psi$ (namely, diagonal, isotropic and 'zeros').
$\Rightarrow$ PCA is very widely applied to image and signal processing tasks!
$\Rightarrow$ FA has strong history in 'social sciences'
and observations ${D}=\{x_1,\dotsc,x_N\}$, find ML estimates for the parameters $\theta=\{W,\mu,\sigma\}$
Now subtract $\hat {\mu}$ from all data points (${x}_n:= {x}_n-\hat {\mu}$) and assume that we have zero-mean data.
and optimize w.r.t. $W$ and $\sigma^2$.
Let's perform pPCA on the example (Tobamovirus) data set using EM. We'll find the two principal components ($M=2$), and then visualize the data in a 2-D plot. The implementation is quite straightforward, have a look at the source file if you're interested in the details.
(θ, Z) = pPCA(X', 2) # uses EM, implemented in scripts/pca_demo_helpers.jl. Feel free to try more/less dimensions.
using PyPlot
plot(Z[1,:]', Z[2,:]', "w")
for n=1:size(Z,2)
PyPlot.text(Z[1,n], Z[2,n], string(n), fontsize=10) # put a label on the position of the data point
end
title("Projection of Tobamovirus data set on two dimensions (numbers correspond to data points)", fontsize=10);
Note that the solution is not unique, but the clusters should be more or less persistent.
Now let's randomly remove 20% of the data:
X_corrupt = convert(Matrix{Float64}, X) # convert to floating point matrix so we can use NaN to indicate missing values
X_corrupt[rand(length(X)) .< 0.2] = NaN # set the elements of X_corrupt to NaN with probability 0.2
X_corrupt
38×18 Array{Float64,2}: 17.0 13.0 14.0 16.0 4.0 … 7.0 NaN NaN 11.0 5.0 12.0 11.0 9.0 12.0 6.0 NaN 0.0 4.0 NaN 2.0 18.0 16.0 NaN 16.0 NaN NaN 0.0 2.0 11.0 3.0 18.0 16.0 15.0 NaN 8.0 8.0 0.0 2.0 9.0 3.0 17.0 13.0 13.0 22.0 8.0 6.0 1.0 2.0 10.0 NaN 16.0 13.0 16.0 21.0 9.0 … 5.0 1.0 2.0 11.0 3.0 22.0 19.0 10.0 16.0 10.0 8.0 0.0 1.0 8.0 NaN NaN 10.0 NaN 10.0 6.0 9.0 NaN 4.0 8.0 2.0 NaN NaN 12.0 15.0 NaN 7.0 0.0 NaN 10.0 3.0 20.0 21.0 12.0 15.0 9.0 NaN 0.0 NaN 10.0 3.0 NaN 11.0 24.0 10.0 9.0 … 11.0 NaN 4.0 9.0 1.0 20.0 12.0 23.0 NaN 8.0 11.0 NaN 4.0 10.0 1.0 18.0 19.0 18.0 NaN 8.0 6.0 NaN 1.0 12.0 NaN ⋮ ⋱ ⋮ 17.0 12.0 22.0 10.0 8.0 10.0 0.0 3.0 9.0 NaN 17.0 16.0 16.0 NaN 8.0 8.0 0.0 2.0 NaN 3.0 19.0 17.0 15.0 NaN NaN 8.0 0.0 2.0 10.0 NaN 18.0 16.0 NaN 19.0 8.0 NaN 0.0 2.0 9.0 3.0 18.0 17.0 15.0 17.0 8.0 … 8.0 0.0 3.0 9.0 3.0 15.0 NaN 14.0 23.0 NaN 6.0 1.0 NaN 11.0 2.0 13.0 11.0 14.0 22.0 7.0 6.0 1.0 3.0 11.0 NaN NaN 11.0 15.0 NaN 10.0 5.0 1.0 2.0 9.0 3.0 14.0 11.0 14.0 25.0 NaN 5.0 1.0 2.0 9.0 3.0 NaN 11.0 15.0 24.0 NaN … 7.0 2.0 NaN NaN 2.0 NaN 9.0 12.0 21.0 8.0 6.0 1.0 NaN 10.0 3.0 15.0 11.0 15.0 22.0 7.0 5.0 1.0 2.0 10.0 2.0
(θ, Z) = pPCA(X_corrupt', 2) # Perform pPCA on the corrupted data set
plot(Z[1,:]', Z[2,:]', "w")
for n=1:size(Z,2)
PyPlot.text(Z[1,n], Z[2,n], string(n), fontsize=10) # put a label on the position of the data point
end
title("Projection of CORRUPTED Tobamovirus data set on two dimensions", fontsize=10);
As you can see, pPCA is quite robust in the face of missing data.
The cell below loads the style file
open("../../styles/aipstyle.html") do f
display("text/html", readstring(f))
end