The purpose of this document is to help learn about Gaussian processes (GP) by showing relevant formulas in code and demonstrating their calculation. This document is based off of Nano de Freitas's code. Unfortunately, I could not understand every detail of the code but I made an effort to annotate as best as I could understand.
Read an introductory text before going over this document. Not much effort is made to explain the concepts. The canonical text for Gaussian processes seems to be Gaussian Processes for Machine Learning by Carl Rasmussen.
For completeness, I'm just going to say that a Gaussian process is a genalization of a Gaussian distribution where the mean, $m$, and covariance, $k$, are functions (as opposed to vectors).
$$ f \sim \mathcal{GP}(m(x),k(x,x'))$$However, 0 prior mean is going to be assumed here.
from __future__ import division #why?
import numpy as np
import matplotlib.pyplot as pl
%matplotlib inline
""" This is code for simple GP regression.
It assumes a zero mean GP Prior """;
# This is the true unknown function we are trying to approximate
f = lambda x: np.sin(0.9*x).flatten() #flatten makes it 1-D
#f = lambda x: (0.25*(x**2)).flatten()
np.random.seed(123) #comment out for a consistent experience
Squared exponential kernel
$$ k(x,x') = \exp\left(-\frac{1}{2\theta}|x-x'|^2\right) $$# Define the kernel
def kernel(x, xp):
""" GP squared exponential kernel """
theta = 0.1
sqdist = np.sum(x**2,1).reshape(-1,1) + np.sum(xp**2,1) - 2*np.dot(x, xp.T)
return np.exp(-.5 * (1/theta) * sqdist)
The NumPy code was written in such a way as to return a matrix for all pairs of $x$ and $x'$.
kernel(np.array([[1],[2]]),np.array([[3],[2],[1]]))
array([[ 2.06115362e-09, 6.73794700e-03, 1.00000000e+00], [ 6.73794700e-03, 1.00000000e+00, 6.73794700e-03]])
What's VERY cool is that the same code allows the calculation of inputs of arbitrary dimension!
kernel(np.array([[1,0,0],[2,0,0]]),np.array([[3,0,0],[2,0,0],[1,0,0]]))
array([[ 2.06115362e-09, 6.73794700e-03, 1.00000000e+00], [ 6.73794700e-03, 1.00000000e+00, 6.73794700e-03]])
By computing the squared distances as $x^2 + x'^2 - 2xx'$ we can use NumPy to vectorize quantities that will come in handy for creating a matrix. The code uses array broadcasting to create an 'outer addition' which creates an association between all pairs.
The line of code calculating sqdist
is clever but difficult to decipher! It would have been clearer to write loops for its calculation using Numba when speed is an issue.
Assuming a 0 mean prior, We write the joint distribution of the training and test (star) values as (following Rasmussen)
$ \left[ \begin{array}{c} {\bf y} \\ {\bf f_{*}} \end{array} \right] \sim \mathcal{N}\left( {\bf 0} , \left[\begin{array}{cc} K(X,X)+\sigma_n I & K(X,X_*) \\ K(X_*,X) & K(X_*,X_*) \end{array} \right] \right) $(Eqn. 2.2 in Rasmussen)
Primarily we want to compute mean values at our 'test' points given data.
$ {\bf\bar{f_*}} = K(X_*,X)[K(X,X)+\sigma_n^2I]^{-1}{\bf y} $ (2.23)
The covariance matrix allows us to draw samples from the distribution.
$ \text{cov}\left({\bf{f_*}}\right) = K(X_*,X_*) - K(X_*,X)[K(X,X)+\sigma_n^2I]^{-1}K(X,X_*) $ (2.24)
Here is the correspondence between the math symbols and code variables. Keep in mind the $K$'s are matrices.
. | code | shorthand |
------------|-----------------|-----------|
$K(X,X)$ | K
| $K$
$K(X_*,X)$ | Kstar
| $K_*$
$K(X_*,X_*)$| Kstarstar
| $K_{**}$
$\sigma_n$ | s
|
${\bf{\bar{f_*}}}$| mu
|
n = 10 # number of training points.
N = 200 # number of test points. (star)
s = 0.05 # noise variance.
# Sample some input points and noisy versions of the function evaluated at
# these points.
X = np.random.uniform(-5, 5, size=(n,1)) # "training"
y = f(X) + s*np.random.randn(n)
K = kernel(X, X)
# points we're going to make predictions at.
Xtest = np.linspace(-5, 5, N ).reshape(-1,1)
Ktest = kernel(X, Xtest)
If you see an inverted matrix in a text, it's probably a good idea not to invert it when implementing your calculations. You want to use some kind of matrix factorization. Here, a Cholesky decomposition helps us calculate the mean vector and covariance matrix as well as draw samples from prior and posterior distributions.
If we use Cholesky decomposition to factor $K_{**}$ as $LL^T$, we can write the mean as
$ {\bf\bar{f_*}} = K_{*}^TK{\bf y} = (K_*^{T} L^{-T}) (L^{-1} {\bf y}) = u v $,
where $v$ in the linear system, $Lv={\bf y}$, can be solved for (matrix inversion was avoided). Similary (as it seems in the code), $u$ can be solved for in $L^Tu=K_*^T$. Unfortunately, I'm not seeing the exact correspondence in the code. But another way to calculate the mean is to calculate $L^{-T}(L^{-1}{\bf y})=L^{-T}v=w$ where $w$ is solved for in the linear system $L^Tw=v$.
The (individual) variances are obtained by taking the diagonal of some covariance matrix. Again, I could not decipher the code (thank you NumPy!).
L = np.linalg.cholesky(K + s*np.eye(n)) # i(eye)dentity added...
#...for numerical stabilization
# compute the mean at our test points.
u = np.linalg.solve(L, Ktest) # = something like LKstar
v=np.linalg.solve(L, y)
mu=np.dot(u.T, v)
#another way to calculate is commented out
#w=np.linalg.solve(L.T,v)
#mu = np.dot( Ktest.T , w )
# compute the variance at our test points.
Kstarstar = kernel(Xtest, Xtest)
s2 = np.diag(Kstarstar) - np.sum(u**2, axis=0) #I don't know how this works!
ss = np.sqrt(s2)
# PLOTS:
import seaborn as sns
pl.figure(1)
pl.clf()
pl.gca().fill_between(Xtest.flat, mu-2*ss, mu+2*ss, color="#dddddd")
pl.scatter(X, y, s=200, marker='+', c='black')
pl.plot(Xtest, f(Xtest), 'b-')
pl.plot(Xtest, mu, 'r--', lw=2)
pl.title('Mean predictions plus 2 st.deviations')
pl.axis([-5, 5, -3, 3])
#pl.savefig('predictive.png', bbox_inches='tight')
pl.show()
Analytical solutions are nice but to really believe we are working with random variables we can draw samples from the GP. Here is where the Cholesky decomposition comes handy. We can generate samples of $ \mathcal{N}({\bf m},K) $ by using a scalar Gaussian generator and scaling it like
$ {\bf m} + L\mathcal{N}(0,I) $
Some details are in Appendix A.2 in Rasmussen. Now we could use just use numpy.random.multivariate_normal
but then it wouldn't be as fun!
Notice the prior only involves 'test' points and no data.
# draw samples from the prior at our test points.
npr_samples=20
L = np.linalg.cholesky(Kstarstar + 1e-6*np.eye( N ))
f_prior = np.dot(L, np.random.normal(size=(N,npr_samples)))
#notice no data goes in f_prior
pl.figure(2)
pl.clf()
#pl.plot(Xtest, f_prior)
sns.tsplot(f_prior.T, time=Xtest, err_style='unit_points')
pl.title(str(npr_samples)+' samples from the GP prior')
pl.axis([-5, 5, -3, 3])
#pl.savefig('prior.png', bbox_inches='tight')
pl.show()
In the posterior sampling, the 'data' are used to calculate the covariance function as
$$\text{cov}({\bf f_*})=K_{**}-K_*^TK^{-1}K_* = K_{**} - u^Tu $$(verify)
# draw samples from the posterior at test points.
#commented out is a more direct way to get L
#L = np.linalg.cholesky(K_
#- np.dot( np.dot( kernel(X,Xtest).T
# , np.linalg.inv(K)
#)
# , kernel(X,Xtest) )
#+ 1e-6*np.eye(N)
#)
nps_samples=20
L = np.linalg.cholesky(Kstarstar + 1e-6*np.eye(N) - np.dot(u.T, u))
f_post = mu.reshape(-1,1) + np.dot(L, np.random.normal(size=(N,npr_samples)))
pl.figure(3)
pl.clf()
pl.plot(Xtest, f(Xtest), 'b-')
#pl.plot(Xtest, f_post) #mpl plot
sns.tsplot(f_post.T, time=Xtest , err_style='unit_points')
pl.scatter(X, y, s=1000, marker='+', c='red',alpha=1) #how do you make the markers more visible??
#pl.plot(Xtest, mu, 'r--', lw=2)
#^^^turns out the theoretical mu is close to the estimated mu from tsplot
pl.title(str(npr_samples)+' samples from the GP posterior')
pl.axis([-5, 5, -3, 3])
#pl.savefig('post.png', bbox_inches='tight')
pl.show()