$% DEFINITIONS \newcommand{\bff}{\mathbf{f}} \newcommand{\bm}{\mathbf{m}} \newcommand{\bk}{\mathbf{k}} \newcommand{\bx}{\mathbf{x}} \newcommand{\by}{\mathbf{y}} \newcommand{\bz}{\mathbf{z}} \newcommand{\bA}{\mathbf{A}} \newcommand{\bB}{\mathbf{B}} \newcommand{\bC}{\mathbf{C}} \newcommand{\bD}{\mathbf{D}} \newcommand{\bI}{\mathbf{I}} \newcommand{\bK}{\mathbf{K}} \newcommand{\bL}{\mathbf{L}} \newcommand{\bM}{\mathbf{M}} \newcommand{\bX}{\mathbf{X}} \newcommand{\bLambda}{\boldsymbol{\Lambda}} \newcommand{\bSigma}{\boldsymbol{\Sigma}} \newcommand{\bmu}{\boldsymbol{\mu}} \newcommand{\calN}{\mathcal{N}} \newcommand{\R}{\mathbb{R}} \newcommand{\E}{\mathbb{E}} \newcommand{\C}{\mathbb{C}} \newcommand{\Rd}{\R^d} \newcommand{\Rdd}{\R^{d\times d}} \newcommand{\bzero}{\mathbf{0}} \newcommand{\GP}{\mbox{GP}} % END OF DEFINITIONS $ In many engineering problems we have to deal with functions that are unknown. For example, in oil reservoir modeling, the permeability tensor $\bK(\bx)$ or the porosity $\phi(\bx)$ of the ground are, generally, unknown quantities. Therefore, we would like to treat them as if they where random. That is, we have to talk about probabilities on function spaces. Such a thing is achieved via the theory of random fields. However, instead of developing the generic mathematical theory of random fields, we concentrate on a special class of random fields, the Gaussian random fields or Gaussian processes.
A Gaussian process (GP) is a generalization of a multivariate Gaussian distribution to infinite dimensions. It essentially defines a probability measure on a function space. When we say that $f(\cdot)$ is a GP, we mean that it is a random variable that is actually a function. Mathematically, we write: \begin{equation} f(\cdot) | m(\cdot), k(\cdot, \cdot) \sim \GP\left(f(\cdot) | m(\cdot), k(\cdot, \cdot) \right), \end{equation} where $m:\Rd\rightarrow R$ is the mean function and $k:\Rd\times\Rd\rightarrow\R$ is the covariance function. So, compared to a multivariate normal we have:
But, what does this definition actually mean? Actually, it gets its meaning from the multivariate Gaussian distribution. Here is how:
points in an $n\times d$ matrix: \begin{equation} \bX = \left( \begin{array}{c} \bx_1\\ \vdots\\ \bx_n \end{array} \right). \end{equation}
the arbitrary inputs in $\bX$ is the following multivariate-normal: \begin{equation} \bff | \bX, m(\cdot), k(\cdot, \cdot) \sim \calN\left(\bff | \bm(\bX), \bk(\bX, \bX) \right), \end{equation} with mean vector: $$ \bm(\bX) = \left( \begin{array}{c} m(\bx_1)\\ \vdots\\ m(\bx_n) \end{array} \right), $$ and covariance matrix: $$ \bk(\bX, \bX) = \left( \begin{array}{ccc} k(\bx_1,\bx_1) & \dots & k(\bx_1, \bx_n)\\ \vdots & \ddots & \vdots\\ k(\bx_n, \bx_1) & \dots & k(\bx_n, \bx_n) \end{array} \right). $$
What is the meaning of $m(\cdot)$? Well, it is quite easy to grasp. For any point $\bx\in\R^d$, $m(\bx)$ should give us the value we beleive is more probable for $f(\bx)$. Mathematically, $m(\bx)$ is nothing more than the expected value of the random variable $f(\bx)$. That is: \begin{equation} m(\bx) = \mathbb{E}[f(\bx)]. \end{equation}
In practical application, we usually take the mean to be:
where $c$ is a parameter.
where $c_i, i=0,\dots,d$ are parameters.
where $c_i$ and $\phi_i(\cdot)$ are parameters and basis functions.
What is the meaning of $k(\cdot, \cdot)$? This concept is considerably more challenging than the mean. Let's try to break it down:
In other words, we beleive that there is about $95\%$ probability that the value of the random variable $f(\bx)$ fall within the interval: $$ \left((m(\bx) - 2\sqrt{k(\bx, \bx)}, m(\bx) + 2\sqrt{k(\bx,\bx)}\right). $$
$f(\bx')$ are correlated. In particular, $k(\bx,\bx')$ is equal to the covariance of the random variables $f(\bx)$ and $f(\bx')$, i.e., $$ k(\bx, \bx') = \mathbb{C}[f(\bx), f(\bx')] = \mathbb{E}\left[ \left(f(\bx) - m(\bx)\right) \left(f(\bx') - m(\bx')\right) \right]. $$
Namely, that for any $\bx\in\Rd$, $k(\bx, \bx) > 0$. This is easly understood by the interpretation of $k(\bx, \bx)$ as the variance of the random variable $f(\bx)$.
$k(\bx, \bx')$ becomes smaller as the distance between $\bx$ and $\bx'$ grows.
For any choice of points $\bX\in\R^{n\times d}$, the covariance matrix: $\bK(\bX, \bX)$ has
to be positive-definite (so that the vector of outputs $\bff$ is indeed a multivariate normal distribution).
Squared expnential (SE) is the most commonly used covariance function. Its formula is as follows: $$ k(\bx, \bx') = s^2\exp\left\{-\frac{1}{2}\sum_{i=1}^d\frac{(x_i - x_i')^2}{\ell_i^2}\right\}, $$ where $s,\ell_i>0, i=1,\dots,d$ are parameters. The interpretation of the parameters is as follows:
about the mean.
The bigger it is, the smoother the samples of $f(\cdot)$ appear along the $i$-th input dimension.
If you do not understand these concepts at this point, do not worry. They will become crystal clear when we learn how we can actually obtain samples of $f(\cdot)$.
For now, let us program in Python the SE covariance function:
import numpy as np
import math
def se_cov(x1, x2, s=1., ell=1.):
"""
A function the computes the SE covariance:
:param x1: One input point (1D array).
:param x2: Another input point (1D array).
:param s: The signal strength (> 0).
:param ell: The length scale(s). Either a positive number or a 1D array of
positive numbers.
:returns: The value of the SE covariance evaluated at x1 and x2 with
signal strength s and length scale(s) ell.
.. pre::
+ x1 must have the same dimensions as x2
+ ell must either be a float or a 1D array with the same dimensions as x1
"""
tmp = (x1 - x2) / ell
return s ** 2 * math.exp(-0.5 * np.dot(tmp, tmp))
As you see se_cov()
implements the SE covariance function.
Let's look at it in one dimension, i.e., $d=1$. We will fix $x'=0$
and plot $k(x,0)$ as a function of $x$.
Here we go:
import matplotlib.pyplot as plt
# The signal strength we want to use
s = 1.
# The lentgh scale we want to use
ell = 1.
x = np.linspace(-5, 5, 100)
kx = [se_cov(x[i], 0, s=s, ell=ell) for i in xrange(x.shape[0])]
plt.plot(x, kx, 'b', linewidth=2, label='SE: $s=%1.2f,\;\ell=%1.2f$' % (s, ell))
plt.ylim([0., 1.5])
plt.xlabel('$x$', fontsize=16)
plt.ylabel('$k(x,0)$', fontsize=16)
plt.legend(loc='best', fontsize=16)
plt.show()
For convenience, let us also create a wrapper on top of the covariance function we just implemented, that will allow us to compute covariance matrix. The following code compute the cross covariance $$ \bk(\bX_1, \bX_2) = \left( \begin{array}{ccc} k(\bx_{11}, \bx_{21}) & \dots & k(\bx_{11}, \bx_{2n_2})\\ \vdots & \ddots & \vdots \\ k(\bx_{1n_1}, \bx_{21}) & \dots & k(\bx_{1n_1}, \bx_{2n_2}) \end{array} \right). $$
def cov_mat(X1, X2, cov_fun=se_cov, **cov_params):
"""
Compute the cross covariance matrix of `X1` and `X2` for the covariance
function `cov_fun`.
:param X1: A matrix of input points (n1 x d)
:param X2: A matrix of input points (n2 x d)
:param cov_fun: The covariance function to use
:param cov_param: Any parameters that we should pass to the covariance
function `cov_fun`.
"""
X1 = np.array(X1)
X2 = np.array(X2)
return np.array([[cov_fun(X1[i, :], X2[j, :], **cov_params) for j in xrange(X2.shape[0])]
for i in xrange(X1.shape[0])])
Let us use this code to plot the SE covariance function for various choices of length scales:
ells = [0.25, 0.5, 1., 1.5, 2.]
s = 1.
for ell in ells:
k = cov_mat(x[:, None], [[0]], ell=ell)
plt.plot(x, k, linewidth=2, label='SE: $s=%1.2f,\;\ell=%1.2f$' % (s, ell))
plt.ylim([0., 1.])
plt.xlim([-6, 20])
plt.xlabel('$x$', fontsize=16)
plt.ylabel('$k(x,0)$', fontsize=16)
plt.legend(loc='best', fontsize=16)
plt.show()
Now, let's repeat this for various signal strengths:
ell = 1.
ss = [0.5, 0.8, 1., 1.5, 2.]
for s in ss:
k = cov_mat(x[:, None], [[0]], s=s)
plt.plot(x, k, linewidth=2, label='SE: $s=%1.2f,\;\ell=%1.2f$' % (s, ell))
plt.xlim([-6, 20])
plt.xlabel('$x$', fontsize=16)
plt.ylabel('$k(x,0)$', fontsize=16)
plt.legend(loc='best', fontsize=16)
plt.show()
Now, let's look at the covariance function in 2D. We will plot $k(\bx = (x_1, x_2), \bzero_2)$ as a function of $x_1$ and $x_2$.
First we need a grid:
x = np.linspace(-3, 3, 16)
X1, X2 = np.meshgrid(x, x)
We will evaluate the covariance at these points:
X = np.hstack([X1.flatten()[:, None], X2.flatten()[:, None]])
plt.plot(X[:, 0], X[:, 1], '.')
plt.xlabel('$x_1$', fontsize=16)
plt.ylabel('$x_2$', fontsize=16)
<matplotlib.text.Text at 0x109271850>
Let's plot the covariance:
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
s = 1.
ell = 1.
k = cov_mat(X, [[0, 0]], s=s, ell=ell)
K = k.reshape(X1.shape)
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.plot_wireframe(X1, X2, K, color='black')
ax.contourf(X1, X2, K, alpha=0.5)
<matplotlib.contour.QuadContourSet instance at 0x10aacc3b0>
Now, let us plot it, with different length scales per input:
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
s = 1.
ell = [1, 0.5]
k = cov_mat(X, [[0, 0]], s=s, ell=ell)
K = k.reshape(X1.shape)
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.plot_wireframe(X1, X2, K, color='black')
ax.contourf(X1, X2, K, alpha=0.5)
<matplotlib.contour.QuadContourSet instance at 0x109a6b680>
Samples from a Gaussian process are functions. But how can we get them? Well, actually, we are not going to sample real functions. What we are going to do is pick a dense set of points $\bX\in\R^{n\times d}$ sample the value of the GP, $\bff = (f(\bx_1),\dots,f(\bx_n))$ on these points. We saw above that the probability density of $\bff$ is just a multivariate normal with a mean vector that is specified from the mean function and a covariance matrix that is specified by the covariance function. Therefore, all we need to know is how to sample from the multivariate normal. But, we have learned this already...
Here is the code:
(For the moment, ignore the noise_variance
parameter. We will explain what it does later on.
Its sole purpose to stabilize the Cholesky decomposition.)
import scipy.linalg
from scipy.stats import norm
def sample_gp(X, cov_fun=se_cov, num_samples=1, noise_variance=1e-12, **cov_params):
"""
Sample a zero-mean Gaussian process at the inputs specified by X.
:param X: The input points on which we will sample the GP.
:param cov_fun: The covariance function we use.
:param num_samples: The number of samples to take.
:param noise_variance: The noise of the process.
:param cov_params: Any parameters of the covariance function.
"""
# Compute the covariance matrix:
K = cov_mat(X, X, cov_fun=cov_fun, **cov_params) + noise_variance * np.eye(X.shape[0])
# Compute the cholesky of this:
L = scipy.linalg.cholesky(K, lower=True)
# Take a sample of standard normals
z = norm.rvs(size=(X.shape[0], num_samples))
# Build the sample from the multivariate normal
f = np.dot(L, z)
return f.T
Perhaps, the best way to demonstrate this is via a 1D example. We will take the mean function to be zero (dashed blue line). We take 5 samples of $f(\cdot)$ (red lines).
cov_fun=se_cov
s = 1.
ell = 1.
x = np.linspace(-3, 3, 50)[:, None]
f = sample_gp(x, num_samples=5, cov_fun=cov_fun, s=s, ell=ell)
plt.plot(x, np.zeros(x.shape[0]), '--b', linewidth=2, label='$m(x)$')
plt.plot(x, f.T, 'r', linewidth=2, label='GP samples')
plt.ylim([-4, 4])
plt.title('GP with SE ($s=%1.2f\; \ell=%1.2f$) and zero mean: samples' % (s, ell), fontsize=16)
plt.xlabel('$x$', fontsize=16)
plt.ylabel('$f(x)$', fontsize=16)
<matplotlib.text.Text at 0x10790dfd0>
Let us now repeat this experiment with a smaller length scale:
cov_fun=se_cov
s = 1.
ell = .5
x = np.linspace(-3, 3, 50)[:, None]
f = sample_gp(x, num_samples=5, cov_fun=cov_fun, s=s, ell=ell)
plt.plot(x, np.zeros(x.shape[0]), '--b', linewidth=2, label='$m(x)$')
plt.plot(x, f.T, 'r', linewidth=2, label='GP samples')
plt.ylim([-4, 4])
plt.title('GP with SE ($s=%1.2f\; \ell=%1.2f$) and zero mean: samples' % (s, ell), fontsize=16)
plt.xlabel('$x$', fontsize=16)
plt.ylabel('$f(x)$', fontsize=16)
<matplotlib.text.Text at 0x10778acd0>
You can probably notice the increased variability of the samples.
Now we make $\ell$ bigger and repeat the experiment:
cov_fun=se_cov
s = 1.
ell = 4.
x = np.linspace(-3, 3, 50)[:, None]
f = sample_gp(x, num_samples=5, cov_fun=cov_fun, s=s, ell=ell)
plt.plot(x, np.zeros(x.shape[0]), '--b', linewidth=2, label='$m(x)$')
plt.plot(x, f.T, 'r', linewidth=2, label='GP samples')
plt.ylim([-4, 4])
plt.title('GP with SE ($s=%1.2f\; \ell=%1.2f$) and zero mean: samples' % (s, ell), fontsize=16)
plt.xlabel('$x$', fontsize=16)
plt.ylabel('$f(x)$', fontsize=16)
<matplotlib.text.Text at 0x107cdc8d0>
And you can see that the samples have become smoother.
Let's increase the signal strength $s$ and see what happens:
cov_fun=se_cov
s = 2.
ell = 1.
x = np.linspace(-3, 3, 50)[:, None]
f = sample_gp(x, num_samples=5, cov_fun=cov_fun, s=s, ell=ell)
plt.plot(x, np.zeros(x.shape[0]), '--b', linewidth=2, label='$m(x)$')
plt.plot(x, f.T, 'r', linewidth=2, label='GP samples')
plt.ylim([-4, 4])
plt.title('GP with SE ($s=%1.2f\; \ell=%1.2f$) and zero mean: samples' % (s, ell), fontsize=16)
plt.xlabel('$x$', fontsize=16)
plt.ylabel('$f(x)$', fontsize=16)
<matplotlib.text.Text at 0x107cbcf10>
Finally, we decrease the signal strength to see what will happen to the samples:
cov_fun=se_cov
s = 0.5
ell = 1.
x = np.linspace(-3, 3, 50)[:, None]
f = sample_gp(x, num_samples=5, cov_fun=cov_fun, s=s, ell=ell)
plt.plot(x, np.zeros(x.shape[0]), '--b', linewidth=2, label='$m(x)$')
plt.plot(x, f.T, 'r', linewidth=2, label='GP samples')
plt.ylim([-4, 4])
plt.title('GP with SE ($s=%1.2f\; \ell=%1.2f$) and zero mean: samples' % (s, ell), fontsize=16)
plt.xlabel('$x$', fontsize=16)
plt.ylabel('$f(x)$', fontsize=16)
<matplotlib.text.Text at 0x107e20a50>
How do we sample from a GP with a non-zero mean function. Well, you just have to add the mean function to the samples you take. Let us look at sample samples from a GP with a SE covariance and a mean function given by: $$ m(x) = 5\sin(x) e^{-|x|}. $$ Here is the code:
cov_fun=se_cov
s = 1.
ell = 1.
x = np.linspace(-6, 6, 50)[:, None]
m = 5. * np.sin(x) * np.exp(-np.abs(x))
f = m.T + sample_gp(x, num_samples=5, cov_fun=cov_fun, s=s, ell=ell)
plt.plot(x, m, '--b', linewidth=2, label='$m(x)$')
plt.plot(x, f.T, 'r', linewidth=2, label='GP samples')
plt.ylim([-4, 4])
plt.title('GP with SE ($s=%1.2f\; \ell=%1.2f$) and zero mean: samples' % (s, ell), fontsize=16)
plt.xlabel('$x$', fontsize=16)
plt.ylabel('$f(x)$', fontsize=16)
<matplotlib.text.Text at 0x108cd4cd0>
Sampling a 2D GP, i.e., the input $\bx$ has two components, is as easy as sampling from the 1D GP.
The first step is to get a grid of points in two dimensions. You do this as follows:
x = np.linspace(-3, 3, 32)
X1, X2 = np.meshgrid(x, x)
These points can be visualized as follows:
X = np.hstack([X1.flatten()[:, None], X2.flatten()[:, None]])
plt.plot(X[:, 0], X[:, 1], '.')
plt.show()
What we are going to do is sample the GP on each one of these points:
num_samples = 5
s = 1.
ell = 1.
fig = plt.figure()
f = sample_gp(X, num_samples=num_samples, cov_fun=cov_fun, s=s, ell=ell)
F = f.reshape((num_samples, ) + X1.shape)
for i in xrange(num_samples):
fig = plt.figure()
ax = fig.add_subplot(111)
ax.contourf(X1, X2, F[i, :, :])
<matplotlib.figure.Figure at 0x10c72d910>
Let us now repeat the experiment with a much smaller length scale:
num_samples = 5
s = 1.
ell = .5
fig = plt.figure()
f = sample_gp(X, num_samples=num_samples, cov_fun=cov_fun, s=s, ell=ell)
F = f.reshape((num_samples, ) + X1.shape)
for i in xrange(num_samples):
fig = plt.figure()
ax = fig.add_subplot(111)
ax.contourf(X1, X2, F[i, :, :])
<matplotlib.figure.Figure at 0x10d108b90>
Now, let us make one length scale small (the horizontal) and one large:
num_samples = 5
s = 1.
ell = [0.5, 1.]
fig = plt.figure()
f = sample_gp(X, num_samples=num_samples, cov_fun=cov_fun, s=s, ell=ell)
F = f.reshape((num_samples, ) + X1.shape)
for i in xrange(num_samples):
fig = plt.figure()
ax = fig.add_subplot(111)
ax.contourf(X1, X2, F[i, :, :])
<matplotlib.figure.Figure at 0x10c1e8250>
for any $\bx,\bx'\in\Rdd$, $k(\bx, \bx')$ depends only on the distance $\bx - \bx'$, i.e., $$ k(\bx, \bx') = k(\bx - \bx'). $$ Show that the SE covariance is a stationary covariance function.
is commonly used to model soil permeability. Plot the exponential covariance in 1D in the same figure as the SE covariance with the same signal strength and length scale parameters. What do you notice?
as we did for the SE covariance. Notice that the sampled paths are continuous but not differentiable.
as we did for the SE covariance.