Introduction to GPy: Covariance Functions

20th January 2014

Neil D. Lawrence

In [3]:
%matplotlib inline
import numpy as np
import pods
import pylab as plt
import GPy
from IPython.display import display

Covariance Functions in GPy

In the last session we introduced Gaussian process models through constructing covariance functions in numpy. The GPy software is a BSD licensed software package for modeling with Gaussian processes in python. It is designed to make it easy for the user to construct models and interact with data. The software is BSD licensed to ensure that there are as few as possible constraints on its use. The GPy documentation is produced with Sphinx and is available here.

In the introduction to Gaussian processes we defined covariance functions (or kernels) as functions to which we passed objects in GPy covariance functions are objects.

In GPy the covariance object is stored in GPy.kern. There are several covariance functions available. The exponentiated quadratic covariance is stored as RBF and can be created as follows.

In [6]:
k = GPy.kern.RBF(input_dim=1)
rbf. Value Constraint Prior Tied to
variance 1.0 +ve
lengthscale 1.0 +ve

Here it's been given the name 'rbf' by default and some default values have been given for the lengthscale and variance, we can also name the covariance and change the initial parameters as follows,

In [8]:
k = GPy.kern.RBF(input_dim=1, name='signal', variance=4.0, lengthscale=2.0)
signal. Value Constraint Prior Tied to
variance 4.0 +ve
lengthscale 2.0 +ve

Plotting the Kernel

The covariance function expresses the covariation between two points. the .plot() method can be applied to the kernel to discover how the covariation changes as a function as one of the inputs with the other fixed (i.e. it plots k(x, z) as a function of a one dimensional x whilst keeping z fixed. By default z is set to 0.0.

In [9]:

Here the title of the kernel is taken from the name we gave it (signal).

Changing Covariance Function Parameters

When we constructed the covariance function we gave it particular parameters. These parameters can be changed later in a couple of different ways firstly, we can simple set the field value of the parameters. If we want to change the lengthscale of the covariance to 3.5 we do so as follows.

In [11]:
k.lengthscale = 3.5
signal. Value Constraint Prior Tied to
variance 4.0 +ve
lengthscale 3.5 +ve

We can even change the naming of the parameters, let's imagine that this covariance function was operating over time instead of space, we might prefer the name timescale for the lengthscale parameter.

In [12]: = 'timescale'
signal. Value Constraint Prior Tied to
variance 4.0 +ve
timescale 3.5 +ve

Now we can set the time scale appropriately.

In [13]:
k.timescale = 10.
signal. Value Constraint Prior Tied to
variance 4.0 +ve
timescale 10.0 +ve

Further Covariance Functions in GPy

There are other types of basic covariance function in GPy. For example the Linear kernel, $$ k(\mathbf{x}, \mathbf{z}) = \alpha \mathbf{x}^\top \mathbf{z} $$ and the Bias kernel, $$ k(\mathbf{x}, \mathbf{z}) = \alpha $$ where everything is equally correlated to each other. Brownian implements Brownian motion which has a covariance function of the form, $$ k(t, t^\prime) = \alpha \text{min}(t, t^\prime). $$

Broadly speaking covariances fall into two classes, stationary covariance functions, for which the kernel can always be written in the form $$ k(\mathbf{x}, \mathbf{z}) = f(r) $$ where $f(\cdot)$ is a function and $r$ is the distance between the vectors $\mathbf{x}$ and $\mathbf{z}$, i.e., $$ r = ||\mathbf{x} - \mathbf{z}||_2 = \sqrt{\left(\mathbf{x} - \mathbf{z}\right)^\top \left(\mathbf{x} - \mathbf{z}\right)}. $$ This partitioning is reflected in the object hierarchy in GPy. There is a base object Kern and this is inherited by Stationary to form the stationary covariance functions (like RBF and Matern32).

Computing the Covariance Function

When using numpy to construct covariances we defined a function, kern_compute which returned a covariance matrix given the name of the covariance function. In GPy, the base object Kern implements a method K. That allows us to compute the associated covariance. Visualizing the structure of this covariance matrix is often informative in understanding whether we have appropriately captured the nature of the relationship between our variables in our covariance function. In GPy the input data is assumed to be provided in a matrix with $n$ rows and $p$ columns where $n$ is the number of data points and $p$ is the number of features we are dealing with. We can compute the entries to the covariance matrix for a given set of inputs X as follows.

In [ ]:
data = pods.datasets.olympic_marathon_men()
# Load in the times of the olympics
X = data['X']

Now to visualize this covariation between the time points we plot it as an image using the imshow command from matplotlib.

imshow(K, interpolation='None')

Setting the interpolation to 'None' prevents the pixels being smoothed in the image. To better visualize the structure of the covariance we've also drawn white lines at the points where the First World War and the Second World War begin. At other points the Olympics are held every four years and the covariation is given by the computing the covariance function for two points which are four years apart, appropriately scaled by the time scale.

In [47]:
def visualize_olympics(K):
    """Helper function for visualizing a covariance computed on the Olympics training data."""
    fig, ax = plt.subplots(figsize=(8,8))
    im = ax.imshow(K, interpolation='None')

    WWI_index = np.argwhere(X==1912)[0][0]
    WWII_index = np.argwhere(X==1936)[0][0]

In [23]:
Index signal.timescale Constraint Prior Tied to

of 10 is ensuring that neighbouring Olympic years are correlated, but there is only weak dependency across the period of each of the world wars. If we increase the timescale to 20 we obtain stronger dependencies between the wars.

In [48]:
k.timescale = 20

Covariance Matrices and Covariance Functions

A Gaussian process specifieds that any finite set of realizations of the function will jointly have a Gaussian density. In other words, if we are interested in the joint distribution of function values at a set of particular points, realizations of those functions can be sampled according to a Gaussian density. The Gaussian process provides a prior over an infinite dimensional object: the function. For our purposes we can think of a function as being like an infinite dimensional vector. The mean of our Gaussian process is normally specified by a vector, but is now itself specified by a mean function. The covariance of the process, instead of being a matrix is also a covariance function. But to construct the infinite dimensional matrix we need to make it a function of two arguments, $k(\mathbf{x}, \mathbf{z})$. When we compute the covariance matrix using k.K(X) we are computing the values of that matrix for the different entries in $\mathbf{X}$. GPy also allows us to compute the cross covariance between two input matrices, $\mathbf{X}$ and $\mathbf{Z}$.

Sampling from a Gaussian Process

We cannot sample a full function from a process because it consists of infinite values, however, we can obtain a finite sample from a function as described by a Gaussian process and visualize these samples as functions. This is a useful exercise as it allows us to visualize the type of functions that fall within the support of our Gaussian process prior. Careful selection of the right covariance function can improve the performance of a Gaussian process based model in practice because the covariance function allows us to bring domain knowledge to bear on the problem. If the input domain is low dimensional, then we can at least ensure that we are encoding something reasonable in the covariance through visualizing samples from the Gaussian process.

For a one dimensional function, if we select a vector of input values, represented by X, to be equally spaced, and ensure that the spacing is small relative to the lengthscale of our process, then we can visualize a sample from the function by sampling from the Gaussian (or a multivariate normal) with the numpy command

F = np.random.multivariate_normal(mu, K, num_samps).T

where mu is the mean (which we will set to be the zero vector) and K is the covariance matrix computed at the points where we wish to visualize the function. The transpose at the end ensures that the the matrix F has num_samps columns and $n$ rows, where $n$ is the dimensionality of the square covariance matrix K.

Below we build a simple helper function for sampling from a Gaussian process and visualizing the result.

In [75]:
def sample_covariance(kern, X, num_samps=10):
    """Sample a one dimensional function as if its from a Gaussian process with the given covariance function."""
    from IPython.display import HTML
    display(HTML('<h2>Samples from a Gaussian Process with ' + + ' Covariance</h2>'))
    K = kern.K(X) 

    # Generate samples paths from a Gaussian with zero mean and covariance K
    F = np.random.multivariate_normal(np.zeros(X.shape[0]), K, num_samps).T

    fig, ax = plt.subplots(figsize=(8,8))

We are now in a position to define a vector of inputs, a covariance function, and then to visualize the samples from the process.

In [76]:
# create an input vector
X = np.linspace(-2, 2, 200)[:, None]

# create a covariance to visualize
k = GPy.kern.RBF(input_dim=1)

# perform the samples.
sample_covariance(k, X)

Samples from a Gaussian Process with rbf Covariance

rbf. Value Constraint Prior Tied to
variance 1.0 +ve
lengthscale 1.0 +ve
In [82]:
k = GPy.kern.Poly(input_dim=1, order=6)
X = np.linspace(-.8, .8, 500)[:, None]
sample_covariance(k, X)

Samples from a Gaussian Process with poly Covariance

poly. Value Constraint Prior Tied to
variance 1.0 +ve

Combining Covariance Functions

Covariance functions can be combined in various ways to make new covariance functions.

Adding Covariance Functions

Perhaps simplest thing you can do to combine covariance functions is to add them. The sum of two Gaussian random variables is also Gaussian distributed, with a covariance equal to the sum of the covariances of the original variables (similarly for the mean). This implies that if we add two functions drawn from Gaussian processes together, then the result is another function drawn from a Gaussian process, and the covariance function is the sum of the two covariance functions of the original process. $$ k(\mathbf{x}, \mathbf{z}) = k_1(\mathbf{x}, \mathbf{z}) + k_2(\mathbf{x}, \mathbf{z}). $$ Here the domains of the two processes don't even need to be the same, so one of the covariance functions could perhaps depend on time and the other on space, $$ k\left(\left[\mathbf{x}\quad t\right], \left[\mathbf{z}\quad t^\prime\right]\right) = k_1(\mathbf{x}, \mathbf{z}) + k_2(t, t^\prime). $$

In GPy the addition operator is overloaded so that it is easy to construct new covariance functions by adding other covariance functions together.

In [63]:
k1 = GPy.kern.RBF(1, variance=4.0, lengthscale=10., name='long term trend')
k2 = GPy.kern.RBF(1, variance=1.0, lengthscale=2., name='short term trend')
k = k1 + k2 = 'signal' = 'timescale' = 'timescale'
signal. Value Constraint Prior Tied to
long term trend.variance 4.0 +ve
long term trend.timescale 10.0 +ve
short term trend.variance 1.0 +ve
short term trend.timescale 2.0 +ve

Multiplying Covariance Functions

An alternative is to multiply covariance functions together. This also leads to a valid covariance function (i.e. the kernel is within the space of Mercer kernels), although the interpretation isn't as straightforward as the additive covariance.

In [62]:
k1 = GPy.kern.Linear(1)
k2 = GPy.kern.RBF(1)
k = k1*k2
mul. Value Constraint Prior Tied to
linear.variances 1.0 +ve
rbf.variance 1.0 +ve
rbf.lengthscale 1.0 +ve

Our choice of X means that the points are close enough together to look like functions. We can see the structure of the covariance matrix we are plotting from if we visualize C.

In [253]:
<matplotlib.image.AxesImage at 0x10fae050>

Now try a range of different covariance functions and values and plot the corresponding sample paths for each using the same approach given above.

In [254]:
# Try plotting sample paths here

Exercise 3

Can you tell the covariance structures that have been used for generating the sample paths shown in the figure below?

Figure a Figure b Figure c Figure d Figure e Figure f
# Exercise 3 answer

A Gaussian Process Regression Model

We will now combine the Gaussian process prior with some data to form a GP regression model with GPy. We will generate data from the function $f ( x ) = − \cos(\pi x ) + \sin(4\pi x )$ over $[0, 1]$, adding some noise to give $y(x) = f(x) + \epsilon$, with the noise being Gaussian distributed, $\epsilon \sim \mathcal{N}(0, 0.01)$.

In [255]:
In [256]:
X = np.linspace(0.05,0.95,10)[:,None]
Y = -np.cos(np.pi*X) + np.sin(4*np.pi*X) + np.random.normal(loc=0.0, scale=0.1, size=(10,1)) 
[<matplotlib.lines.Line2D at 0x10ab9bd0>]

A GP regression model based on an exponentiated quadratic covariance function can be defined by first defining a covariance function,

In [265]:
k = GPy.kern.RBF(input_dim=1, variance=1., lengthscale=0.2)

And then combining it with the data to form a Gaussian process model,

In [266]:
m = GPy.models.GPRegression(X,Y,k)

Just as for the covariance function object, we can find out about the model using the command print m.

In [25]:

Model: GP regression
Log-likelihood: -413.019393422
Number of Parameters: 5

GP_regression. Value Constraint Prior Tied to
add.rbf.variance 243.874122742 +ve
add.rbf.lengthscale (2,) +ve
add.bias.variance 94.4144417427 +ve
Gaussian_noise.variance 1e-05 fixed

Note that by default the model includes some observation noise with variance 1. We can see the posterior mean prediction and visualize the marginal posterior variances using m.plot().

In [268]:
{'dataplot': [<matplotlib.lines.Line2D at 0x11259950>],
 'gpplot': [[<matplotlib.lines.Line2D at 0x1124b650>],
  [<matplotlib.patches.Polygon at 0x1124bb50>],
  [<matplotlib.lines.Line2D at 0x1124bcd0>],
  [<matplotlib.lines.Line2D at 0x11259450>]]}

The actual predictions of the model for a set of points Xstar (an $m \times p$ array) can be computed using Ystar, Vstar, up95, lo95 = m.predict(Xstar)

Exercise 4

a) What do you think about this first fit? Does the prior given by the GP seem to be adapted?

# Exercise 4 a) answer

b) The parameters of the models can be modified using a regular expression matching the parameters names (for example m['noise'] = 0.001 ). Change the values of the parameters to obtain a better fit.

In [269]:
# Exercise 4 b) answer

c) As in Section 2, random sample paths from the conditional GP can be obtained using np.random.multivariate_normal(mu[:,0],C) where the mean vector and covariance matrix mu, C are obtained through the predict function mu, C, up95, lo95 = m.predict(Xp,full_cov=True). Obtain 10 samples from the posterior sample and plot them alongside the data below.

In [270]:
# Exercise 4 c) answer

Covariance Function Parameter Estimation

As we have seen during the lectures, the parameters values can be estimated by maximizing the likelihood of the observations. Since we don’t want one of the variance to become negative during the optimization, we can constrain all parameters to be positive before running the optimisation.

In [271]:
m.constrain_positive('.*') # Constrains all parameters matching .* to be positive, i.e. everything
[0 1 2]
WARNING: reconstraining parameters ['rbf.variance', 'rbf.lengthscale', 'Gaussian_noise.variance']

The warnings are because the parameters are already constrained by default, the software is warning us that they are being reconstrained.

Now we can optimize the model using the m.optimize() method.

In [272]:
print m
  GP_regression.           |       Value        |  Constraint  |  Prior  |  Tied to
  rbf.variance             |    0.675369033494  |     +ve      |         |         
  rbf.lengthscale          |    0.127867676628  |     +ve      |         |         
  Gaussian_noise.variance  |  0.00698660648594  |     +ve      |         |         

The parameters obtained after optimisation can be compared with the values selected by hand above. As previously, you can modify the kernel used for building the model to investigate its influence on the model.

The Running Example

Now we'll consider a small example with real world data, data giving the pace of all marathons run at the olympics. To load the data use

In [17]:
data = pods.datasets.olympic_marathon_men()
print data['details']
Olympic mens' marathon gold medal winning times from 1896 to 2012. Time given in pace (minutes per kilometer). Data is originally downloaded and collated from Wikipedia, we are not responsible for errors in the data
In [18]:
X = data['X']
Y = data['Y']
plt.plot(X, Y, 'bx')
plt.ylabel('marathon pace min/km')
<matplotlib.text.Text at 0x10e71c350>

Exercise 5

a) Build a Gaussian process model for the olympic data set using a combination of an exponentiated quadratic and a bias covariance function. Fit the covariance function parameters and the noise to the data. Plot the fit and error bars from 1870 to 2030. Do you think the predictions are reasonable? If not why not?

In [19]:
# Exercise 5 a) answer 
kern = GPy.kern.RBF(1, lengthscale=80) + GPy.kern.Matern52(1, lengthscale=10) + GPy.kern.Bias(1)
model = GPy.models.GPRegression(X, Y, kern)
model.plot()# Exercise 5 d) answer

Model: GP regression
Log-likelihood: -5.99078279431
Number of Parameters: 6

GP_regression. Value Constraint Prior Tied to
add.rbf.variance 2.58425103654 +ve
add.rbf.lengthscale 102.447254296 +ve
add.Mat52.variance 0.0204066240251 +ve
add.Mat52.lengthscale 6.52778209539 +ve
add.bias.variance 17.5408872485 +ve
Gaussian_noise.variance0.0368133074589 +ve

b) Fit the same model, but this time intialize the length scale of the exponentiated quadratic to 0.5. What has happened? Which of model has the higher log likelihood, this one or the one from (a)?

Hint: use model.log_likelihood() for computing the log likelihood.

In [276]:
# Exercise 5 b) answer
kern=GPy.kern.RBF(1, lengthscale=0.5)+GPy.kern.Matern52(1,lengthscale=10)+GPy.kern.Bias(1)

c) Modify your model by including two covariance functions. Intitialize a covariance function with an exponentiated quadratic part, a Matern 3/2 part and a bias covariance. Set the initial lengthscale of the exponentiated quadratic to 80 years, set the initial length scale of the Matern 3/2 to 10 years. Optimize the new model and plot the fit again. How does it compare with the previous model?

In [277]:
# Exercise 5 c) answer

d) Repeat part c) but now initialize both of the covariance functions' lengthscales to 20 years. Check the model parameters, what happens now?

In [278]:
# Exercise 5 d) answer

e) Now model Model the data with a product of an exponentiated quadratic covariance function and a linear covariance function. Fit the covariance function parameters. Why are the variance parameters of the linear part so small? How could this be fixed?

In [279]:
# Exercise 5 e) answer

More Advanced: Uncertainty Propagation

Let $x$ be a random variable defined over the real numbers, $\Re$, and $f(\cdot)$ be a function mapping between the real numbers $\Re \rightarrow \Re$. Uncertainty propagation is the study of the distribution of the random variable $f ( x )$.

We will see in this section the advantage of using a model when only a few observations of $f$ are available. We consider here the 2-dimensional Branin test function defined over [−5, 10] × [0, 15] and a set of 25 observations as seen in Figure 3.

In [23]:
# Definition of the Branin test function
def branin(X):
    y = (X[:,1]-5.1/(4*np.pi**2)*X[:,0]**2+5*X[:,0]/np.pi-6)**2
    y += 10*(1-1/(8*np.pi))*np.cos(X[:,0])+10

# Training set defined as a 5*5 grid:
xg1 = np.linspace(-5,10,5)
xg2 = np.linspace(0,15,5)
X = np.zeros((xg1.size * xg2.size,2))
for i,x1 in enumerate(xg1):
    for j,x2 in enumerate(xg2):
        X[i+xg1.size*j,:] = [x1,x2]

Y = branin(X)[:,None]

We assume here that we are interested in the distribution of $f (U )$ where $U$ is a random variable with uniform distribution over the input space of $f$. We will focus on the computation of two quantities: $E[ f (U )]$ and $P( f (U ) > 200)$.

Computation of $E[f(U)]$

The expectation of $f (U )$ is given by $\int_x f ( x )\text{d}x$. A basic approach to approximate this integral is to compute the mean of the 25 observations: np.mean(Y). Since the points are distributed on a grid, this can be seen as the approximation of the integral by a rough Riemann sum. The result can be compared with the actual mean of the Branin function which is 54.31.

Alternatively, we can fit a GP model and compute the integral of the best predictor by Monte Carlo sampling:

In [24]:
# Fit a GP
# Create an exponentiated quadratic plus bias covariance function
kg = GPy.kern.RBF(input_dim=2, ARD = True)
kb = GPy.kern.Bias(input_dim=2)
k = kg + kb

# Build a GP model
m = GPy.models.GPRegression(X,Y,k)
m.Gaussian_noise.variance = 1e-5

# constrain parameters to be bounded

# fix the noise variance

# Randomize the model and optimize

# Plot the resulting approximation to Brainin
# Here you get a two-d plot becaue the function is two dimensional.

# Compute the mean of model prediction on 1e5 Monte Carlo samples
Xp = np.random.uniform(size=(1e5,2))
Xp[:,0] = Xp[:,0]*15-5
Xp[:,1] = Xp[:,1]*15
Yp = m.predict(Xp)[0]
creating /var/folders/22/6ls22g994bdfdpwx4f9gcmsw0000gn/T/scipy-neil-mcRzlh/python27_intermediate/compiler_08f57918657bc3b6af9ef49d104a532a

Model: GP regression
Log-likelihood: -74285.8774977
Number of Parameters: 5

GP_regression. Value Constraint Prior Tied to
add.rbf.variance 1.0 +ve
add.rbf.lengthscale (2,) +ve
add.bias.variance 1.0 +ve
Gaussian_noise.variance1e-05 +ve
WARNING: reconstraining parameters GP_regression.Gaussian_noise.variance

Model: GP regression
Log-likelihood: -74285.8774977
Number of Parameters: 5

GP_regression. Value Constraint Prior Tied to
add.rbf.variance 1.0 +ve
add.rbf.lengthscale (2,) +ve
add.bias.variance 1.0 +ve
Gaussian_noise.variance1e-05 fixed
Index GP_regression.add.rbf.lengthscale Constraint Prior Tied to

Exercise 6

a) Has the approximation of the mean been improved by using the GP model?

# Exercise 6 a) answer

b) One particular feature of GPs we have not use for now is their prediction variance. Can you use it to define some confidence intervals around the previous result?

In [ ]:
# Exercise 6 b) answer

Computation of $P( f (U ) > 200)$

In various cases it is interesting to look at the probability that $f$ is greater than a given threshold. For example, assume that $f$ is the response of a physical model representing the maximum constraint in a structure depending on some parameters of the system such as Young’s modulus of the material (say $Y$) and the force applied on the structure (say $F$). If the later are uncertain, the probability of failure of the structure is given by $P( f (Y, F ) > \text{f_max} )$ where $f_\text{max}$ is the maximum acceptable constraint.

Exercise 7

a) As previously, use the 25 observations to compute a rough estimate of the probability that $f (U ) > 200$.

In [ ]:
# Exercise 7 a) answer

b) Compute the probability that the best predictor is greater than the threshold.

In [ ]:
# Exercise 7 b) answer

c) Compute some confidence intervals for the previous result

In [ ]:
# Exercise 7 c) answer

These two values can be compared with the actual value {$P( f (U ) > 200) = 1.23\times 10^{−2}$ .

We now assume that we have an extra budget of 10 evaluations of f and we want to use these new evaluations to improve the accuracy of the previous result.

Exercise 8

a) Given the previous GP model, where is it interesting to add the new observations if we want to improve the accuracy of the estimator and reduce its variance?

# Exercise 8 a) answer

b) Can you think about (and implement!) a procedure that updates sequentially the model with new points in order to improve the estimation of $P( f (U ) > 200)$?

In [ ]:
# Exercise 8 b) answer