$% DEFINITIONS \newcommand{\bx}{\mathbf{x}} \newcommand{\by}{\mathbf{y}} \newcommand{\bz}{\mathbf{z}} \newcommand{\bA}{\mathbf{A}} \newcommand{\bB}{\mathbf{B}} \newcommand{\bC}{\mathbf{C}} \newcommand{\bD}{\mathbf{D}} \newcommand{\bI}{\mathbf{I}} \newcommand{\bL}{\mathbf{L}} \newcommand{\bM}{\mathbf{M}} \newcommand{\bLambda}{\boldsymbol{\Lambda}} \newcommand{\bSigma}{\boldsymbol{\Sigma}} \newcommand{\bmu}{\boldsymbol{\mu}} \newcommand{\calN}{\mathcal{N}} \newcommand{\R}{\mathbb{R}} \newcommand{\E}{\mathbb{E}} \newcommand{\C}{\mathbb{C}} \newcommand{\Rd}{\R^d} \newcommand{\Rdd}{\R^{d\times d}} \newcommand{\bzero}{\mathbf{0}} % END OF DEFINITIONS $ We provide a few technical details on multivariate normal (MN) probability densities that cannot be easily found in standard textbooks. These details are extremely important when we face the problem of sampling from a MN.
We say that the $d$-dimensional random vector $\bx$ follows a MV with mean $\bmu\in\R^d$ and a positive definite covariance matrix $\bSigma\in\Rdd$, if its PDF is:
$$\bx | \bmu, \bSigma \sim p(\bx | \bmu, \bSigma)= \calN\left(\bx | \bmu, \bSigma\right) = (2\pi)^{-\frac{d}{2}}|\bSigma|^{-\frac{1}{2}}\exp\left\{-\frac{1}{2}(\bx-\bmu)^T\bSigma^{-1}(\bx - \bmu)\right\}$$,
where $|\bA|$ is the determinant of the matrix $\bA$.
The following proposition is of extreme importance. It justifies the names of $\bmu$ and $\bSigma$.
Let $\bx\sim \calN\left(\bx | \bmu, \bSigma\right)$ for some $\bmu\in\Rd$ and $\bSigma\in\Rdd$. Then \begin{eqnarray} \E[\bx | \bmu, \bSigma] &=& \int \bx p(\bx | \bmu, \bSigma)d\bx = \bmu,\\ \C[\bx|\bmu, \bSigma] &=& \int (\bx - \E[\bx|\bmu, \bSigma)(\bx - \E[\bx|\bmu, \bSigma])^Tp(\bx|\bmu, \bSigma)d\bx = \bSigma, \end{eqnarray} where $\E[\cdot|\cdot]$ and $\C[\cdot|\cdot]$ are the mean and covariance operators.
Our goal is to be able to do the following things in Python:
The standard normal probability density is given for $d=1, \mu=0$, and $\Sigma=\sigma^2=1$.
Before we talk about MN, we must be able to evaluate the PDF and sample the standard normal.
Therefore, we say that the random variable $z$ follows a standard normal if its PDF is:
$$
z \sim p(z|\mu=0, \sigma^2=1) = \calN\left(z | 0, 1 \right) = \frac{1}{\sqrt{2\pi}}\exp\left\{-\frac{1}{2}x^2\right\} =: \phi(x).
$$
The prefered way to evaluate the standard normal PDF, $\phi(x)$, is to use
scipy.stats.norm.pdf()
.
You may sample from it using either scipy.stats.norm.rvs()
or numpy.random.randn()
. We demonstrate this with the following example.
scipy.stats
way¶from scipy.stats import norm
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1, 1)
# ppf = percent point function (inverse CDF)
x = np.linspace(norm.ppf(0.01), norm.ppf(0.99), 100)
# Plot \phi(x)
ax.plot(x, norm.pdf(x), 'r-', lw=5, alpha=0.6, label='norm pdf')
# Generate some random values
r = norm.rvs(size=1000)
# Draw a histogram out of these random values and compare it to the PDF
ax.hist(r, normed=True, histtype='stepfilled', alpha=0.2, label='histogram')
ax.set_xlabel('$x$', fontsize=16)
ax.set_ylabel('$\phi(x)$', fontsize=16)
ax.legend(loc='best', frameon=False)
<matplotlib.legend.Legend at 0x10efd4c90>
numpy.random.way
¶import numpy as np
import matplotlib.pyplot as plt
fix, ax = plt.subplots(1, 1)
# Just draw random samples and plot the histogram
r = np.random.randn(1000)
ax.hist(r, normed=True, histtype='stepfilled', alpha=0.2)
ax.set_xlabel('$x$', fontsize=16)
ax.set_ylabel('$\phi(x)$', fontsize=16)
<matplotlib.text.Text at 0x10f5c9210>
The standard MV probability density is the MV density with $\bmu = \bzero_d$ and $\bSigma = \bI$, where $\bzero_d\in\Rd$ and $\bI\in\Rdd$ are the zero vector and the unit matrix in $d$-dimensions, respectively. If $\bx$ follows a standard MV, then its components, $x_i, i=1,\dots,d$, are indpendent identically distributed (iid) according to a standard normal. That is, it is trivial to show that: $$ p(\bx | \bmu = \bzero_d, \bSigma = \bI_d ) = \prod_{i=1}^d\calN(x_i | 0, 1) = \prod_{i=1}^d\phi(x_i). $$ No big deal. So, how can we sample from this distribution?
# Here it is in two dimensions
d = 2
# We will take n = 10 samples
n = 10
samples = np.random.randn(n, d)
print samples
[[-0.61175564 -0.09948978] [-0.69621572 -1.28704937] [ 1.68880195 -0.37657685] [ 1.38192068 1.67934031] [ 0.08537035 -0.1594675 ] [-0.20848104 -0.80711706] [ 0.44521631 -2.08741765] [ 0.16485541 -0.22667149] [ 0.00561693 0.33818003] [ 1.14610005 0.70019393]]
# Taking some more and plotting them (by the way, we use scipy.stats.norm
# this time just because we can :-)):
n = 1000
samples = norm.rvs(size=(n, d))
plt.plot(samples[:, 0], samples[:, 1], 'k.', markersize=10)
[<matplotlib.lines.Line2D at 0x110342d10>]
# Now lets draw some contours along with these samples
x = np.linspace(-3, 3, 64)
X1, X2 = np.meshgrid(x, x)
Y = norm.pdf(X1) * norm.pdf(X2)
plt.contourf(X1, X2, Y)
plt.colorbar()
plt.plot(samples[:, 0], samples[:, 1], 'k.', markersize=10)
plt.xlabel('$x_1$', fontsize=16)
plt.ylabel('$x_2$', fontsize=16)
<matplotlib.text.Text at 0x110dccc10>
Assume now that we have a generic MN random vector $\mathbf{x}$ with mean $\boldsymbol{\mu}$ and covariance matrix $\boldsymbol{\Sigma}$. We want to be able to do two tasks:
Usually, we do not compute $p(\bx|\bmu, \bSigma) =\calN\left(\bx |\bmu, \bSigma\right)$, but its logarithm, $\log p(\bx|\bmu, \bSigma)$. The reason is that $p(\bx|\bmu, \bSigma)$ can be an extremely small number when $d$ is big enough.
The computation of $\log p(\bx)$ is not very difficult. However, it can become problematic quite fast. The difficulty arises from two factors:
The classical way is as follows:
import scipy.linalg
import numpy as np
def mn_logpdf(x, mu, Sigma):
"""
Evaluates the log of the multivariate normal PDF with mean ``mu`` and
covariance matrix ``Sigma`` at ``x``.
:param x: The point at which we want to evaluate the PDF.
:param mu: The mean vector.
:param Sigma: The covariance matrix.
.. preconditions::
The covariance matrix has to be positive definite.
"""
d = mu.shape[0]
L = scipy.linalg.cholesky(Sigma, lower=True)
log_det_Sigma = 2. * np.sum(np.log(np.diag(L)))
z = scipy.linalg.solve_triangular(L, x - mu, lower=True)
return 0.5 * (d * math.log(2. * math.pi) - log_det_Sigma - np.dot(z.T, z))
Here is an example using the code:
# We experiment in two dimensions
# The mean vector
mu = np.zeros(2)
# The covariance matrix
Sigma = np.array([[1, 0.9], [0.9, 2.]])
# Evaluating at a particular point
print 'log p([0.1, 0.2]) = ', mn_logpdf(np.array([0.1, 0.2]), mu, Sigma)
x = np.linspace(-3., 3, 64)
X1, X2 = np.meshgrid(x, x)
# Put all these in a list:
X = np.hstack([X1.flatten()[:, None], X2.flatten()[:, None]])
y = np.array([mn_logpdf(X[i, :], mu, Sigma) for i in xrange(X.shape[0])])
Y = y.reshape(X1.shape)
plt.contourf(X1, X2, Y)
plt.colorbar()
log p([0.1, 0.2]) = 0.82187784603
<matplotlib.colorbar.Colorbar instance at 0x112947a70>
To sample from a MN, we must make use of the following theorem (see Exercise 3):
Let $\bmu\in\Rd$ be an arbitrary vector and $\bSigma\in\Rdd$ a positive definite matrix. If $\mathbf{A}\in\mathbb{R}^{d\times d}$ be such that: $$ \boldsymbol{\Sigma} = \mathbf{A}\mathbf{A}^T, $$ and $\bz$ be a $d$-dimensional standard MN random vector, i.e., $$ \bz \sim \calN(\bx | \bzero_d, \bI_i), $$ then the random vector $$ \bx = \bmu + \bA \bz, $$ follows a $\calN\left(\bx | \bmu, \bSigma\right)$.
Theorem 1 is all we need in order to sample from a MN. The idea is to sample $\bz$ from a standard MN, and then compute $\bx = \bmu + \bA \bz$ to get a sample from the original MN. The choice of the matrix $\bA$ is completely up to us. The most usual choise is, of course, the Cholesky decomposition. However, many other choices are preferable depending on the situation.
import scipy.linalg
from scipy.stats import norm
def mn_sample(mu, Sigma):
"""
Sample the multivariate normal distribution with mean ``mu``
and covariance matrix ``Sigma``.
:param mu: The mean.
:param Sigma: The covariance matrix.
"""
d = mu.shape[0]
L = scipy.linalg.cholesky(Sigma, lower=True)
z = norm.rvs(size=(d,))
return mu + np.dot(L, z)
Here is an example using the code:
# The mean vector
mu = np.zeros(2)
# The covariance matrix
Sigma = np.array([[1, 0.9], [0.9, 1.]])
# Take samples
for i in xrange(10):
print 'sample', i + 1, mn_sample(mu, Sigma)
# Take many samples
samples = np.vstack([mn_sample(mu, Sigma) for i in xrange(100)])
# Plot the samples along with the log of the PDF
x = np.linspace(-3., 3, 64)
X1, X2 = np.meshgrid(x, x)
# Put all these in a list:
X = np.hstack([X1.flatten()[:, None], X2.flatten()[:, None]])
y = np.array([mn_logpdf(X[i, :], mu, Sigma) for i in xrange(X.shape[0])])
Y = y.reshape(X1.shape)
plt.contourf(X1, X2, Y)
plt.colorbar()
plt.plot(samples[:, 0], samples[:, 1], 'k.', markersize=10)
sample 1 [ 1.97596671 1.71269622] sample 2 [-1.21595971 -0.37169496] sample 3 [ 0.36997546 -0.36585857] sample 4 [ 0.44644418 0.27787761] sample 5 [-0.76557377 -0.46927614] sample 6 [ 0.15706456 -0.08921951] sample 7 [ 1.17396166 0.36139712] sample 8 [-1.20464106 -0.93459037] sample 9 [ 3.74328045 3.91864596] sample 10 [-1.88862526 -1.21592455]
[<matplotlib.lines.Line2D at 0x112f830d0>]
Let $$ \bx | \bmu, \bSigma \sim \calN(\bx | \bmu, \bSigma) $$ and suppose that $\bx$ can be naturally split in two subsets: an $\mathbf{observed}$ part, $\bx_o$, of dimensionality $d_o$ and a $\mathbf{query}$ part, $\bx_q$ of dimensionality $d_q$. Mathematically, we write: $$ \bx = \left(\begin{array}{c} \bx_o\\ \bx_q \end{array}\right). $$ The question we want to pose is this: $$ \bx_q | \bx_o, \bmu, \bSigma \sim ?. $$ In other words: What is the distribution of $\bx_q$ given $\bx_o$?
To answer this question, think of $\bmu$ and $\bSigma$ as the block matrices $$ \bmu = \left( \begin{array}{c} \bmu_o\\ \bmu_q \end{array} \right), $$ where $\bmu_o\in\R^{n_o}$ and $\bmu_q\in\R^{n_q}$, and $$ \bSigma = \left( \begin{array}{cc} \bSigma_{oo} & \bSigma_{oq}\\ \bSigma_{qo} & \bSigma_{qq} \end{array} \right), $$ with $\bSigma_{oo}\in\R^{d_o\times d_o}, \bSigma_{oq}\in\R^{d_o\times d_q}, \bSigma_{qo} = \bSigma_{oq}^T$, and $\bSigma_{qq}\in\R^{d_q\times d_q}$, respectively. Then \begin{equation} \bx_q | \bx_o, \bmu, \bSigma \sim \calN\left( \bx_q | \bmu_{q|o}, \bSigma_{q|o} \right), \end{equation} where $$ \bmu_{q|o} = \bmu_q + \bSigma_{qo}\bSigma_{oo}^{-1}(\bx_o - \bmu_o), $$ and $$ \bSigma_{q|o} = \bSigma_{qq} - \bSigma_{qo}\bSigma_{oo}\bSigma_{oq}. $$ Remeber, this is an extremely useful results that occurs over and over again. For example it occurs in:
Assume that things are exactly as above. However, we are now interested in finding the marginal probability density of $\bx_o$ given $\bmu$ and $\bSigma$. That is, we are looking for: $$ \bx_o | \bmu, \bSigma \sim ?. $$ The answer is: $$ \bx_o | \bmu, \bSigma \sim \calN\left(\bx_o | \bmu_o, \bSigma_{oo} \right). $$ This might seem trivial, but it is not that trivial to prove (see Ex. 10).
Show that: $$ |\bSigma| = \left(\prod_{i=1}^dA_{ii}\right)^2. $$
Prove the 4-th step of the classical way to compute the log PDF of the MN.
Prove Theorem 1. It sufficies to show that the mean of $\bx$ is
and that its covariance is $$ \C[\bx | \bmu, \bSigma] = \bSigma. $$
Prove Proposition 1. Hint: Diagonalize $\bSigma$ using its eigenvalues and eigenvectors.
The Python code we have provided for sampling is very inefficient because it requires the computation of the Cholesky decomposition. Write a Python function that finds the Cholesky of $\bSigma$ once and produces many samples. Use the following prototype:
def mn_random(mu, Sigma, num_samples=1):
"""
:param mu: The mean of the multivariate normal.
:param Sigma: The covariance matrix of the multivariate normal.
:param num_samples: The number of samples to take.
:returns: A :class:`numpy.ndarray` of dimensions ``num_samples x mu.shape[0]``.
"""
pass
num_dim
is a Python property decorator. If you are inside a class method you can simply use it like this:self.num_dim
Here is the code you have to complete:
class MultivariateNormal(object):
"""
A class implementing a MN.
:param mu: The mean of the multivariate normal.
:param Sigma: The covariance matrix of the multivariate normal.
"""
# The internally stored mu
mu = None
# The internally stored Sigma
Sigma = None
# The internally stored Cholesky of Sigma
L = None
# The internally stored log determinant of Sigma
log_det_Sigma = None
@property
def num_dim(self):
"""
The number of dimensions of this MN.
"""
return self.mu.shape[0]
def __init__(self, mu, Sigma):
"""
Initialize the object.
Make sure you initialize all fields.
"""
self.mu = mu
# Initialize the rest (self.Sigma, self.L, self.log_det_Sigma).
def logpdf(self, x):
"""
Evaluate the log of the pdf of the MN.
:param x: The point at which you want to compute the log of the PDF.
:returns: The log of the PDF.
Use the quantities you calculated at the beginning.
"""
pass
def rvs(self, num_samples=1):
"""
Sample from the MN.
:params num_samples: The number of samples.
:returns: The samples in a 2D array of size ``num_samples x self.num_dim``
"""
pass
The Cholesky decomposition is not the only way to find a matrix $\bA$ that satisfies the assumptions of Theorem 1. Show how the eigenvalues and eigenvectors of $\bSigma$ may be used in order to find another matrix $\bA$ with the same property. After figuring out the mathematical form of $\bA$, re-implement create a copy of the class MultivariateNormal
that uses this $\bA$ instead of the Cholesky decomposition. Hint: Remember that $\bSigma$ is positive definite. It therefore has positive eigenvalues. As soon as you figure out what you have to do, you may use scipy.linalg.eigh.
Let $\bA\in\R^{n\times n}, \bB\in\R^{n \times m}, \bC\in\R^{m\times n}$, and $\bD\in\R^{m\times m}$ be matrices. Assume that $\bA$ and $\bD$ are invertible. Show that:
where we have:
$$ \bM = (\bA - \bB\bD^{-1}\bC)^{-1}. $$The quantity $\bM^{-1} = \bA - \bB\bD^{-1}\bC$ is known as the Shur complement. Hint: *Look for a block matrix:
$$ \left( \begin{array}{cc} \bA_i & \bB_i\\ \bC_i & \bD_i \end{array} \right), $$that satisfies:
$$ \left( \begin{array}{cc} \bA & \bB\\ \bC & \bD \end{array} \right) \left( \begin{array}{cc} \bA_i & \bB_i\\ \bC_i & \bD_i \end{array} \right) = \left( \begin{array}{cc} \bI_{n} & \bzero_{n\times m} \\ \bzero_{m\times n} & \bI_{m} \end{array} \right). $$Define the precision matrix $\bLambda$ to be the inverse of the covariance matrix $\bSigma$, i.e., $$ \bLambda = \bSigma^{-1}. $$ Write $\bLambda$ in a block format $$ \bLambda = \left( \begin{array}{cc} \bLambda_{oo} & \bLambda_{oq}\\ \bLambda_{qo} & \bLambda_{qq} \end{array} \right), $$ and use the Shur complement (see Ex. 8) to find the blocks. Then, use the definition of the conditional expectation: $$ p(\bx_q | \bx_o, \bmu, \bSigma) = \frac{p(\bx_o, \bx_q|\bmu, \bSigma)}{p(\bx_o|\bmu, \bSigma)} \propto p(\bx |\bmu, \bSigma) \propto \exp\left\{-\frac{1}{2}(\bx - \bmu)^T\bSigma^{-1}(\bx - \bmu) \right\}, $$ and show that the term inside the exponential can be written as:
$$ \frac{1}{2}(\bx - \bmu)^T\bSigma^{-1}(\bx - \bmu) = (\bx_q - \bmu_{q|o})^T\bSigma_{q|o}^{-1}(\bx_q - \bmu_{q|o}) + \text{const}, $$where the constant may depend on anything except $\bx_q$. Then you may conclude that $\bx_q | \bx_o, \bmu, \bSigma$ is indeed distributed according to the multivariate normal distribution we have provided above. Why? This procedure is known as "completing the square".
Hint: * Remember that the marginal is defined as: $$ p(\bx_o | \bmu, \bSigma) = \int p(\bx_o, \bx_q|\bmu, \bSigma)d\bx_q. $$ So, all you have to do is to compute this integral.