%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
Difficult to sample probability distributions arise very naturally in the Bayesian approach to Statistics. Here we use Bayes' rule in the following form: $$ \mathbb P(\theta | D ) = \frac{\mathbb P(D | \theta) \mathbb P(\theta)}{\mathbb P(D)} $$ where $\theta$ represents our statistical model (usual a parameterised model, hence the symbol $\theta$) and $D$ represents our data. We typically don't have access to $\mathbb P(D)$, the probability of the data. $\mathbb P(\theta)$ is our prior belief about the model. $\mathbb P(D | \theta)$ is the likelihood, the probability of seeing the data given a choice of $\theta$. Hence we apply Bayes' rule in the form $$ \mathbb P(\theta | D ) \propto \mathbb P(D | \theta) \mathbb P(\theta) $$ Notice that we get an un-normalised probability distribution, which is fortunately exactly what we've been dealing with in our MCMC methods.
What question do we want to answer? Back in my undergrad days, my basic statistics course spent some time on maximum-likelihood estimation. Once we have fixed our model, we find the value of $\theta$ which makes the likelihood $\mathbb P(D|\theta)$ most likely. In the Bayesian setting, if we fix a uniform ("uniformative") prior, then this is the same as asking which value of $\theta$ maximises the posterior probability $\mathbb P(\theta|D)$. However, it would be more in keeping with Bayesian philosophy to actually think about the distribution $\mathbb P(\theta|D)$. To think about why, consider these two possibilities:
In the 1st case, the maximum likelihood seems a very reasonable single fact to report; in the second, it might be fairer to say that the data has told us rather little, and we actually have rather little knowledge of $\theta$.
There are, of course, also situations where the prior becomes important (to pick one example, dealing with many bits of data, and using previous data to inform our prior when dealing with new data).
For complicated models, $\mathbb P(D | \theta) \mathbb P(\theta)$ becomes hard to sample from, and this is where MCMC methods become useful: we draw many samples from the distribution $\mathbb P(\theta|D)$ and use these to give a numerical approximation to the distribution.
You can easily find tutorials and resources online. Let me pick out:
In fact, I have issues with the Pythonic Perambulations presentation. So, actually, let's attack simple regression first.
The mathematical model is that $(x_i)$ are given points, and $y_i = a + bx_i + e_i$ where $e_i \sim N(0,\sigma^2)$ represents the uncertainty in our measurements. Note the asymmetry here: each $x_i$ is known with certainty, and all the uncertainty is in $y_i$. Much more on this can be read about here: http://arxiv.org/abs/1008.4686
So $$ f(y | a, b, \sigma^2) = \prod_i (2\pi\sigma^2)^{-1/2} \exp\Big( -\frac{(y_i - a - bx_i)^2}{2\sigma^2} \Big). $$
That just leaves the issue of finding the prior. Let's defer this...
def sample_data(a, b, sigma, n):
x = np.sort( np.random.random(size = n) * 50 )
y = a + b*x + np.random.normal(scale = sigma, size = n)
return x,y
x, y = sample_data(5, 3, 8, 20)
_ = plt.scatter(x,y)
We define a Python function which is the log of the likelihood, and then evaluate at the known "actual" parameters. Then we'll use the scipy
package to find the maximal likelihood (although, in this case, we of course know a closed form solution for this).
def log_likelihood(theta, x, y):
a, b, sigma = theta
return - np.sum((y - a - b*x)**2) / (2 * sigma**2) - x.size * np.log(sigma*sigma) * 0.5
log_likelihood((5, 3, 8), x, y)
-47.495561772643228
import scipy.optimize as op
nll = lambda *args: -log_likelihood(*args) # Minimise the negative to get the maximum!
result = op.minimize(nll, [5, 3, 8], args=(x, y))
a_est, b_est, sigma_est = result["x"]
print(a_est, b_est, sigma_est, "-->", log_likelihood((a_est, b_est, sigma_est), x, y) )
3.51146082401 3.06088920588 6.08486037495 --> -46.1160736448
We'll now introduce a prior: at present, I'm using a flat prior on $a$ and $b$ but the Jeffrey's prior on $\sigma$.
def log_prior(theta):
a, b, sigma = theta
return - np.log(abs(sigma))
def log_prob(theta, x, y):
return log_prior(theta) + log_likelihood(theta, x, y)
import emcee
ndim, nwalkers = 3, 50
sampler = emcee.EnsembleSampler(nwalkers, ndim, log_prob, args=[x, y])
starting_guesses = np.random.normal(0, 0.1, (nwalkers, ndim)) + [a_est, b_est, sigma_est]
sampler.run_mcmc(starting_guesses, 2000)
emcee_trace = sampler.chain[:, 1000:, :].reshape(-1, ndim).T
I'll use the Triangle Plot package to show the marginals.
import triangle
figure = triangle.corner(emcee_trace.T, labels=["$a$", "$b$", "$\sigma$"], truths=[5, 3, 8])
The historgrams are the marginal distributions, and the three density plots are the correlations. The lines mark the "real" parameters.
What the Pythonic Perambulations post now does is the following piece of code.
The numpy
pattern is as follows:
alpha, beta
are set to the list of samples of our posterioralpha[:, None]
makes a new view of alpha of shape ? by 1yfit
is of shape ? by 20 (20 because that's the size of xfit)..mean(0)
and .std(0)
compute the (sample) mean and standard deviation of the first axis.def plot_MCMC_model(ax, xdata, ydata, trace):
"""Plot the linear model and 2sigma contours"""
ax.plot(xdata, ydata, 'ok')
alpha, beta = trace[:2]
xfit = np.linspace(0, 50, 20)
yfit = alpha[:, None] + beta[:, None] * xfit
mu = yfit.mean(0)
sig = 2 * yfit.std(0)
ax.plot(xfit, mu, '-k')
ax.fill_between(xfit, mu - sig, mu + sig, color='lightgray')
ax.set_xlabel('x')
ax.set_ylabel('y')
fig, axes = plt.subplots(figsize=(8,6))
axes.set_xlim([0,50])
plot_MCMC_model(axes, x, y, emcee_trace)
alpha, beta = emcee_trace[:2]
print("Line:", alpha.mean(), beta.mean())
Line: 3.67671802535 3.05446620054
This code has the following effect:
For a linear model, we can compute this exactly. Fix a value of $x$.
$$ \mu = \frac1n \sum_{i=1}^n a_i + b_ix = \overline a + \overline b x $$\begin{align*} \sigma^2 &= \frac1n \sum_{i=1}^n (a_i + b_ix)^2 - \mu^2 = \Big(\frac1n \sum_i a_i^2 - \overline a^2\Big) + \Big(\frac1n \sum_i b_i^2 - \overline b^2\Big)x^2 + 2x \Big( \frac1n\sum_i a_ib_i - \overline a\overline b \Big) \\ &= \sigma_a^2 + \sigma_b^2x^2 + 2x \Big( \frac1n\sum_i a_ib_i - \overline a\overline b \Big) \end{align*}So the line is just obtained by taking the mean of $(a_i)$ and $(b_i)$.
Here I'm following Jordan's lecture course. It's usual to work with a matrix model $y = X\beta + \epsilon$ so in our notation we get
$$ \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{pmatrix} = \begin{pmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{pmatrix} \begin{pmatrix} a \\ b \end{pmatrix} + \begin{pmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n \end{pmatrix} $$For reference, then
$$ X^TX = \begin{pmatrix} n & \sum_i x_i \\ \sum_i x_i & \sum_i x_i^2 \end{pmatrix} \implies (X^TX)^{-1} = \frac{1}{n^2\sigma_x^2} \begin{pmatrix} \sum_i x_i^2 & -\sum_i x_i \\ -\sum_i x_i & n \end{pmatrix} $$We'll introduce what you might call a "hyper-parameter" $g$, with here $p(g) \propto g^{-3/2} \exp(-n/2g)$ (an Inverse Gamma distribution) and then
$$ \beta \sim N(0, g\sigma^2 (X^TX)^{-1}). $$We also put the Jeffrey's prior $p(\sigma) \propto 1/\sigma$ on $\sigma$. Thus the Posterior is
\begin{align*} p(\beta,\sigma,g | y) &\propto p(y|\beta,\sigma,g) p(\beta,\sigma,g) = p(y|\beta,\sigma,g) p(\beta|\sigma,g) p(\sigma) p(g) \\ &\propto \prod (2\pi\sigma^2)^{-1/2} \exp\Big(-\frac{1}{2\sigma^2} (y_i - (X\beta)_i)^2 \Big) \frac{\sqrt{\det(X^TX)}}{2\pi g\sigma^2} \exp\Big(-\frac{1}{2g\sigma^2} \beta^TX^TX\beta\Big) \frac{1}{\sigma} g^{-3/2} \exp(-n/2g) \\ &\propto \prod (2\pi\sigma^2)^{-1/2} \exp\Big(-\frac{1}{2\sigma^2} (y_i - a - bx_i)^2 \Big) \frac{1}{g\sigma^2} \exp\Big(-\frac{1}{2g\sigma^2} \sum_i (a+bx_i)^2\Big) \frac{1}{\sigma} g^{-3/2} \exp(-n/2g) \end{align*}def log_prior(theta):
a, b, sigma, g = theta
if g <= 0:
return -np.inf
beta_prior = -np.log(g * sigma**2) - np.sum( (a + b * x)**2 ) / ( 2 * g * sigma**2 )
sigma_prior = - np.log(abs(sigma))
g_prior = - 1.5 * np.log(g) - len(x) / (2 * g)
return beta_prior + sigma_prior + g_prior
def log_prob(theta, x, y):
lp = log_prior(theta)
if not np.isfinite(lp):
return -np.inf
return lp + log_likelihood(theta[:3], x, y)
ndim, nwalkers = 4, 50
sampler = emcee.EnsembleSampler(nwalkers, ndim, log_prob, args=[x, y])
starting_guesses = np.random.normal(0, 0.1, (nwalkers, ndim)) + [a_est, b_est, sigma_est, 1]
sampler.run_mcmc(starting_guesses, 2000)
emcee_trace = sampler.chain[:, 1000:, :].reshape(-1, ndim).T
figure = triangle.corner(emcee_trace.T, labels=["$a$", "$b$", "$\sigma$", "$g$"], truths=[5, 3, 8, 1])
Let's just marginalise over $g$, to get the following.
figure = triangle.corner(emcee_trace[:3,:].T, labels=["$a$", "$b$", "$\sigma$"], truths=[5, 3, 8])
fig, axes = plt.subplots(figsize=(8,6))
axes.set_xlim([0,50])
plot_MCMC_model(axes, x, y, emcee_trace[:3,:])
alpha, beta = emcee_trace[:2]
print("Line:", alpha.mean(), beta.mean())
Line: 3.56815932092 3.05726399527
Perhaps as we might hope, this makes rather little difference.
Instead of attacking the same problem as elsewhere, I want to pick on a problem, again from the undergrad days1, that I felt unhappy with using the standard tools: it's a regression model.
A beaker of liquid is heated, and then removed from the source of heat and left to cool in a room whose temperature is 9 degrees. The temperature of the liquid is measured at the moment it is removed from the heat, and at intervals of 1 minute thereafter. The following results are observed:
Time (mins) | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|---|
Temperature | 100 | 82 | 60 | 50 | 40 | 32 | 28 |
Setup and study an appropriate regression model.
1Can you tell I've been brushing up on my probability and statistics recently?
The relevant bit of Physics here is Newton's Law of Cooling: "the rate of heat loss of a body is proportional to the difference in temperatures between the body and its surroundings." This means that $$ \frac{dT}{dt} = -r(T-T_e)$$ where $T$ is the temperature, $t$ is time, $T_e$ is the temperature of the "environment" (our room) and $r>0$ is a constant depending on the beaker and the liquid and the room (and which we presumably wish to estimate from our data).
The solution to this differential equation is $$ T(t) - T_e = (T(0)-T_e) \exp(-rt). $$ We're told that $T_e = 9$, and so $T(t) = 9 + (T(0)-9)\exp(-rt)$.
From now on, write $\Delta T = T - T_e = T - 9$. The data becomes
Time (mins) | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|---|
$\Delta T$ | 91 | 73 | 51 | 41 | 31 | 23 | 19 |
Perhaps the temperature of 100 degrees (Celcius) at the start is red herring.
So, 100 degrees at the start might make us think of boiling water, and so treat this as an accurate measure. But I don't think we should: all the readings should have an error.
A silly way to setup a standard regression model is to note that $$ \frac{T(t)-9}{T(0)-9} = e^{-rt} \implies \log\Big(\frac{T(t)-9}{T(0)-9}\Big) = -rt.$$ This then gives us data which we can plug into a least-squares regression model. However, I see a problem here:
Instead, let's re-arrange to get $$ \log( \Delta T(t_i) ) = \log(\Delta T(0)) - rt_i. $$ So setting $y_i$ to be the measured value of $\log( \Delta T(t_i) )$ we have a regression model $y_i = a - rt_i + e_i$. (Then $y_0 = a + e_0$ so we treat the initial reading in a more "symmetric" way).
I still find this problematic, as again, let $x_i$ be the reading of $\Delta T(t_i)$, so $x_i = \Delta T(t_i) + \epsilon_i$ with an error term. Then $$ \Delta T(t_i) + \epsilon_i = x_i = \exp(y_i) = e^a e^{-rt_i} \exp(e_i) = \Delta T(t_i) \exp(e_i). $$ Thus $$ \epsilon_i = \Delta T(t_i)\big( e^{e_i} - 1 \big). $$
If $X\sim N(0,\sigma^2)$ and $Y = \alpha(e^X-1)$ for some $\alpha > 0$ then $Y$ has the following density $$ f_Y(y) = \frac{1}{\sqrt{2\pi\sigma^2}} \frac{1}{y+\alpha} \exp\Big( -\frac{1}{2\sigma^2}\log\Big(\frac{t}{\alpha}+1\Big)^2 \Big), $$ supported on $y/\alpha + 1 > 0$ or $y > -\alpha$. This is a shifted Log Normal distribution.
# alpha = 1, sigma = 1
y = np.arange(-0.99, 2, 0.01)
f = np.exp(-0.5 * np.log(y+1) * np.log(y+1)) / (y + 1) / np.sqrt(2 * np.pi)
plt.plot(y,f)
[<matplotlib.lines.Line2D at 0x6db3b70>]
Anyway, for the sake of doing some Bayesian analysis, let's suppose the errors in our readings, $\epsilon_i$, are distributed $N(0,\sigma^2)$.
def exact(T0, r, ti):
Ti = T0 * np.exp(-r * ti)
return Ti
def simulate(T0, r, ti, variance):
Ti = T0 * np.exp(-r * ti)
Ti += np.random.normal(0, np.sqrt(variance), Ti.size)
return Ti
simulate(91, 0.26, np.arange(0,7), 0.00001)
array([ 90.9949331 , 70.16341598, 54.1047304 , 41.70874508, 32.16502375, 24.79957468, 19.12166433])
simulate(91, 0.26, np.arange(0,7), 5)
array([ 97.00893952, 71.49114008, 56.27001536, 43.71170836, 36.91663608, 25.04277522, 21.88228346])
In our naive model, we have $Y_i = \log \Delta T(t_i) + e_i$ and the maximum likelihood estimates for $a$ and $r$ are: $$ \hat r = - \Big( \sum_i (t_i - \overline t)(y_i - \overline y) \Big) / \Big( \sum_i (t_i - \overline t)^2 \Big), \qquad \hat a = \overline y + \hat r \overline t. $$
def naive_regress(yi, ti):
"""Returns (a,r)"""
tmean = np.sum(ti) / ti.size
ymean = np.sum(yi) / yi.size
r = - np.sum((ti - tmean) * (yi - ymean)) / np.sum((ti - tmean)**2)
a = ymean + r * tmean
return a, r
ti = np.arange(0,7)
Ti = simulate(91, 0.26, ti, 5)
xi = np.log(Ti)
a_naive, r_naive = naive_regress(xi, ti)
print(a_naive, r_naive, "-->", np.exp(a_naive), r_naive)
4.54473069836 0.275487935096 --> 94.135073066 0.275487935096
fig, axes = plt.subplots(1, 2, figsize=(15, 5))
_ = axes[0].plot(ti, a_naive - r_naive * ti, lw = 2)
_ = axes[0].plot(ti, xi, lw = 3)
_ = axes[1].plot(ti, np.exp(a_naive - r_naive * ti), lw = 2)
_ = axes[1].plot(ti, Ti, lw = 3)
My model is that $x_i = \Delta T(t_i) + \epsilon_i$ with $\epsilon_i \sim N(0,\sigma^2)$. So $X_i \sim N(e^{a-rt_i}, \sigma^2)$, and thus $$ \mathbb P(x|a, r, \sigma^2) = \prod_i \frac{1}{\sqrt{2\pi\sigma^2}} \exp\Big( -\frac{(x_i - e^{a-rt_i})^2}{2\sigma^2} \Big). $$
I'll just use the Jeffrey's prior on $\sigma$ and flat priors on $a,b$.
def log_likelihood(theta, ti, Ti):
a, r, sigma = theta
return - np.sum((Ti - np.exp(a - r * ti))**2) / (2 * sigma**2) - Ti.size * np.log(sigma*sigma) * 0.5
log_likelihood((np.log(91), 0.26, 5), ti, Ti)
-11.704905380534159
import scipy.optimize as op
nll = lambda *args: -log_likelihood(*args) # Minimise the negative to get the maximum!
result = op.minimize(nll, [np.log(91), 0.26, 5], args=(ti, Ti))
a_est, r_est, sigma_est = result["x"]
print(np.exp(a_est), r_est, sigma_est, "-->", log_likelihood((a_est, r_est, sigma_est), ti, Ti) )
92.085032471 0.265229447515 1.69725825583 --> -7.20309913771
def log_prior(theta):
a, b, sigma = theta
if sigma <= 0:
return -np.inf
return - np.log(sigma)
def log_prob(theta, ti, Ti):
lp = log_prior(theta)
if not np.isfinite(lp):
return -np.inf
return lp + log_likelihood(theta, ti, Ti)
import emcee
ndim, nwalkers = 3, 50
sampler = emcee.EnsembleSampler(nwalkers, ndim, log_prob, args=[ti, Ti])
starting_guesses = np.random.normal(0, 0.1, (nwalkers, ndim)) + [a_est, r_est, sigma_est]
sampler.run_mcmc(starting_guesses, 2000)
emcee_trace = sampler.chain[:, 1000:, :].reshape(-1, ndim).T
import triangle
figure = triangle.corner(emcee_trace.T, labels=["$a$", "$r$", "$\sigma$"], truths=[np.log(91), 0.26, 5])
fig, axes = plt.subplots(figsize=(12, 5))
_ = axes.plot(ti, Ti, "ok")
x = np.linspace(0,6,30)
a, r = emcee_trace[:2]
y = np.exp(a[:,None] - r[:,None]*x)
mu = y.mean(0)
s = y.std(0)
_ = axes.plot(x, mu, color="black")
_ = axes.fill_between(x, mu - s, mu + s, color='lightgray')
_ = axes.plot(ti, np.exp(a_naive - r_naive * ti))
_ = axes.legend(["Data", "Bayesian", "naive"])
Ti = np.array( [91, 73, 51, 41, 31, 23, 19] )
ti = np.array( range(7) )
xi = np.log(Ti)
a_naive, r_naive = naive_regress(xi, ti)
print(a_naive, r_naive, "-->", np.exp(a_naive), r_naive)
4.51298712746 0.26810823074 --> 91.1938196201 0.26810823074
nll = lambda *args: -log_likelihood(*args) # Minimise the negative to get the maximum!
result = op.minimize(nll, [np.log(91), 0.26, 5], args=(ti, Ti))
a_est, r_est, sigma_est = result["x"]
print(np.exp(a_est), r_est, sigma_est, "-->", log_likelihood((a_est, r_est, sigma_est), ti, Ti) )
91.9417249709 0.271455483938 1.54476101614 --> -6.54408505649
ndim, nwalkers = 3, 50
sampler = emcee.EnsembleSampler(nwalkers, ndim, log_prob, args=[ti, Ti])
starting_guesses = np.random.normal(0, 0.1, (nwalkers, ndim)) + [a_est, r_est, sigma_est]
sampler.run_mcmc(starting_guesses, 2000)
emcee_trace = sampler.chain[:, 1000:, :].reshape(-1, ndim).T
figure = triangle.corner(emcee_trace.T, labels=["$a$", "$r$", "$\sigma$"])
fig, axes = plt.subplots(figsize=(12, 5))
_ = axes.plot(ti, Ti, "ok")
x = np.linspace(0,6,30)
a, r = emcee_trace[:2]
y = np.exp(a[:,None] - r[:,None]*x)
mu = y.mean(0)
s = y.std(0)
_ = axes.plot(x, mu, color="black")
_ = axes.fill_between(x, mu - s, mu + s, color='lightgray')
_ = axes.plot(ti, np.exp(a_naive - r_naive * ti))
_ = axes.legend(["Data", "Bayesian", "naive"])
a, r = emcee_trace[:2]
print("Naive regression r:", r_naive, "Bayesian model r:", r.mean())
Naive regression r: 0.26810823074 Bayesian model r: 0.272503086349