This is the material of a Hands-On we had in the NeuroBiophysics research group of Institute of Physics of São Carlos (USP). As a part of a Journal Club, this followed the reading of the first chapter of classical textbook on Markov Chain Monte Carlo methods. For this reason, you may not find all the basic details of MCMC methods in here, as well as its derivation.
We basically studied two separate cases. In Part 1, we are interested in drawing random samples from a given probability distribution so that we can perform a statistical inference. Part 2 is then concerned with bayesian statistical inference using Markov Chains to draw samples from a posterior distribution, given a probabilistic model. In this later case, I have used the pymc package (currently in its version 2.2).
I appreciate comments and suggestions.
All the following python code and data is in WTFPL license (details in the end).
First, let's start with a simple Gibbs sampling process. The scenario can be the following. You have a very hard expression describing some probabilistic model over some variables (usually, way more than one). You cannot compute the marginal distributions, but can compute the conditionals. The objective is to construct a stationary Markov chain whose equilibrium distributions is exactly that hard distribution.
import random, math
from math import *
import numpy as np
import pylab as p
import matplotlib.pyplot as plt
from matplotlib.ticker import LinearLocator, FormatStrFormatter
%matplotlib inline
def gibbs(N=5000, thin=1000, outequilibrium=500):
x0 = 0.
y0 = 0.
for i in range(outequilibrium):
x0 = random.gammavariate(3, 1.0/( y0*y0 + 4 ) )
y0 = random.gauss(1.0/( x0 + 1 ), 1.0/math.sqrt( 2*x0 + 2 ))
x = np.array([x0])
y = np.array([y0])
xaux = x0
yaux = y0
for i in range(N):
for j in range(thin):
xaux = random.gammavariate(3, 1.0/( yaux*yaux + 4 ) )
yaux = random.gauss(1.0/( xaux + 1 ), 1.0/math.sqrt( 2*xaux + 2 ))
# Storing uncorrelated data
x = np.append( x, xaux )
y = np.append( y, yaux )
return x,y
Let's generate some observations
x,y = gibbs(N=1000,thin=10)
We can now use them to evaluate some average values.
m = np.mean(x)
print '<X> = %f' %(m)
m = np.mean(y)
print '<Y> = %f' %(m)
<X> = 0.649957 <Y> = 0.638119
Even some non-trivial average values...
n = x.shape[0]
m = 0
for j in range(n):
m += x[j]*math.cos(y[j])
m /= float(n)
print '< X cos(Y) > = %f' %(m)
< X cos(Y) > = 0.476095
It is very instructive to really see how the average values are evoling with the amount of monte carlo steps we are performing.
Nmedias = 100
Nj = 100
avgX = np.array([])
avgY = np.array([])
x = np.array([])
y = np.array([])
for j in range(Nmedias):
xn,yn = gibbs(N = Nj, thin = 50)
x = np.append( x, xn )
avgX = np.append( avgX, np.mean(x) )
y = np.append( y, yn )
avgY = np.append( avgY, np.mean(y) )
plt.plot(avgX, '-r')
plt.plot(avgY, '-b')
[<matplotlib.lines.Line2D at 0x7dbf750>]
Next, let's estimate the whole distribution from our observations
Nbins = 30
H, xedges, yedges = np.histogram2d(x, y, bins=( Nbins, Nbins) )
extent = [yedges[0], yedges[-1], xedges[-1], xedges[0]]
X, Y = np.meshgrid(xedges[0:Nbins], yedges[0:Nbins])
fig = plt.figure()
ax = fig.gca(projection='3d')
ax = fig.add_subplot(111, projection='3d')
surf = ax.plot_surface(X, Y, H, rstride=1, cstride=1, cmap=cm.coolwarm,
linewidth=0, antialiased=False)
#ax.set_zlim(-1.01, 1.01)
ax.zaxis.set_major_locator(LinearLocator(10))
ax.zaxis.set_major_formatter(FormatStrFormatter('%.02f'))
fig.colorbar(surf, shrink=0.5, aspect=5)
plt.xlabel("x")
plt.ylabel("y")
plt.show()
I now move on to the second part and start to play with the pymc package. The main idea is to make an bayesian inference and find everything we can about a statistical mixture of a bimodal data set. As pointed out by Brenno Barbosa, this is indeed the subject of chapter 24 in the main reference I cited in the begin of this text (Gilks et al). In fact, you can check there for more information.
Let's start calling some important libraries
%reset
Once deleted, variables cannot be recovered. Proceed (y/[n])? y
import numpy as np
import matplotlib.pyplot as plt
import pylab
%matplotlib inline
To start understanding bayesian inference and how this can be implemented with pymc, we start with a simpler example: a normally distributed data set. See by yourself:
print 'Reading the data...'
s = np.loadtxt('data/dados_nonmixture.data')
print 'We have %d observations' %(s.shape[0])
print 'Building histogram...'
hist, bin_edges = np.histogram(s, density=True, bins=15)
plt.bar(bin_edges[:-1], hist, width=0.2)
plt.show()
Reading the data... We have 2000 observations Building histogram...
Of course we wouldn't need any bayesian inference to discover the average value and variance, but suppose we are not interested in these (even more, we are not frequentists, we are bayesians!). In fact what we want is to draw the distribution of both mean value and variance (i.e., they are not parameters anymore, they are random variable) given the data we have in our hands. We hope that the distribution has a small support and have a well defined bell shape, so that we can learn something about the mean and variance of the dataset.
from pymc import *
We the following, we define a model in which the mean and variance (here given as precision = inverse of variance) are drawn from uniform distributions
# Defining the model
def mymodel():
s = np.loadtxt('data/dados_nonmixture.data')
mean = Uniform( 'mean', lower = s.min(), upper = s.max() )
precision = Uniform('precision', lower = 0.0001, upper = 1.5)
process = Normal('process', mu = mean, tau = precision, observed = True, value = s)
return locals()
# Model object
model = MCMC( mymodel() )
To run the model, we just call the sample() method. The variable names are self-explanatory and match the names we have used in Part 1.
# Sampling procedure
model.sample( iter = 2000, thin = 10 , burn = 100)
[****************100%******************] 2000 of 2000 complete
model.stats()
{'mean': {'95% HPD interval': array([-1.02737691, -0.95901627]), 'mc error': 0.0016671214231405488, 'mean': -0.98841406086924199, 'n': 190, 'quantiles': {2.5: -1.027376910881334, 25: -0.99926181151830407, 50: -0.98583565879817858, 75: -0.97672509856598488, 97.5: -0.95379519481852748}, 'standard deviation': 0.017442049317452416}, 'precision': {'95% HPD interval': array([ 1.49875542, 1.49959646]), 'mc error': 0.00045840846910202588, 'mean': 1.4986940543572966, 'n': 190, 'quantiles': {2.5: 1.4939545646260397, 25: 1.4987554173346109, 50: 1.4987554173346109, 75: 1.4994072196730017, 97.5: 1.4995964588047999}, 'standard deviation': 0.003371510905781389}}
Let's check the mean and precision distributions
# Checking the model variables
Matplot.plot(model)
Plotting precision Plotting mean
Please, draw your conclusions from the above distributions. Are both of them good enough? Are our priors good guesses? What needs to be changed? What can be changed?
We can also see a graphical representation of this model
graph = pymc.graph.graph(model)
graph.write_png("graph_nomix.png")
True
Almost the same thing, but now with some interesting ingredient: we have a mixture of two gaussians. Our assumption now is that these observations are drawn from a gaussian with some average value with probability, say, p, and with some different average value with probability 1-p. We want to find out what are the mean values of each gaussian, variance and at what rate they are mixed.
For the sake of simplicity, I'll assume both the gaussians have the same variance. This is far from being necessary. Now, you can start wondering how would you solve this problem without this assumption.
print 'Reading the data...'
s = np.loadtxt('data/dados_mixture.data')
print 'We have %d observations' %(s.shape[0])
print 'Building histogram...'
hist, bin_edges = np.histogram(s, density=True, bins=15)
plt.bar(bin_edges[:-1], hist, width=0.2)
plt.show()
Reading the data... We have 2000 observations Building histogram...
Our model will be as follows.
# Definition of the model
def hip():
# This is the data
s = np.loadtxt('data/dados_mixture.data')
# The Bernoulli parameter
theta = Uniform("theta", lower = 0., upper = 1. )
# Bernoulli process to mix the two distributions
bern = Bernoulli("bern", p = theta, size = s.shape[0])
# Means and variances for the two gaussian distributions
mean1 = Uniform('mean1', lower = s.min(), upper = s.max() )
mean2 = Uniform('mean2', lower = s.min(), upper = s.max() )
std_dev = Uniform('std_dev', lower = 0., upper = 5.)
# Constructing the mean value for the mixed process
@deterministic(plot = False)
def mean(bern = bern, mean1 = mean1, mean2 = mean2):
return bern * mean1 + (1 - bern) * mean2
@deterministic(plot = False)
def precision(std_dev = std_dev):
return 1.0 / ( std_dev * std_dev )
# The process itself
process = Normal('process', mu = mean, tau = precision, value = s, \
observed = True)
return locals()
# Defining the model
model = MCMC( hip() )
# Steps
MCstep = 200
Nburns = 10000
Nsteps = MCstep*200 + Nburns
# Iteration process
model.sample( iter = Nsteps, burn = Nburns, thin = MCstep )
[****************100%******************] 50000 of 50000 complete
model.stats( variables = ['mean1', 'mean2'] )
{'mean1': {'95% HPD interval': array([-1.03187514, -0.3999318 ]), 'mc error': 0.022692637182771567, 'mean': -0.86364250126878428, 'n': 700, 'quantiles': {2.5: -1.0221344185334451, 25: -1.0084127363144639, 50: -0.99853025989043254, 75: -0.63354873573445469, 97.5: -0.29597917456679484}, 'standard deviation': 0.23164490052596071}, 'mean2': {'95% HPD interval': array([-0.63747047, 1.04457518]), 'mc error': 0.068787299351651399, 'mean': 0.55667255585550057, 'n': 700, 'quantiles': {2.5: -0.70568504677328225, 25: -0.43520448807601259, 50: 0.98332536793538394, 75: 1.0025922455921525, 97.5: 1.0299970299783643}, 'standard deviation': 0.69400477536385885}}
# Checking the dynamics of mean1 and mean2
Matplot.plot( model.trace('mean1') )
Matplot.plot( model.trace('mean2') )
Plotting mean1 Plotting mean2
We need more steps to reach convergence...
# Trying again with larger burn interval
model.sample( iter = 800*MCstep, burn = 300*MCstep, thin = MCstep )
[****************100%******************] 160000 of 160000 complete
Matplot.plot( model.trace('mean1') )
Matplot.plot( model.trace('mean2') )
Plotting mean1 Plotting mean2
Now it seems that we have gone far enought to reach an equilibrium condition. Let's take a quick look into the other two variables.
Matplot.plot( model.trace('theta') )
Matplot.plot( model.trace('std_dev') )
Plotting theta Plotting std_dev
Getting the status of the model starting after 300 monte carlo steps.
model.stats(start = 300)
{'bern': {'95% HPD interval': array([[ 1., 1.], [ 1., 1.], [ 0., 0.], ..., [ 0., 0.], [ 0., 1.], [ 1., 1.]]), 'mc error': array([ 0. , 0. , 0. , ..., 0. , 0.04277485, 0. ]), 'mean': array([ 1. , 1. , 0. , ..., 0. , 0.7375, 1. ]), 'n': 400, 'quantiles': {2.5: array([ 1., 1., 0., ..., 0., 0., 1.]), 25: array([ 1., 1., 0., ..., 0., 0., 1.]), 50: array([ 1., 1., 0., ..., 0., 1., 1.]), 75: array([ 1., 1., 0., ..., 0., 1., 1.]), 97.5: array([ 1., 1., 0., ..., 0., 1., 1.])}, 'standard deviation': array([ 0. , 0. , 0. , ..., 0. , 0.4399929, 0. ])}, 'mean': {'95% HPD interval': array([[-1.02325994, -0.9836972 ], [-1.02325994, -0.9836972 ], [ 0.95414825, 1.03166316], ..., [ 0.95414825, 1.03166316], [-1.02325994, 1.01421082], [-1.02325994, -0.9836972 ]]), 'mc error': array([ 0.00051406, 0.00051406, 0.00106816, ..., 0.00106816, 0.08540785, 0.00051406]), 'mean': array([-1.00347714, -1.00347714, 0.9961384 , ..., 0.9961384 , -0.47916034, -1.00347714]), 'n': 400, 'quantiles': {2.5: array([-1.02325994, -1.02325994, 0.95482211, ..., 0.95482211, -1.02123343, -1.02325994]), 25: array([-1.01000676, -1.01000676, 0.98205589, ..., 0.98205589, -1.00770635, -1.01000676]), 50: array([-1.00361561, -1.00361561, 0.99725366, ..., 0.99725366, -0.99829576, -1.00361561]), 75: array([-0.99627449, -0.99627449, 1.01060757, ..., 1.01060757, 0.95750727, -0.99627449]), 97.5: array([-0.9836972 , -0.9836972 , 1.03394489, ..., 1.03394489, 1.01945825, -0.9836972 ])}, 'standard deviation': array([ 0.01045394, 0.01045394, 0.02001978, ..., 0.02001978, 0.87833447, 0.01045394])}, 'mean1': {'95% HPD interval': array([-1.02325994, -0.9836972 ]), 'mc error': 0.00051406400622172946, 'mean': -1.0034771428317881, 'n': 400, 'quantiles': {2.5: -1.0232599398349786, 25: -1.0100067619684092, 50: -1.0036156103530811, 75: -0.99627449437075899, 97.5: -0.98369719572723935}, 'standard deviation': 0.010453936320626071}, 'mean2': {'95% HPD interval': array([ 0.95414825, 1.03166316]), 'mc error': 0.0010681575460294817, 'mean': 0.99613840086734851, 'n': 400, 'quantiles': {2.5: 0.95482210920101107, 25: 0.98205588880816674, 50: 0.99725366267368132, 75: 1.0106075740696616, 97.5: 1.0339448861048506}, 'standard deviation': 0.020019784690695663}, 'precision': {'95% HPD interval': array([ 5.78728904, 6.57111383]), 'mc error': 0.010101309396832252, 'mean': 6.1729449300525649, 'n': 400, 'quantiles': {2.5: 5.762178465512406, 25: 6.0360094542061757, 50: 6.1802201331222504, 75: 6.3064494491035239, 97.5: 6.5606939643129909}, 'standard deviation': 0.198547699951762}, 'std_dev': {'95% HPD interval': array([ 0.39010409, 0.41568314]), 'mc error': 0.00033097829183626018, 'mean': 0.40264546302423981, 'n': 400, 'quantiles': {2.5: 0.39051092420533512, 25: 0.39820607128876617, 50: 0.40229082686851181, 75: 0.40707568062516614, 97.5: 0.41692774708686103}, 'standard deviation': 0.0064948079945427494}, 'theta': {'95% HPD interval': array([ 0.74323787, 0.77982117]), 'mc error': 0.00050010167963355145, 'mean': 0.75987424839119411, 'n': 400, 'quantiles': {2.5: 0.74115288756009057, 25: 0.75326977356496794, 50: 0.76036854609327098, 75: 0.76628608606936777, 97.5: 0.77816423924817746}, 'standard deviation': 0.0096024160929834186}}
Saving a graphical representation of the model
graph = pymc.graph.graph(model)
graph.write_png("graph_mix.png")
True
Finally, I have provided the file that I have used to generate this data ('scripts/twogaussiasn.py'). You can compare the values we have obtained here and see how good is our inference method.
Moreover, it's important that you get your hands dirty. There are two things you could do in this sense:
What could you change in this code to make it better? Rewrite it, think about the assumptions and the priors that I have used.
Try another simple problem. I have supplied a Poisson process with a sudden change in its rate. Find out at which time this change happens.
DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE
Version 2, December 2004
Copyright (C) 2004 Sam Hocevar sam@hocevar.net
Everyone is permitted to copy and distribute verbatim or modified
copies of this license document, and changing it is allowed as long
as the name is changed.
DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE
TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION