Introduction to Neural Networks (Psy 5038): Python & MCMC

MCMC with PyMC

PyMC 2.3.4 is the production version.

For introductions to PyMC and Probabilistic programming in Python see: and

A tutorial from the authors of PyMC.

If you want to explore PyMC 3, which is under development, see:

Also see for examples from John K. Kruschke's book (1rst ed.), "Doing Bayesian Data Analysis" translated from R to Python.

In [ ]:

PyMC example: Unsupervised clustering assuming a mixture model

Neural network motivation

In the next lectures, we will begin to look at large-scale system neural architectures for vision. The basic structure consists of lateral, feedforward, and feedback computations. A conjectured function of lateral processing is to link features of a similar type, for example similar orientations from nearby spatial positions. Texture synthesis Gibbs sampler is an example of a lateral linking process which links similar colors. These can be extended to other more abstract features.

Information can also be linked based on a model of "common cause". Often there is more than one explanation for any piece of data, and one needs to estimate the probability of where it came from. These kinds of models are naturally hierarchical. One example is segmentation where one decides which pixels belong to one object (e.g. in the foreground) vs. another (the background).

To illustrate a basic computation, here we look a simple toy version of the problem in which a scalar data point could come from one of two processes. We don't know which process led to any observed value, we don't know the central tendencies or variances of the processes. But we assume we have enough data to make good guesses.

Specifically, we'll see how a gaussian mixture model can be modeled using MCMC sampling with PyMC. The model is simple enough that we could implement an MCMC solution without the help of PyMC. However, using PyMC will help to see how to use it to model more complex problems.

Later we'll see how from a machine learning perspective, grouping operations are a form of unsupervised learning and be studied in the context of clustering algorithms such as EM and K-means. In the next lecture, we'll look at the EM solution and apply it to a clustering problem but in a different context.

This example is based on and uses code from Chapter 3 of Probabilistic Programming and Bayesian Methods for Hackers

In [1]:
import numpy as np
import pymc as pm
import matplotlib.pyplot as plt
%matplotlib inline

#We'll use this next function to show the graphical relations
from IPython.display import Image
Couldn't import dot_parser, loading of dot files will not be possible.
Synthetic data

There are a number of ways of drawing random numbers using Python packages. For example try np.random.normal(0,1). We'll use the scipy.stats module, because it supports a number of methods for many standard distributions. Good to know about.

Our true underlying model is a mixture of gaussians, with mixing parameter truep:

In [2]:
Object `norm` not found.
In [3]:
from scipy.stats import bernoulli,norm



for i in xrange(len(data)):
    p = bernoulli.rvs(truep)
    data[i]=p*norm.rvs(truemean1, truestd)+(1-p)*norm.rvs(truemean2, truestd)

Generative model

Here's where we begin to use PyMC. Suppose we were presented with the data, but don't know the generative process. So we begin to guess the possible general structure.

We will assume that each data point comes from one of two normal distributions, but we don't know the probability p that it comes from the first. The true value is truep (above), but we don't know that. Let's define a PyMC random variable p that reflects our uncertainty. The true value of p is fixed for the process that generated the data; but we just don't know its value, so assume it is uniformly distributed between 0 and 1. We will want to estimate the value of p.

Now imagine that for each data point we'd also like to estimate whether it came from the first or second distribution. We define a second random (vector) variable "assignment" that depends on p. In the language of graphical models, 'p' is the parent of the 'assignment' (vector) variable. The assignment vector thus contains n random values, one for each data point.

In [4]:
p = pm.Uniform("p", 0, 1)

assignment = pm.Categorical("assignment", [p, 1 - p], size=np.shape(data)[0])

These variables are called stochastic variables.

Experiment with:

pm.Uniform("test", 0, 1,value=.2,observed=True).value
pm.Uniform("test", 0, 1,observed=False).value


In [4]:

We also want to estimate the two means of the normal distributions assumed to have generated the data.

We'll assume the priors on the means of the two generating distributions are normally distributed. We just guess that their centers are 110 and 210, and allow for our uncertainty with a guess that each of their standard deviations is 10. You could explore how much bad guesses cost in terms of iterations to convergence (or in general, non-convergence).

Call these random variables 'centers'.

Note that these standard deviations reflect our uncertainty in the means, not the variability in data generated by either of the two assumed distributions. We also want to estimate standard deviations for each of the two distributions.

PyMC works with precision $\tau$, rather than standard deviation $\sigma$:

$$\tau = {1 \over \sigma^2}$$

Call the precisions 'taus'. And we'll assume a uniform distribution of the standard deviations from 0 to 100.

In [5]:
centers = pm.Normal("centers", [110, 210], [10**-2, 10**-2], size=2)
taus = 1.0 / pm.Uniform("stds", 0, 100, size=2) ** 2

We've defined the priors. Now we need to describe how the generating variables could produce the observed data. PyMC has a "decorator" function pm.deterministic, that is used to model functions in which the inputs, if known, deterministically specify the outputs. IF we did know which distribution a data point came from, we could assign it to the appropriate mean (center 0 or 1) with certainty. Similarity for the taus.

The functions below map an assignment of 0 or 1, to a set of parameters representing the centers and taus.

In [6]:
def center_i(assignment=assignment, centers=centers):
    return centers[assignment]

def tau_i(assignment=assignment, taus=taus):
    return taus[assignment]

# OK. Here's the final summary line representing our model that relates the centers and taus to the data
# We assume that observations, the data, can be explained by the mixture model 

observations = pm.Normal("obs", center_i, tau_i, value=data, observed=True)


print "Random assignments: ", assignment.value[:4], "..."

print "Assigned center mean: ", center_i.value[:4], "..."

print "Assigned precision: ", tau_i.value[:4], "..."

print centers[assignment].value

These show current values, which may be accidents of the initial set-up, or reflect requests for random draws. Ideally, their exact values shouldn't matter to the final estimates.

While center_i and tau_i are deterministic functions, their values while running the simulation will vary reflecting the stochastic property of their parents. So they are still random variables.

In [17]:

Finally we ask PyMC to create a model class.

In [7]:
# create a model class
model = pm.Model([p, assignment, observations, taus, centers])

The model will get handed off to pm.MCMC(model) class below to run the simulations. But first, let's get a graphical view of what we've done using the directed graph function: pm.graph.dag.

Graphical view

Ellipses represent nodes to be estimated or are fixed. Triangles represent deterministic functions. It looks more complicated than it is.

In [8]:
In [7]:

MCMC simulation

It can help to try to estimate the MAP value of the parameters. This can then be used as the starting condition for MCMC. We'll leave this step out for now.

In [8]:
#map1 = pm.MAP(model) #stores the fitted variables' values in foo.value
In [9]:
mcmc = pm.MCMC(model)
#now use MCMC's sample method
 [-----------------100%-----------------] 50000 of 50000 complete in 31.3 sec


In [10]:
from IPython.core.pylabtools import figsize

figsize(12.5, 9)
lw = 1
center_trace = mcmc.trace("centers")[:]

# for pretty colors.
colors = ["#348ABD", "#A60628"] \
if center_trace[-1, 0] > center_trace[-1, 1] \
    else ["#A60628", "#348ABD"]

plt.plot(center_trace[:, 0], label="trace of center 0", c=colors[0], lw=lw)
plt.plot(center_trace[:, 1], label="trace of center 1", c=colors[1], lw=lw)
plt.title("Traces of unknown parameters")
leg = plt.legend(loc="upper right")

std_trace = mcmc.trace("stds")[:]
plt.plot(std_trace[:, 0], label="trace of standard deviation of cluster 0",
     c=colors[0], lw=lw)
plt.plot(std_trace[:, 1], label="trace of standard deviation of cluster 1",
     c=colors[1], lw=lw)
plt.legend(loc="upper left")

p_trace = mcmc.trace("p")[:]
plt.plot(p_trace, label="$p$: frequency of assignment to cluster 0",
     color="#467821", lw=lw)
plt.ylim(0, 1)
<matplotlib.legend.Legend at 0x10d7ec7d0>

The traces don't appear to have converged yet. mcmc remembers its samples, so we can continue where we left off and gather another 100,000.

In [11]:
 [-----------------100%-----------------] 100000 of 100000 complete in 63.5 sec
In [12]:
figsize(12.5, 4)
center_trace = mcmc.trace("centers", chain=1)[:]
prev_center_trace = mcmc.trace("centers", chain=0)[:]

x = np.arange(50000)
plt.plot(x, prev_center_trace[:, 0], label="previous trace of center 0",
     lw=lw, alpha=0.4, c=colors[1])
plt.plot(x, prev_center_trace[:, 1], label="previous trace of center 1",
     lw=lw, alpha=0.4, c=colors[0])

x = np.arange(50000, 150000)
plt.plot(x, center_trace[:, 0], label="new trace of center 0", lw=lw, c="#348ABD")
plt.plot(x, center_trace[:, 1], label="new trace of center 1", lw=lw, c="#A60628")

plt.title("Traces of unknown center parameters")
leg = plt.legend(loc="upper right")
<matplotlib.text.Text at 0x10d72e0d0>

Let's get the means of the traces and compare with the true values

In [15]:
print mcmc.trace('centers')[:,0].mean(),mcmc.trace('centers')[:,1].mean(),mcmc.trace('p')[:].mean()

print truemean1, truemean2, truep
91.0217581589 192.749727837 0.175733625088
100.0 200.0 0.25

But we shouldn't include samples from the early part of the chain when estimating the means.

Try estimating the means using only the last 50000 steps of the traces.

If there is insufficent storage, one can specify a burn-in period where the first part of the chain gets discarded. Further, there may be concern over the independence of samples. One can also just keep every mth sample. This is called thinning, and can be specified during the sampling call.

In [19]:

One can check for corrrelations by looking the auto-correlations of the traces. Plotting the histogram gives some sense of whether how the space is being sampled.

In [16]:
from pymc.Matplot import plot as mcplot

mcplot(mcmc.trace("centers", 2), common_scale=False)
Plotting centers_0
Plotting centers_1
/Users/kersten/anaconda/lib/python2.7/site-packages/numpy/core/ VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`.
In [49]:
Plotting p

Once we leave toy problems, questions of convergence, adequate mixing become more of a challenge. See Chapter 3 from Probabilistic Programming and Bayesian Methods for Hackers for a discussion in the context of this example.

Further information and references on probabilistic programming using python and other software tools

HDDM: Hierarchical Bayesian estimation of the Drift-Diffusion Model in Python.

Non-python, probablistic computation for cognition using Church:

For tutorials, see: and for software,including python:

In [55]:
from IPython.display import YouTubeVideo
In [54]:
In [ ]: