Notebook

Gibbs sampling for the beta-bernoulli mixture model¶

Bayesian beta-bernoulli (fixed size) mixture model¶

Given a fixed $K$ and $D$, the beta-bernoulli mixture model is a generative model describing $D$-dimensional binary vectors ($y_i \in \{0,1\}^{D}$) drawn from a $K$-component mixture.

We can describe the probabilistic model as:

\begin{align*} \pi|\alpha &\sim \text{Dirichlet}(\{ \frac{\alpha}{K} \}_{k=1}^{K}) \\ c_i|\pi &\sim \text{Discrete}(\pi) \\ p_d^{k} | \beta,\gamma &\sim \text{Beta}(\beta, \gamma) \\ y_i^{d} | c_i{=}k,p_d^{k} &\sim \text{Bernoulli}(p_d^{k}) \end{align*}

Let us clear up some of this notation. $\pi$ is a $K$-dimensional vector living in the $(K-1)$-dimensional probability simplex. For each $k \in [K]$, $\{p_d^k\}_{d=1}^{D} \in [0,1]^{D}$. Given $N$ data points, the assignment vector $\{c_i\}_{i=1}^{N} \in \{0, ..., K-1\}^{N}$. The hyperparameters of this model are $\mathcal{H} = (\alpha, \beta, \gamma)$.

As we will see later on, this model has nice analytical properties for Gibbs sampling, since the beta distribution is a nice conjugate prior for the bernoulli distribution.

The inference problem we will consider is the following. Given a dataset $\mathcal{Y} = \{y_i\}_{i=1}^{N}$, we want to learn the posterior distribution on the assignment vector $\mathcal{C} = \{c_i\}_{i=1}^{N}$. That is, we want to be able to estimate and draw samples from $p(\mathcal{C} | \mathcal{Y}; \mathcal{H})$.

Gibbs sampling for estimating the posterior¶

This problem can be solved efficiently with Gibbs sampling, which is another Markov-chain monte carlo (MCMC) variant. Let the notation $\mathcal{C}_{\neg i} = \{ c_j \in \mathcal{C} : j \neq i \}$. The Gibbs sampler assumes we can efficiently sample from the distribution $p(c_i | \mathcal{C}_{\neg i}, \mathcal{Y}; \mathcal{H})$. Every iteration of the sampling then works by sampling for each $i \in [N]$, \begin{align*} c^{(t)}_{i} \gets p(c_i | \{ c_j \in \mathcal{C}^{(t)} : j < i \}, \{ c_j \in \mathcal{C}^{(t-1)} : j > i \}, \mathcal{Y}; \mathcal{H}) \end{align*}

Let the notation $\mathcal{Y}^{k}_{\neg i} = \{ y_j \in \mathcal{Y} : c_j = k, j \neq i \}$. It turns out, for the beta-bernoulli model, we can indeed derive that \begin{align*} p(c_i{=}k|\mathcal{C}_{\neg i}, \mathcal{Y};\mathcal{H}) \propto \frac{ |\mathcal{Y}^{k}_{\neg i}| + \frac{\alpha}{K} }{ N - 1 + \alpha } \prod_{d=1}^{D} \frac{(\beta+\sum_{y_k\in \mathcal{Y}^{k}_{\neg i}} y_k^{d})^{y_i^d} ( \gamma + |\mathcal{Y}^{k}_{\neg i}| - \sum_{y_k\in \mathcal{Y}^{k}_{\neg i}} y_k^{d})^{(1-y_i^{d})} }{ \beta + \gamma + |\mathcal{Y}^{k}_{\neg i}| } \end{align*} We can construct $p(c_i{=}k|\mathcal{C}_{\neg i}, \mathcal{Y};\mathcal{H})$ exactly by then enumerating through all $K$ clusters and normalizing.

Implementation¶

We'll first implement a sampler from a discrete distribution. Note while scipy.stats.rv_discrete (Docs) implements this functionality, it is quite heavy weight to call within a loop, so we roll our own.

In [1]:

# workspace setup
import numpy as np
import scipy as sp
import scipy.io
import scipy.misc
import scipy.special

%matplotlib inline
import matplotlib.pylab as plt

import itertools as it
import math

In [2]:

def discrete_sample(pmf):
    coin = np.random.random()
    acc = 0.0
    n, = pmf.shape
    for i in xrange(n):
        acc0 = acc + pmf[i]
        if coin <= acc0:
            return i
        acc = acc0
    return n-1

Ok enough playing around! Now the actual Gibbs sampler implementation. Note we make use of numpy broadcasting for performance:

In [3]:

def gibbs_beta_bernoulli(Y, K, alpha, beta, gamma, niters):
    N, D = Y.shape
    alpha, beta, gamma = map(float, [alpha, beta, gamma])

    # start with random assignment
    assignments = np.random.randint(0, K, size=N)

    # initialize the sufficient statistics (cluster sums) accordingly
    sums = np.zeros((D, K), dtype=np.int64)
    cnts = np.zeros(K, dtype=np.int64)
    for yi, ci in zip(Y, assignments):
        sums[:,ci] += yi
        cnts[ci] += 1

    history = np.zeros((niters, N), dtype=np.int64)

    # precomputations
    nplog = np.log
    npexp = np.exp
    nparray = np.array
    logsumexp = sp.misc.logsumexp
    lg_denom = nplog(N - 1 + alpha)
    alpha_over_K = alpha/K
    beta_plus_gamma = beta + gamma

    for t in xrange(niters):
        for i, (yi, ci) in enumerate(zip(Y, assignments)):
            # remove from SS
            sums[:,ci] -= yi
            cnts[ci] -= 1

            lg_term1 = nplog(cnts + alpha_over_K) - lg_denom - D*nplog(beta_plus_gamma + cnts)
            lg_term2 = nplog(beta + sums)
            lg_term3 = nplog(gamma + cnts - sums)

            lg_dist = lg_term1 + (lg_term2*yi[:,np.newaxis] + lg_term3*(1-yi[:,np.newaxis])).sum(axis=0)
            lg_dist -= logsumexp(lg_dist) # normalize

            # reassign
            ci = discrete_sample(npexp(lg_dist))
            assignments[i] = ci
            sums[:,ci] += yi
            cnts[ci] += 1
        history[t] = assignments

    return history

Testing¶

Let us first generate a small toy dataset (using the beta-bernoulli model as the generative process). Then we will test our Gibbs sampler on this small dataset.

In [56]:

alpha, beta, gamma = 2., 1., 1.
K = 2
D = 3
N = 4

pis = np.random.dirichlet(alpha/K*np.ones(K))
cis = np.array([discrete_sample(pis) for _ in xrange(N)])
aks = np.random.beta(beta, gamma, size=(K, D))

print 'Pi:', pis
print 'C :', cis

def bernoulli(p):
    return 1 if np.random.random() <= p else 0

Y = np.zeros((N, D), dtype=np.int64)
for i in xrange(N):
    Y[i] = np.array([bernoulli(aks[cis[i], d]) for d in xrange(D)])

Pi: [ 0.17704708  0.82295292]
C : [1 1 0 1]

In [57]:

niters = 50000
chain = gibbs_beta_bernoulli(Y, K, alpha, beta, gamma, niters)

Posterior distribution¶

Now the natural question is, how do we verify that the Gibbs sampler is indeed drawing samples from $p(\mathcal{C} | \mathcal{Y})$. This is a bit trickier in the Bayesian case than the non-Bayesian case, since what we are really after is a distribution instead of, e.g. a point estimate.

Luckily, for small problems (small $K$ and $N$), we can actually calculate $p(\mathcal{C} | \mathcal{Y}; \mathcal{H})$ exactly by brute force enumeration. This is because $p(\mathcal{C} | \mathcal{Y} ; \mathcal{H})$ is a discrete distribution of size $K^{N}$, and we can analytically calculate the joint distribution $p(\mathcal{C}, \mathcal{Y}; \mathcal{H})$; from the joint, the posterior follows by $p(\mathcal{C} | \mathcal{Y} ) = \frac{p(\mathcal{C}, \mathcal{Y})}{\sum_{c} p(c ,\mathcal{Y})}$, where the summation in the denominator is over all possible $K^N$ assignment vectors (from here on, we drop the hyperparameter dependence to ease notation).

For an arbitrary $\mathcal{C}, \mathcal{Y}$, we have $p(\mathcal{C}, \mathcal{Y}) = p(\mathcal{C}) p(\mathcal{Y} | \mathcal{C})$. From page 2, we have \begin{align*} p(\mathcal{C}) = \frac{\Gamma(\alpha)}{\Gamma(|\mathcal{Y}|+\alpha)} \prod_{k=1}^{K} \frac{\Gamma( |\mathcal{Y}^{k}| + \frac{\alpha}{K})}{ \Gamma(\frac{\alpha}{K})} \end{align*} where $\Gamma(\cdot)$ is the Gamma function. We can derive $p(\mathcal{Y}|\mathcal{C})$ as follows. \begin{align*} p(\mathcal{Y} | \mathcal{C}) &= \prod_{k=1}^{K} p(\mathcal{Y}^{k} | \mathcal{C}^{k}) \\ &= \prod_{k=1}^{K} \int_{\Theta_k} [\prod_{y_i \in \mathcal{Y}^k} p(y_i | \Theta_k) ] p(\Theta_k ; \mathcal{H}) \; d\Theta_k \end{align*} Now for each $k$, we evaluate the inner integral as: \begin{align*} \int_{\theta_1,...,\theta_D} \prod_{d=1}^{D} \left(\prod_{y_i \in \mathcal{Y}^k} \theta_d^{y_i^d} (1-\theta_d)^{1-y_i^d}\right) \frac{1}{B(\beta,\gamma)} \theta_d^{\beta-1}(1-\theta_d)^{\gamma-1} \; d\theta_1...\theta_{D} \end{align*} where $B(\beta,\gamma)$ is the Beta function. Noting that $\int_{0}^{1} x^{(m-1)} (1-x)^{(n-1)} \; dx = \frac{\Gamma(m)\Gamma(n)}{\Gamma(m+n)}$, we can simplify the above integral to: \begin{align*} \frac{1}{B(\beta,\gamma)^{D}} \prod_{d=1}^{D} \frac{\Gamma( \sum_{y_i \in \mathcal{Y}^k} y_i^d + \beta) \Gamma( |\mathcal{Y}^{k}| - \sum_{y_i \in \mathcal{Y}^k} y_i^d + \gamma)}{ \Gamma( |\mathcal{Y}^{k}| + \beta + \gamma )} \end{align*} And therefore, \begin{align*} p(\mathcal{Y} | \mathcal{C}) = \frac{1}{B(\beta,\gamma)^{KD}} \prod_{k=1}^{K} \prod_{d=1}^{D} \frac{\Gamma( \sum_{y_i \in \mathcal{Y}^k} y_i^d + \beta) \Gamma( |\mathcal{Y}^{k}| - \sum_{y_i \in \mathcal{Y}^k} y_i^d + \gamma)}{ \Gamma( |\mathcal{Y}^{k}| + \beta + \gamma )} \end{align*}

Therefore, we can compute the posterior distribution of the data as follows:

In [58]:

def all_assignment_vectors():
    return it.product(range(K), repeat=N)

def lg_pr_joint(C, Y, K, alpha, beta, gamma):
    N, D = Y.shape
    nks = np.bincount(C, minlength=K)
    assert nks.shape[0] == K
    assert C.shape[0] == N

    # log P(C)
    gammaln = sp.special.gammaln
    betaln = sp.special.betaln
    term1 = gammaln(alpha) - gammaln(N + alpha) - K*gammaln(alpha/K)
    term2 = sum(gammaln(nk + alpha/K) for nk in nks)
    lg_pC = term1 + term2

    # log P(Y|C)
    term1 = K*D*betaln(beta, gamma)
    term2 = D*sum(gammaln(nk + beta + gamma) for nk in nks)
    sums = np.zeros((K, D))
    for yi, ci in zip(Y, C):
        sums[ci] += yi
    def fn1(nk, sum_yid):
        assert nk >= sum_yid
        return gammaln(sum_yid + beta) + gammaln(nk - sum_yid + gamma)
    term3 = sum(sum(fn1(nk, yid) for yid in row) for nk, row in zip(nks, sums))
    lg_pYgC = -term1 - term2 + term3
    
    return lg_pC + lg_pYgC

def brute_force_posterior(Y, K, alpha, beta, gamma):
    N, _ = Y.shape

    # enumerate K^N cluster assignments
    lg_pis = np.array([lg_pr_joint(np.array(C), Y, K, alpha, beta, gamma) for C in all_assignment_vectors()])
    lg_pis -= sp.misc.logsumexp(lg_pis)
    
    return np.exp(lg_pis)

# generate an ID for each K^N element
idmap = { C : i for i, C in enumerate(all_assignment_vectors()) }
revidmap = { i : C for i, C in enumerate(all_assignment_vectors()) }

actual_posterior = brute_force_posterior(Y, K, alpha, beta, gamma)

Let's pause for a second and take a look at the MAP estimator:

In [59]:

print 'P(C=actual|Y):', actual_posterior[idmap[tuple(cis)]]
print 'max_C P(C|Y):', actual_posterior.max()
print 'argmax_C P(C|Y):', revidmap[actual_posterior.argmax()]

P(C=actual|Y): 0.0864616067769
max_C P(C|Y): 0.132805028009
argmax_C P(C|Y): (0, 0, 0, 0)

Notice how the ground truth cluster assignment is not the MAP estimator!

To measure the distance between the actual posterior distribution and that produced by our Gibbs sampler, we compare the KL-divergence of the actual and the empirical. Recall for discrete distributions the KL-divergence (relative entropy) is defined as \begin{align*} D(P||Q) = \sum_{x} P(x) \log\frac{P(x)}{Q(x)} \end{align*}

In [60]:

smoothing = 1e-5
skip = 100
skipped_chain = chain[::skip]
window = 10000

def kl(a, b):
    return np.sum([p*np.log(p/q) for p, q in zip(a, b)])

def histify(history, K):
    _, N = history.shape
    hist = np.zeros(K**N, dtype=np.float)
    for h in history:
        hist[idmap[tuple(h)]] += 1.0
    return hist

def fn(i):
    hist = histify(skipped_chain[max(0, i-window):(i+1)], K) + smoothing
    hist /= hist.sum()
    return kl(actual_posterior, hist)

kls = map(fn, xrange(skipped_chain.shape[0]))

Let's take a peek at the posterior distribution of the last window compared with our brute force posterior distribution:

In [61]:

H = histify(skipped_chain[skipped_chain.shape[0]-1-window:], K) + smoothing
H /= H.sum()
print "C\t\tP_g(C|Y)\tP_a(C|Y)\t|Diff|"
for c, (a, b) in zip(all_assignment_vectors(), zip(H, actual_posterior)):
    def f(x): return '%.7f' % (x)
    print "\t".join(map(str, [c,f(a),f(b),f(math.fabs(a-b))]))

C		P_g(C|Y)	P_a(C|Y)	|Diff|
(0, 0, 0, 0)	0.1240000	0.1328050	0.0088050
(0, 0, 0, 1)	0.0900000	0.0864616	0.0035384
(0, 0, 1, 0)	0.0860000	0.0864616	0.0004616
(0, 0, 1, 1)	0.0200000	0.0227718	0.0027718
(0, 1, 0, 0)	0.0280000	0.0288205	0.0008205
(0, 1, 0, 1)	0.1200000	0.0910871	0.0289129
(0, 1, 1, 0)	0.0280000	0.0227718	0.0052282
(0, 1, 1, 1)	0.0340000	0.0288205	0.0051795
(1, 0, 0, 0)	0.0260000	0.0288205	0.0028205
(1, 0, 0, 1)	0.0180000	0.0227718	0.0047718
(1, 0, 1, 0)	0.0620000	0.0910871	0.0290871
(1, 0, 1, 1)	0.0380000	0.0288205	0.0091795
(1, 1, 0, 0)	0.0180000	0.0227718	0.0047718
(1, 1, 0, 1)	0.0980000	0.0864616	0.0115384
(1, 1, 1, 0)	0.0660000	0.0864616	0.0204616
(1, 1, 1, 1)	0.1440000	0.1328050	0.0111949

This looks reasonable. We can also get a sense for the trend as we increase the number of iterations:

In [64]:

plt.plot(np.arange(0, chain.shape[0], skip) + 1, kls)
plt.xlabel('Iterations')
plt.ylabel('Posterior KL-divergence')

Posterior predictive distribution¶

Another distribution we can look at is the posterior predictive distribution $p(y|\mathcal{C},\mathcal{Y};\mathcal{H})$, that is, the distribution over the next data point. To derive this, we first note that $p(y|\mathcal{C},\mathcal{Y}) = \sum_{k=1}^{K} p(y,c{=}k|\mathcal{C},\mathcal{Y}) = \sum_{k=1}^{K} p(c{=}k|\mathcal{C}) p(y|\mathcal{Y}_{k})$.

From Eq. 9, pg. 3, we have $p(c{=}k|\mathcal{C}) = \frac{|\mathcal{C}_{k}| + \frac{\alpha}{K}}{|\mathcal{C}| + \alpha}$. Next, $p(y|\mathcal{Y}_{k})$ is simply the posterior predictive of the beta-bernoulli likelihood model. That is, $p(y_d{=}1|\mathcal{Y}_{k})= \frac{\beta + \sum_{y_i \in \mathcal{Y}_{k}} y_i^d}{ \beta + \gamma + |\mathcal{Y}_{k}| } $. Putting it together, we have \begin{align*} p(y|\mathcal{C},\mathcal{Y}) = \sum_{k=1}^{K} \frac{|\mathcal{C}_{k}| + \frac{\alpha}{K}}{|\mathcal{C}| + \alpha} \prod_{d=1}^{D} (\hat{p}_d^k)^{y_d} (1-\hat{p}_d^k)^{1-y_d} \end{align*} where $\hat{p}_d^k = \frac{\beta + \sum_{y_i \in \mathcal{Y}_{k}} y_i^d}{ \beta + \gamma + |\mathcal{Y}_{k}| } $.

We can also look at the non-Bayesian case, where we know the exact values for $\pi$ and $\{ p_d^k \}$. \begin{align*} p(y | \pi, \{p_d^k\}) = \sum_{k=1}^{K} \pi_k \prod_{d=1}^{D} (p_d^k)^{y_d} (1-p_d^k)^{1-y_d} \end{align*}

This makes explicit how our prior information affects our estimates of $\hat{\pi}$ and $\hat{p_d^k}$. Now for the reference distributions, we know both the ground truth cluster assignment and the exact model parameters, so we can compute $ p(y|\mathcal{C},\mathcal{Y}) $ and $ p(y | \pi, \{p_d^k\})$ exactly. For our Gibbs sampler, we estimate $p(y|\mathcal{C},\mathcal{Y})$ by averaging over our samples of $\mathcal{C}$. Details below:

In [65]:

def posterior_predictive1(C, Y, K, alpha, beta, gamma):
    N, D = Y.shape
    assert C.shape[0] == N
    
    nks = np.bincount(C, minlength=K)
    sums = np.zeros((D, K), dtype=np.int64) 
    for yi, ci in zip(Y, C): 
        sums[:,ci] += yi
        
    lg_term1 = np.log(nks + alpha/K) - np.log(N + alpha)

def posterior_predictive(C, Y, K, alpha, beta, gamma):
    N, D = Y.shape
    assert C.shape[0] == N

    nks = np.bincount(C, minlength=K)
    sums = np.zeros((K, D), dtype=np.int64) 
    for yi, ci in zip(Y, C): 
        sums[ci] += yi

    def fn(yvalue):
        def fn1(nk, sum_yid, yd):
            assert nk >= sum_yid
            theta = (beta + sum_yid) / (beta + gamma + nk) 
            assert theta >= 0.0 and theta <= 1.0 
            return np.log(theta) if yd else np.log(1.-theta)
        def fn2(nk, row):
            assert len(yvalue) == row.shape[0]
            term1 = np.log(nk + alpha/K) - np.log(N + alpha)
            term2 = sum(fn1(nk, sum_yid, yd) for sum_yid, yd in zip(row, yvalue))
            return term1 + term2
        return sp.misc.logsumexp([fn2(nk, row) for nk, row in zip(nks, sums)])

    yvalues = it.product([0, 1], repeat=D)
    lg_pr_yvalue = map(fn, yvalues)
    return np.exp(lg_pr_yvalue)

def posterior_predictive_nonbayes(pis, aks):
    K, = pis.shape
    assert aks.shape[0] == K
    _, D = aks.shape
    
    def fn(yvalue):
        def fn2(yd, theta_kd):
            return np.log(theta_kd) if yd else np.log(1.-theta_kd)
        def fn1(pi_k, theta_k):
            return np.log(pi_k) + sum(fn2(yd, theta_kd) for yd, theta_kd in zip(yvalue, theta_k))
        return sp.misc.logsumexp([fn1(pi_k, theta_k) for pi_k, theta_k in zip(pis, aks)])
    
    yvalues = it.product([0, 1], repeat=D)
    lg_pr_yvalue = map(fn, yvalues)
    return np.exp(lg_pr_yvalue)

In [66]:

# reference distributions
actual_posterior_predictive = posterior_predictive(cis, Y, K, alpha, beta, gamma)
nonbayes_posterior_predictive = posterior_predictive_nonbayes(pis, aks)

posterior_predictives = np.array([
    posterior_predictive(assignment, Y, K, alpha, beta, gamma) for assignment in skipped_chain])
    
def fn1(i):
    posteriors = posterior_predictives[min(0, i-window):(i+1)].mean(axis=0)
    return kl(actual_posterior_predictive, posteriors)

posterior_predictive_kls = map(fn1, xrange(posterior_predictives.shape[0]))

In [67]:

PH = posterior_predictives[posterior_predictives.shape[0]-1-window:].mean(axis=0)
print "y\t\tP(y|C) [gibbs]\tP(y|C) [actual]\tP(y|params)\t|gibbs-actual|"
for y, ((a, b), c) in zip(it.product([0,1],repeat=D), zip(zip(PH, actual_posterior_predictive), nonbayes_posterior_predictive)):
    print "\t".join(map(str, [y, a, b, c, math.fabs(a-b)]))

y		P(y|C) [gibbs]	P(y|C) [actual]	P(y|params)	|gibbs-actual|
(0, 0, 0)	0.129465256173	0.130765432099	0.145796887467	0.00130017592593
(0, 0, 1)	0.0661453919753	0.0707160493827	0.0602427431159	0.00457065740741
(0, 1, 0)	0.107173910494	0.0973827160494	0.271569844867	0.00979119444444
(0, 1, 1)	0.0656987746914	0.0566913580247	0.135681228984	0.00900741666667
(1, 0, 0)	0.196573169753	0.177382716049	0.0713598709559	0.0191904537037
(1, 0, 1)	0.107571737654	0.110024691358	0.0312461999697	0.0024529537037
(1, 1, 0)	0.197682108025	0.216691358025	0.189236544177	0.01900925
(1, 1, 1)	0.129689651235	0.140345679012	0.0948666804639	0.0106560277778

Let's take a look at the posterior predictive distributions of the last window compared to the ground truth distributions:

Notice how our Gibbs estimate of the distribution does not quite match the non-Bayesian one, which makes sense. Since $N$ is small in our case, our prior distribution is still weighing in non-negligibly. Finally, let's take a look at the KL-divergence (w.r.t. the Bayesian distribution with ground truth clusters) as the number of iterations increases:

In [68]:

plt.plot(np.arange(0, niters, skip) + 1, posterior_predictive_kls)
plt.xlabel('Iterations')
plt.ylabel('Posterior predictive KL-divergence')

Out[68]:

<matplotlib.text.Text at 0x11546be10>

Note that this distribution seems to converge more slowly than the posterior in the previous section.

USPS digit dataset example¶

In this section, we use our Gibbs sampler to learn clusters for digit recognition. Our digit dataset is the USPS digit dataset. We'll just focus on the digit 2 for this example.

In [15]:

usps_digits = sp.io.loadmat('data/usps_resampled/usps_resampled.mat')

In [16]:

usps_class_2 = usps_digits['train_labels'][2,:] == 1
usps_Y0 = usps_digits['train_patterns'][:,usps_class_2]
usps_Y = np.zeros(usps_Y0.shape)
usps_Y[usps_Y0>0.1] = 1.0 # make the image binary
plt.imshow(usps_Y[:,3].reshape((16,16)), cmap=plt.cm.binary)

Out[16]:

<matplotlib.image.AxesImage at 0x112abe950>

In [70]:

usps_Y = np.array(usps_Y.T, dtype=np.int64) # convert dataset to our dimension

We are ready to run our Gibbs sampler on this dataset. First, we use the hyper-parameters used by Maaten

In [69]:

# USPS priors
usps_alpha, usps_beta, usps_gamma = 50., .5, .5
usps_K = 40
usps_N, usps_D = usps_Y.shape
usps_niters = 1000
print usps_N, usps_D

475 256

In [20]:

usps_chain = gibbs_beta_bernoulli(usps_Y, usps_K, usps_alpha, usps_beta, usps_gamma, usps_niters)

Let's plot the posterior predictive of each cluster separately, to see what each cluster looks like.

In [39]:

def posterior_thetas(C, Y, K, beta, gamma):
    N, D = Y.shape
    assert C.shape[0] == N
    thetas = np.zeros((K, D))
    for k in xrange(K):
        thetas[k] = (beta + Y[C==k].sum(axis=0)) / (beta + gamma + N)
    return thetas

usps_chain = usps_chain[::skip]
pts = np.array([posterior_thetas(c, usps_Y, usps_K, usps_beta, usps_gamma) for c in usps_chain])
pts = pts.mean(axis=0)

In [42]:

digits_per_row = 5
assert not (pt.shape[0] % digits_per_row)
img = np.vstack([
    np.hstack([
        p.reshape((16, 16)) for p in pt[i:(i+digits_per_row)]]) 
    for i in xrange(0, pt.shape[0], digits_per_row)]) 
plt.rcParams['figure.figsize'] = (10.0, 8.0)
plt.imshow(img, cmap=plt.cm.binary)

Out[42]:

<matplotlib.image.AxesImage at 0x1140aa510>

We'll notice some of the clusters are empty. This is because our sampler assigns zero mass to some of the cluster; we can see this by plotting a histogram of, e.g. the last assignment vector generated by our Gibbs sampler.

In [52]:

plt.hist(usps_chain[-1], bins=40)

Out[52]:

(array([  3.,  14.,   0.,   4.,  33.,   0.,  11.,   6.,   6.,  23.,  14.,
          0.,  22.,  13.,   2.,   0.,   6.,   2.,   4.,  19.,   0.,   4.,
          2.,   8.,  14.,   0.,  29.,  64.,   0.,   4.,   8.,  11.,  23.,
          8.,  16.,  30.,  11.,  10.,   4.,  47.]),
 array([  0.   ,   0.975,   1.95 ,   2.925,   3.9  ,   4.875,   5.85 ,
          6.825,   7.8  ,   8.775,   9.75 ,  10.725,  11.7  ,  12.675,
         13.65 ,  14.625,  15.6  ,  16.575,  17.55 ,  18.525,  19.5  ,
         20.475,  21.45 ,  22.425,  23.4  ,  24.375,  25.35 ,  26.325,
         27.3  ,  28.275,  29.25 ,  30.225,  31.2  ,  32.175,  33.15 ,
         34.125,  35.1  ,  36.075,  37.05 ,  38.025,  39.   ]),
 <a list of 40 Patch objects>)

This is a great segway into infinite dimensional models!

In [ ]: