Given a fixed $K$ and $D$, the beta-bernoulli mixture model is a generative model describing $D$-dimensional binary vectors ($y_i \in \{0,1\}^{D}$) drawn from a $K$-component mixture.
We can describe the probabilistic model as:
\begin{align*} \pi|\alpha &\sim \text{Dirichlet}(\{ \frac{\alpha}{K} \}_{k=1}^{K}) \\ c_i|\pi &\sim \text{Discrete}(\pi) \\ p_d^{k} | \beta,\gamma &\sim \text{Beta}(\beta, \gamma) \\ y_i^{d} | c_i{=}k,p_d^{k} &\sim \text{Bernoulli}(p_d^{k}) \end{align*}Let us clear up some of this notation. $\pi$ is a $K$-dimensional vector living in the $(K-1)$-dimensional probability simplex. For each $k \in [K]$, $\{p_d^k\}_{d=1}^{D} \in [0,1]^{D}$. Given $N$ data points, the assignment vector $\{c_i\}_{i=1}^{N} \in \{0, ..., K-1\}^{N}$. The hyperparameters of this model are $\mathcal{H} = (\alpha, \beta, \gamma)$.
As we will see later on, this model has nice analytical properties for Gibbs sampling, since the beta distribution is a nice conjugate prior for the bernoulli distribution.
The inference problem we will consider is the following. Given a dataset $\mathcal{Y} = \{y_i\}_{i=1}^{N}$, we want to learn the posterior distribution on the assignment vector $\mathcal{C} = \{c_i\}_{i=1}^{N}$. That is, we want to be able to estimate and draw samples from $p(\mathcal{C} | \mathcal{Y}; \mathcal{H})$.
This problem can be solved efficiently with Gibbs sampling, which is another Markov-chain monte carlo (MCMC) variant. Let the notation $\mathcal{C}_{\neg i} = \{ c_j \in \mathcal{C} : j \neq i \}$. The Gibbs sampler assumes we can efficiently sample from the distribution $p(c_i | \mathcal{C}_{\neg i}, \mathcal{Y}; \mathcal{H})$. Every iteration of the sampling then works by sampling for each $i \in [N]$, \begin{align*} c^{(t)}_{i} \gets p(c_i | \{ c_j \in \mathcal{C}^{(t)} : j < i \}, \{ c_j \in \mathcal{C}^{(t-1)} : j > i \}, \mathcal{Y}; \mathcal{H}) \end{align*}
Let the notation $\mathcal{Y}^{k}_{\neg i} = \{ y_j \in \mathcal{Y} : c_j = k, j \neq i \}$. It turns out, for the beta-bernoulli model, we can indeed derive that \begin{align*} p(c_i{=}k|\mathcal{C}_{\neg i}, \mathcal{Y};\mathcal{H}) \propto \frac{ |\mathcal{Y}^{k}_{\neg i}| + \frac{\alpha}{K} }{ N - 1 + \alpha } \prod_{d=1}^{D} \frac{(\beta+\sum_{y_k\in \mathcal{Y}^{k}_{\neg i}} y_k^{d})^{y_i^d} ( \gamma + |\mathcal{Y}^{k}_{\neg i}| - \sum_{y_k\in \mathcal{Y}^{k}_{\neg i}} y_k^{d})^{(1-y_i^{d})} }{ \beta + \gamma + |\mathcal{Y}^{k}_{\neg i}| } \end{align*} We can construct $p(c_i{=}k|\mathcal{C}_{\neg i}, \mathcal{Y};\mathcal{H})$ exactly by then enumerating through all $K$ clusters and normalizing.
We'll first implement a sampler from a discrete distribution. Note while scipy.stats.rv_discrete
(Docs) implements this functionality, it is quite heavy weight to call within a loop, so we roll our own.
# workspace setup
import numpy as np
import scipy as sp
import scipy.io
import scipy.misc
import scipy.special
%matplotlib inline
import matplotlib.pylab as plt
import itertools as it
import math
def discrete_sample(pmf):
coin = np.random.random()
acc = 0.0
n, = pmf.shape
for i in xrange(n):
acc0 = acc + pmf[i]
if coin <= acc0:
return i
acc = acc0
return n-1
Ok enough playing around! Now the actual Gibbs sampler implementation. Note we make use of numpy broadcasting for performance:
def gibbs_beta_bernoulli(Y, K, alpha, beta, gamma, niters):
N, D = Y.shape
alpha, beta, gamma = map(float, [alpha, beta, gamma])
# start with random assignment
assignments = np.random.randint(0, K, size=N)
# initialize the sufficient statistics (cluster sums) accordingly
sums = np.zeros((D, K), dtype=np.int64)
cnts = np.zeros(K, dtype=np.int64)
for yi, ci in zip(Y, assignments):
sums[:,ci] += yi
cnts[ci] += 1
history = np.zeros((niters, N), dtype=np.int64)
# precomputations
nplog = np.log
npexp = np.exp
nparray = np.array
logsumexp = sp.misc.logsumexp
lg_denom = nplog(N - 1 + alpha)
alpha_over_K = alpha/K
beta_plus_gamma = beta + gamma
for t in xrange(niters):
for i, (yi, ci) in enumerate(zip(Y, assignments)):
# remove from SS
sums[:,ci] -= yi
cnts[ci] -= 1
lg_term1 = nplog(cnts + alpha_over_K) - lg_denom - D*nplog(beta_plus_gamma + cnts)
lg_term2 = nplog(beta + sums)
lg_term3 = nplog(gamma + cnts - sums)
lg_dist = lg_term1 + (lg_term2*yi[:,np.newaxis] + lg_term3*(1-yi[:,np.newaxis])).sum(axis=0)
lg_dist -= logsumexp(lg_dist) # normalize
# reassign
ci = discrete_sample(npexp(lg_dist))
assignments[i] = ci
sums[:,ci] += yi
cnts[ci] += 1
history[t] = assignments
return history
Let us first generate a small toy dataset (using the beta-bernoulli model as the generative process). Then we will test our Gibbs sampler on this small dataset.
alpha, beta, gamma = 2., 1., 1.
K = 2
D = 3
N = 4
pis = np.random.dirichlet(alpha/K*np.ones(K))
cis = np.array([discrete_sample(pis) for _ in xrange(N)])
aks = np.random.beta(beta, gamma, size=(K, D))
print 'Pi:', pis
print 'C :', cis
def bernoulli(p):
return 1 if np.random.random() <= p else 0
Y = np.zeros((N, D), dtype=np.int64)
for i in xrange(N):
Y[i] = np.array([bernoulli(aks[cis[i], d]) for d in xrange(D)])
Pi: [ 0.17704708 0.82295292] C : [1 1 0 1]
niters = 50000
chain = gibbs_beta_bernoulli(Y, K, alpha, beta, gamma, niters)
Now the natural question is, how do we verify that the Gibbs sampler is indeed drawing samples from $p(\mathcal{C} | \mathcal{Y})$. This is a bit trickier in the Bayesian case than the non-Bayesian case, since what we are really after is a distribution instead of, e.g. a point estimate.
Luckily, for small problems (small $K$ and $N$), we can actually calculate $p(\mathcal{C} | \mathcal{Y}; \mathcal{H})$ exactly by brute force enumeration. This is because $p(\mathcal{C} | \mathcal{Y} ; \mathcal{H})$ is a discrete distribution of size $K^{N}$, and we can analytically calculate the joint distribution $p(\mathcal{C}, \mathcal{Y}; \mathcal{H})$; from the joint, the posterior follows by $p(\mathcal{C} | \mathcal{Y} ) = \frac{p(\mathcal{C}, \mathcal{Y})}{\sum_{c} p(c ,\mathcal{Y})}$, where the summation in the denominator is over all possible $K^N$ assignment vectors (from here on, we drop the hyperparameter dependence to ease notation).
For an arbitrary $\mathcal{C}, \mathcal{Y}$, we have $p(\mathcal{C}, \mathcal{Y}) = p(\mathcal{C}) p(\mathcal{Y} | \mathcal{C})$. From page 2, we have \begin{align*} p(\mathcal{C}) = \frac{\Gamma(\alpha)}{\Gamma(|\mathcal{Y}|+\alpha)} \prod_{k=1}^{K} \frac{\Gamma( |\mathcal{Y}^{k}| + \frac{\alpha}{K})}{ \Gamma(\frac{\alpha}{K})} \end{align*} where $\Gamma(\cdot)$ is the Gamma function. We can derive $p(\mathcal{Y}|\mathcal{C})$ as follows. \begin{align*} p(\mathcal{Y} | \mathcal{C}) &= \prod_{k=1}^{K} p(\mathcal{Y}^{k} | \mathcal{C}^{k}) \\ &= \prod_{k=1}^{K} \int_{\Theta_k} [\prod_{y_i \in \mathcal{Y}^k} p(y_i | \Theta_k) ] p(\Theta_k ; \mathcal{H}) \; d\Theta_k \end{align*} Now for each $k$, we evaluate the inner integral as: \begin{align*} \int_{\theta_1,...,\theta_D} \prod_{d=1}^{D} \left(\prod_{y_i \in \mathcal{Y}^k} \theta_d^{y_i^d} (1-\theta_d)^{1-y_i^d}\right) \frac{1}{B(\beta,\gamma)} \theta_d^{\beta-1}(1-\theta_d)^{\gamma-1} \; d\theta_1...\theta_{D} \end{align*} where $B(\beta,\gamma)$ is the Beta function. Noting that $\int_{0}^{1} x^{(m-1)} (1-x)^{(n-1)} \; dx = \frac{\Gamma(m)\Gamma(n)}{\Gamma(m+n)}$, we can simplify the above integral to: \begin{align*} \frac{1}{B(\beta,\gamma)^{D}} \prod_{d=1}^{D} \frac{\Gamma( \sum_{y_i \in \mathcal{Y}^k} y_i^d + \beta) \Gamma( |\mathcal{Y}^{k}| - \sum_{y_i \in \mathcal{Y}^k} y_i^d + \gamma)}{ \Gamma( |\mathcal{Y}^{k}| + \beta + \gamma )} \end{align*} And therefore, \begin{align*} p(\mathcal{Y} | \mathcal{C}) = \frac{1}{B(\beta,\gamma)^{KD}} \prod_{k=1}^{K} \prod_{d=1}^{D} \frac{\Gamma( \sum_{y_i \in \mathcal{Y}^k} y_i^d + \beta) \Gamma( |\mathcal{Y}^{k}| - \sum_{y_i \in \mathcal{Y}^k} y_i^d + \gamma)}{ \Gamma( |\mathcal{Y}^{k}| + \beta + \gamma )} \end{align*}
Therefore, we can compute the posterior distribution of the data as follows:
def all_assignment_vectors():
return it.product(range(K), repeat=N)
def lg_pr_joint(C, Y, K, alpha, beta, gamma):
N, D = Y.shape
nks = np.bincount(C, minlength=K)
assert nks.shape[0] == K
assert C.shape[0] == N
# log P(C)
gammaln = sp.special.gammaln
betaln = sp.special.betaln
term1 = gammaln(alpha) - gammaln(N + alpha) - K*gammaln(alpha/K)
term2 = sum(gammaln(nk + alpha/K) for nk in nks)
lg_pC = term1 + term2
# log P(Y|C)
term1 = K*D*betaln(beta, gamma)
term2 = D*sum(gammaln(nk + beta + gamma) for nk in nks)
sums = np.zeros((K, D))
for yi, ci in zip(Y, C):
sums[ci] += yi
def fn1(nk, sum_yid):
assert nk >= sum_yid
return gammaln(sum_yid + beta) + gammaln(nk - sum_yid + gamma)
term3 = sum(sum(fn1(nk, yid) for yid in row) for nk, row in zip(nks, sums))
lg_pYgC = -term1 - term2 + term3
return lg_pC + lg_pYgC
def brute_force_posterior(Y, K, alpha, beta, gamma):
N, _ = Y.shape
# enumerate K^N cluster assignments
lg_pis = np.array([lg_pr_joint(np.array(C), Y, K, alpha, beta, gamma) for C in all_assignment_vectors()])
lg_pis -= sp.misc.logsumexp(lg_pis)
return np.exp(lg_pis)
# generate an ID for each K^N element
idmap = { C : i for i, C in enumerate(all_assignment_vectors()) }
revidmap = { i : C for i, C in enumerate(all_assignment_vectors()) }
actual_posterior = brute_force_posterior(Y, K, alpha, beta, gamma)
Let's pause for a second and take a look at the MAP estimator:
print 'P(C=actual|Y):', actual_posterior[idmap[tuple(cis)]]
print 'max_C P(C|Y):', actual_posterior.max()
print 'argmax_C P(C|Y):', revidmap[actual_posterior.argmax()]
P(C=actual|Y): 0.0864616067769 max_C P(C|Y): 0.132805028009 argmax_C P(C|Y): (0, 0, 0, 0)
Notice how the ground truth cluster assignment is not the MAP estimator!
To measure the distance between the actual posterior distribution and that produced by our Gibbs sampler, we compare the KL-divergence of the actual and the empirical. Recall for discrete distributions the KL-divergence (relative entropy) is defined as \begin{align*} D(P||Q) = \sum_{x} P(x) \log\frac{P(x)}{Q(x)} \end{align*}
smoothing = 1e-5
skip = 100
skipped_chain = chain[::skip]
window = 10000
def kl(a, b):
return np.sum([p*np.log(p/q) for p, q in zip(a, b)])
def histify(history, K):
_, N = history.shape
hist = np.zeros(K**N, dtype=np.float)
for h in history:
hist[idmap[tuple(h)]] += 1.0
return hist
def fn(i):
hist = histify(skipped_chain[max(0, i-window):(i+1)], K) + smoothing
hist /= hist.sum()
return kl(actual_posterior, hist)
kls = map(fn, xrange(skipped_chain.shape[0]))
Let's take a peek at the posterior distribution of the last window compared with our brute force posterior distribution:
H = histify(skipped_chain[skipped_chain.shape[0]-1-window:], K) + smoothing
H /= H.sum()
print "C\t\tP_g(C|Y)\tP_a(C|Y)\t|Diff|"
for c, (a, b) in zip(all_assignment_vectors(), zip(H, actual_posterior)):
def f(x): return '%.7f' % (x)
print "\t".join(map(str, [c,f(a),f(b),f(math.fabs(a-b))]))
C P_g(C|Y) P_a(C|Y) |Diff| (0, 0, 0, 0) 0.1240000 0.1328050 0.0088050 (0, 0, 0, 1) 0.0900000 0.0864616 0.0035384 (0, 0, 1, 0) 0.0860000 0.0864616 0.0004616 (0, 0, 1, 1) 0.0200000 0.0227718 0.0027718 (0, 1, 0, 0) 0.0280000 0.0288205 0.0008205 (0, 1, 0, 1) 0.1200000 0.0910871 0.0289129 (0, 1, 1, 0) 0.0280000 0.0227718 0.0052282 (0, 1, 1, 1) 0.0340000 0.0288205 0.0051795 (1, 0, 0, 0) 0.0260000 0.0288205 0.0028205 (1, 0, 0, 1) 0.0180000 0.0227718 0.0047718 (1, 0, 1, 0) 0.0620000 0.0910871 0.0290871 (1, 0, 1, 1) 0.0380000 0.0288205 0.0091795 (1, 1, 0, 0) 0.0180000 0.0227718 0.0047718 (1, 1, 0, 1) 0.0980000 0.0864616 0.0115384 (1, 1, 1, 0) 0.0660000 0.0864616 0.0204616 (1, 1, 1, 1) 0.1440000 0.1328050 0.0111949
This looks reasonable. We can also get a sense for the trend as we increase the number of iterations:
plt.plot(np.arange(0, chain.shape[0], skip) + 1, kls)
plt.xlabel('Iterations')
plt.ylabel('Posterior KL-divergence')
Another distribution we can look at is the posterior predictive distribution $p(y|\mathcal{C},\mathcal{Y};\mathcal{H})$, that is, the distribution over the next data point. To derive this, we first note that $p(y|\mathcal{C},\mathcal{Y}) = \sum_{k=1}^{K} p(y,c{=}k|\mathcal{C},\mathcal{Y}) = \sum_{k=1}^{K} p(c{=}k|\mathcal{C}) p(y|\mathcal{Y}_{k})$.
From Eq. 9, pg. 3, we have $p(c{=}k|\mathcal{C}) = \frac{|\mathcal{C}_{k}| + \frac{\alpha}{K}}{|\mathcal{C}| + \alpha}$. Next, $p(y|\mathcal{Y}_{k})$ is simply the posterior predictive of the beta-bernoulli likelihood model. That is, $p(y_d{=}1|\mathcal{Y}_{k})= \frac{\beta + \sum_{y_i \in \mathcal{Y}_{k}} y_i^d}{ \beta + \gamma + |\mathcal{Y}_{k}| } $. Putting it together, we have \begin{align*} p(y|\mathcal{C},\mathcal{Y}) = \sum_{k=1}^{K} \frac{|\mathcal{C}_{k}| + \frac{\alpha}{K}}{|\mathcal{C}| + \alpha} \prod_{d=1}^{D} (\hat{p}_d^k)^{y_d} (1-\hat{p}_d^k)^{1-y_d} \end{align*} where $\hat{p}_d^k = \frac{\beta + \sum_{y_i \in \mathcal{Y}_{k}} y_i^d}{ \beta + \gamma + |\mathcal{Y}_{k}| } $.
We can also look at the non-Bayesian case, where we know the exact values for $\pi$ and $\{ p_d^k \}$. \begin{align*} p(y | \pi, \{p_d^k\}) = \sum_{k=1}^{K} \pi_k \prod_{d=1}^{D} (p_d^k)^{y_d} (1-p_d^k)^{1-y_d} \end{align*}
This makes explicit how our prior information affects our estimates of $\hat{\pi}$ and $\hat{p_d^k}$. Now for the reference distributions, we know both the ground truth cluster assignment and the exact model parameters, so we can compute $ p(y|\mathcal{C},\mathcal{Y}) $ and $ p(y | \pi, \{p_d^k\})$ exactly. For our Gibbs sampler, we estimate $p(y|\mathcal{C},\mathcal{Y})$ by averaging over our samples of $\mathcal{C}$. Details below:
def posterior_predictive1(C, Y, K, alpha, beta, gamma):
N, D = Y.shape
assert C.shape[0] == N
nks = np.bincount(C, minlength=K)
sums = np.zeros((D, K), dtype=np.int64)
for yi, ci in zip(Y, C):
sums[:,ci] += yi
lg_term1 = np.log(nks + alpha/K) - np.log(N + alpha)
def posterior_predictive(C, Y, K, alpha, beta, gamma):
N, D = Y.shape
assert C.shape[0] == N
nks = np.bincount(C, minlength=K)
sums = np.zeros((K, D), dtype=np.int64)
for yi, ci in zip(Y, C):
sums[ci] += yi
def fn(yvalue):
def fn1(nk, sum_yid, yd):
assert nk >= sum_yid
theta = (beta + sum_yid) / (beta + gamma + nk)
assert theta >= 0.0 and theta <= 1.0
return np.log(theta) if yd else np.log(1.-theta)
def fn2(nk, row):
assert len(yvalue) == row.shape[0]
term1 = np.log(nk + alpha/K) - np.log(N + alpha)
term2 = sum(fn1(nk, sum_yid, yd) for sum_yid, yd in zip(row, yvalue))
return term1 + term2
return sp.misc.logsumexp([fn2(nk, row) for nk, row in zip(nks, sums)])
yvalues = it.product([0, 1], repeat=D)
lg_pr_yvalue = map(fn, yvalues)
return np.exp(lg_pr_yvalue)
def posterior_predictive_nonbayes(pis, aks):
K, = pis.shape
assert aks.shape[0] == K
_, D = aks.shape
def fn(yvalue):
def fn2(yd, theta_kd):
return np.log(theta_kd) if yd else np.log(1.-theta_kd)
def fn1(pi_k, theta_k):
return np.log(pi_k) + sum(fn2(yd, theta_kd) for yd, theta_kd in zip(yvalue, theta_k))
return sp.misc.logsumexp([fn1(pi_k, theta_k) for pi_k, theta_k in zip(pis, aks)])
yvalues = it.product([0, 1], repeat=D)
lg_pr_yvalue = map(fn, yvalues)
return np.exp(lg_pr_yvalue)
# reference distributions
actual_posterior_predictive = posterior_predictive(cis, Y, K, alpha, beta, gamma)
nonbayes_posterior_predictive = posterior_predictive_nonbayes(pis, aks)
posterior_predictives = np.array([
posterior_predictive(assignment, Y, K, alpha, beta, gamma) for assignment in skipped_chain])
def fn1(i):
posteriors = posterior_predictives[min(0, i-window):(i+1)].mean(axis=0)
return kl(actual_posterior_predictive, posteriors)
posterior_predictive_kls = map(fn1, xrange(posterior_predictives.shape[0]))
PH = posterior_predictives[posterior_predictives.shape[0]-1-window:].mean(axis=0)
print "y\t\tP(y|C) [gibbs]\tP(y|C) [actual]\tP(y|params)\t|gibbs-actual|"
for y, ((a, b), c) in zip(it.product([0,1],repeat=D), zip(zip(PH, actual_posterior_predictive), nonbayes_posterior_predictive)):
print "\t".join(map(str, [y, a, b, c, math.fabs(a-b)]))
y P(y|C) [gibbs] P(y|C) [actual] P(y|params) |gibbs-actual| (0, 0, 0) 0.129465256173 0.130765432099 0.145796887467 0.00130017592593 (0, 0, 1) 0.0661453919753 0.0707160493827 0.0602427431159 0.00457065740741 (0, 1, 0) 0.107173910494 0.0973827160494 0.271569844867 0.00979119444444 (0, 1, 1) 0.0656987746914 0.0566913580247 0.135681228984 0.00900741666667 (1, 0, 0) 0.196573169753 0.177382716049 0.0713598709559 0.0191904537037 (1, 0, 1) 0.107571737654 0.110024691358 0.0312461999697 0.0024529537037 (1, 1, 0) 0.197682108025 0.216691358025 0.189236544177 0.01900925 (1, 1, 1) 0.129689651235 0.140345679012 0.0948666804639 0.0106560277778
Let's take a look at the posterior predictive distributions of the last window compared to the ground truth distributions:
Notice how our Gibbs estimate of the distribution does not quite match the non-Bayesian one, which makes sense. Since $N$ is small in our case, our prior distribution is still weighing in non-negligibly. Finally, let's take a look at the KL-divergence (w.r.t. the Bayesian distribution with ground truth clusters) as the number of iterations increases:
plt.plot(np.arange(0, niters, skip) + 1, posterior_predictive_kls)
plt.xlabel('Iterations')
plt.ylabel('Posterior predictive KL-divergence')
<matplotlib.text.Text at 0x11546be10>
Note that this distribution seems to converge more slowly than the posterior in the previous section.
In this section, we use our Gibbs sampler to learn clusters for digit recognition. Our digit dataset is the USPS digit dataset. We'll just focus on the digit 2 for this example.
usps_digits = sp.io.loadmat('data/usps_resampled/usps_resampled.mat')
usps_class_2 = usps_digits['train_labels'][2,:] == 1
usps_Y0 = usps_digits['train_patterns'][:,usps_class_2]
usps_Y = np.zeros(usps_Y0.shape)
usps_Y[usps_Y0>0.1] = 1.0 # make the image binary
plt.imshow(usps_Y[:,3].reshape((16,16)), cmap=plt.cm.binary)
<matplotlib.image.AxesImage at 0x112abe950>
usps_Y = np.array(usps_Y.T, dtype=np.int64) # convert dataset to our dimension
We are ready to run our Gibbs sampler on this dataset. First, we use the hyper-parameters used by Maaten
# USPS priors
usps_alpha, usps_beta, usps_gamma = 50., .5, .5
usps_K = 40
usps_N, usps_D = usps_Y.shape
usps_niters = 1000
print usps_N, usps_D
475 256
usps_chain = gibbs_beta_bernoulli(usps_Y, usps_K, usps_alpha, usps_beta, usps_gamma, usps_niters)
Let's plot the posterior predictive of each cluster separately, to see what each cluster looks like.
def posterior_thetas(C, Y, K, beta, gamma):
N, D = Y.shape
assert C.shape[0] == N
thetas = np.zeros((K, D))
for k in xrange(K):
thetas[k] = (beta + Y[C==k].sum(axis=0)) / (beta + gamma + N)
return thetas
usps_chain = usps_chain[::skip]
pts = np.array([posterior_thetas(c, usps_Y, usps_K, usps_beta, usps_gamma) for c in usps_chain])
pts = pts.mean(axis=0)
digits_per_row = 5
assert not (pt.shape[0] % digits_per_row)
img = np.vstack([
np.hstack([
p.reshape((16, 16)) for p in pt[i:(i+digits_per_row)]])
for i in xrange(0, pt.shape[0], digits_per_row)])
plt.rcParams['figure.figsize'] = (10.0, 8.0)
plt.imshow(img, cmap=plt.cm.binary)
<matplotlib.image.AxesImage at 0x1140aa510>
We'll notice some of the clusters are empty. This is because our sampler assigns zero mass to some of the cluster; we can see this by plotting a histogram of, e.g. the last assignment vector generated by our Gibbs sampler.
plt.hist(usps_chain[-1], bins=40)
(array([ 3., 14., 0., 4., 33., 0., 11., 6., 6., 23., 14., 0., 22., 13., 2., 0., 6., 2., 4., 19., 0., 4., 2., 8., 14., 0., 29., 64., 0., 4., 8., 11., 23., 8., 16., 30., 11., 10., 4., 47.]), array([ 0. , 0.975, 1.95 , 2.925, 3.9 , 4.875, 5.85 , 6.825, 7.8 , 8.775, 9.75 , 10.725, 11.7 , 12.675, 13.65 , 14.625, 15.6 , 16.575, 17.55 , 18.525, 19.5 , 20.475, 21.45 , 22.425, 23.4 , 24.375, 25.35 , 26.325, 27.3 , 28.275, 29.25 , 30.225, 31.2 , 32.175, 33.15 , 34.125, 35.1 , 36.075, 37.05 , 38.025, 39. ]), <a list of 40 Patch objects>)
This is a great segway into infinite dimensional models!