Notebook

In [2]:

# Utils

from IPython.display import Image

Tutorial Derivation: Gibbs Sampling for Latent Dirichlet Allocation

Introduction¶

This notebook describes in painful detail the Gibbs sampling algorithm for Latent Dirichlet Allocation. It is based on "Finding scientific topics", the PNAS paper by Griffiths and Steyvers ( link ), and not the original paper by Blei et al.

Probabilistic Model and Generative Story¶

We assume the following generative story for document creation:

A number of $D$ documents are generated independently. There are a number of $T$ topics that characterize the documents. There are $W$ unique words in the documents. Each document has a length of $L_d$ words.

Each document $d$ has a topic distribution $\theta_d$, a categorical distribution with $T$ categories (and there are $D$ such distributions).

Each topic $t$ has a word distribution $\phi_t$, a categorical distribution with $W$ categories (and there are $T$ such distributions).

To generate a document, we generate each word independently. For each document $d$ and word position $l$, a topic corresponding to that position is sampled from the topic distribution:

$P(z_{dl} = t) = \theta_d(t)$

According to the sampled topic, the word is generated according to:

$P(w_{dl} = w) = \phi_{z_{dl}}(w)$

We place Dirichlet conjugate priors on $\theta$:

$\theta \sim Dirichlet(\alpha)$

and $\phi$:

$\phi \sim Dirichlet(\beta)$.

Furthermore, to simplify things, we assume $\alpha$ and $\beta$ are symmetric priors, and in experiments we initalize them to $1$.

The probabilistic model in plate model notation is illustrated below:

In [3]:

lda_pm = Image(filename='images/lda_pm.png')
lda_pm

Out[3]:

The joint distribution is:

$P(w,z, \theta, \phi ; \alpha, \beta) = P(\theta ; \alpha) P(z | \theta) P(\phi ; \beta) P(w | z,\phi)$

Integrating out $\theta$ and $\phi$¶

\begin{align*} P(w,z ; \alpha, \beta) &= P(\theta ; \alpha) P(z | \theta) P(\phi ; \beta) P(w | z,\phi)\\ &= \int_{\phi} \int_{\theta} P(\theta ; \alpha) P(z | \theta) P(\phi ; \beta) P(w | z,\phi)\\ &= \int_{\phi} P(\phi ; \beta) P(w | z,\phi) \int_{\theta} P(\theta ; \alpha) P(z | \theta)\\ &= f_1 f_2 \end{align*}\begin{align*} f_1 &= \int_{\phi} (\prod_t P(\phi_t ; \beta)) (\prod_d \prod_l P(w_{dl} | z_{dl}, \phi))\\ &= \int_{\phi} \prod_t (\frac{\Gamma(\beta W)}{\prod_w \Gamma(\beta)} \prod_w \phi_t(w)^{\beta-1})\prod_d \prod_l \phi_{z_{dl}}(w_{dl})\\ &= k_1 \int_{\phi} \prod_t \prod_w \phi_t(w)^{\beta-1} \prod_t \prod_w \phi_t(w)^{n(w,t)}\\ &= k_1 \int_{\phi} \prod_t \prod_w \phi_t(w)^{n(w,t)+\beta-1}\\ &= k_1 \prod_t \int_{\phi_t} \prod_w \phi_t(w)^{n(w,t)+\beta-1}\\ &= k_1 \prod_t \frac{\prod_w \Gamma(n(w,t)+\beta)}{\Gamma(\sum_w (n(w,t)+\beta))}\\ &= k_1 \prod_t \frac{\prod_w \Gamma(n(w,t)+\beta)}{\Gamma(n(t)+\beta W)} \end{align*}

Where:

$k_1 = \left(\frac{\Gamma(\beta W)}{\Gamma(\beta)^W}\right)^T$

$n(w,t)$ is the number of times word $w$ has been assigned to topic $t$ in all documents

and

$n(t)$ is the number of words assigned to topic $t$ in all documents.

The exact same steps can be applied to compute $f_2$ as:

$$ f_2 = k_2 \prod_d \frac{\prod_t \Gamma(n(d,t)+\alpha)}{\Gamma(n(d)+\alpha T)} $$

Where:

$k_2 = \left(\frac{\Gamma(\alpha T)}{\Gamma(\alpha)^T}\right)^D$

$n(d,t)$ is the number of times a word from document $d$ has been assigned to topic $t$

and

$n(d)$ is the number of words in document $d$, i.e. $L_d$ from the plate notation.

Gibbs sampling¶

$$f_1 = P(w|z)$$$$f_2 = P(z)$$

The goal of the sampling procedure is to estimate $\phi$ and $\theta$. $\phi$ informs us what the topics might be, $\theta$ tells us how important each topic is in a document. But we have just integrated them out! Instead of sampling the values directly, we will estimate them based on sampling $P(z|w)$.

$P(z|w) \varpropto P(w|z)P(z)$

Concretely, we update:

$$P(z_{ij}=k | z_{-ij}, w)$$

Where $i$ is a document index, $j$ is a position in document $i$, and $k$ is a topic. $z_{ij}$ is the topic of word $w_{ij}$. All the other assignments $z_{-ij}$ are assumed to be known.

$$P(z_{ij}=k | z_{-ij}, w) = \frac{P(z_{ij}=k, z_{-ij}, w)}{P(z_{-ij}, w)} = \frac{f_1}{g_1} \frac{f_2}{g_2}$$$$g_1 = p(w|z_{-ij})$$$$g_2 = p(z_{-ij})$$

$g_1$, resp. $g_2$ are similar to $f_1$, resp. $f_2$ except that we don't count the assignment $z_{ij}=k$, i.e.

$$g_1 = k_1 (\prod_{t\neq k} \frac{\prod_w \Gamma(n(w,t)+\beta)}{\Gamma(n(t)+\beta W)}) \frac{(\prod_{w\neq w_{ij}} \Gamma(n(w,k)+\beta))\Gamma(n(w_{ij},k)-1 +\beta)}{\Gamma(n(k)-1+\beta W)}$$

Consequently:

$$ \frac{f_1}{g_1} = \frac{\Gamma(n(w_{ij},k)+\beta)}{\Gamma(n(k)+\beta W)} \frac{\Gamma(n(k)-1+\beta W)}{\Gamma(n(w_{ij},k)-1 +\beta)} = \frac{n(w_{ij},k)-1+ \beta}{n(k)-1+\beta W} $$

Similarly:

$$g_2 = k_2 ( \prod_{d\neq i} \frac{\prod_t \Gamma(n(d,t)+\alpha)}{\Gamma(n(d)+\alpha T)} ) \frac{ (\prod_{t \neq k} \Gamma(n(i,t)+\alpha)) \Gamma(n(i,k)-1+\alpha)}{\Gamma(n(i)-1+\alpha T)}$$

Then:

$$ \frac{f_2}{g_2} = \frac{\Gamma(n(i,k)+\alpha)}{\Gamma(n(i)+\alpha T)} \frac{\Gamma(n(i)-1+\alpha T)}{\Gamma(n(i,k)-1+\alpha)} = \frac{n(i,k)-1+\alpha}{n(i)-1+\alpha T} $$

Finally, the update is:

$$P(z_{ij}=k | z_{-ij}, w) \varpropto \frac{n(w_{ij},k)-1+ \beta}{n(k)-1+\beta W}\frac{n(i,k)-1+\alpha}{n(i)-1+\alpha T}$$

We compute this value for all $k = \lbrace 1, \dots, T\rbrace$, and sample:

$$z^{(next)}_{ij} \sim Cat(P'(z_{ij}| z_{-ij}, w))$$

We use $P'$ to indicate that the parameters of the conditional should be normalized.

A Remark on Notation¶

This notation is not used in presentations of LDA. Instead, it is often preferred to use a notation closer to the data structures of the algorithm. For instance, let us convert our notation to match that of Probabilistic Topic Models. They use two matrices:

$C^{WT}$ is a $W \times T$ matrix and an element $C^{WT}_{wk}$ counts the number of times word $w$ has been assigned to topic $k$ excluding the current instance $w_{ij}$.

$n(w_{ij},k)-1 = C^{WT}_{w_{ij}k}$.

$n(k)-1 = \sum_{w} C^{WT}_{wk}$

$C^{DT}$ is a $D \times T$ matrix and an element $C^{DT}_{dt}$ counts the number of times topic $t$ has been assigned to a word in document $d$ excluding the current instance $w_{ij}$.

$n(i,k)-1 = C^{DT}_{ik}$

$n(i)-1 = \sum_{k} C^{DT}_{ik}$

Rewriting the equation for the conditional we obtain the same result as Griffiths and Steyvers:

$$P(z_{ij}=k | z_{-ij}, w) \varpropto \frac{C^{WT}_{w_{ij}k}+ \beta}{\sum_{w} C^{WT}_{wk}+\beta W}\frac{C^{DT}_{ik}+\alpha}{\sum_{k} C^{DT}_{ik}+\alpha T}$$

The Sampling Algorithm¶

$z$ = $initialize()$

for $it =1, \dots, maxIt$:

for $i =1, \dots, D$:

for $j =1, \dots, L_i$:

$C^{WT}, C^{DT} = update(z, i, j)$

sample new $z^{(new)}_{ij} \sim Cat(P'(z_{ij}| z_{-ij}, w))$

$C^{WT}, C^{DT} = update(z, z^{(new)}_{ij})$

yield $z$

$initialize$ simply assigns random topics to the words (or sample $\theta$ and then $z$). $update$ updates the count matrices based on the assignment $z$, and the current instance $i,j$. The update before the sample will decrement $C^{WT}_{w_{ij},z^{(old)}_{ij}}$ and $C^{DT}_{i,z^{(old)}_{ij}}$ and after the sample we will increment $C^{WT}_{w_{ij},z^{(new)}_{ij}}$ and $C^{DT}_{i,z^{(new)}_{ij}}$.

Uncollapsing $\phi$ and $\theta$¶

The sampler returns a set of assignments of topics to the words. However, we are interested in $\phi$ and possibly $\theta$, which we have integrated out (or collapsed). We can recover them using:

$$\phi'_t(w) = \frac{C^{WT}_{wt}+ \beta}{\sum_{w} C^{WT}_{wt}+\beta W}$$$$\theta'_d(t) = \frac{C^{DT}_{dt}+\alpha}{\sum_{t} C^{DT}_{dt}+\alpha T}$$