# Discrete Data and the Multinomial Distribution¶

### Preliminaries¶

• Goal
• Simple Bayesian and maximum likelihood-based density estimation for discretely valued data sets
• Materials
• Mandatory
• These lecture notes
• Optional
• Bishop pp. 67-70, 74-76, 93-94

### Discrete Data: the 1-of-K Coding Scheme¶

• Consider a coin-tossing experiment with outcomes $x \in\{0,1\}$ (tail and head) and let $0\leq \mu \leq 1$ represent the probability of heads. This model can written as a Bernoulli distribution: $$p(x|\mu) = \mu^{x}(1-\mu)^{1-x}$$
• Note that the variable $x$ acts as a (binary) selector for the tail or head probabilities. Think of this as an 'if'-statement in programming.
• Now consider a $K$-sided coin (e.g., a six-faced die (pl.: dice)). How should we encode outcomes?
• Option 1: $x \in \{1,2,\ldots,K\}$.

• E.g., for $K=6$, if the die lands on the 3rd face $\,\Rightarrow x=3$.
• Option 2: $x=(x_1,\ldots,x_K)^T$ with binary selection variables $$x_k = \begin{cases} 1 & \text{if die landed on kth face}\\ 0 & \text{otherwise} \end{cases}$$

• E.g., for $K=6$, if the die lands on the 3rd face $\,\Rightarrow x=(0,0,1,0,0,0)^T$.
• This coding scheme is called a 1-of-K or one-hot coding scheme.
• It turns out that the one-hot coding scheme is mathematically more convenient!
• Consider a $K$-sided die. We use a one-hot coding scheme. Assume the probabilities $p(x_k=1) = \mu_k$ with $\sum_k \mu_k = 1$. The data generating distribution is then (note the similarity to the Bernoulli distribution)
$$p(x|\mu) = \mu_1^{x_1} \mu_2^{x_2} \cdots \mu_K^{x_K}=\prod_{k=1}^K \mu_k^{x_k} \tag{B-2.26}$$
• This generalized Bernoulli distribution is called the categorical distribution (or sometimes the 'multi-noulli' distribution).

### Bayesian Density Estimation for a Loaded Die¶

• Now let's proceed with Bayesian density estimation $p(x|\theta)$ for an observed data set $D=\{x_1,\ldots,x_N\}$ of $N$ IID rolls of a $K$-sided die, with
$$x_{nk} = \begin{cases} 1 & \text{if the nth throw landed on kth face}\\ 0 & \text{otherwise} \end{cases}$$

##### Model specification¶
• The data generating PDF is $$p(D|\mu) = \prod_n \prod_k \mu_k^{x_{nk}} = \prod_k \mu_k^{\sum_n x_{nk}} = \prod_k \mu_k^{m_k} \tag{B-2.29}$$ where $m_k= \sum_n x_{nk}$ is the total number of occurrences that we 'threw' $k$ eyes. Note that $\sum_k m_k = N$.
• This distribution depends on the observations only through the quantities $\{m_k\}$.
• We need a prior for the parameters $\mu = (\mu_1,\mu_2,\ldots,\mu_K)$. In the binary coin toss example, we used a beta distribution that was conjugate with the binomial and forced us to choose prior pseudo-counts.

• The generalization of the beta prior to the $K$ parameters $\{\mu_k\}$ is the Dirichlet distribution: $$p(\mu|\alpha) = \mathrm{Dir}(\mu|\alpha) = \frac{\Gamma\left(\sum_k \alpha_k\right)}{\Gamma(\alpha_1)\cdots \Gamma(\alpha_K)} \prod_{k=1}^K \mu_k^{\alpha_k-1}$$

• Again, you can interpret $\alpha_k$ as the prior number of (pseudo-)observations that the die landed on the $k$th face.
##### Inference for $\{\mu_k\}$¶
• The posterior for $\{\mu_k\}$ can be obtained through Bayes rule:
\begin{align*} p(\mu|D,\alpha) &\propto p(D|\mu) \cdot p(\mu|\alpha) \\ &\propto \prod_k \mu_k^{m_k} \cdot \prod_k \mu_k^{\alpha_k-1} \\ &= \prod_k \mu_k^{\alpha_k + m_k -1}\\ &\propto \mathrm{Dir}\left(\mu\,|\,\alpha + m \right) \tag{B-2.41} \\ &= \frac{\Gamma\left(\sum_k (\alpha_k + m_k) \right)}{\Gamma(\alpha_1+m_1) \Gamma(\alpha_2+m_2) \cdots \Gamma(\alpha_K + m_K)} \prod_{k=1}^K \mu_k^{\alpha_k + m_k -1} \end{align*}

where $m = (m_1,m_2,\ldots,m_K)^T$.

• Again, we recognize the $\alpha_k$'s as prior pseudo-counts and the Dirichlet distribution shows to be a conjugate prior to the categorical/multinomial:
\begin{align*} \underbrace{\text{Dirichlet}}_{\text{posterior}} &\propto \underbrace{\text{categorical}}_{\text{likelihood}} \cdot \underbrace{\text{Dirichlet}}_{\text{prior}} \end{align*}
• This is actually a generalization of the conjugate relation that we found for the binary coin toss:
\begin{align*} \underbrace{\text{Beta}}_{\text{posterior}} &\propto \underbrace{\text{binomial}}_{\text{likelihood}} \cdot \underbrace{\text{Beta}}_{\text{prior}} \end{align*}
##### Prediction of next toss for the loaded die¶
• Let's apply what we have learned about the loaded die to compute the probability that we throw the $k$th face at the next toss.
\begin{align*} p(x_{\bullet,k}=1|D) &= \int p(x_{\bullet,k}=1|\mu)\,p(\mu|D) \,\mathrm{d}\mu \\ &= \int_0^1 \mu_k \times \mathcal{Dir}(\mu|\,\alpha+m) \,\mathrm{d}\mu \\ &= \mathrm{E}\left[ \mu_k \right] \\ &= \frac{m_k + \alpha_k }{ N+ \sum_k \alpha_k} \end{align*}
• In the above derivation, we noticed that the data generating distribution for $N$ die tosses $D=\{x_1,\ldots,x_N\}$ only depends on the data frequencies $m_k$: $$p(D|\mu) = \prod_n \underbrace{\prod_k \mu_k^{x_{nk}}}_{\text{categorical dist.}} = \prod_k \mu_k^{\sum_n x_{nk}} = \prod_k \mu_k^{m_k} \tag{B-2.29}$$
• A related distribution is the distribution over data frequency observations $D_m=\{m_1,\ldots,m_K\}$, which is called the multinomial distribution, $$p(D_m|\mu) =\frac{N!}{m_1! m_2!\ldots m_K!} \,\prod_k \mu_k^{m_k}\,.$$
• Verify for yourself that (Exercise):
• the categorial distribution is a special case of the multinomial for $N=1$.
• the Bernoulli is a special case of the categorial distribution for $K=2$.
• the binomial is a special case of the multinomial for $K=2$.

### Maximum Likelihood Estimation for the Multinomial¶

##### Maximum likelihood as a special case of Bayesian estimation¶
• We can get the maximum likelihood estimate $\hat{\mu}_k$for $\mu_k$ based on $N$ throws of a $K$-sided die from the Bayesian framework by using a uniform prior for $\mu$ and taking the mode of the posterior for $\mu$: \begin{align*} \hat{\mu}_k &= \arg\max_{\mu_k} p(D|\mu) \\ &= \arg\max_{\mu_k} p(D|\mu)\cdot \mathrm{Uniform}(\mu) \\ &= \arg\max_{\mu_k} p(D|\mu) \cdot \left.\mathrm{Dir}(\mu|\alpha)\right|_{\alpha=(1,1,\ldots,1)} \\ &= \arg\max_{\mu_k} \left.p(\mu|D,\alpha)\right|_{\alpha=(1,1,\ldots,1)} \\ &= \arg\max_{\mu_k} \left.\mathrm{Dir}\left( \mu | m + \alpha \right)\right|_{\alpha=(1,1,\ldots,1)} \\ &= \frac{m_k}{\sum_k m_k} = \frac{m_k}{N} \end{align*} where we used the fact that the maximum of the Dirichlet distribution $\mathrm{Dir}(\{\alpha_1,\ldots,\alpha_K\})$ is obtained at $(\alpha_k-1)/(\sum_k\alpha_k - K)$.
##### Maximum likelihood estimation by optimizing a constrained log-likelihood¶
• Of course, we shouldn't have to go through the full Bayesian framework to get the maximum likelihood estimate. Alternatively, we can find the maximum of the likelihood directly.

• The log-likelihood for the multinomial distribution is given by

\begin{align*} \mathrm{L}(\mu) &\triangleq \log p(D_m|\mu) \propto \log \prod_k \mu_k^{m_k} = \sum_k m_k \log \mu_k \end{align*}

• When doing ML estimation, we must obey the constraint $\sum_k \mu_k = 1$, which can be accomplished by a Lagrange multiplier (see Bishop App.E). The augmented log-likelihood with Lagrange multiplier is then
$$\mathrm{L}^\prime(\mu) = \sum_k m_k \log \mu_k + \lambda \cdot \left(1 - \sum_k \mu_k \right)$$
• Set derivative to zero yields the sample proportion for $\mu_k$ $$\begin{equation*} \nabla_{\mu_k} \mathrm{L}^\prime = \frac{m_k } {\hat\mu_k } - \lambda \overset{!}{=} 0 \; \Rightarrow \; \hat\mu_k = \frac{m_k }{N} \end{equation*}$$ where we get $\lambda$ from the constraint $$\begin{equation*} \sum_k \hat \mu_k = \sum_k \frac{m_k} {\lambda} = \frac{N}{\lambda} \overset{!}{=} 1 \end{equation*}$$

### Recap Maximum Likelihood Estimation for Gaussian and Multinomial Distributions¶

Given $N$ IID observations $D=\{x_1,\dotsc,x_N\}$.

• For a multivariate Gaussian model $p(x_n|\theta) = \mathcal{N}(x_n|\mu,\Sigma)$, we obtain ML estimates
\begin{align} \hat \mu &= \frac{1}{N} \sum_n x_n \tag{sample mean} \\ \hat \Sigma &= \frac{1}{N} \sum_n (x_n-\hat\mu)(x_n - \hat \mu)^T \tag{sample variance} \end{align}
• For discrete outcomes modeled by a 1-of-K categorical distribution we find
\begin{align} \hat\mu_k = \frac{1}{N} \sum_n x_{nk} \quad \left(= \frac{m_k}{N} \right) \tag{sample proportion} \end{align}
• Note the similarity for the means between discrete and continuous data.

## OPTIONAL SLIDES¶

### Some Useful Matrix Calculus¶

When doing derivatives with matrices, e.g. for maximum likelihood estimation, it will be helpful to be familiar with some matrix calculus. We shortly recapitulate used formulas here.

• We define the gradient of a scalar function $f(A)$ w.r.t. an $n \times k$ matrix $A$ as $$\nabla_A f \triangleq \begin{bmatrix} \frac{\partial{f}}{\partial a_{11}} & \frac{\partial{f}}{\partial a_{12}} & \cdots & \frac{\partial{f}}{\partial a_{1k}}\\ \frac{\partial{f}}{\partial a_{21}} & \frac{\partial{f}}{\partial a_{22}} & \cdots & \frac{\partial{f}}{\partial a_{2k}}\\ \vdots & \vdots & \cdots & \vdots\\ \frac{\partial{f}}{\partial a_{n1}} & \frac{\partial{f}}{\partial a_{n2}} & \cdots & \frac{\partial{f}}{\partial a_{nk}} \end{bmatrix}$$
• The following formulas are useful (see Bishop App.-C) \begin{align*} |A^{-1}|&=|A|^{-1} \tag{B-C.4} \\ \nabla_A \log |A| &= (A^{T})^{-1} = (A^{-1})^T \tag{B-C.28} \\ \mathrm{Tr}[ABC]&= \mathrm{Tr}[CAB] = \mathrm{Tr}[BCA] \tag{B-C.9} \\ \nabla_A \mathrm{Tr}[AB] &=\nabla_A \mathrm{Tr}[BA]= B^T \tag{B-C.25} \\ \nabla_A \mathrm{Tr}[ABA^T] &= A(B+B^T) \tag{B-C.27}\\ \nabla_x x^TAx &= (A+A^T)x \tag{from B-C.27}\\ \nabla_X a^TXb &= \nabla_X \mathrm{Tr}[ba^TX] = ab^T \notag \end{align*}
In :
open("../../styles/aipstyle.html") do f