- Goal
- Introduction to the Expectation-Maximization (EM) Algorithm with application to Gaussian Mixture Models (GMM)

- Materials
- Mandatory
- These lecture notes

- Optional
- Bishop pp. 430-439 for Gaussian Mixture Models

- Mandatory

Sofar, model inference was solved analytically, but we used strong assumptions:

- IID sampling, $p(D) = \prod_n p(x_n)$

- Simple Gaussian (or multinomial) PDFs, $p(x_n) = \mathcal{N}(x_n|\mu,\Sigma)$

- Some limitations of Simple Gaussian Models with IID Sampling:
- What if the PDF is
**multi-modal**(or is just not Gaussian in any other way)? - Covariance matrix $\Sigma$ has $D(D+1)/2$ parameters.
- This quickly becomes
**a very large number**for increasing dimension $D$.

- This quickly becomes
- Temporal signals are often
**not IID**.

- What if the PDF is

- What if the PDF is multi-modal (or is just not Gaussian in any other way)?
**Discrete latent**variable models (a.k.a.**mixture**models). We'll cover this case in this lesson.

- Covariance matrix $\Sigma$ has $D(D+1)/2$ parameters. This quickly becomes very large for increasing dimension $D$.
**Continuous latent**variable models (a.k.a.**dimensionality reduction**models). Covered in lesson 11.

- Temporal signals are often not IID.
- Introduce
**Markov dependencies**and**latent state**variable models. This will be covered in lesson 13.

- Introduce

- You're now asked to build a density model for a data set (Old Faithful, Bishop pg. 681) that clearly is not distributed as a single Gaussian:

Consider again a set of observed data $D=\{x_1,\dotsc,x_N\}$

- This time we suspect that there are unobserved class labels that would help explain (or predict) the data, e.g.,
- the observed data are the color of living things; the unobserved classes are animals and plants.
- observed are wheel sizes; unobserved categories are trucks and personal cars.
- observed is an audio signal; unobserved classes include speech, music, traffic noise, etc.

- Classification problems with unobserved classes are called
**Clustering**problems. The learning algorithm needs to**discover the classes from the observed data**.

- If the categories were observed as well, these data could be nicely modeled by the previously discussed generative classification framework.

- Let us use
**equivalent model assumptions to linear generative classification**: $$\begin{align*} p(x_n,\mathcal{C}_k) &= \overbrace{p(\mathcal{C}_k)}^{\text{class prior}} \, \overbrace{p(x_n|\mathcal{C}_k)}^{\text{likelihood}} = \pi_k \,\mathcal{N}\left(x_n|\mu_k,\Sigma_k \right) \end{align*}$$ where, as previously, we use notational shorthand $\mathcal{C}_k \triangleq (z_{nk}=1)$ and a*hidden*$1$-of-$K$ selector variable $z_{nk}$ with each observation $x_n$ as $$ z_{nk} = \begin{cases} 1 & \text{if } \, z_n \in \mathcal{C}_k\\ 0 & \text{otherwise} \end{cases} $$

- Since the class labels are not observed, the probability for observations in a GMM is obtained through
*marginalization*over the classes (this is different in classification problems, where the classes are observed): $$\begin{align*} p(x_n) = \boxed{\sum_k \pi_k \mathcal{N}\left(x_n|\mu_k,\Sigma_k \right)} \end{align*}$$- The class priors $p(\mathcal{C}_k) =\pi_k$ are often called
*mixture coefficients*.

- The class priors $p(\mathcal{C}_k) =\pi_k$ are often called

- GMMs are very popular models. They have decent computational properties and are
**universal approximators of densities**(as long as there are enough Gaussians of course)

- The log-likelihood for observed data $D=\{x_1,\dotsc,x_N\}$, $$\begin{align*} \log p(D|\theta) &\stackrel{\text{IID}}{=} \sum_n \log p(x_n|\theta) = \sum_n \log \left( \sum_{k=1}^K p(\mathcal{C}_k) \, p(x_n|\mathcal{C}_k) \right) \\ &= \sum_n \log \left( \sum_k \pi_k\mathcal{N}(x_n|\mu_k,\Sigma_k) \right) \end{align*}$$ ... and now the log-of-sum cannot be further simplified.

- Compare this to the log-likelihood for $D=\{(x_1,y_1),\dotsc,(x_N,y_N)\}$ in (generative) classification: $$\begin{align*} \log p(D|\theta) &= \sum_n \log \prod_k p(x_n,\mathcal{C}_{k}|\theta)^{y_{nk}} = \sum_{n,k} y_{nk} \log p(x_n,\mathcal{C}_{k}|\theta) \\ &= \sum_k m_k \log \pi_k + \sum_{n,k} y_{nk} \log \mathcal{N}(x_n|\mu_k,\Sigma) \end{align*}$$ which led to easy Gaussian and multinomial parameter estimation.

- Fortunately GMMs can be trained by maximum likelihood using an efficient algorithm: Expectation-Maximization.

Let's introduce a $1$-of-$K$ selector variable $z_{nk}$ to represent the *unobserved* classes $\mathcal{C}_k$ by
$$
z_{nk} = \begin{cases} 1 & \text{if $z_n$ in class $\mathcal{C}_k$}\\
0 & \text{otherwise} \end{cases}
$$

- We don't
*observe*the class labels $z_{nk}$, but we can compute the posterior probability for the class labels, given the observation $x_n$: $$\begin{align*} p(\mathcal{C}_k | x_n ) &= \frac{p(x_n|\mathcal{C}_k)p(\mathcal{C}_k)}{\sum_{k^\prime} p(x_n|\mathcal{C}_{k^\prime})p(\mathcal{C}_{k^\prime})} = \frac{\pi_k \mathcal{N}(x_n | \mu_k,\Sigma_k)}{\sum_{k^\prime} \pi_{k^\prime} \mathcal{N}(x_n | \mu_{k^\prime},\Sigma_{k^\prime})} \end{align*}$$

- The $\gamma_{nk} \triangleq p(\mathcal{C}_k | x_n ) = p(z_{nk}=1 | x_n )$ are also called
**responsibilities**. Note that $0 \leq \gamma_{nk} \leq 1$ by definition and that they can be evaluated (i.e., we can compute it).

- The responsibilities $\gamma_{nk}$ are soft class indicators and they play the same role in clustering as class selection variables ($y_{nk}$) do in classification problems.

**IDEA**: Let's apply the (generative) classification formulas and substitute the reponsibilities $\gamma_{nk}$ wherever the formulas use the binary class indicators $y_{nk}$.

- Try parameter updates (same as for generative Gaussian classification): $$\begin{align*} \hat \pi_k &= \frac{m_k}{N}\,,\quad \text{where }m_k = \sum_n \gamma_{nk} \\ \hat \mu_k &= \frac{1}{m_k} \sum_n \gamma_{nk} x_n \\ \hat \Sigma_k &= \frac{1}{m_k} \sum_{n} \gamma_{nk} (x_n-\hat \mu_k)(x_n-\hat \mu_k)^T \end{align*}$$

- But wait ..., the responsibilities $\gamma_{nk}=\frac{\pi_k \mathcal{N}(x_n|\mu_k,\Sigma_k)}{\sum_j \pi_j \mathcal{N}(x_n|\mu_j,\Sigma_j)}$ are a function of the model parameters $\{\pi,\mu,\Sigma\}$ and the parameter updates depend on the responsibilities ...

**Solution**: Iterate updating between $\gamma_{nk}$ and the parameters $\{\pi,\mu,\Sigma\}$. This procedure is called the**Expectation-Maximization (EM)**algorithm.

We'll perform clustering on the data set from the illustrative example by fitting a GMM consisting of two Gaussians using the EM algorithm.

In [1]:

```
using DataFrames, CSV, LinearAlgebra
include("scripts/gmm_plot.jl") # Holds plotting function
old_faithful = CSV.read("datasets/old_faithful.csv")
X = convert(Matrix{Float64}, [old_faithful[1] old_faithful[2]]')
N = size(X, 2)
# Initialize the GMM. We assume 2 clusters.
clusters = [MvNormal([4.;60.], [.5 0;0 10^2]);
MvNormal([2.;80.], [.5 0;0 10^2])]
π_hat = [0.5; 0.5] # Mixing weights
γ = fill!(Matrix{Float64}(undef,2,N), NaN) # Responsibilities (row per cluster)
# Define functions for updating the parameters and responsibilities
function updateResponsibilities!(X, clusters, π_hat, γ)
# Expectation step: update γ
norm = [pdf(clusters[1], X) pdf(clusters[2], X)] * π_hat
γ[1,:] = (π_hat[1] * pdf(clusters[1],X) ./ norm)'
γ[2,:] = 1 .- γ[1,:]
end
function updateParameters!(X, clusters, π_hat, γ)
# Maximization step: update π_hat and clusters using ML estimation
m = sum(γ, dims=2)
π_hat = m / N
μ_hat = (X * γ') ./ m'
for k=1:2
Z = (X .- μ_hat[:,k])
Σ_k = Symmetric(((Z .* (γ[k,:])') * Z') / m[k])
clusters[k] = MvNormal(μ_hat[:,k], convert(Matrix, Σ_k))
end
end
# Execute the algorithm: iteratively update parameters and responsibilities
subplot(2,3,1); plotGMM(X, clusters, γ); title("Initial situation")
updateResponsibilities!(X, clusters, π_hat, γ)
subplot(2,3,2); plotGMM(X, clusters, γ); title("After first E-step")
updateParameters!(X, clusters, π_hat, γ)
subplot(2,3,3); plotGMM(X, clusters, γ); title("After first M-step")
iter_counter = 1
for i=1:3
for j=1:i+1
updateResponsibilities!(X, clusters, π_hat, γ)
updateParameters!(X, clusters, π_hat, γ)
iter_counter += 1
end
subplot(2,3,3+i);
plotGMM(X, clusters, γ);
title("After $(iter_counter) iterations")
end
PyPlot.tight_layout()
```

Note that you can step through the interactive demo yourself by running this script in julia. You can run a script in julia by

`julia> include("path/to/script-name.jl")`

Classification | Clustering | |

1 | Class label $y_n$ is observed | Class label $z_n$ is latent |

2 | log-likelihood conditions on observed class$$\propto \sum_{nk} y_{nk} \log \mathcal{N}(x_n|\mu_k,\Sigma_k)$$ | log-likelihood marginalizes over latent classes$$\propto \sum_{n}\log \sum_k \pi_k \mathcal{N}(x_n|\mu_k,\Sigma_k)$$ |

3 | 'Hard' class selector $$ y_{nk} = \begin{cases} 1 & \text{if $y_n$ in class $\mathcal{C}_k$}\\ 0 & \text{otherwise} \end{cases} $$ | 'Soft' class responsibility $$\gamma_{nk} = p(\mathcal{C}_k|x_n)$$ |

4 | Estimation: $$\begin{align*} \hat{\pi}_k &= \frac{1}{N}\sum_n y_{nk} \\ \hat{\mu}_k &= \frac{\sum_n y_{nk} x_n}{\sum_n y_{nk}} \\ \hat{\Sigma}_k &= \frac{\sum_n y_{nk} (x_n-\hat\mu_k)(x_n-\hat\mu_k)^T}{\sum_n y_{nk}} \end{align*}$$ |
Estimation (1 update of M-step!) $$\begin{align*} \hat{\pi}_k &= \frac{1}{N}\sum_n \gamma_{nk} \\ \hat{\mu}_k &= \frac{\sum_n \gamma_{nk} x_n}{\sum_n \gamma_{nk}} \\ \hat{\Sigma}_k &= \frac{\sum_n \gamma_{nk} (x_n-\hat\mu_k)(x_n-\hat\mu_k)^T}{\sum_n \gamma_{nk}} \end{align*}$$ |

In [2]:

```
open("../../styles/aipstyle.html") do f
display("text/html", read(f,String))
end
```

In [ ]:

```
```