# Clustering with Gaussian Mixture Models¶

### Preliminaries¶

• Goal
• Introduction to the Expectation-Maximization (EM) Algorithm with application to Gaussian Mixture Models (GMM)
• Materials
• Mandatory
• These lecture notes
• Optional
• Bishop pp. 430-439 for Gaussian Mixture Models

### Limitations of Simple IID Gaussian Models¶

Sofar, model inference was solved analytically, but we used strong assumptions:

• IID sampling, $p(D) = \prod_n p(x_n)$
• Simple Gaussian (or multinomial) PDFs, $p(x_n) = \mathcal{N}(x_n|\mu,\Sigma)$
• Some limitations of Simple Gaussian Models with IID Sampling:
• What if the PDF is multi-modal (or is just not Gaussian in any other way)?
• Covariance matrix $\Sigma$ has $D(D+1)/2$ parameters.
• This quickly becomes a very large number for increasing dimension $D$.
• Temporal signals are often not IID.

### Towards More Flexible Models¶

• What if the PDF is multi-modal (or is just not Gaussian in any other way)?
• Discrete latent variable models (a.k.a. mixture models). We'll cover this case in this lesson.
• Covariance matrix $\Sigma$ has $D(D+1)/2$ parameters. This quickly becomes very large for increasing dimension $D$.
• Continuous latent variable models (a.k.a. dimensionality reduction models). Covered in lesson 11.
• Temporal signals are often not IID.
• Introduce Markov dependencies and latent state variable models. This will be covered in lesson 13.

### Illustrative Example¶

• You're now asked to build a density model for a data set (Old Faithful, Bishop pg. 681) that clearly is not distributed as a single Gaussian:

### Unobserved Classes¶

Consider again a set of observed data $D=\{x_1,\dotsc,x_N\}$

• This time we suspect that there are unobserved class labels that would help explain (or predict) the data, e.g.,
• the observed data are the color of living things; the unobserved classes are animals and plants.
• observed are wheel sizes; unobserved categories are trucks and personal cars.
• observed is an audio signal; unobserved classes include speech, music, traffic noise, etc.
• Classification problems with unobserved classes are called Clustering problems. The learning algorithm needs to discover the classes from the observed data.

### The Gaussian Mixture Model (GMM)¶

• Let us use equivalent model assumptions to linear generative classification: \begin{align*} p(x_n,\mathcal{C}_k) &= \overbrace{p(\mathcal{C}_k)}^{\text{class prior}} \, \overbrace{p(x_n|\mathcal{C}_k)}^{\text{likelihood}} = \pi_k \,\mathcal{N}\left(x_n|\mu_k,\Sigma_k \right) \end{align*} where, as previously, we use notational shorthand $\mathcal{C}_k \triangleq (z_{nk}=1)$ and a hidden $1$-of-$K$ selector variable $z_{nk}$ with each observation $x_n$ as $$z_{nk} = \begin{cases} 1 & \text{if } \, z_n \in \mathcal{C}_k\\ 0 & \text{otherwise} \end{cases}$$
• Since the class labels are not observed, the probability for observations in a GMM is obtained through marginalization over the classes (this is different in classification problems, where the classes are observed): \begin{align*} p(x_n) = \boxed{\sum_k \pi_k \mathcal{N}\left(x_n|\mu_k,\Sigma_k \right)} \end{align*}
• The class priors $p(\mathcal{C}_k) =\pi_k$ are often called mixture coefficients.

### Gaussian Mixture Models¶

• GMMs are very popular models. They have decent computational properties and are universal approximators of densities (as long as there are enough Gaussians of course)

### Inference: Log-Likelihood for GMM¶

• The log-likelihood for observed data $D=\{x_1,\dotsc,x_N\}$, \begin{align*} \log p(D|\theta) &\stackrel{\text{IID}}{=} \sum_n \log p(x_n|\theta) = \sum_n \log \left( \sum_{k=1}^K p(\mathcal{C}_k) \, p(x_n|\mathcal{C}_k) \right) \\ &= \sum_n \log \left( \sum_k \pi_k\mathcal{N}(x_n|\mu_k,\Sigma_k) \right) \end{align*} ... and now the log-of-sum cannot be further simplified.
• Compare this to the log-likelihood for $D=\{(x_1,y_1),\dotsc,(x_N,y_N)\}$ in (generative) classification: \begin{align*} \log p(D|\theta) &= \sum_n \log \prod_k p(x_n,\mathcal{C}_{k}|\theta)^{y_{nk}} = \sum_{n,k} y_{nk} \log p(x_n,\mathcal{C}_{k}|\theta) \\ &= \sum_k m_k \log \pi_k + \sum_{n,k} y_{nk} \log \mathcal{N}(x_n|\mu_k,\Sigma) \end{align*} which led to easy Gaussian and multinomial parameter estimation.
• Fortunately GMMs can be trained by maximum likelihood using an efficient algorithm: Expectation-Maximization.

### Introducing a Soft Class Indicator¶

Let's introduce a $1$-of-$K$ selector variable $z_{nk}$ to represent the unobserved classes $\mathcal{C}_k$ by $$z_{nk} = \begin{cases} 1 & \text{if z_n in class \mathcal{C}_k}\\ 0 & \text{otherwise} \end{cases}$$

• We don't observe the class labels $z_{nk}$, but we can compute the posterior probability for the class labels, given the observation $x_n$: \begin{align*} p(\mathcal{C}_k | x_n ) &= \frac{p(x_n|\mathcal{C}_k)p(\mathcal{C}_k)}{\sum_{k^\prime} p(x_n|\mathcal{C}_{k^\prime})p(\mathcal{C}_{k^\prime})} = \frac{\pi_k \mathcal{N}(x_n | \mu_k,\Sigma_k)}{\sum_{k^\prime} \pi_{k^\prime} \mathcal{N}(x_n | \mu_{k^\prime},\Sigma_{k^\prime})} \end{align*}
• The $\gamma_{nk} \triangleq p(\mathcal{C}_k | x_n ) = p(z_{nk}=1 | x_n )$ are also called responsibilities. Note that $0 \leq \gamma_{nk} \leq 1$ by definition and that they can be evaluated (i.e., we can compute it).
• The responsibilities $\gamma_{nk}$ are soft class indicators and they play the same role in clustering as class selection variables ($y_{nk}$) do in classification problems.

### ML estimation for Clustering: The Expectation-Maximization (EM) Algorithm Idea¶

• IDEA: Let's apply the (generative) classification formulas and substitute the reponsibilities $\gamma_{nk}$ wherever the formulas use the binary class indicators $y_{nk}$.
• Try parameter updates (same as for generative Gaussian classification): \begin{align*} \hat \pi_k &= \frac{m_k}{N}\,,\quad \text{where }m_k = \sum_n \gamma_{nk} \\ \hat \mu_k &= \frac{1}{m_k} \sum_n \gamma_{nk} x_n \\ \hat \Sigma_k &= \frac{1}{m_k} \sum_{n} \gamma_{nk} (x_n-\hat \mu_k)(x_n-\hat \mu_k)^T \end{align*}
• But wait ..., the responsibilities $\gamma_{nk}=\frac{\pi_k \mathcal{N}(x_n|\mu_k,\Sigma_k)}{\sum_j \pi_j \mathcal{N}(x_n|\mu_j,\Sigma_j)}$ are a function of the model parameters $\{\pi,\mu,\Sigma\}$ and the parameter updates depend on the responsibilities ...
• Solution: Iterate updating between $\gamma_{nk}$ and the parameters $\{\pi,\mu,\Sigma\}$. This procedure is called the Expectation-Maximization (EM) algorithm.

#### CODE EXAMPLE¶

We'll perform clustering on the data set from the illustrative example by fitting a GMM consisting of two Gaussians using the EM algorithm.

In [1]:
using DataFrames, CSV, LinearAlgebra
include("scripts/gmm_plot.jl") # Holds plotting function
X = convert(Matrix{Float64}, [old_faithful[1] old_faithful[2]]')
N = size(X, 2)

# Initialize the GMM. We assume 2 clusters.
clusters = [MvNormal([4.;60.], [.5 0;0 10^2]);
MvNormal([2.;80.], [.5 0;0 10^2])]
π_hat = [0.5; 0.5]                    # Mixing weights
γ = fill!(Matrix{Float64}(undef,2,N), NaN)  # Responsibilities (row per cluster)

# Define functions for updating the parameters and responsibilities
function updateResponsibilities!(X, clusters, π_hat, γ)
# Expectation step: update γ
norm = [pdf(clusters[1], X) pdf(clusters[2], X)] * π_hat
γ[1,:] = (π_hat[1] * pdf(clusters[1],X) ./ norm)'
γ[2,:] = 1 .- γ[1,:]
end
function updateParameters!(X, clusters, π_hat, γ)
# Maximization step: update π_hat and clusters using ML estimation
m = sum(γ, dims=2)
π_hat = m / N
μ_hat = (X * γ') ./ m'
for k=1:2
Z = (X .- μ_hat[:,k])
Σ_k = Symmetric(((Z .* (γ[k,:])') * Z') / m[k])
clusters[k] = MvNormal(μ_hat[:,k], convert(Matrix, Σ_k))
end
end

# Execute the algorithm: iteratively update parameters and responsibilities
subplot(2,3,1); plotGMM(X, clusters, γ); title("Initial situation")
updateResponsibilities!(X, clusters, π_hat, γ)
subplot(2,3,2); plotGMM(X, clusters, γ); title("After first E-step")
updateParameters!(X, clusters, π_hat, γ)
subplot(2,3,3); plotGMM(X, clusters, γ); title("After first M-step")
iter_counter = 1
for i=1:3
for j=1:i+1
updateResponsibilities!(X, clusters, π_hat, γ)
updateParameters!(X, clusters, π_hat, γ)
iter_counter += 1
end
subplot(2,3,3+i);
plotGMM(X, clusters, γ);
title("After $(iter_counter) iterations") end PyPlot.tight_layout()  Note that you can step through the interactive demo yourself by running this script in julia. You can run a script in julia by julia> include("path/to/script-name.jl") ### Clustering vs. (Generative) Classification¶  Classification Clustering 1 Class label$y_n$is observed Class label$z_n\$ is latent 2 log-likelihood conditions on observed class$$\propto \sum_{nk} y_{nk} \log \mathcal{N}(x_n|\mu_k,\Sigma_k)$$ log-likelihood marginalizes over latent classes$$\propto \sum_{n}\log \sum_k \pi_k \mathcal{N}(x_n|\mu_k,\Sigma_k)$$ 3 'Hard' class selector$$y_{nk} = \begin{cases} 1 & \text{if y_n in class \mathcal{C}_k}\\ 0 & \text{otherwise} \end{cases}$$ 'Soft' class responsibility$$\gamma_{nk} = p(\mathcal{C}_k|x_n)$$ 4 Estimation: \begin{align*} \hat{\pi}_k &= \frac{1}{N}\sum_n y_{nk} \\ \hat{\mu}_k &= \frac{\sum_n y_{nk} x_n}{\sum_n y_{nk}} \\ \hat{\Sigma}_k &= \frac{\sum_n y_{nk} (x_n-\hat\mu_k)(x_n-\hat\mu_k)^T}{\sum_n y_{nk}} \end{align*} Estimation (1 update of M-step!) \begin{align*} \hat{\pi}_k &= \frac{1}{N}\sum_n \gamma_{nk} \\ \hat{\mu}_k &= \frac{\sum_n \gamma_{nk} x_n}{\sum_n \gamma_{nk}} \\ \hat{\Sigma}_k &= \frac{\sum_n \gamma_{nk} (x_n-\hat\mu_k)(x_n-\hat\mu_k)^T}{\sum_n \gamma_{nk}} \end{align*}
In [2]:
open("../../styles/aipstyle.html") do f