# Mixture Models¶

###### COMP4670/8600 - Introduction to Statistical Machine Learning - Tutorial¶

$\newcommand{\trace}[1]{\operatorname{tr}\left\{#1\right\}}$ $\newcommand{\Norm}[1]{\lVert#1\rVert}$ $\newcommand{\RR}{\mathbb{R}}$ $\newcommand{\inner}[2]{\langle #1, #2 \rangle}$ $\newcommand{\DD}{\mathscr{D}}$ $\newcommand{\grad}[1]{\operatorname{grad}#1}$ $\DeclareMathOperator*{\argmin}{arg\,min}$ Run this cell to set up $\LaTeX$.

In [ ]:
import numpy as np
import matplotlib.pyplot as plt

## Bishop Exercise 9.7¶

Verify that maximization of the complete-data log likelihood, \begin{align} \ln p(\mathbf{X},\mathbf{Z}|\mathbf{\mu},\mathbf{\Sigma},\mathbf{\pi}) & = \sum_{n=1}^N \sum_{k=1}^K z_{nk} \{ \ln \pi_k + \ln \mathcal{N}(\mathbf{x}_n | \mathbf{\mu}_k, \mathbf{\Sigma}_k \}, \end{align} for a Gaussian mixture model leads to the result that the means and covariances of each component are fitted independently to the corresponding group of data points, and the mixing coefficients are given by the fractions of points in each group.

## Bishop Exercise 9.8¶

Show that if we maximize \begin{align} \mathbb{E}_\mathbf{Z} \left[ \ln p(\mathbf{X},\mathbf{Z}|\mathbf{\mu},\mathbf{\Sigma},\mathbf{\pi}) \right] & = \sum_{n=1}^N \sum_{k=1}^K \gamma(z_{nk}) \{ \ln \pi_k + \ln \mathcal{N}(\mathbf{x}_n | \mathbf{\mu}_k \mathbf{\Sigma}_k \}, \end{align} with respect to $\mathbf{\mu}_k$ while keeping the responsibilities $\gamma(z_{nk})$ fixed, we obtain the closed form solution \begin{align} \mathbf{\mu}_k & = \frac{1}{N_k} \sum_{n=1}^N \gamma(z_{nk}). \end{align}

## Bishop Exercise 9.9¶

Show that if we instead maximize with respect to $\mathbf{\Sigma}_k$ and $\mathbf{\pi}$ while keeping the responsibilities $\gamma(z_{nk})$ fixed, we obtain the closed form solution \begin{align} \mathbf{\Sigma}_k & = \frac{1}{N_k} \sum_{n=1}^N \gamma(z_{nk}) (\mathbf{x}_n-\mathbf{\mu}_k)(\mathbf{x}_n-\mathbf{\mu}_k)^\top, \\ \pi_k &= \frac{N_k}{N}. \end{align}

## Mixture Model Implementation¶

### Get the data¶

https://machlearn.gitlab.io/isml2017/tutorial/faithful.csv

The data is two dimensional, denoting the waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA.

In [ ]:
# load the old faithful dataset and standardise

data_labels = ('Eruption length','Eruption wait')

data -= np.mean(data, axis=0)
data /= np.std(data, axis=0)
data = data.T

assert data.shape == (2, 272)

### Plotting helper functions¶

In [ ]:
# plot_cov_ellipse was taken from here:
# http://www.nhsilbert.net/source/2014/06/bivariate-normal-ellipse-plotting-in-python/

def plot_cov_ellipse(cov, pos, volume=.5, ax=None, fc='none', ec=[0,0,0], a=1, lw=2):
"""
Plots an ellipse enclosing *volume* based on the specified covariance
matrix (*cov*) and location (*pos*). Additional keyword arguments are passed on to the
ellipse patch artist.

Parameters
----------
cov : The 2x2 covariance matrix to base the ellipse on
pos : The location of the center of the ellipse. Expects a 2-element
sequence of [x0, y0].
volume : The volume inside the ellipse; defaults to 0.5
ax : The axis that the ellipse will be plotted on. Defaults to the
current axis.
"""

import numpy as np
from scipy.stats import chi2
import matplotlib.pyplot as plt
from matplotlib.patches import Ellipse

def eigsorted(cov):
vals, vecs = np.linalg.eigh(cov)
order = vals.argsort()[::-1]
return vals[order], vecs[:,order]

if ax is None:
ax = plt.gca()

vals, vecs = eigsorted(cov)
theta = np.degrees(np.arctan2(*vecs[:,0][::-1]))

kwrg = {'facecolor':fc, 'edgecolor':ec, 'alpha':a, 'linewidth':lw}

# Width and height are "full" widths, not radius
width, height = 2 * np.sqrt(chi2.ppf(volume,2)) * np.sqrt(vals)
ellip = Ellipse(xy=pos, width=width, height=height, angle=theta, **kwrg)

def plot_components(mu, Sigma, colours, *args, **kwargs):
'''
Plot ellipses for the bivariate normals with mean mu[:,i] and covariance Sigma[:,:,i]
'''
assert mu.shape[1] == Sigma.shape[2]
assert mu.shape[0] == 2
assert Sigma.shape[0] == 2
assert Sigma.shape[1] == 2
for i in range(mu.shape[1]):
kwargs['ec'] = colours[i]
plot_cov_ellipse(Sigma[:,:,i], mu[:,i], *args, **kwargs)

import matplotlib.colors as mcol

br_cmap = mcol.LinearSegmentedColormap.from_list("MyCmapName",["b","r"])

def plot_data(redness=None):
if redness is not None:
assert len(redness) == data.shape[1]
assert all(_ >= 0 and _ <= 1 for _ in redness)
c = redness
else:
c = 'g'
plt.figure(figsize=(8,8))
plt.scatter(data[0,:],data[1,:], marker='.', s=8, linewidths=2, c=c, cmap=br_cmap)
plt.xlabel(data_labels[0])
plt.ylabel(data_labels[1])
plt.axis([-2,2,-2,2], 'equal')
In [ ]:
# initialise the mixture components and plot

pi_0 = np.array([0.5,0.5])
mu_0 = np.array([[-1,1],[1,-1]], dtype=np.float)
Sigma_0 = np.concatenate(
[np.eye(2).reshape(2,2,1) for _ in range(mu_0.shape[1])],
axis=2,
)

plot_data()
plot_components(mu_0, Sigma_0, ['b','r'], 0.2)
plt.show()

## Optimise the Gaussian Mixture with EM¶

Using the equations you have derived, along with the relevant related material from the Bishop textbook, perform maximum likelihood estimation for $\mathbf{\mu}$, $\mathbf{\Sigma}$ and $\mathbf{\pi}$.

Make some nice plots. Note that the plot_data function takes an argument to control the color of the scatter plot.

In [ ]:
# Solution goes here

## Interpret the Previous Algorithm in terms of General EM¶

Refer to Bishop and/or the lecture slides, and explain with equations:

1. the assumed generative data generating process for the mixture of Gaussians,
2. what the parameters of the model are,
3. what the latent (unobserved) variables are and why they make maximum likelihood parameter estimation non-trivial,
4. how EM provides a solution to the maximum likelihood estimation problem.

Make yourself comfortable with EM to the point that you could readily apply it to another model.