Mixture Models

COMP4670/8600 - Introduction to Statistical Machine Learning - Tutorial

$\newcommand{\trace}[1]{\operatorname{tr}\left\{#1\right\}}$ $\newcommand{\Norm}[1]{\lVert#1\rVert}$ $\newcommand{\RR}{\mathbb{R}}$ $\newcommand{\inner}[2]{\langle #1, #2 \rangle}$ $\newcommand{\DD}{\mathscr{D}}$ $\newcommand{\grad}[1]{\operatorname{grad}#1}$ $\DeclareMathOperator*{\argmin}{arg\,min}$ Run this cell to set up $\LaTeX$.

In [ ]:
import numpy as np
import matplotlib.pyplot as plt

Bishop Exercise 9.7

Verify that maximization of the complete-data log likelihood, \begin{align} \ln p(\mathbf{X},\mathbf{Z}|\mathbf{\mu},\mathbf{\Sigma},\mathbf{\pi}) & = \sum_{n=1}^N \sum_{k=1}^K z_{nk} \{ \ln \pi_k + \ln \mathcal{N}(\mathbf{x}_n | \mathbf{\mu}_k, \mathbf{\Sigma}_k \}, \end{align} for a Gaussian mixture model leads to the result that the means and covariances of each component are fitted independently to the corresponding group of data points, and the mixing coefficients are given by the fractions of points in each group.

Solution description

Bishop Exercise 9.8

Show that if we maximize \begin{align} \mathbb{E}_\mathbf{Z} \left[ \ln p(\mathbf{X},\mathbf{Z}|\mathbf{\mu},\mathbf{\Sigma},\mathbf{\pi}) \right] & = \sum_{n=1}^N \sum_{k=1}^K \gamma(z_{nk}) \{ \ln \pi_k + \ln \mathcal{N}(\mathbf{x}_n | \mathbf{\mu}_k \mathbf{\Sigma}_k \}, \end{align} with respect to $\mathbf{\mu}_k$ while keeping the responsibilities $\gamma(z_{nk})$ fixed, we obtain the closed form solution \begin{align} \mathbf{\mu}_k & = \frac{1}{N_k} \sum_{n=1}^N \gamma(z_{nk}). \end{align}

Solution description

Bishop Exercise 9.9

Show that if we instead maximize with respect to $\mathbf{\Sigma}_k$ and $\mathbf{\pi}$ while keeping the responsibilities $\gamma(z_{nk})$ fixed, we obtain the closed form solution \begin{align} \mathbf{\Sigma}_k & = \frac{1}{N_k} \sum_{n=1}^N \gamma(z_{nk}) (\mathbf{x}_n-\mathbf{\mu}_k)(\mathbf{x}_n-\mathbf{\mu}_k)^\top, \\ \pi_k &= \frac{N_k}{N}. \end{align}

Solution description

Mixture Model Implementation

Get the data

Download the famous old faithful dataset from here:

https://machlearn.gitlab.io/isml2017/tutorial/faithful.csv

The data is two dimensional, denoting the waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA.

In [ ]:
# load the old faithful dataset and standardise

data = np.loadtxt('faithful.csv', delimiter=',')
data_labels = ('Eruption length','Eruption wait')

data -= np.mean(data, axis=0)
data /= np.std(data, axis=0)
data = data.T

assert data.shape == (2, 272)

Plotting helper functions

In [ ]:
# plot_cov_ellipse was taken from here:
# http://www.nhsilbert.net/source/2014/06/bivariate-normal-ellipse-plotting-in-python/

def plot_cov_ellipse(cov, pos, volume=.5, ax=None, fc='none', ec=[0,0,0], a=1, lw=2):
    """
    Plots an ellipse enclosing *volume* based on the specified covariance
    matrix (*cov*) and location (*pos*). Additional keyword arguments are passed on to the 
    ellipse patch artist.

    Parameters
    ----------
        cov : The 2x2 covariance matrix to base the ellipse on
        pos : The location of the center of the ellipse. Expects a 2-element
            sequence of [x0, y0].
        volume : The volume inside the ellipse; defaults to 0.5
        ax : The axis that the ellipse will be plotted on. Defaults to the 
            current axis.
    """

    import numpy as np
    from scipy.stats import chi2
    import matplotlib.pyplot as plt
    from matplotlib.patches import Ellipse

    def eigsorted(cov):
        vals, vecs = np.linalg.eigh(cov)
        order = vals.argsort()[::-1]
        return vals[order], vecs[:,order]

    if ax is None:
        ax = plt.gca()

    vals, vecs = eigsorted(cov)
    theta = np.degrees(np.arctan2(*vecs[:,0][::-1]))

    kwrg = {'facecolor':fc, 'edgecolor':ec, 'alpha':a, 'linewidth':lw}

    # Width and height are "full" widths, not radius
    width, height = 2 * np.sqrt(chi2.ppf(volume,2)) * np.sqrt(vals)
    ellip = Ellipse(xy=pos, width=width, height=height, angle=theta, **kwrg)

    ax.add_artist(ellip)
    

def plot_components(mu, Sigma, colours, *args, **kwargs):
    '''
    Plot ellipses for the bivariate normals with mean mu[:,i] and covariance Sigma[:,:,i]
    '''
    assert mu.shape[1] == Sigma.shape[2]
    assert mu.shape[0] == 2
    assert Sigma.shape[0] == 2
    assert Sigma.shape[1] == 2
    for i in range(mu.shape[1]):
        kwargs['ec'] = colours[i]
        plot_cov_ellipse(Sigma[:,:,i], mu[:,i], *args, **kwargs)

import matplotlib.colors as mcol

br_cmap = mcol.LinearSegmentedColormap.from_list("MyCmapName",["b","r"])

def plot_data(redness=None):
    if redness is not None:
        assert len(redness) == data.shape[1]
        assert all(_ >= 0 and _ <= 1 for _ in redness)
        c = redness
    else:
        c = 'g'
    plt.figure(figsize=(8,8))
    plt.scatter(data[0,:],data[1,:], marker='.', s=8, linewidths=2, c=c, cmap=br_cmap)
    plt.xlabel(data_labels[0])
    plt.ylabel(data_labels[1])
    plt.axis([-2,2,-2,2], 'equal')
In [ ]:
# initialise the mixture components and plot

pi_0 = np.array([0.5,0.5])
mu_0 = np.array([[-1,1],[1,-1]], dtype=np.float)
Sigma_0 = np.concatenate(
    [np.eye(2).reshape(2,2,1) for _ in range(mu_0.shape[1])],
    axis=2,
)

plot_data()
plot_components(mu_0, Sigma_0, ['b','r'], 0.2)
plt.show()

Optimise the Gaussian Mixture with EM

Using the equations you have derived, along with the relevant related material from the Bishop textbook, perform maximum likelihood estimation for $\mathbf{\mu}$, $\mathbf{\Sigma}$ and $\mathbf{\pi}$.

Make some nice plots. Note that the plot_data function takes an argument to control the color of the scatter plot.

In [ ]:
# Solution goes here

Interpret the Previous Algorithm in terms of General EM

Refer to Bishop and/or the lecture slides, and explain with equations:

  1. the assumed generative data generating process for the mixture of Gaussians,
  2. what the parameters of the model are,
  3. what the latent (unobserved) variables are and why they make maximum likelihood parameter estimation non-trivial,
  4. how EM provides a solution to the maximum likelihood estimation problem.

Make yourself comfortable with EM to the point that you could readily apply it to another model.

Solution description