Notebook

Introduction to Expectation Maximization via Example¶

Problem¶

Consider an experiment with coin 1 that has a probability $\alpha_1$ of heads, and a coin 2 that has a probability $\alpha_2$ of heads. We draw $m$ samples as follows - for each sample, pick one of the coins at random, flip it $n$ times, and record the number of heads. If we recorded which coin we used for each sample, we have complete information and can estimate $\alpha_1$ and $\alpha_2$ in closed form. To be very explicit, suppose we drew 5 samples with the number of heads represented as a vector $x$, and the sequence of coins chosen was 1,2,2,1,2. If we know the sequence of coins, then it is easy to estimate -- for each coin, we can simply sum up the number of heads and divide by the number of tosses. However, if we do not know the sequence of coins, then it is harder. We will first start by writing the likelihood function.

The log-likelihood function

$$ \begin{split} l(\theta) &= \sum_{i=1}^m \log p(x_i|\theta) \\ &= \sum_{i=1}^m \log\left(B(x_i|\ n,\alpha_1) + B(x_i|\ n,\alpha_2)\right) + \textsf{constant} \end{split} $$

We will now write the complete data likelihood function, which assumes we observe both the number of heads and the coin number. Let $z_i$ be the latent variable that identifies the coin, i.e., $z_i \in \{1, 2\}$. The complete data log-likelihood is

$$ \begin{split} l_c(\theta) &= \sum_{i=1}^m \log p(x_i, z_i|\theta) \\ &= \sum_{i=1}^m \log \Pi_{j=1}^2\left(p(z_i=j)p(x_i|z_i=j)\right)^{\mathbb{I}_{\{z_i = j\}}} \\ &= \sum_{i=1}^m \sum_{j=1}^2 \mathbb{I}_{\{z_i = j\}}\log\left(p(z_i=j)p(x_i|z_i=j)\right) \end{split} $$

This function cannot be computed because we do not know $z_i$. We work around this difficulty by defining expected complete data log likelihood as follows:

$$ \begin{split} Q(\theta, \theta_{t-1}) &= E[l_c(\theta)|\mathcal{D}, \theta^{t-1}] \\ &= \sum_{i=1}^m \sum_{j=1}^2 E[\mathbb{I}_{\{z_i = j\}}|\mathcal{D}, \theta^{t-1}]\log\left(p(z_i=j)p(x_i|z_i=j)\right) \\ &= \sum_{i=1}^m \sum_{j=1}^2 p(z_i = j|x_i, \theta^{t-1})\log\left(\frac{1}{2}B(x_i|n, \alpha_j)\right) \end{split} $$

The expectation is of random variable $z_i$ using parameters $\theta^{t-1}$ conditional on the observed data $\mathcal{D} = \{x_i\}_{i=1}^m$.

E-step¶

The E-step simply evaluates

$$ p(z_i = j|x_i, \theta^{t-1}) = \frac{\alpha_{j,t-1}^{x_i}(1-\alpha_{j, t-1})^{n-x_i}}{\alpha_{1,t-1}^{x_i}(1-\alpha_{1,t-1})^{n-x_i} + \alpha_{2,t-1}^{x_i}(1-\alpha_{2,t-1})^{n-x_i}} \triangleq r_{ij} $$

where $r_{ij}$ is the probability that $i$th data $x_i$ is from coin $j$. The calculation is a simple application of Bayes rule. Substituting this expression we get

$$ Q(\theta,\theta^{t-1}) = \sum_{i=1}^m \sum_{j=1}^{2} r_{ij}\left(x_i\log\alpha_j + (n-x_i)\log(1-\alpha_j)\right) + \text{terms independent of }\theta $$

M-step¶

The M-step is to compute

$$ \theta^t \in \mathrm{argmax}_{\theta} Q(\theta, \theta^{t-1}) $$

Taking the derivative wrt $\alpha_j$ and setting it to zero, we get

$$ \frac{\sum_{i=1}^m r_{ij} x_i}{\alpha_j} - \frac{\sum_{i=1}^m r_{ij}(n-x_i)}{1-\alpha_j} = 0. $$

Solving we get the maximizing $$\alpha_j = \frac{\sum_{i=1}^m r_{ij}\ x_i/n}{\sum_{i=1}^m r_{ij}}.$$ The above expression is very interpretable: $x_i/n$ is an estimate for probability of head of the coin. The expression allows us to combine these estimates with weights equal to probability that it is from a particular coin.

It can be shown that likelihood of the observed data is monotonically increasing, which can be used as a convenient debugging tool!

Reference

Kevin Murphy, Machine Learning: a Probabilistic Perspective, MIT Press, 2011.

Generate data¶

In [1]:

import numpy as np
from scipy.stats import binom, bernoulli
import math as mth
from IPython.display import display
import pandas as pd


# Define problem parameters
nc = 2 # number of coins
act_theta = [0.2, 0.7] # probability of success for coin 1 and coin 2
m, n = 20, 10 # m = number of coin picks, n = number of tosses for each coin pick

def gen_data():
    x = []
    for _ in range(m):
        # pick the coin randomly
        p = act_theta[bernoulli.rvs(0.5, loc=0, size=1)[0]]
        x += [binom.rvs(n, p)] # record number of heads in n tosses
    return x

Estimate the $\theta$s¶

In [2]:

def next_theta(theta):
    """
    Function uses EM method to calculate new theta using the given
    value of theta
    """
    
    # E-step
    r = np.zeros((m,nc))
    for i in range(m):
        r[i]=[binom.pmf(x[i], n, theta[j]) for j in range(nc)]
        r[i] = r[i]/r[i].sum()
    
    # M-step
    new_theta = [0.0, 0.0]
    for j in range(nc):
        new_theta[j] = sum([r[i,j]*x[i] for i in range(m)])/(n*r[:,j].sum())
    return new_theta

def log_likelihood(theta):
    """
    Calculate log-likelihood function for given theta
    """
    return sum([mth.log(binom.pmf(k, n, theta[0])+binom.pmf(k, n, theta[1])) for k in x])

def EM(debug=False):
    """
    Use EM algorithm to determine theta
    """
    theta = np.random.rand(nc).tolist()
    obj = log_likelihood(theta)
    for i in range(500): # max_iter is 500
        theta = next_theta(theta)
        new_obj = log_likelihood(theta)

        # Sanity check: objective value always increases
        assert new_obj >= obj, "likelihood decreased!"

        # stopping criterion
        if new_obj <= obj + 1e-7:
            break
        obj = new_obj
    return theta      

In [3]:

x = gen_data()
print x
EM()

[6, 4, 2, 1, 7, 10, 2, 2, 3, 8, 6, 4, 1, 1, 2, 2, 8, 3, 6, 3]

Out[3]:

[0.22917590030584753, 0.70378821853450535]

A subtle point is that in the estimated $\theta$ the probabilities can be switched around because statistically it is equivalent to call either of the coins as coin 1 and the other one as coin 2!