Consider an experiment with coin 1 that has a probability $\alpha_1$ of heads, and a coin 2 that has a probability $\alpha_2$ of heads. We draw $m$ samples as follows - for each sample, pick one of the coins at random, flip it $n$ times, and record the number of heads. If we recorded which coin we used for each sample, we have complete information and can estimate $\alpha_1$ and $\alpha_2$ in closed form. To be very explicit, suppose we drew 5 samples with the number of heads represented as a vector $x$, and the sequence of coins chosen was 1,2,2,1,2. If we know the sequence of coins, then it is easy to estimate -- for each coin, we can simply sum up the number of heads and divide by the number of tosses. However, if we do not know the sequence of coins, then it is harder. We will first start by writing the likelihood function.
The log-likelihood function
$$ \begin{split} l(\theta) &= \sum_{i=1}^m \log p(x_i|\theta) \\ &= \sum_{i=1}^m \log\left(B(x_i|\ n,\alpha_1) + B(x_i|\ n,\alpha_2)\right) + \textsf{constant} \end{split} $$We will now write the complete data likelihood function, which assumes we observe both the number of heads and the coin number. Let $z_i$ be the latent variable that identifies the coin, i.e., $z_i \in \{1, 2\}$. The complete data log-likelihood is
$$ \begin{split} l_c(\theta) &= \sum_{i=1}^m \log p(x_i, z_i|\theta) \\ &= \sum_{i=1}^m \log \Pi_{j=1}^2\left(p(z_i=j)p(x_i|z_i=j)\right)^{\mathbb{I}_{\{z_i = j\}}} \\ &= \sum_{i=1}^m \sum_{j=1}^2 \mathbb{I}_{\{z_i = j\}}\log\left(p(z_i=j)p(x_i|z_i=j)\right) \end{split} $$This function cannot be computed because we do not know $z_i$. We work around this difficulty by defining expected complete data log likelihood as follows:
$$ \begin{split} Q(\theta, \theta_{t-1}) &= E[l_c(\theta)|\mathcal{D}, \theta^{t-1}] \\ &= \sum_{i=1}^m \sum_{j=1}^2 E[\mathbb{I}_{\{z_i = j\}}|\mathcal{D}, \theta^{t-1}]\log\left(p(z_i=j)p(x_i|z_i=j)\right) \\ &= \sum_{i=1}^m \sum_{j=1}^2 p(z_i = j|x_i, \theta^{t-1})\log\left(\frac{1}{2}B(x_i|n, \alpha_j)\right) \end{split} $$The expectation is of random variable $z_i$ using parameters $\theta^{t-1}$ conditional on the observed data $\mathcal{D} = \{x_i\}_{i=1}^m$.
The E-step simply evaluates
$$ p(z_i = j|x_i, \theta^{t-1}) = \frac{\alpha_{j,t-1}^{x_i}(1-\alpha_{j, t-1})^{n-x_i}}{\alpha_{1,t-1}^{x_i}(1-\alpha_{1,t-1})^{n-x_i} + \alpha_{2,t-1}^{x_i}(1-\alpha_{2,t-1})^{n-x_i}} \triangleq r_{ij} $$where $r_{ij}$ is the probability that $i$th data $x_i$ is from coin $j$. The calculation is a simple application of Bayes rule. Substituting this expression we get
The M-step is to compute
$$ \theta^t \in \mathrm{argmax}_{\theta} Q(\theta, \theta^{t-1}) $$Taking the derivative wrt $\alpha_j$ and setting it to zero, we get
$$ \frac{\sum_{i=1}^m r_{ij} x_i}{\alpha_j} - \frac{\sum_{i=1}^m r_{ij}(n-x_i)}{1-\alpha_j} = 0. $$Solving we get the maximizing $$\alpha_j = \frac{\sum_{i=1}^m r_{ij}\ x_i/n}{\sum_{i=1}^m r_{ij}}.$$ The above expression is very interpretable: $x_i/n$ is an estimate for probability of head of the coin. The expression allows us to combine these estimates with weights equal to probability that it is from a particular coin.
It can be shown that likelihood of the observed data is monotonically increasing, which can be used as a convenient debugging tool!
Reference
import numpy as np
from scipy.stats import binom, bernoulli
import math as mth
from IPython.display import display
import pandas as pd
# Define problem parameters
nc = 2 # number of coins
act_theta = [0.2, 0.7] # probability of success for coin 1 and coin 2
m, n = 20, 10 # m = number of coin picks, n = number of tosses for each coin pick
def gen_data():
x = []
for _ in range(m):
# pick the coin randomly
p = act_theta[bernoulli.rvs(0.5, loc=0, size=1)[0]]
x += [binom.rvs(n, p)] # record number of heads in n tosses
return x
def next_theta(theta):
"""
Function uses EM method to calculate new theta using the given
value of theta
"""
# E-step
r = np.zeros((m,nc))
for i in range(m):
r[i]=[binom.pmf(x[i], n, theta[j]) for j in range(nc)]
r[i] = r[i]/r[i].sum()
# M-step
new_theta = [0.0, 0.0]
for j in range(nc):
new_theta[j] = sum([r[i,j]*x[i] for i in range(m)])/(n*r[:,j].sum())
return new_theta
def log_likelihood(theta):
"""
Calculate log-likelihood function for given theta
"""
return sum([mth.log(binom.pmf(k, n, theta[0])+binom.pmf(k, n, theta[1])) for k in x])
def EM(debug=False):
"""
Use EM algorithm to determine theta
"""
theta = np.random.rand(nc).tolist()
obj = log_likelihood(theta)
for i in range(500): # max_iter is 500
theta = next_theta(theta)
new_obj = log_likelihood(theta)
# Sanity check: objective value always increases
assert new_obj >= obj, "likelihood decreased!"
# stopping criterion
if new_obj <= obj + 1e-7:
break
obj = new_obj
return theta
x = gen_data()
print x
EM()
[6, 4, 2, 1, 7, 10, 2, 2, 3, 8, 6, 4, 1, 1, 2, 2, 8, 3, 6, 3]
[0.22917590030584753, 0.70378821853450535]
A subtle point is that in the estimated $\theta$ the probabilities can be switched around because statistically it is equivalent to call either of the coins as coin 1 and the other one as coin 2!