This can be represented graphically as a **fully connected** directed graph having $K$ nodes, with each node having **incoming** links from all **lower** numbered nodes.

Now consider the general case where the graph is **not necessarily** fully connected. Let the a given node $x_k$ have **incoming** links from the set of nodes $pa_k$ known as the *parent nodes* of $x_k$. Then the joint distribution defined by the directed acyclic graph over all nodes is given by the product of a conditional distribution for *each node conditioned on its parent nodes*. A graph with $K$ nodes has the joint distribution defined as

Represent multiple nodes of the form $t_1, \ldots, t_N$ as a single node with a box, known as a

*plate*, drawn around it and annotated with $N$, the number of repeated nodes.Deterministic model parameters, e.g. the mean of a distribution, are drawn as solid circles, and random variables as open circles

Observed variables are drawn as shaded open circles

When this condition holds, using the product rule, we may write the joint distribution of $a$ conditioned on $b$ and $c$ as

$$p\left(a, b \mid c \right) = p\left(a\mid b,c\right)p\left(b\mid c\right) = p\left(a\mid c\right) p\left(b \mid c\right)$$**(a)** There is a node, $n \in C$, in the path such that the arrows on the path meet head-to-tail or tail-to-tail at n

**(b)** There is a node, $n \notin C$ with none of its descendants in $C$, in the path such that the arrows meet head-to-head at the node. A node $d$ is considered to a descendant of a node $p$ if there is a path from $p$ to $d$ in which each path step is in the direction of the arrows connecting the nodes on th path.

Note that model parameters (e.g. the mean of a distribution) will always be tail-to-tail and therefore play no role in determining d-separation.

where $\delta(x) = 1$

if $x$ is true, 0 otherwise

Approach:

$ argmax_{\theta} E_{P(Z|X,\theta)}\left[log P(x,z|\theta) \right]$

using some probability distribution on $Z$, namely, $P(Z|X,\theta)$ - do we need a model for this distribution?

$ argmax_{\theta} E_{P(Z|X,\theta)}\left[log P(x,z|\theta) \right]$

see slide at 29:59

EM - general procedure for learning from partly observed data. Given observed varialbes, $X$, and unobserved variables, $Z$, define

$Q\left(\theta' | \theta \right) = E_{E_{P(Z|X,\theta)}} \left[\log P(X,Z | \theta') \right]$

Iterate until convergence:

E Step: Use $X$ and current $\theta$ to calculate $P(Z|X,\theta)$ - done for every variable in $Z$ for each training example,

M Step: Replace current $\theta$ by

$\theta \leftarrow arg max_{\theta'} Q $

using step 1, plug into into last equation on slide 29:59 and pick max theta

Example: see slide 38:59, 50:59

More generally: Given observed variables $X$ and unobserved variables $Z$ all of which are boolean:

E Step: Calculate for each training example, $k$, the expected value of each unobserved variable in $Z$

M Step: Calculate estimates similar to MLE, but replacing each count (i.e. observed data proportions) by its expected count

$\delta(Y=1) \rightarrow E_{Z|X,\theta}[Y]$

$\delta(Y=0) \rightarrow 1 - \delta(Y=1)$