Graphical Models

Probabilistic graph models provide an intuitive way to express joint probability relationships. The nodes in a graph represent random variables and the edges between nodes represent a probabilistic relationship between variables.

Bayesian Networks

Here we consider acyclic directed graphical models that express the joint probability relationship between some set of variables, known as Bayesian Networks because of the prominant use of Bayes' Theorem. To begin, consider the joint probability distribution $p\left( x_1, \ldots, x_K \right)$ over $K$ variables. Using repeated application of the product rule, this can be expressed as

$$p \left(x_1, \ldots, x_K \right) = p \left(x_K \mid x_1, \ldots, x_{K-1} \right) \ldots p\left(x_2 \mid x_1 \right) p\left(x_1\right) $$

This can be represented graphically as a fully connected directed graph having $K$ nodes, with each node having incoming links from all lower numbered nodes.

Now consider the general case where the graph is not necessarily fully connected. Let the a given node $x_k$ have incoming links from the set of nodes $pa_k$ known as the parent nodes of $x_k$. Then the joint distribution defined by the directed acyclic graph over all nodes is given by the product of a conditional distribution for each node conditioned on its parent nodes. A graph with $K$ nodes has the joint distribution defined as

$$ p\left(\mathbf{x}\right) = \prod_{k=1}^K p\left(x_k \mid pa_k \right) $$

Graphical Notation

There are several conventions used to convey information in graphical models. They are briefly summarized here - see Bishop 363-365 for more detail.

  • Represent multiple nodes of the form $t_1, \ldots, t_N$ as a single node with a box, known as a plate, drawn around it and annotated with $N$, the number of repeated nodes.

  • Deterministic model parameters, e.g. the mean of a distribution, are drawn as solid circles, and random variables as open circles

  • Observed variables are drawn as shaded open circles

Conditional Independence

Consider three random variables $a,b,c$, we use the following notation to indicate that $a$ is conditionally idependent of $b$ given $c$

$$a \perp\\!\\!\\!\perp b \mid c$$

When this condition holds, using the product rule, we may write the joint distribution of $a$ conditioned on $b$ and $c$ as

$$p\left(a, b \mid c \right) = p\left(a\mid b,c\right)p\left(b\mid c\right) = p\left(a\mid c\right) p\left(b \mid c\right)$$

D-separation

Given the description of conditional independence above, we now consider a general directed acyclic graph. Let $A$, $B$, and $C$ be arbitrary sets of nonintersecting nodes. We wish to determine if the graph structure implies the conditional independence $A \perp\\!\\!\\!\perp B \mid C$. It turns out this conidition is satisfied if all possible paths from any node in $A$ to any node in $B$ are blocked, in which case $A$ is said to be d-separated from $B$. A path is considered blocked if either of the following hold:

(a) There is a node, $n \in C$, in the path such that the arrows on the path meet head-to-tail or tail-to-tail at n

(b) There is a node, $n \notin C$ with none of its descendants in $C$, in the path such that the arrows meet head-to-head at the node. A node $d$ is considered to a descendant of a node $p$ if there is a path from $p$ to $d$ in which each path step is in the direction of the arrows connecting the nodes on th path.

Note that model parameters (e.g. the mean of a distribution) will always be tail-to-tail and therefore play no role in determining d-separation.


Training a Bayes Net

MLE for training a Bayes Net, estimating the theta parameters for the nodes, is the equivalent to maximizing the likelihood of the data. So for each node, probability of the node being equal to $s$ be conditioned on say $f$ and $a$, then we have - TODO - generalize this to N conditional quantities

$$\theta_{s|ij} = \frac{\sum_{k=1}^K \delta \left(f_k = i, a_k =j, s_k =1 \right)}{\sum_{k=1}^K \delta \left(f_k=i, a_k=j \right)}$$

where $\delta(x) = 1$

if $x$ is true, 0 otherwise

Partially Observed Data

Can't use the MLE approach above due to lack of data Let $X$ be all observed variables values Let $Z$ be all unobserved variables Assume we always observe/unobserve the same variables - it is possible to generalize this

Approach:

$ argmax_{\theta} E_{P(Z|X,\theta)}\left[log P(x,z|\theta) \right]$

using some probability distribution on $Z$, namely, $P(Z|X,\theta)$ - do we need a model for this distribution?

Expectation Maximization (EM) Algorithim

guaranteed to find local maximum of expected log likelihood above

$ argmax_{\theta} E_{P(Z|X,\theta)}\left[log P(x,z|\theta) \right]$

see slide at 29:59

EM - general procedure for learning from partly observed data. Given observed varialbes, $X$, and unobserved variables, $Z$, define

$Q\left(\theta' | \theta \right) = E_{E_{P(Z|X,\theta)}} \left[\log P(X,Z | \theta') \right]$

Iterate until convergence:

  • E Step: Use $X$ and current $\theta$ to calculate $P(Z|X,\theta)$ - done for every variable in $Z$ for each training example,

  • M Step: Replace current $\theta$ by

$\theta \leftarrow arg max_{\theta'} Q $

using step 1, plug into into last equation on slide 29:59 and pick max theta

Example: see slide 38:59, 50:59

More generally: Given observed variables $X$ and unobserved variables $Z$ all of which are boolean:

  • E Step: Calculate for each training example, $k$, the expected value of each unobserved variable in $Z$

  • M Step: Calculate estimates similar to MLE, but replacing each count (i.e. observed data proportions) by its expected count

$\delta(Y=1) \rightarrow E_{Z|X,\theta}[Y]$

$\delta(Y=0) \rightarrow 1 - \delta(Y=1)$

Example: Linear Gaussian Model

TODO A useful example may be a Linear-Gaussian model - see Bishop pages 370-372 -