Bayesian Machine Learning and Information Processing (5SSD0)

  • In this notebook, we provide a set of exercises that should help you prepare for the exam.

  • Feel free to make use of Sam Roweis' cheat sheets for Matrix identities and Gaussian identities.

    • These cheat sheets are not allowed at the exam but if needed we will provide the necessary formulas at the exam sheet.

Machine Learning Overview

Probability Theory Review

  • Proof that $p(A|I) + p(\bar{A}|I) = 1$ follows from the sum rule.

  • Proof that

  • Is it more correct to speak about the likelihood of a model (or model parameters) than about the likelihood of an observed data set. And why?

  • Is a speech signal a 'probabilistic' (random) or a deterministic signal?

  • A dark bag contains five red balls and seven green ones. (a) What is the probability of drawing a red ball on the first draw? Balls are not returned to the bag after each draw. (b) If you know that on the second draw the ball was a green one, what is now the probability of drawing a red ball on the first draw?

  • A bag contains one ball, known to be either white or black. A white ball is put in, the bag is shaken, and a ball is drawn out, which proves to be white. What is now the chance of drawing a white ball?

  • Proof that, the mean and (co-)variance of any distribution $p(x)$ is processed by a linear tranformation as $$\begin{align} \mathrm{E}[Ax +b] &= A\mathrm{E}[x] + b \tag{SRG-3a}\\ \mathrm{cov}[Ax +b] &= A\,\mathrm{cov}[x]\,A^T \tag{SRG-3b} \end{align}$$

  • Proof that, for any distribution of $x$ and $y$ and $z=x+y$ $$\begin{align*} \mathrm{E}[z] &= \mathrm{E}[x] + \mathrm{E}[y] \\ \mathrm{var}[z] &= \mathrm{var}[x] + \mathrm{var}[y] + 2\mathrm{cov}[x,y] \end{align*}$$

  • $$\begin{align*} \mathrm{E}[z] &= \int_z z \left[\int_x p_x(x)p_y(z-x) \,\mathrm{d}{x} \right] \,\mathrm{d}{z} \\ &= \int_x p_x(x) \left[ \int_z z p_y(z-x)\,\mathrm{d}{z} \right] \,\mathrm{d}{x} \\ &= \int_x p_x(x) \left[ \int_{y^\prime} (y^\prime +x)p_y(y^\prime)\,\mathrm{d}{y^\prime} \right] \,\mathrm{d}{x} \notag\\ &= \int_x p_x(x) \left( \mathrm{E}[y]+x \right) \,\mathrm{d}{x} \notag\\ &= \mathrm{E}[x] + \mathrm{E}[y] \qquad \text{(always; follows from SRG-3a)} \end{align*}$$

Bayesian Machine Learning

  • Proof that the Bayes estimate minimizes the expected mean-square error, i.e., proof that
$$ \hat \theta_{bayes} = \arg\min_{\hat \theta} \int_\theta (\hat \theta -\theta)^2 p \left( \theta |D \right) \,\mathrm{d}{\theta} $$
  • on the Bayes factor ...

Continuous Data and the Gaussian Distribution

  • Proof that a linear transformation $z=Ax+b$ of a Gaussian variable $\mathcal{N}(x|\mu,\Sigma)$ is Gaussian distributed as $$ p(z) = \mathcal{N} \left(z \,|\, A\mu+b, A\Sigma A^T \right) \tag{SRG-4a} $$
  • Given independent variables $x \sim \mathcal{N}(\mu_x,\sigma_y^2)$ and $y \sim \mathcal{N}(\mu_y,\sigma_y^2)$, what is the PDF for $z = A\cdot(x -y) + b$ ?
  • (Answer): $z$ is also Gaussian with $$ p_z(z) = \mathcal{N}(z \,|\, A(\mu_x-\mu_y)+b, \, A (\sigma_x^2 + \sigma_y^2) A^T) $$

  • Show that Eq.SRG-8 is a special case of Eq.SRG-4a.

  • Proof $$ p(x,\theta) = \mathcal{N} \left( \begin{bmatrix} x\\ \theta \end{bmatrix} \,\left|\, \begin{bmatrix} \mu_0\\ \mu_0\end{bmatrix},
       \begin{bmatrix} \sigma_0^2+\sigma^2  & \sigma_0^2\\ 
       \sigma_0^2 &\sigma_0^2 
    \end{bmatrix} \right. \right) $$
  • Look up conditioning and marginalization in canonical coordinates and compare to the formulas for the moment parameterization of the Gaussian. Any conclusions?

  • Derive $$\begin{align*} p(\theta|x) &= \mathcal{N} \left( \theta\,|\,\mu_1, \sigma_1^2 \right)\,, \end{align*}$$ with $$\begin{align*} K &= \frac{\sigma_0^2}{\sigma_0^2+\sigma^2} \qquad \text{($K$ is called: Kalman gain)}\\ \mu_1 &= \mu_0 + K \cdot (x-\mu_0)\\ \sigma_1^2 &= \left( 1-K \right) \sigma_0^2 \end{align*}$$

Discrete Data and the Multinomial Distribution

  • The probability that we throw the $k$th face at the next toss was $$\begin{align*} p(x_{\bullet,k}=1|D) = \frac{m_k + \alpha_k }{ N+ \sum_k \alpha_k} \end{align*}$$ Does this answer make sense?

  • Verify for yourself that (Exercise):

    • the categorial distribution is a special case of the multinomial for $N=1$.
    • the Bernoulli is a special case of the categorial distribution for $K=2$.
    • the binomial is a special case of the multinomial for $K=2$.
  • Show that Laplace's generalized rule of succession can be worked out to a prediction that is composed of a prior prediction and data-based correction term.

  • We didn't use a co-variance matrix for discrete data. Why?


  • Let's work out the log-likelihood for multiple observations $$\begin{align*} \log p(D|w) &\stackrel{\text{IID}}{=} \sum_n \log \mathcal{N}(y_n|\,w^T x_n,\sigma^2) \propto -\frac{1}{2\sigma^2} \sum_{n} {(y_n - w^T x_n)^2}\\ &= -\frac{1}{2\sigma^2}\left( {y - \mathbf{X}w } \right)^T \left( {y - \mathbf{X} w } \right) \end{align*}$$ where we defined $N\times 1$ vector $y = \left(y_1 ,y_2 , \ldots ,y_N \right)^T$ and $(N\times D)$-dim matrix $\mathbf{X} = \left( x_1 ,x_2 , \ldots ,x_n \right)^T$.
    • Now, we want to apply the trained model. New data points can be predicted by $$\begin{equation*} p(y_\bullet \,|\, x_\bullet,\hat w_{\text{ML}}) = \mathcal{N}(y_\bullet \,|\, \hat w_{\text{ML}}^T x_\bullet, \sigma^2 ) \end{equation*}$$
  • Note that the expected value of a predicted new data point

$$ \mathrm{E}[y_\bullet] = \hat w_{\text{ML}}^T x_\bullet = x_\bullet^T \hat{w}_{\text{ML}} = \left( x_\bullet^T \mathbf{X}^\dagger \right) y $$

can also be expressed as a linear combination of the observed data points

$$y = \left( {y_1 ,y_1 , \ldots ,y_N } \right)^T \,.$$


Latent Variable Models and Variational Bayes

  1. Given the free energy functional $ F[q] = \sum_z q(z) \log \frac{q(z)}{p(x,z)}$, proof the EE, DE and AC decompositions.

  2. Bishop exercise 9.3

  • Unfortunately, the KL functional is not computable in this form because it contains the Bayesian posterior $p(z|x)$ to which we have no access.

Factor Graphs

  • (Exercise) Derive now that the message coming from the open end of a half-edge always equals $1$.

  • (Exercise) Verfiy for yourself that all maginals in a cycle-free graph (a tree) can be computed exactly by starting with messages at the terminals and working towards the root of the tree.

  • (Ex.1) Reflect on the fact that we now have methods for both marginalization and processing observations in FFGs. In principle, we are sufficiently equipped to do inference in probabilistic models through message passing. Draw the graph for $$p(x_1,x_2,x_3)=f_a(x_1)\cdot f_b(x_1,x_2)\cdot f_c(x_2,x_3)$$ and show which boxes need to be closed for computing $p(x_1|x_2)$.

  • (Ex.2) Consider a variable $X$ with measurements $D=\{x_1,x_2\}$. We assume the following model for $X$: $$\begin{align*} p(D,\theta) &= p(\theta)\cdot \prod_{n=1}^2 p(x_n|\theta) \\ p(\theta) &= \mathcal{N}(\theta \mid 0,1) \\ p(x_n \mid\theta) &= \mathcal{N}(x_n \mid \theta,1) \end{align*}$$

    • Draw the factor graph and infer $\theta$ through the Sum-Product Algorithm.

Dynamic Models

  • Show that in a Markovian state-space model, the observation sequence $x^T$ is not a first-order Markov chain, i.e., show that for the model $$\begin{align*} p(x^T,z^T) &= p(z_1) \prod_{t=2}^T p(z_t\,|\,z_{t-1})\,\prod_{t=1}^T p(x_t\,|\,z_t) \end{align*}$$ the following statement holds: $$p(x_t\,|\,x_{t-1},x_{t-2}) \neq p(x_t\,|\,x_{t-1})\,.$$ In other words, the latent variables $z_t$ represent a memory bank for past observations beyond $t-1$.

Intelligent Agents and Active Inference

In [1]:
open("../../styles/aipstyle.html") do f
    display("text/html", read(f,String))
In [ ]: