Let $\mathbf{X} \in \mathbb{R}^{n \times d}$ be the matrix of input vectors, $\mathbf{y} \in \mathbb{R}^{n \times 1}$ be the vector of targets and $\boldsymbol{\theta} \in \mathbb{R}^{n \times 1}$ be the vector of weights. Assume that the likelihood is a Gaussian: $$p\left(\mathbf{y} | \mathbf{X}, \boldsymbol{\theta}, \mathbf\Sigma \right) = \mathcal{N}(\mathbf{y} | \mathbf{X}\boldsymbol{\theta}, \mathbf{\Sigma}) = |2\pi\mathbf\Sigma|^{-\frac{1}{2}} \exp\left\{ -\frac{1}{2} \left(\mathbf{y}-\mathbf{X}\boldsymbol\theta\right)^T \mathbf\Sigma^{-1} \left(\mathbf{y}-\mathbf{X}\boldsymbol\theta\right) \right\}$$ where $\mathbf{\Sigma} \in \mathbb{R}^{n \times n}$ is the covariance matrix and we assume is given. Assume also that the prior for $\boldsymbol{\theta}$ is a Gaussian: $$p\left(\boldsymbol{\theta}\right) = \mathcal{N}(\boldsymbol\theta | \mathbf{0}, \mathbf{\Delta}) = |2\pi\mathbf{\Delta}|^{-\frac{1}{2}} \exp\left\{-\frac{1}{2} \boldsymbol{\theta}^T \mathbf\Delta^{-1} \boldsymbol{\theta}\right\}$$ where $\mathbf{\Delta} \in \mathbb{R}^{d \times d}$ is the covariance matrix.

Exercise 1

Then the posterior for $\boldsymbol{\theta}$ is: $$p\left( \boldsymbol{\theta} | \mathbf{y}, \mathbf{X}, \mathbf{\Sigma} \right) \propto p\left(\mathbf{y} | \mathbf{X}, \boldsymbol{\theta}, \mathbf\Sigma \right) p\left(\boldsymbol{\theta}\right) \propto \exp\left\{ -\frac{1}{2} \boldsymbol{\theta}^T \left(\mathbf{X}^T\mathbf{\Sigma}^{-1}\mathbf{X} + \mathbf{\Delta}^{-1}\right) \boldsymbol{\theta} + \boldsymbol{\theta}^T\mathbf{X}^T\mathbf{\Sigma}^{-1}\mathbf{y} \right\}$$

Now if we want the posterior of $\boldsymbol\theta$ to be a Gaussian of the form: $$\mathcal{N}\left( \boldsymbol\theta | \boldsymbol{\theta}_n, \mathbf{V}_n \right) = |2\pi\mathbf{V}_n|^{-\frac{1}{2}} \exp\left\{ -\frac{1}{2} \left(\boldsymbol\theta - \boldsymbol{\theta}_n\right)^T \mathbf{V}_{n}^{-1} \left(\boldsymbol\theta - \boldsymbol{\theta}_n\right) \right\} \propto \exp\left\{ -\frac{1}{2} \boldsymbol{\theta}^T\mathbf{V}_{n}^{-1}\boldsymbol{\theta} + \boldsymbol{\theta}^T\mathbf{V}_{n}^{-1}\boldsymbol{\theta}_n \right\}$$ we have to equate: $$\mathbf{V}_n^{-1} = \mathbf{X}^T\mathbf{\Sigma}^{-1}\mathbf{X} + \mathbf{\Delta}^{-1}$$ and $$\mathbf{V}_n^{-1}\boldsymbol{\theta}_n = \mathbf{X}^T\mathbf{\Sigma}^{-1}\mathbf{y} $$

Thus we can write the posterior for $\boldsymbol\theta$ as: $$p\left( \boldsymbol{\theta} | \mathbf{y}, \mathbf{X}, \mathbf{\Sigma} \right) = \mathcal{N}\left( \boldsymbol\theta | \boldsymbol{\theta}_n, \mathbf{V}_n \right)$$ where $\mathbf{V}_n^{-1} = \mathbf{X}^T\mathbf{\Sigma}^{-1}\mathbf{X} + \mathbf{\Delta}^{-1}$ and $\boldsymbol{\theta}_n = \left( \mathbf{X}^T\mathbf{\Sigma}^{-1}\mathbf{X} + \mathbf{\Delta}^{-1} \right)^{-1} \mathbf{X}^T\mathbf{\Sigma}^{-1}\mathbf{y}$

The ridge estimator for $\boldsymbol\theta$ is given by $$\hat{\boldsymbol{\theta}}_R = \left( \mathbf{X}^T\mathbf{X} + \delta^2I_d \right)^{-1} \mathbf{X}^T\mathbf{y}$$ and is equal to the posterior mean $\boldsymbol{\theta}_n$ when $\mathbf\Sigma = \sigma^2I_n$ (i.e. the elements in the dataset are uncorrelated and have the same variance) and $\mathbf{\Delta}=\tau^2I_d$ (i.e. the elements of the prior are uncorrelated and have the same variance), in fact in this case the posterior mean is equal to $$\boldsymbol{\theta}_n = \left( \frac{1}{\sigma^2} \mathbf{X}^T\mathbf{X} + \frac{1}{\tau^2}I_d \right)^{-1} \frac{1}{\sigma^2}\mathbf{X}^T\mathbf{y} = \left( \mathbf{X}^T\mathbf{X} + \frac{\sigma^2}{\tau^2}I_d \right)^{-1} \mathbf{X}^T\mathbf{y}$$

The MLE estimator for $\boldsymbol\theta$ is given by $$\hat{\boldsymbol{\theta}}_{ML} = \left( \mathbf{X}^T\mathbf{X} \right)^{-1} \mathbf{X}^T\mathbf{y}$$ and is equal to the posterior mean $\boldsymbol{\theta}_n$ when $\mathbf\Sigma = \sigma^2I_n$ (i.e. the elements in the dataset are uncorrelated and have the same variance) and $\mathbf{\Delta}^{-1}=0$ (i.e. the variance in the prior tends to infinite), in fact in this case the posterior mean is equal to $$\boldsymbol{\theta}_n = \left( \frac{1}{\sigma^2} \mathbf{X}^T\mathbf{X} \right)^{-1} \frac{1}{\sigma^2}\mathbf{X}^T\mathbf{y} = \left( \mathbf{X}^T\mathbf{X} \right)^{-1} \mathbf{X}^T\mathbf{y}$$

Exercise 2

We now calculate the maximum likelihood estimate for $\mathbf{\Sigma}$. To do so we calculate the derivative of the log-likelihood

$$\frac{\partial}{\partial{\mathbf{\Sigma}^{-1}}} -\frac{1}{2}\mathbf{y}^T\mathbf{\Sigma}^{-1}\mathbf{y} + \boldsymbol{\theta}^T\mathbf{X}^T\mathbf{\Sigma}^{-1}\mathbf{y} - \frac{1}{2} \boldsymbol{\theta}^T\mathbf{X}^T\mathbf{\Sigma}^{-1}\mathbf{X}\boldsymbol{\theta} + \frac{1}{2} \log{|\mathbf{\Sigma}^{-1}|}$$$$=-\frac{1}{2}\mathbf{y}\mathbf{y}^T + \mathbf{y}\boldsymbol{\theta}^T\mathbf{X}^T - \frac{1}{2} \mathbf{X}\boldsymbol{\theta}\boldsymbol{\theta}^T\mathbf{X}^T + \frac{1}{2}\mathbf{\Sigma}$$

Then we set the derivative log-likelihood to zero and we get $$\mathbf{\Sigma}_{ML} = \mathbf{yy}^T + \mathbf{X}\boldsymbol{\theta\theta}^T\mathbf{X}^T - 2\mathbf{y}\boldsymbol{\theta}^T\mathbf{X}^T = \left( \mathbf{y}-\mathbf{X}\boldsymbol\theta \right) \left( \mathbf{y}-\mathbf{X}\boldsymbol\theta \right)^T$$

Exercise 3

Assume the covariance matrix is unkown and is given an inverse Wishart prior on $\mathbf{\Sigma}$, with fixed know parameters $\alpha$ and $\mathbf{\Sigma}^*$: $$p\left( \mathbf{\Sigma} | \alpha, \mathbf{\Sigma}^* \right) \propto |\mathbf{\Sigma}|^{-\left(\alpha+n+1\right)/2} + \exp\left\{ -\frac{1}{2} \mathrm{trace}\left( \mathbf{\Sigma}^* \mathbf{\Sigma}^{-1} \right) \right\}$$ Assume that the likelihood is a Gaussian: $$p\left(\mathbf{y} | \mathbf{X}, \boldsymbol{\theta}, \mathbf\Sigma \right) = \mathcal{N}(\mathbf{y} | \mathbf{X}\boldsymbol{\theta}, \mathbf{\Sigma}) = |2\pi\mathbf\Sigma|^{-\frac{1}{2}} \exp\left\{ -\frac{1}{2} \left(\mathbf{y}-\mathbf{X}\boldsymbol\theta\right)^T \mathbf\Sigma^{-1} \left(\mathbf{y}-\mathbf{X}\boldsymbol\theta\right) \right\}= |2\pi\mathbf\Sigma|^{-\frac{1}{2}} \exp\left\{ -\frac{1}{2} \mathrm{trace}\left( \left(\mathbf{y}-\mathbf{X}\boldsymbol\theta\right)^T \mathbf\Sigma^{-1} \left(\mathbf{y}-\mathbf{X}\boldsymbol\theta\right) \right)\right\}$$ and since $\mathrm{trace}\left(\mathbf{z}^T\mathbf{Az}\right) = \mathrm{trace}\left(\mathbf{z}\mathbf{z}^T\mathbf{A}\right)$ $$p\left(\mathbf{y} | \mathbf{X}, \boldsymbol{\theta}, \mathbf\Sigma \right) \propto |\mathbf\Sigma|^{-\frac{1}{2}} \exp\left\{ -\frac{1}{2} \mathrm{trace}\left( \left(\mathbf{y}-\mathbf{X}\boldsymbol\theta\right) \left(\mathbf{y}-\mathbf{X}\boldsymbol\theta\right)^T \mathbf\Sigma^{-1} \right)\right\}$$

We can write the posterior as $$p\left( \mathbf{\Sigma} | \mathbf{y}, \mathbf{X}, \boldsymbol\theta, \alpha, \mathbf{\Sigma}^* \right) \propto p\left(\mathbf{y} | \mathbf{X}, \boldsymbol{\theta}, \mathbf\Sigma \right) p\left( \mathbf{\Sigma} | \alpha, \mathbf{\Sigma}^* \right) \propto |\mathbf\Sigma|^{-\left(\left(\alpha+1\right)+n+1\right)/2} \exp\left\{ -\frac{1}{2} \mathrm{trace}\left[ \left( \left( \mathbf{y}-\mathbf{X}\boldsymbol\theta \right) \left( \mathbf{y}-\mathbf{X}\boldsymbol\theta \right)^T + \mathbf{\Sigma}^* \right) \mathbf{\Sigma}^{-1} \right] \right\}$$ thus we can write the posterior as an inverse Wishart with parameters $\alpha + 1$ and $\left( \mathbf{y}-\mathbf{X}\boldsymbol\theta \right) \left( \mathbf{y}-\mathbf{X}\boldsymbol\theta \right)^T + \mathbf{\Sigma}^*$.

If $\alpha = n+1$ and $\mathbf{\Sigma}^* = 0$ then the posterior has parameters $\alpha +2$ and $\left( \mathbf{y}-\mathbf{X}\boldsymbol\theta \right) \left( \mathbf{y}-\mathbf{X}\boldsymbol\theta \right)^T$, thus the expectation of the distribution is $$ \mathbb{E}\left(\mathbf\Sigma\right) = \left( \mathbf{y}-\mathbf{X}\boldsymbol\theta \right) \left( \mathbf{y}-\mathbf{X}\boldsymbol\theta \right)^T$$ which is the maximum likelihood estimate $\mathbf{\Sigma}_{ML}$.