Bayesian Machine Learning

Preliminaries

  • Goals
    • Introduction to Bayesian (i.e., probabilistic) modeling
  • Materials
    • Mandatory
      • These lecture notes
    • Optional
      • Bishop pp. 21-24

Example Problem: Predicting a Coin Toss

  • Question. We observe a the following sequence of heads (h) and tails (t) when tossing the same coin repeatedly $$D=\{hthhtth\}\,.$$

  • What is the probability that heads (h) comes up next?

  • Answer later in this lecture.

The Bayesian Machine Learning Framework

  • Suppose that your task is to predict a future observation $x$, based on $N$ past observations $D=\{x_1,\dotsc,x_N\}$.
  • The Bayesian approach for this task involves three stages:
    1. Model specification
    2. Parameter estimation (i.e., learning from observed data; using Bayesian inference)
    3. Prediction (apply the model)
  • Next, we discuss these three stages in a bit more detail.

(1) Model specification

Your first task is to propose a model with tuning parameters $\theta$ for generating the observations $x$.

  • This involves specification of $p(x|\theta)$ and a prior for the parameters $p(\theta)$.
  • You choose the distribution $p(x|\theta)$ based on your physical understanding of the data generating process.
  • Note that, for a given data set $D=\{x_1,x_2,\dots,x_N\}$ with independent observations $x_n$, $$ p(D|\theta) = \prod_{n=1}^N p(x_n|\theta)$$ so usually you select a model for generating one observation $x_n$ and then use (in-)dependence assumptions to combine these models into a model for your observed data set $D$.
  • You choose the prior $p(\theta)$ to reflect what you know about the parameter values before you see the data $D$.

(2) Parameter estimation

  • After model specification, you need to measure/collect a data set $D$. Then, use Bayes rule to find the posterior distribution for the parameters, $$ p(\theta|D) = \frac{p(D|\theta) p(\theta)}{p(D)} \propto p(D|\theta) p(\theta) $$
    • The denominator is only a normalization factor.
  • Note that there's no need for you to design a smart parameter estimation algorithm. The only complexity lies in the computational issues.
  • This "recipe" works only if the RHS factors can be evaluated; this is what machine learning is about
    $\Rightarrow$ Machine learning is EASY, apart from computational details :)

(3) Prediction

  • Given the data $D$, our knowledge about the yet unobserved datum $x$ is captured by $$\begin{align*} p(x|D) &= \int p(x,\theta|D) \,\mathrm{d}\theta\\ &= \int p(x|\theta,D) p(\theta|D) \,\mathrm{d}\theta\\ &= \int p(x|\theta) p(\theta|D) \,\mathrm{d}\theta\\ \end{align*}$$
  • Again, no need to invent a special prediction algorithm. Probability theory takes care of all that. The complexity of prediction is just computational: how to carry out the marginalization over $\theta$.
  • In order to execute prediction, you need to have access to the factors $p(x|\theta)$ and $p(\theta|D)$. Where do these factors come from? Are they available?
  • What did we learn from $D$? Without access to $D$, we would predict new observations through $$ p(x) = \int p(x,\theta) \,\mathrm{d}\theta = \int p(x|\theta) p(\theta) \,\mathrm{d}\theta $$
  • NB The application of the learned posterior $p(\theta|D)$ not necessarily has to be prediction. We use it here as an example, but other applications are of course also possible.

Bayesian Model Comparison

  • There appears to be a remaining problem: How good really were our model assumptions $p(x|\theta)$ and $p(\theta)$?
  • Technically, this is a model comparison problem
  • [Q.] What if I have more candidate models, say $\mathcal{M} = \{m_1,\ldots,m_K\}$ where each model relates to specific prior $p(\theta|m_k)$ and likelihood $p(D|\theta,m_k)$? Can we evaluate the relative performance of a model against another model from the set?
  • [A.]: Start again with model specification. Specify a prior $p(m_k)$ for each of the models and then solve the desired inference problem:
    $$\begin{align*} p(m_k|D) &= \frac{p(D|m_k) p(m_k)}{p(D)} \\ &\propto p(m_k) \cdot p(D|m_k) \\ &= p(m_k)\cdot \int_\theta p(D,\theta|m_k) \,\mathrm{d}\theta\\ &= \underbrace{p(m_k)}_{\substack{\text{model}\\\text{prior}}}\cdot \int_\theta \underbrace{p(D|\theta,m_k)}_{\text{likelihood}} \,\underbrace{p(\theta|m_k)}_{\text{prior}}\, \mathrm{d}\theta\\ \end{align*}$$

Bayesian Model Comparison (continued)

  • You, the engineer, have to choose the factors $p(D|\theta,m_k)$, $p(\theta|m_k)$ and $p(m_k)$. After that, for a given data set $D$, the model posterior $p(m_k|D)$ can be computed.
  • If you need to work with one model,select the model with largest posterior $p(m_k|D)$
  • Alternatively, if you don't want to choose a model, you can do prediction by Bayesian model averaging to utilitize the predictive power from all models: $$\begin{align*} p(x|D) &= \sum_k \int p(x,\theta,m_k|D)\,\mathrm{d}\theta \\ &= \sum_k \underbrace{p(m_k|D)}_{\substack{\text{model}\\\text{posterior}}} \cdot \int \underbrace{p(\theta|D)}_{\substack{\text{parameter}\\\text{posterior}}} \, \underbrace{p(x|\theta,m_k)}_{\text{likelihood}} \,\mathrm{d}\theta \end{align*}$$
  • $\Rightarrow$ In a Bayesian framework, model comparison follows the same recipe as parameter estimation; it just works at one higher hierarchical level.
  • More on this in part 2 (Tjalkens).

Machine Learning and the Scientific Method Revisited

  • Bayesian probability theory provides a unified framework for information processing (and even the Scientific Method).

Now Solve the Example Problem: Predicting a Coin Toss

  • We observe a the following sequence of heads (h) and tails (t) when tossing the same coin repeatedly $$D=\{hthhtth\}\,.$$

  • What is the probability that heads (h) comes up next? We solve this in the next slides ...

Coin toss example (1): Model Specification

We observe a sequence of $N$ coin tosses $D=\{x_1,\ldots,x_N\}$ with $n$ heads.

Likelihood
  • Assume a Bernoulli distributed variable $p(x_k=h|\mu)=\mu$, which leads to a binomial distribution for the likelihood (assume $n$ times heads were thrown): $$ p(D|\mu) = \prod_{k=1}^N p(x_k|\mu) = \mu^n (1-\mu)^{N-n} $$
Prior
  • Assume the prior belief is governed by a beta distribution $$ p(\mu) = \mathcal{B}(\mu|\alpha,\beta) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} \mu^{\alpha-1}(1-\mu)^{\beta-1} $$
  • The Beta distribution is a conjugate prior for the Binomial distribution, which means that $$ \text{beta} \propto \text{binomial} \times \text{beta} $$
  • $\alpha$ and $\beta$ are called hyperparameters, since they parameterize the distribution for another parameter ($\mu$). E.g., $\alpha=\beta=1$ (uniform).

Coin toss example (2): Parameter estimation

  • Infer posterior PDF over $\mu$ through Bayes rule
$$\begin{align*} p(\mu|D) &\propto p(D|\mu)\,p(\mu|\alpha,\beta) \\ &= \mu^n (1-\mu)^{N-n} \times \mu^{\alpha-1} (1-\mu)^{\beta-1} \\ &= \mu^{n+\alpha-1} (1-\mu)^{N-n+\beta-1} \end{align*}$$

hence the posterior is also beta distributed as

$$ p(\mu|D) = \mathcal{B}(\mu|\,n+\alpha, N-n+\beta) $$
  • Essentially, here ends the machine learning activity

Coin Toss Example (3): Prediction

  • Now, we want to use the trained model. Let's use it to predict future observations.
  • Marginalize over the parameter posterior to get the predictive PDF for a new coin toss $x_\bullet$, given the data $D$,
$$\begin{align*} p(x_\bullet=h|D) &= \int_0^1 p(x_\bullet=h|\mu)\,p(\mu|D) \,\mathrm{d}\mu \\ &= \int_0^1 \mu \times \mathcal{B}(\mu|\,n+\alpha, N-n+\beta) \,\mathrm{d}\mu \\ &= \frac{n+\alpha}{N+\alpha+\beta} \qquad \mbox{(a.k.a. Laplace rule)}\hfill \end{align*}$$
  • Finally, we're ready to solve our example problem: for $D=\{hthhtth\}$ and uniform prior ($\alpha=\beta=1$), we get
$$ p(x_\bullet=h|D)=\frac{n+1}{N+2} = \frac{4+1}{7+2} = \frac{5}{9}$$

Coin Toss Example: What did we learn?

  • What did we learn from the data? Before seeing any data, we think that $$p(x_\bullet=h)=\left. p(x_\bullet=h|D) \right|_{n=N=0} = \frac{\alpha}{\alpha + \beta}\,.$$
  • After the $N$ coin tosses, we think that $p(x_\bullet=h|D) = \frac{n+\alpha}{N+\alpha+\beta}$.
  • Note the following decomposition
$$\begin{align*} p(x_\bullet=h|\,D) &= \frac{n+\alpha}{N+\alpha+\beta} = \frac{n}{N+\alpha+\beta} + \frac{\alpha}{N+\alpha+\beta} \\ &= \frac{N}{N+\alpha+\beta}\cdot \frac{n}{N} + \frac{\alpha+\beta}{N+\alpha+\beta} \cdot \frac{\alpha}{\alpha+\beta} \\ &= \underbrace{\frac{\alpha}{\alpha+\beta}}_{prior} + \underbrace{\frac{N}{N+\alpha+\beta}}_{gain}\cdot \big( \underbrace{\frac{n}{N}}_{MLE} - \underbrace{\frac{\alpha}{\alpha+\beta}}_{prior} \big) \end{align*}$$
  • Note that, since $0\leq\text{gain}\lt 1$, the Bayesian estimate lies between prior and maximum likelihood estimate.
  • For large $N$, the gain goes to $1$ and $p(x_\bullet=h|D)$ goes to the maximum likelihood estimate (the relative frequency) $n/N$.

CODE EXAMPLE

Bayesian evolution of $p(\mu|D)$ for the coin toss

Let's see how $p(\mu|D)$ evolves as we increase the number of coin tosses $N$. We'll use two different priors to demonstrate the effect of the prior on the posterior (set $N=0$ to inspect the prior).

In [1]:
using Reactive, Interact, PyPlot, Distributions
f = figure()
range_grid = range(0.0, stop=1.0, length=100)
μ = 0.4
samples = rand(192) .<= μ # Flip 192 coins
@manipulate for N=0:1:192; withfig(f) do
        n = sum(samples[1:N]) # Count number of heads in first N flips
        posterior1 = Beta(1+n, 1+(N-n))
        posterior2 = Beta(5+n, 5+(N-n))
        plot(range_grid, pdf.(posterior1,range_grid), "k-")
        plot(range_grid, pdf.(posterior2,range_grid), "k--")
        xlabel(L"\mu"); ylabel(L"p(\mu|\mathcal{D})"); grid()
        title(L"p(\mu|\mathcal{D})"*" for N=$(N), n=$(n) (real \$\\mu\$=$(μ))")
        legend(["Based on uniform prior "*L"B(1,1)","Based on prior "*L"B(5,5)"], loc=4)
    end
end
Out[1]:
WebIO.mount(this.previousSibling,{"props":{},"nodeType":"DOM","type":"node","instanceArgs":{"namespace":"html","tag":"div"},"children":[{"props":{"className":"field"},"nodeType":"DOM","type":"node","instanceArgs":{"namespace":"html","tag":"div"},"children":[{"props":{},"nodeType":"Scope","type":"node","instanceArgs":{"imports":{"data":[{"name":"knockout","type":"js","url":"/assetserver/88bf0b823d344fcff0b77616d12840286b6c8fa3-knockout.js"},{"name":"knockout_punches","type":"js","url":"/assetserver/2bd5b097e96d6b4fb8b3b86c01ec3eb6f00ba8bc-knockout_punches.js"},{"name":null,"type":"js","url":"/assetserver/33fc17ead5d6c83c74b2449f7c19558c51387837-all.js"},{"name":null,"type":"css","url":"/assetserver/f0f1a82ab037979b58e3919f91c0a1436c2b13ea-style.css"},{"name":null,"type":"css","url":"/assetserver/84fa6bb423ab1a438691a358fbcb123d0820e96c-main.css"}],"type":"async_block"},"id":"knockout-component-4ba54b13-d2c1-42a7-80b9-41e2e57dc0e1","handlers":{"_promises":{"importsLoaded":[function (ko, koPunches) { ko.punches.enableAll(); ko.bindingHandlers.numericValue = { init : function(element, valueAccessor, allBindings, data, context) { var stringified = ko.observable(ko.unwrap(valueAccessor())); stringified.subscribe(function(value) { var val = parseFloat(value); if (!isNaN(val)) { valueAccessor()(val); } }) valueAccessor().subscribe(function(value) { var str = JSON.stringify(value); if ((str == "0") && (["-0", "-0."].indexOf(stringified()) >= 0)) return; if (["null", ""].indexOf(str) >= 0) return; stringified(str); }) ko.applyBindingsToNode(element, { value: stringified, valueUpdate: allBindings.get('valueUpdate')}, context); } }; var json_data = JSON.parse("{\"changes\":0,\"value\":96}"); var self = this; function AppViewModel() { for (var key in json_data) { var el = json_data[key]; this[key] = Array.isArray(el) ? ko.observableArray(el) : ko.observable(el); } [this["changes"].subscribe((function (val){!(this.valueFromJulia["changes"]) ? (WebIO.setval({"name":"changes","scope":"knockout-component-4ba54b13-d2c1-42a7-80b9-41e2e57dc0e1","id":"ob_02","type":"observable"},val)) : undefined; return this.valueFromJulia["changes"]=false}),self),this["value"].subscribe((function (val){!(this.valueFromJulia["value"]) ? (WebIO.setval({"name":"value","scope":"knockout-component-4ba54b13-d2c1-42a7-80b9-41e2e57dc0e1","id":"ob_01","type":"observable"},val)) : undefined; return this.valueFromJulia["value"]=false}),self)] } self.model = new AppViewModel(); self.valueFromJulia = {}; for (var key in json_data) { self.valueFromJulia[key] = false; } ko.applyBindings(self.model, self.dom); } ]},"changes":[(function (val){return (val!=this.model["changes"]()) ? (this.valueFromJulia["changes"]=true, this.model["changes"](val)) : undefined})],"value":[(function (val){return (val!=this.model["value"]()) ? (this.valueFromJulia["value"]=true, this.model["value"](val)) : undefined})]},"systemjs_options":null,"observables":{"changes":{"sync":false,"id":"ob_02","value":0},"value":{"sync":true,"id":"ob_01","value":96}}},"children":[{"props":{"attributes":{"class":"interact-flex-row"}},"nodeType":"DOM","type":"node","instanceArgs":{"namespace":"html","tag":"div"},"children":[{"props":{"attributes":{"class":"interact-flex-row-left"}},"nodeType":"DOM","type":"node","instanceArgs":{"namespace":"html","tag":"div"},"children":[{"props":{"className":"interact ","style":{"padding":"5px 10px 0px 10px"}},"nodeType":"DOM","type":"node","instanceArgs":{"namespace":"html","tag":"label"},"children":["N"]}]},{"props":{"attributes":{"class":"interact-flex-row-center"}},"nodeType":"DOM","type":"node","instanceArgs":{"namespace":"html","tag":"div"},"children":[{"props":{"max":192,"min":0,"attributes":{"type":"range","data-bind":"numericValue: value, valueUpdate: 'input', event: {change : function () {this.changes(this.changes()+1)}}","orient":"horizontal"},"step":1,"className":"slider slider is-fullwidth","style":{}},"nodeType":"DOM","type":"node","instanceArgs":{"namespace":"html","tag":"input"},"children":[]}]},{"props":{"attributes":{"class":"interact-flex-row-right"}},"nodeType":"DOM","type":"node","instanceArgs":{"namespace":"html","tag":"div"},"children":[{"props":{"attributes":{"data-bind":"text: value"}},"nodeType":"DOM","type":"node","instanceArgs":{"namespace":"html","tag":"p"},"children":[]}]}]}]}]},{"props":{},"nodeType":"Scope","type":"node","instanceArgs":{"imports":{"data":[],"type":"async_block"},"id":"scope-b2cdd369-f329-483a-82fb-04f269a831ce","handlers":{"obs-output":[function (updated_htmlstr) { var el = this.dom.querySelector("#out"); WebIO.propUtils.setInnerHtml(el, updated_htmlstr); }]},"systemjs_options":null,"observables":{"obs-output":{"sync":false,"id":"ob_05","value":"<div class='display:none'></div><unsafe-script style='display:none'>\nWebIO.mount(this.previousSibling,{&quot;props&quot;:{&quot;attributes&quot;:{&quot;class&quot;:&quot;interact-flex-row&quot;}},&quot;nodeType&quot;:&quot;DOM&quot;,&quot;type&quot;:&quot;node&quot;,&quot;instanceArgs&quot;:{&quot;namespace&quot;:&quot;html&quot;,&quot;tag&quot;:&quot;div&quot;},&quot;children&quot;:[{&quot;props&quot;:{&quot;setInnerHtml&quot;:&quot;&lt;img src=&#39;&#39;&gt;&lt;/img&gt;&quot;},&quot;nodeType&quot;:&quot;DOM&quot;,&quot;type&quot;:&quot;node&quot;,&quot;instanceArgs&quot;:{&quot;namespace&quot;:&quot;html&quot;,&quot;tag&quot;:&quot;div&quot;},&quot;children&quot;:[]}]})</unsafe-script>"}}},"children":[{"props":{"id":"out","setInnerHtml":"<div class='display:none'></div><unsafe-script style='display:none'>\nWebIO.mount(this.previousSibling,{&quot;props&quot;:{&quot;attributes&quot;:{&quot;class&quot;:&quot;interact-flex-row&quot;}},&quot;nodeType&quot;:&quot;DOM&quot;,&quot;type&quot;:&quot;node&quot;,&quot;instanceArgs&quot;:{&quot;namespace&quot;:&quot;html&quot;,&quot;tag&quot;:&quot;div&quot;},&quot;children&quot;:[{&quot;props&quot;:{&quot;setInnerHtml&quot;:&quot;&lt;img src=&#39;&#39;&gt;&lt;/img&gt;&quot;},&quot;nodeType&quot;:&quot;DOM&quot;,&quot;type&quot;:&quot;node&quot;,&quot;instanceArgs&quot;:{&quot;namespace&quot;:&quot;html&quot;,&quot;tag&quot;:&quot;div&quot;},&quot;children&quot;:[]}]})</unsafe-script>"},"nodeType":"DOM","type":"node","instanceArgs":{"namespace":"html","tag":"div"},"children":[]}]}]})

$\Rightarrow$ With more data, the relevance of the prior diminishes!

From Posterior to Point-Estimate

  • Sometimes we want just one 'best' parameter (vector), rather than a posterior distribution over parameters. Why?
  • Recall Bayesian prediction
$$ p(x|D) = \int p(x|\theta)p(\theta|D)\,\mathrm{d}{\theta} $$
  • If we approximate posterior $p(\theta|D)$ by a delta function for one 'best' value $\hat\theta$, then the predictive distribution collapses to
$$ p(x|D)= \int p(x|\theta)\,\delta(\theta-\hat\theta)\,\mathrm{d}{\theta} = p(x|\hat\theta) $$
  • This is the model $p(x|\theta)$ evaluated at $\theta=\hat\theta$.
  • Note that $p(x|\hat\theta)$ is much easier to evaluate than the integral for full Bayesian prediction.

Some Well-known Point-Estimates

  • Bayes estimate
$$ \hat \theta_{bayes} = \int \theta \, p\left( \theta |D \right) \,\mathrm{d}{\theta} $$
  • (homework). Proof that the Bayes estimate minimizes the expected mean-square error, i.e., proof that
$$ \hat \theta_{bayes} = \arg\min_{\hat \theta} \int_\theta (\hat \theta -\theta)^2 p \left( \theta |D \right) \,\mathrm{d}{\theta} $$
  • Maximum A Posteriori (MAP) estimate $$ \hat \theta_{\text{map}}= \arg\max _{\theta} p\left( \theta |D \right) = \arg \max_{\theta} p\left(D |\theta \right) \, p\left(\theta \right) $$
  • Maximum Likelihood (ML) estimate $$ \hat \theta_{ml} = \arg \max_{\theta} p\left(D |\theta\right) $$
    • Note that Maximum Likelihood is MAP with uniform prior

Bayesian vs Maximum Likelihood Learning

Consider the task: predict a datum $x$ from an observed data set $D$.

Bayesian Maximum Likelihood
1. Model SpecificationChoose a model $m$ with data generating distribution $p(x|\theta,m)$ and parameter prior $p(\theta|m)$Choose a model $m$ with same data generating distribution $p(x|\theta,m)$. No need for priors.
2. Learninguse Bayes rule to find the parameter posterior, $$ p(\theta|D) = \propto p(D|\theta) p(\theta) $$ By Maximum Likelihood (ML) optimization, $$ \hat \theta = \arg \max_{\theta} p(D |\theta) $$
3. Prediction$$ p(x|D) = \int p(x|\theta) p(\theta|D) \,\mathrm{d}\theta $$ $$ p(x|D) = p(x|\hat\theta) $$

Report Card on Maximum Likelihood Estimation

  • Maximum Likelihood (ML) is MAP with uniform prior, or MAP is 'penalized' ML $$ \hat \theta_{map} = \arg \max _\theta \{ \overbrace{\log p\left( D|\theta \right)}^{\mbox{log-likelihood}} + \overbrace{\log p\left( \theta \right)}^{\mbox{penalty}} \} $$
  • (good!). Works rather well if we have a lot of data because the influence of the prior diminishes with more data.
  • (bad). Cannot be used for model comparison. E.g. best model does generally not correspond to largest likelihood (see part-2, Tjalkens).
  • (good). Computationally often do-able. Useful fact (since $\log$ is monotonously increasing): $$\arg\max_\theta \log p(D|\theta) = \arg\max_\theta p(D|\theta)$$

$\Rightarrow$ ML estimation is an approximation to Bayesian learning, but for good reason a very popular learning method when faced with lots of available data.

In [2]:
open("../../styles/aipstyle.html") do f display("text/html", read(f, String)) end
In [ ]: