Question. We observe a the following sequence of heads (h) and tails (t) when tossing the same coin repeatedly $$D=\{hthhtth\}\,.$$
What is the probability that heads (h) comes up next?
Your first task is to propose a model with tuning parameters $\theta$ for generating the observations $x$.
so usually you select a model for generating one observation $x_n$ and then use (in-)dependence assumptions to combine these models into a model for your observed data set $D$.
We observe a the following sequence of heads (h) and tails (t) when tossing the same coin repeatedly $$D=\{hthhtth\}\,.$$
What is the probability that heads (h) comes up next? We solve this in the next slides ...
We observe a sequence of $N$ coin tosses $D=\{x_1,\ldots,x_N\}$ with $n$ heads.
hence the posterior is also beta distributed as
$$ p(\mu|D) = \mathcal{B}(\mu|\,n+\alpha, N-n+\beta) $$Bayesian evolution of $p(\mu|D)$ for the coin toss
Let's see how $p(\mu|D)$ evolves as we increase the number of coin tosses $N$. We'll use two different priors to demonstrate the effect of the prior on the posterior (set $N=0$ to inspect the prior).
using Reactive, Interact, PyPlot, Distributions
f = figure()
range_grid = range(0.0, stop=1.0, length=100)
μ = 0.4
samples = rand(192) .<= μ # Flip 192 coins
@manipulate for N=0:1:192; withfig(f) do
n = sum(samples[1:N]) # Count number of heads in first N flips
posterior1 = Beta(1+n, 1+(N-n))
posterior2 = Beta(5+n, 5+(N-n))
plot(range_grid, pdf.(posterior1,range_grid), "k-")
plot(range_grid, pdf.(posterior2,range_grid), "k--")
xlabel(L"\mu"); ylabel(L"p(\mu|\mathcal{D})"); grid()
title(L"p(\mu|\mathcal{D})"*" for N=$(N), n=$(n) (real \$\\mu\$=$(μ))")
legend(["Based on uniform prior "*L"B(1,1)","Based on prior "*L"B(5,5)"], loc=4)
end
end
$\Rightarrow$ With more data, the relevance of the prior diminishes!
Consider the task: predict a datum $x$ from an observed data set $D$.
Bayesian | Maximum Likelihood | |
1. Model Specification | Choose a model $m$ with data generating distribution $p(x|\theta,m)$ and parameter prior $p(\theta|m)$ | Choose a model $m$ with same data generating distribution $p(x|\theta,m)$. No need for priors. |
2. Learning | use Bayes rule to find the parameter posterior, $$ p(\theta|D) = \propto p(D|\theta) p(\theta) $$ | By Maximum Likelihood (ML) optimization, $$ \hat \theta = \arg \max_{\theta} p(D |\theta) $$ |
3. Prediction | $$ p(x|D) = \int p(x|\theta) p(\theta|D) \,\mathrm{d}\theta $$ | $$ p(x|D) = p(x|\hat\theta) $$ |
$\Rightarrow$ ML estimation is an approximation to Bayesian learning, but for good reason a very popular learning method when faced with lots of available data.