There are two fundamental types of statistical questions we'll want to answer:
Given this Model, what parameters best fit my data?
Examples:
Given two potential Models, which better describes my data?
Examples:
Often one of the two models is a null hypothesis, or a baseline model in which the effect you're interested in is not observed.
Both the model fitting and model selection problems can be approached from either a frequentist or a Bayesian standpoint. Fundamentally, the difference between these lies in the definition of probability that they use:
A frequentist probability is a measure long-run frequency of (real or imagined) repeated trials. Among other things, this generally means that observed data can be described probabilistically (you can repeat the experiment and get a different realization) while model parameters are fixed, and cannot be described probabilistically (the universe remains the same no matter how many times you observe it).
A Bayesian probability is a quantification of belief. Among other things, this generally means that observed data are treated as fixed (you know exactly what values you measured) while model parameters – including the "true" values of the data reflected by noisy measurements – are treated probabilistically (your degree of knowledge about the universe changes as you gather more data).
Perhaps surprisingly to the uninitiated, scientists and statisticians have been vehemently fighting in favor of one approach or the other for decades. For more a more fleshed-out discussion of these different definitions and their consequences, you can see my series of blog posts on the topic.
Thus the end-goal of a Bayesian analysis is a probabilistic statement about the universe. Roughly we want to measure
$$ P(science) $$Where "science" might be encapsulated in the cosmological model, the mass of a planet around a star, or whatever else we're interested in learning about.
We don't of course measure this without reference to data, so more specifically we want to measure
$$ P(science~|~data) $$which should be read "the probability of the science given the data."
Of course, we should be explicit that this measurement is not done in a vaccum: generally before observing any data we have some degree of background information that informs the science, so we should actually write
$$ P(science~|~data, background\ info) $$This should be read "the probability of the science given the data and the background information".
Finally, there are often things in the scientific model that we don't particularly care about: these are known as "nuisance parameters". As an example of a nuisance parameter, if you are finding a planet in radial velocity data, the secular motion of the star is extremely important to model correctly, but in the end you don't really care about its best-fit value.
With that in mind, we can write:
$$ P(science,nuisance\ parameters~|~data, background\ info) $$Where as before the comma should be read as an "and".
This is starting to get a bit cumbersome, so let's create some symbols that will let us express this more easily:
$$ P(\theta_S, \theta_N~|~D, I) $$Finally, we'll often just write $\theta = (\theta_S, \theta_N)$ as a shorthand for all the model parameters.
This quantity, $P(\theta~|~D,I)$ is called the "posterior probability" and determining this quantity is the ultimate goal of a Bayesian analysis.
Now all we need to do is compute it!
The core problem is this: We do not have a way to directly calculate $P(\theta~|~D,I)$. We often do have an expression for $P(D~|~\theta,I)$, but these two expressions are not equal.
$$ P(\theta~|~D,I) \ne P(D~|~\theta,I) $$We need to figure out how these two expressions are related.
We will generally be dealing with probability densities, that is, $P(A) dA$ is the probability of a value falling between $A$ and $A + dA$.
Probability densities are normalized such that the union of all possible events has a probability of unity; mathematically that criterion looks like this:
$$ \int P(A) dA = 1 $$Among other things, consider the units implied by this expression: because probability is dimensionless, the units of $P(A)$ must be the inverse of the units of $A$. This can be very useful to keep in mind as you manipulate probabilistic expressions!
When we're talking about two different variables, we can write a joint probability:
$$ P(A,B) $$This should be read "probability of $A$ and $B$".
In the case that $A$ and $B$ are independent, $P(A,B) = P(A)P(B)$
Sometimes we want to express the probability of one variable conditioned on the value of another variable. We can write it like this:
$$ P(A\mid B) = \frac{P(A, B)}{P(B)} $$This should be read "the probability of $A$ given $B$".
Rearranging this we get the more common expression of this:
$$ P(A, B) = P(A\mid B)P(B) $$Note that conditional probabilities are still probabilities, so that the normalization is the same as above:
$$ \int P(A\mid B)dA = 1 $$This makes clear that $P(A \mid B)$ has units equivalent to $1/A$ – and no units related to $B$. Among other things, this means that expressions like $\int P(A\mid B)dB$ should raise a very big red flag!
Given the above two properties, we can quite easily show that $$ \int P(A, B) dA = P(B) $$ This is known as marginalization – we have integrated A out of the joint probability, and we are left with the raw probability of B.
The definition of conditional probability is entirely symmetric, so we can write
$$ P(A, B) = P(B, A) $$$$ P(A\mid B)P(B) = P(B\mid A)P(A) $$which is more commonly rearranged in this form:
$$ P(A\mid B) = \frac{P(B\mid A)P(A)}{P(B)} $$This is known as Bayes' Theorem or Bayes' Rule, and is important because it gives a formula for "flipping" conditional probabilities.
If we replace these labels, we find the usual expression of Bayes' theorem as it relates to model fitting:
$$ P(\theta \mid D) = \frac{P(D\mid\theta)P(\theta)}{P(D)} $$Technically all the probabilities should all be conditioned on the information $I$:
$$ P(\theta \mid D,I) = \frac{P(D \mid \theta,I)P(\theta \mid I)}{P(D \mid I)} $$Recall $\theta$ is the model we're interested in, $D$ is the observed data, and $I$ encodes all the prior information, including what led us to choose the particular model we're using.
When we write Bayes' rule this way, we're all of a sudden doing something controversial: can you see where this controversy lies?
Two controversial points:
We have a probability distribution over model parameters. A frequentist would say this is meaningless!
The answer depends on the prior $P(\theta\mid I)$. This is the probability of the model parameters without any data: how are we supposed to know that?
Nevertheless, applying Bayes' rule in this manner gives us a means of quantifying our knowledge of the parameters $\theta$ given observed data
We have four terms in the above expression, and we need to make sure we understand them:
This is the quantity we want to compute: our knowledge of the model given the data & background knowledge (including the choice of model).
This measures the probability of seeing our data given the model. This is identical to the quantity maximized in frequentist maximum-likelihood approaches.
This encodes any knowledge we had about the answer before measuring the current data.
You might prefer the acronym FML (it's also called the Evidence among other things). In the context of model fitting, it acts as a normalization constant and in most cases can be ignored.
In general the Fully Marginalized Likelihood (FML) is very costly to compute, which makes the acronym doubly appropriate in any situation where you actually need it. To see why it's so costly, consider that the FML can be expressed as an integral using the identities we covered above: $$ P(D\mid I) = \int P(D\mid\theta, I) P(\theta) d\theta $$ In other words, it is the integral over the likelihood for all possible values of theta. When your likelihood is a complicated function of many parameters, computing this integral can become extremely costly (a manifestation of the curse of dimensionality).
In general, for model fitting, you can ignore the FML as a simple normalization term. In model selection, the FML can become important.
At first blush, this might all seem needlessly complicated. Why not simply maximize the likelihood and be done with it? Why multiply by a prior at all?
There are a couple good reasons to go through all of this:
Many advocates of the Bayesian approach argue for it's philosophical purity: you quantify knowledge in terms of a probability, then follow the math to compute the answer. The fact that you need to specify a prior might be inconvenient, but we can't simply pretend it away. There are good reasons to think that the Bayesian posterior is just the quantity we wish to compute; in that case we should compute it, however inconvenient!
Perhaps the most vocal 20th century proponent of this view was Jaynes; I'd highly suggest looking at his book, Probability Theory: The Logic of Science (PDF here).
Whether frequentist or Bayesian, the maximum likelihood "point estimate" is only a small part of the picture. What we're really interested in scientifically is the uncertainty of the estimates. So simply reporting a point estimate is not appropriate.
In frequentist approaches, "error bars" are generally computed from Confidence Intervals, which effectively measure $P(\hat\theta\mid\theta)$, rather than $P(\theta\mid D)$. It takes some mental gymnastics to relate the confidence interval to the quantity we as scientists have in mind when we say "uncertainty".
In the Bayesian approach, we are actually measuring $P(\theta\mid D)$ from the beginning.
For some approachable reading on frequentist vs. Bayesian uncertainties, I'd suggest The Fallacy of Placing Confidence in Confidence Intervals, as well as my (rather opinionated) blog post on the topic, Confidence, Credibility, and why Frequentism and Science do not Mix.
Additionally, Bayesian approaches offer a very natural way to systematically account for nuisance parameters.
Consider that we now have a recipe for computing the posterior, $P(\theta\mid D,I)$. Recall that in general our model parameters $\theta$ contain some parameters of interest (what we called $\theta_S$, or the "science") and some nuisance parameters (what we called $\theta_N$). That is, we're really measuring
$$ P(\theta_S,\theta_N\mid D, I) $$What we're really interested in, though, is $P(\theta_S\mid D, I)$ which we can compute via a straightforward marginalization integral:
$$ P(\theta_S\mid D, I) = \int P(\theta_S, \theta_N\mid D, I)d\theta_N $$where the integral is over the entire space of $\theta_N$. This quantity, the marginalized posterior, is our final result of interest.
At first glance, you might think this problem would be nearly as difficult as the computation of the FML above. We'll see later, however, that a feature of the MCMC approaches typical for large Bayesian problems is that such marginalized likelihoods come essentially for free.
We now have all the probabilistic machinery we need to do Bayesian inference... next we will get our fingers on our keyboards and fit some Bayesian models!