This is a little mathematical background to justify why we might require priors to be "invariant" under certain transformations of the parameter space. This is not a new idea, but the derivation is my own (and hence so are errors...)
I'll use sympy
to do some examples.
%matplotlib inline
import sympy
sympy.init_printing(use_latex = True)
As motivation, consider a model where $(X_i)_{i=1}^n$ are IID samples from $N(\mu,\sigma^2)$ where we wish to use a sample $(x_i)$ to make inferences about $\mu, \sigma$. For some variety, I'll use the precision $\tau = 1 / \sigma^2$.
So our random variable $x=(x_i)$ takes values in $\Omega = \mathbb R^n$. Consider the transformation $\phi : \Omega\rightarrow\Omega; (x_i) \mapsto (bx_i+b)$. That is, for each sample, we scale by $b$ and add $a$. Then $\phi(x)$ is a sample of IID $N(b\mu+a, b^2\sigma^2)$. If $\Theta$ is the parameter space (in our case, say $\{ (\mu,\tau) : \tau > 0 \}$) then consider the transformation $\psi : \Theta \rightarrow \Theta; (\mu,\tau) \mapsto (b\mu+a, b^{-2} \tau )$. It is hence reasonable to say the following:
As an example, if $\phi(x) = ( 2x_i - 1 )$ and our inference about $(x_i)$ gives a value to $\mathbb P( 0\leq \mu \leq 2, 1 \leq \tau \leq 3 )$ then as $\psi(\mu,\tau) = (2\mu-1, \tau/2)$, the inference about $(2x_i-1)$ should give the sample value to $\mathbb P( -1 \leq \mu \leq 3, 1/4 \leq \tau \leq 3/4 )$.
Let $f(x|\theta)$ be the PDF of $x$ given the parameter value $\theta$. So $\mathbb P(x\in A | \theta) = \int_A f(x|\theta) dx$ for measurable $A\subseteq\Omega$. Our assumption is that apply $\phi$ to $\Omega$ is "the same as" applying $\psi$ to $\Theta$. That is, $$ \mathbb P(\phi(x)\in A | \theta) = \mathbb P(x\in A | \psi(\theta) ). $$ By the change of variables formula the distribution $y = \phi(x)$ has density $f(\phi^{-1}(y)|\theta) |\det J(\phi^{-1})|$ here $J$ is the Jacobian. Using that $|\det J(\phi^{-1})| = 1 / |\det J(\phi)|$ we find that $$ f(\phi^{-1}(y)|\theta) = f(y|\psi(\theta)) |\det J(\phi)|, $$ or equivalently, with $x = \phi^{-1}(y)$, $$ f(x|\theta) = f(\phi(x)|\psi(\theta)) |\det J(\phi)|. $$
Let's check this for our normal distribution example.
sympy
I don't know how to specify $x_1, x_2, \cdots, x_n$ for an unknown $n$.Write $\sum_i (x_i - \mu)^2 = \sum_i x_i^2 - 2\mu \sum_i x_i + n\mu^2 = S_{xx} - 2\mu S_x + n\mu^2$.
def density(Sxx, Sx, n, mu, tau):
"""Return the density (2\pi\sigma^2)^{-n/2} \exp(-\sum (x_i-\mu)^2 / 2\sigma^2)
= (2\pi)^{-n/2} \tau^{n/2} \exp(-\tau (S_{xx} - 2\mu S_x + n\mu^2)/2)"""
return (2 * sympy.pi)**(-n/2) * tau**(n/2) * sympy.exp( - tau * (Sxx - 2*mu*Sx + n*mu**2) / 2 )
n = sympy.symbols("n", integer = True, positive = True)
Sx, mu = sympy.symbols("Sx mu", real = True)
Sxx, tau = sympy.symbols("Sxx tau", real = True, positive = True)
density(Sxx, Sx, n, mu, tau)
Our transformation $\phi$ is $x_i \mapsto bx_i+a$ and so $S_{xx} \mapsto \sum_i (bx_i+a)^2 = b^2S_{xx} + 2abS_x + na^2$ and $S_x \mapsto \sum_i bx_i+a = bS_x+na$.
def phi(a, b):
def func(Sxx, Sx, n):
return b*b*Sxx + 2*a*b*Sx + n*a*a, b*Sx + n*a, n
return func
a, b = sympy.symbols("a b", real = True)
density(*phi(a,b)(Sxx,Sx,n),mu=mu,tau=tau)
Our transformation $\psi$ is $\mu\mapsto b\mu+a, \tau \mapsto b^{-2} \tau$.
def psi(a, b):
def func(mu, tau):
return b*mu+a, b**(-2) * tau
return func
density(Sxx,Sx,n,*psi(a,b)(mu,tau))
Unfortunately, our "hack" means we can't automate the calculation of the determinant. We have $\phi:\mathbb R^n \rightarrow \mathbb R^n; (x_i) \mapsto (bx_i+a)$ so $J(\phi) = bI_n$ and hence $\det J(\phi) = b^n$.
rhs = density(*(phi(a,b)(Sxx,Sx,n) + psi(a,b)(mu,tau))) * sympy.Abs(b)**n
sympy.simplify(rhs)
lhs = density(Sxx,Sx,n,mu,tau)
lhs
Unfortunately, the following is a little distressing...
rhs == lhs
False
rhs - lhs == 0
False
sympy.simplify(rhs - lhs)
sympy.simplify(rhs - lhs) == 0
True
Okay, so what about our inference, about the posterior distribution, do we wish to be true?
$$ \mathbb P(\theta \in \Psi(A) | \phi(x) ) = \mathbb P(\theta\in A | x). $$Again, if we write $p(\theta|x)$ for the posterior density, then this is equivalent to:
$$ p(\theta|x) = p(\psi(\theta)| \phi(x)) |\det J(\psi)|. $$By Bayes' Theorem, we want the following to be true:
$$ \frac{f(x|\theta) p(\theta)}{p(x)} = \frac{f(\phi(x)|\psi(\theta)) p(\psi(\theta))}{p(\phi(x))} |\det J(\psi)|. $$By our assumption $f(x|\theta) = f(\phi(x)|\psi(\theta)) |\det J(\phi)|$, this is equivalent to:
$$ \frac{f(\phi(x)|\psi(\theta)) |\det J(\phi)| p(\theta)}{p(x)} = \frac{f(\phi(x)|\psi(\theta)) p(\psi(\theta))}{p(\phi(x))} |\det J(\psi)|. $$We need to deal with the terms $p(x)$ and $p(\phi(x))$. Now, these are likely to be "improper distributions", much as we expect $p(\theta)$ to be improper. Thus, $\int p(x) dx$ may fail to be finite, and so we more correctly think of $p(x)$ as measuring "proportional" or "relative" probability: $\int_A p(x) dx$ cannot be interpretted as a probability, but we can make sense of the ratio $\int_A p(x) dx / \int_B p(x) dx$ as the "ratio" of more likely $x$ is to be in $A$ relative to $B$.
With that out the way, it still makes sense to transform $x$. Our philosophy is that $\phi$ isn't "changing" the data, and so $\phi(x)$ should be no more or less likely than $x$. Formally then, we want $$ \mathbb P(x \in A) = \mathbb P(\phi(x)\in A). $$ By change of variables, this means that $$ p(x) = p(\phi(x)) |\det J(\phi)|. $$
Combining this with the above, what we hope to be true is that:
$$ \frac{f(\phi(x)|\psi(\theta)) |\det J(\phi)| p(\theta)}{p(\phi(x))|\det J(\phi)|} = \frac{f(\phi(x)|\psi(\theta)) p(\psi(\theta))}{p(\phi(x))} |\det J(\psi)|. $$This cancels down to:
$$ p(\theta) = p(\psi(\theta)) |\det J(\psi)|. $$Thus we have derived a relationship which the prior should have!
We have $\psi(\mu,\tau) = (b\mu+a, b^{-2} \tau)$ and so $|\det J(\psi)| = |b|^{-1}$. Thus we need
$$ p(\mu,\tau) = |b|^{-1} p(b\mu+a, b^{-2} \tau) $$for all $\mu,\tau,a,b$. Set $b=1$ and $a=-\mu$ to get
$$ p(\mu,\tau) = p(0, \tau) $$So $p(\mu,\tau) = h(\tau)$ for some function $h$. Then set $b = \tau^{1/2}$ to find that
$$ h(\tau) = |b|^{-1} h(|b|^{-2}\tau) \implies h(\tau) = \tau^{-1/2} h(1) \implies h(\tau) \propto \tau^{-1/2} $$as $\tau > 0$. Thus $p(\mu,\tau) \propto \tau^{-1/2}$.
Firstly, let's check that at least the posterior is well-defined.
posterior = density(Sxx, Sx, n, mu, tau) * tau**(-1/2)
posterior
int1 = sympy.integrate(posterior, (mu, -sympy.oo, sympy.oo))
int1 = sympy.simplify(int1)
int1
sympy.integrate(int1, (tau, 1, sympy.oo))
Again, sympy
defeats me: I'm not sure how to tell it $|S_{x}| = |\sum x_i| \leq n^{1/2} S_{xx}^{1/2}$ by Cauchy-Schwarz and so $S_x^2 / n \leq S_{xx}$.
alpha = sympy.symbols("alpha", real = True, positive = True)
int2 = (1/sympy.sqrt(n)) * (2*sympy.pi)**((1-n)/2) * tau**((n-2)/2) * sympy.exp(-tau * alpha / 2)
int2
sympy.simplify( sympy.integrate(int2, (tau,0,sympy.oo)) )
def normalise_my_prior(alpha, n):
return 2 * sympy.sqrt(2) * sympy.pi**((1-n)/2) * sympy.gamma((n+2)/2) / ( alpha**(n/2) * n * sympy.sqrt(n) )
normalise_my_prior(alpha, n)
sympy.simplify( int1 / normalise_my_prior(alpha, n) )
The Jeffreys Prior is one way to construct a "noninformative" prior which is "invariant": what this means is that you get given a prodecure to follow to construct a prior, and if you then transform your coordinates, and follow the procedure in the new coorindate system, then you get a prior which is the transformed version of your previous prior. (There is some category theory here: this is a "covariant" prior in the catgeory sense, not the stats sense, of the word).
From Duke lecture notes we find that $p(s) = s^{-3/2}$ if $s=\sigma^2$. From Berkeley lecture notes we find $p(\sigma) = \sigma^{-2}$, though we're warned that this gives "poor convergence properties", though I don't fully understand what that means.
Let's just check that there are consistent:
$$ p_\sigma(\sigma) = p_s(s) \Big|\frac{ds}{d\sigma}\Big| = 2 s^{-3/2} \sigma \propto \sigma^{-2}. $$My prior is in terms of $\tau = \sigma^{-2}$ and is $p(\tau) = \tau^{-1/2}$. Thus
$$ p_\sigma(\sigma) = p_\tau(\tau) \Big|\frac{d\tau}{d\sigma}\Big| = 2\tau^{-1/2} \sigma^{-3} \propto \sigma^{-2}. $$So, this agrees with the Jeffreys prior.
For comparison, if $p_\sigma(\sigma) = \sigma^{-1}$ then $p_\tau(\tau) = \sigma^{-1} \Big|\frac{d\sigma}{d\tau}\Big| = \sigma^{-1} \frac12 \sigma^3 \propto \sigma^2 = 1/\tau$.
The posterior distribution is then
$$ p(\mu,\tau | x) = p(x | \mu,\tau) p(\mu,\tau) = \Big(\frac{\tau}{2\pi}\Big)^{n/2} \exp\Big(-\frac{\tau}{2} \sum (x_i-\mu)^2 \Big) \tau^{-1/2}. $$With $\alpha = S_{xx} - S_x^2 / n$, from the above calculation, the normalised posterior is
$$ p(\mu,\tau | x) = \frac{\alpha^{n/2} n^{3/2}}{2\sqrt 2} \pi^{-1/2} \Gamma\Big(\frac{n}{2}+1\Big)^{-1} \Big(\frac{\tau}{2}\Big)^{n/2} \exp\Big(-\frac{\tau}{2} \sum (x_i-\mu)^2 \Big) \tau^{-1/2} $$What I do below is generate some samples from $N(0,1)$ to compare this distribution with the one from using the prior $p(\sigma) = 1/\sigma$.
Firstly we find the normaisation for this new prior, repeating the work above.
posterior = density(Sxx, Sx, n, mu, tau) * tau**(-1)
sympy.simplify( sympy.integrate(posterior, (mu,-sympy.oo, sympy.oo)))
int1 = n**(-1/2) * (2*sympy.pi)**((1-n)/2) * tau**((n-3)/2) * sympy.exp(-tau * alpha / 2)
sympy.integrate(int1, (tau,0,sympy.oo))
def normalise_other_prior(alpha, n):
return ( 2 / (alpha * n**(1/2)) ) * (2*sympy.pi)**((1-n)/2) * (alpha/2)**((3-n)/2) * sympy.gamma((n-1)/2)
normalise_other_prior(alpha, n)
sympy.simplify( int1 / normalise_other_prior(alpha, n) )
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
def make_example(n):
x = np.random.normal(size = n)
Sxx = np.sum(x*x)
Sx = np.sum(x)
def posterior(mu, tau):
return ( (tau / (2*sympy.pi))**(n/2) * sympy.exp( - tau * (Sxx - 2*mu*Sx + n*mu*mu) / 2 ) * tau**(-1/2)
/ normalise_my_prior(Sxx-Sx*Sx/n, n) )
def posterior1(mu, tau):
return ( (tau / (2*sympy.pi))**(n/2) * sympy.exp( - tau * (Sxx - 2*mu*Sx + n*mu*mu) / 2 ) * tau**(-1)
/ normalise_other_prior(Sxx-Sx*Sx/n, n) )
return posterior, posterior1
posterior, posterior1 = make_example(10)
mubins = np.linspace(-1, 1, 50)
taubins = np.linspace(0.1, 2, 50)
# Slow as our functions use sympy primitives not numpy primitives.
Z = np.zeros([len(mubins),len(taubins)])
Z1 = np.zeros([len(mubins),len(taubins)])
for x, mux in enumerate(mubins):
for y, tauy in enumerate(taubins):
Z[x,y] = posterior(mux, tauy)
Z1[x,y] = posterior1(mux, tauy)
plt.contour(mubins, taubins, Z.T)
<matplotlib.contour.QuadContourSet at 0x17e40c50>
plt.contour(mubins, taubins, Z1.T)
<matplotlib.contour.QuadContourSet at 0x17e9e198>
Comparing two dimensional distributions is hard: what do "confidence intervals" mean, for example? So, let's marginalise out $\mu$ and get the posterior distribution on $\tau$. We copy some formula from above.
posterior = (alpha * tau / 2)**(n/2) / (tau * sympy.gamma(n/2)) * sympy.exp(-alpha * tau / 2)
posterior
posterior1 = (alpha * tau / 2)**((n-1)/2) / (tau * sympy.gamma((n-1)/2)) * sympy.exp(-alpha * tau / 2)
posterior1
So actually the only difference is that the "other" prior multiplies by $\tau^{-1/2}$.
from scipy.special import gamma
def make_example(n):
x = np.random.normal(size=n)
alpha = np.sum(x*x) - np.sum(x)**2/n
def p1(tau):
return (alpha * tau / 2)**(n/2) / (tau * gamma(n/2)) * np.exp(-alpha * tau / 2)
def p2(tau):
return (alpha * tau / 2)**((n-1)/2) / (tau * gamma((n-1)/2)) * np.exp(-alpha * tau / 2)
return p1, p2
p1, p2 = make_example(10)
x = np.linspace(0.1, 2, 50)
y1 = p1(x)
y2 = p2(x)
plt.plot(x, y1)
plt.plot(x, y2)
[<matplotlib.lines.Line2D at 0x19b0f9e8>]