This is a little mathematical background to justify why we might require priors to be "invariant" under certain transformations of the parameter space. This is not a new idea, but the derivation is my own (and hence so are errors...)

I'll use sympy to do some examples.

In [3]:

%matplotlib inline
import sympy
sympy.init_printing(use_latex = True)

As motivation, consider a model where $(X_i)_{i=1}^n$ are IID samples from $N(\mu,\sigma^2)$ where we wish to use a sample $(x_i)$ to make inferences about $\mu, \sigma$. For some variety, I'll use the precision $\tau = 1 / \sigma^2$.

So our random variable $x=(x_i)$ takes values in $\Omega = \mathbb R^n$. Consider the transformation $\phi : \Omega\rightarrow\Omega; (x_i) \mapsto (bx_i+b)$. That is, for each sample, we scale by $b$ and add $a$. Then $\phi(x)$ is a sample of IID $N(b\mu+a, b^2\sigma^2)$. If $\Theta$ is the parameter space (in our case, say $\{ (\mu,\tau) : \tau > 0 \}$) then consider the transformation $\psi : \Theta \rightarrow \Theta; (\mu,\tau) \mapsto (b\mu+a, b^{-2} \tau )$. It is hence reasonable to say the following:

Suppose we make some inference about $\theta = (\mu,\tau)$ given $x = (x_i)$.
Then we should make the "same" inference about $\psi(\theta)$ given $\phi(x)$.

As an example, if $\phi(x) = ( 2x_i - 1 )$ and our inference about $(x_i)$ gives a value to $\mathbb P( 0\leq \mu \leq 2, 1 \leq \tau \leq 3 )$ then as $\psi(\mu,\tau) = (2\mu-1, \tau/2)$, the inference about $(2x_i-1)$ should give the sample value to $\mathbb P( -1 \leq \mu \leq 3, 1/4 \leq \tau \leq 3/4 )$.

The assumption¶

Let $f(x|\theta)$ be the PDF of $x$ given the parameter value $\theta$. So $\mathbb P(x\in A | \theta) = \int_A f(x|\theta) dx$ for measurable $A\subseteq\Omega$. Our assumption is that apply $\phi$ to $\Omega$ is "the same as" applying $\psi$ to $\Theta$. That is, $$ \mathbb P(\phi(x)\in A | \theta) = \mathbb P(x\in A | \psi(\theta) ). $$ By the change of variables formula the distribution $y = \phi(x)$ has density $f(\phi^{-1}(y)|\theta) |\det J(\phi^{-1})|$ here $J$ is the Jacobian. Using that $|\det J(\phi^{-1})| = 1 / |\det J(\phi)|$ we find that $$ f(\phi^{-1}(y)|\theta) = f(y|\psi(\theta)) |\det J(\phi)|, $$ or equivalently, with $x = \phi^{-1}(y)$, $$ f(x|\theta) = f(\phi(x)|\psi(\theta)) |\det J(\phi)|. $$

Let's check this for our normal distribution example.

In sympy I don't know how to specify $x_1, x_2, \cdots, x_n$ for an unknown $n$.
So let's get around this with a hack.

Write $\sum_i (x_i - \mu)^2 = \sum_i x_i^2 - 2\mu \sum_i x_i + n\mu^2 = S_{xx} - 2\mu S_x + n\mu^2$.

In [4]:

def density(Sxx, Sx, n, mu, tau):
    """Return the density (2\pi\sigma^2)^{-n/2} \exp(-\sum (x_i-\mu)^2 / 2\sigma^2)
    = (2\pi)^{-n/2} \tau^{n/2} \exp(-\tau (S_{xx} - 2\mu S_x + n\mu^2)/2)"""
    return (2 * sympy.pi)**(-n/2) * tau**(n/2) * sympy.exp( - tau * (Sxx - 2*mu*Sx + n*mu**2) / 2 )

n = sympy.symbols("n", integer = True, positive = True)
Sx, mu = sympy.symbols("Sx mu", real = True)
Sxx, tau = sympy.symbols("Sxx tau", real = True, positive = True)

density(Sxx, Sx, n, mu, tau)

Out[4]:

$$\frac{\tau^{\frac{n}{2}}}{\left(2 \pi\right)^{\frac{n}{2}}} e^{- \frac{\tau}{2} \left(- 2 Sx \mu + Sxx + \mu^{2} n\right)}$$

Our transformation $\phi$ is $x_i \mapsto bx_i+a$ and so $S_{xx} \mapsto \sum_i (bx_i+a)^2 = b^2S_{xx} + 2abS_x + na^2$ and $S_x \mapsto \sum_i bx_i+a = bS_x+na$.

In [5]:

def phi(a, b):
    def func(Sxx, Sx, n):
        return b*b*Sxx + 2*a*b*Sx + n*a*a, b*Sx + n*a, n
    return func
    
a, b = sympy.symbols("a b", real = True)
density(*phi(a,b)(Sxx,Sx,n),mu=mu,tau=tau)

Out[5]:

$$\frac{\tau^{\frac{n}{2}}}{\left(2 \pi\right)^{\frac{n}{2}}} e^{- \frac{\tau}{2} \left(2 Sx a b + Sxx b^{2} + a^{2} n + \mu^{2} n - 2 \mu \left(Sx b + a n\right)\right)}$$

Our transformation $\psi$ is $\mu\mapsto b\mu+a, \tau \mapsto b^{-2} \tau$.

In [6]:

def psi(a, b):
    def func(mu, tau):
        return b*mu+a, b**(-2) * tau
    return func

density(Sxx,Sx,n,*psi(a,b)(mu,tau))

Out[6]:

$$\frac{\left(\frac{\tau}{b^{2}}\right)^{\frac{n}{2}}}{\left(2 \pi\right)^{\frac{n}{2}}} e^{- \frac{\tau}{2 b^{2}} \left(- Sx \left(2 a + 2 b \mu\right) + Sxx + n \left(a + b \mu\right)^{2}\right)}$$

Unfortunately, our "hack" means we can't automate the calculation of the determinant. We have $\phi:\mathbb R^n \rightarrow \mathbb R^n; (x_i) \mapsto (bx_i+a)$ so $J(\phi) = bI_n$ and hence $\det J(\phi) = b^n$.

In [7]:

rhs = density(*(phi(a,b)(Sxx,Sx,n) + psi(a,b)(mu,tau))) * sympy.Abs(b)**n
sympy.simplify(rhs)

Out[7]:

$$\frac{\tau^{\frac{n}{2}}}{2^{\frac{n}{2}} \pi^{\frac{n}{2}}} e^{\frac{\tau}{2} \left(2 Sx \mu - Sxx - \mu^{2} n\right)}$$

In [8]:

lhs = density(Sxx,Sx,n,mu,tau)
lhs

Out[8]:

$$\frac{\tau^{\frac{n}{2}}}{\left(2 \pi\right)^{\frac{n}{2}}} e^{- \frac{\tau}{2} \left(- 2 Sx \mu + Sxx + \mu^{2} n\right)}$$

Unfortunately, the following is a little distressing...

In [9]:

rhs == lhs

Out[9]:

False

In [10]:

rhs - lhs == 0

Out[10]:

False

In [11]:

sympy.simplify(rhs - lhs)

Out[11]:

$$0$$

In [12]:

sympy.simplify(rhs - lhs) == 0

Out[12]:

True

What we wish to be true¶

Okay, so what about our inference, about the posterior distribution, do we wish to be true?

$$ \mathbb P(\theta \in \Psi(A) | \phi(x) ) = \mathbb P(\theta\in A | x). $$

Again, if we write $p(\theta|x)$ for the posterior density, then this is equivalent to:

$$ p(\theta|x) = p(\psi(\theta)| \phi(x)) |\det J(\psi)|. $$

By Bayes' Theorem, we want the following to be true:

$$ \frac{f(x|\theta) p(\theta)}{p(x)} = \frac{f(\phi(x)|\psi(\theta)) p(\psi(\theta))}{p(\phi(x))} |\det J(\psi)|. $$

By our assumption $f(x|\theta) = f(\phi(x)|\psi(\theta)) |\det J(\phi)|$, this is equivalent to:

$$ \frac{f(\phi(x)|\psi(\theta)) |\det J(\phi)| p(\theta)}{p(x)} = \frac{f(\phi(x)|\psi(\theta)) p(\psi(\theta))}{p(\phi(x))} |\det J(\psi)|. $$

A little arm waving.¶

We need to deal with the terms $p(x)$ and $p(\phi(x))$. Now, these are likely to be "improper distributions", much as we expect $p(\theta)$ to be improper. Thus, $\int p(x) dx$ may fail to be finite, and so we more correctly think of $p(x)$ as measuring "proportional" or "relative" probability: $\int_A p(x) dx$ cannot be interpretted as a probability, but we can make sense of the ratio $\int_A p(x) dx / \int_B p(x) dx$ as the "ratio" of more likely $x$ is to be in $A$ relative to $B$.

With that out the way, it still makes sense to transform $x$. Our philosophy is that $\phi$ isn't "changing" the data, and so $\phi(x)$ should be no more or less likely than $x$. Formally then, we want $$ \mathbb P(x \in A) = \mathbb P(\phi(x)\in A). $$ By change of variables, this means that $$ p(x) = p(\phi(x)) |\det J(\phi)|. $$

Combining this with the above, what we hope to be true is that:

$$ \frac{f(\phi(x)|\psi(\theta)) |\det J(\phi)| p(\theta)}{p(\phi(x))|\det J(\phi)|} = \frac{f(\phi(x)|\psi(\theta)) p(\psi(\theta))}{p(\phi(x))} |\det J(\psi)|. $$

This cancels down to:

$$ p(\theta) = p(\psi(\theta)) |\det J(\psi)|. $$

Thus we have derived a relationship which the prior should have!

For our example.¶

We have $\psi(\mu,\tau) = (b\mu+a, b^{-2} \tau)$ and so $|\det J(\psi)| = |b|^{-1}$. Thus we need

$$ p(\mu,\tau) = |b|^{-1} p(b\mu+a, b^{-2} \tau) $$

for all $\mu,\tau,a,b$. Set $b=1$ and $a=-\mu$ to get

$$ p(\mu,\tau) = p(0, \tau) $$

So $p(\mu,\tau) = h(\tau)$ for some function $h$. Then set $b = \tau^{1/2}$ to find that

$$ h(\tau) = |b|^{-1} h(|b|^{-2}\tau) \implies h(\tau) = \tau^{-1/2} h(1) \implies h(\tau) \propto \tau^{-1/2} $$

as $\tau > 0$. Thus $p(\mu,\tau) \propto \tau^{-1/2}$.

Some comments¶

Firstly, let's check that at least the posterior is well-defined.

In [69]:

posterior = density(Sxx, Sx, n, mu, tau) * tau**(-1/2)
posterior

Out[69]:

$$\frac{\tau^{\frac{n}{2}}}{\tau^{0.5} \left(2 \pi\right)^{\frac{n}{2}}} e^{- \frac{\tau}{2} \left(- 2 Sx \mu + Sxx + \mu^{2} n\right)}$$

In [70]:

int1 = sympy.integrate(posterior, (mu, -sympy.oo, sympy.oo))
int1 = sympy.simplify(int1)
int1

Out[70]:

$$\frac{1}{\sqrt{n}} 2^{- \frac{n}{2} + \frac{1}{2}} \pi^{- \frac{n}{2} + \frac{1}{2}} \tau^{\frac{n}{2} - 1.0} e^{\frac{\tau}{2} \left(\frac{Sx^{2}}{n} - Sxx\right)}$$

In [71]:

sympy.integrate(int1, (tau, 1, sympy.oo))

Out[71]:

$$\begin{cases} \frac{1}{\sqrt{n}} \left(2 \pi\right)^{- \frac{n}{2} + \frac{1}{2}} \left(\frac{2^{\frac{n}{2}} e^{- \frac{i \pi}{2} n} \Gamma{\left(\frac{n}{2} \right)}}{\operatorname{polar\_lift}^{\frac{n}{2}}{\left (\frac{Sx^{2}}{n} - Sxx \right )}} + \frac{2^{\frac{n}{2} - 1} n e^{- \frac{i \pi}{2} n} \Gamma{\left(- \frac{n}{2} \right)} \gamma\left(\frac{n}{2}, \frac{e^{i \pi}}{2} \operatorname{polar\_lift}{\left (\frac{Sx^{2}}{n} - Sxx \right )}\right)}{\Gamma{\left(- \frac{n}{2} + 1 \right)} \operatorname{polar\_lift}^{\frac{n}{2}}{\left (\frac{Sx^{2}}{n} - Sxx \right )}}\right) & \text{for}\: \left(\left\lvert{\operatorname{periodic_{argument}}{\left (e^{i \pi} \operatorname{polar\_lift}{\left (\frac{Sx^{2}}{n} - Sxx \right )},\infty \right )}}\right\rvert = \frac{\pi}{2} \wedge - \frac{n}{2} - \frac{1}{2} > - \frac{3}{2}\right) \vee \left\lvert{\operatorname{periodic_{argument}}{\left (e^{i \pi} \operatorname{polar\_lift}{\left (\frac{Sx^{2}}{n} - Sxx \right )},\infty \right )}}\right\rvert < \frac{\pi}{2} \\\int_{1}^{\infty} \frac{1}{\sqrt{n}} 2^{- \frac{n}{2} + \frac{1}{2}} \pi^{- \frac{n}{2} + \frac{1}{2}} \tau^{\frac{n}{2} - 1.0} e^{\frac{\tau}{2} \left(\frac{Sx^{2}}{n} - Sxx\right)}\, d\tau & \text{otherwise} \end{cases}$$

Again, sympy defeats me: I'm not sure how to tell it $|S_{x}| = |\sum x_i| \leq n^{1/2} S_{xx}^{1/2}$ by Cauchy-Schwarz and so $S_x^2 / n \leq S_{xx}$.

In [72]:

alpha = sympy.symbols("alpha", real = True, positive = True)
int2 = (1/sympy.sqrt(n)) * (2*sympy.pi)**((1-n)/2) * tau**((n-2)/2) * sympy.exp(-tau * alpha / 2)
int2

Out[72]:

$$\frac{\tau^{\frac{n}{2} - 1}}{\sqrt{n} e^{\frac{\alpha \tau}{2}}} \left(2 \pi\right)^{- \frac{n}{2} + \frac{1}{2}}$$

In [73]:

sympy.simplify( sympy.integrate(int2, (tau,0,sympy.oo)) )

Out[73]:

$$\frac{2 \sqrt{2}}{\alpha^{\frac{n}{2}} n^{\frac{3}{2}}} \pi^{- \frac{n}{2} + \frac{1}{2}} \Gamma{\left(\frac{n}{2} + 1 \right)}$$

In [74]:

def normalise_my_prior(alpha, n):
    return 2 * sympy.sqrt(2) * sympy.pi**((1-n)/2) * sympy.gamma((n+2)/2) / ( alpha**(n/2) * n * sympy.sqrt(n) )

normalise_my_prior(alpha, n)

Out[74]:

$$\frac{2 \sqrt{2}}{\alpha^{\frac{n}{2}} n^{\frac{3}{2}}} \pi^{- \frac{n}{2} + \frac{1}{2}} \Gamma{\left(\frac{n}{2} + 1 \right)}$$

In [75]:

sympy.simplify( int1 / normalise_my_prior(alpha, n) )

Out[75]:

$$\frac{\alpha^{\frac{n}{2}} \tau^{\frac{n}{2} - 1.0}}{2^{\frac{n}{2}} \Gamma{\left(\frac{n}{2} \right)}} e^{\frac{\tau}{2} \left(\frac{Sx^{2}}{n} - Sxx\right)}$$

Comparison with the Jeffreys prior¶

The Jeffreys Prior is one way to construct a "noninformative" prior which is "invariant": what this means is that you get given a prodecure to follow to construct a prior, and if you then transform your coordinates, and follow the procedure in the new coorindate system, then you get a prior which is the transformed version of your previous prior. (There is some category theory here: this is a "covariant" prior in the catgeory sense, not the stats sense, of the word).

From Duke lecture notes we find that $p(s) = s^{-3/2}$ if $s=\sigma^2$. From Berkeley lecture notes we find $p(\sigma) = \sigma^{-2}$, though we're warned that this gives "poor convergence properties", though I don't fully understand what that means.

Let's just check that there are consistent:

$$ p_\sigma(\sigma) = p_s(s) \Big|\frac{ds}{d\sigma}\Big| = 2 s^{-3/2} \sigma \propto \sigma^{-2}. $$

My prior is in terms of $\tau = \sigma^{-2}$ and is $p(\tau) = \tau^{-1/2}$. Thus

$$ p_\sigma(\sigma) = p_\tau(\tau) \Big|\frac{d\tau}{d\sigma}\Big| = 2\tau^{-1/2} \sigma^{-3} \propto \sigma^{-2}. $$

So, this agrees with the Jeffreys prior.

For comparison, if $p_\sigma(\sigma) = \sigma^{-1}$ then $p_\tau(\tau) = \sigma^{-1} \Big|\frac{d\sigma}{d\tau}\Big| = \sigma^{-1} \frac12 \sigma^3 \propto \sigma^2 = 1/\tau$.

Convergence¶

The posterior distribution is then

$$ p(\mu,\tau | x) = p(x | \mu,\tau) p(\mu,\tau) = \Big(\frac{\tau}{2\pi}\Big)^{n/2} \exp\Big(-\frac{\tau}{2} \sum (x_i-\mu)^2 \Big) \tau^{-1/2}. $$

With $\alpha = S_{xx} - S_x^2 / n$, from the above calculation, the normalised posterior is

$$ p(\mu,\tau | x) = \frac{\alpha^{n/2} n^{3/2}}{2\sqrt 2} \pi^{-1/2} \Gamma\Big(\frac{n}{2}+1\Big)^{-1} \Big(\frac{\tau}{2}\Big)^{n/2} \exp\Big(-\frac{\tau}{2} \sum (x_i-\mu)^2 \Big) \tau^{-1/2} $$

What I do below is generate some samples from $N(0,1)$ to compare this distribution with the one from using the prior $p(\sigma) = 1/\sigma$.

Firstly we find the normaisation for this new prior, repeating the work above.

In [80]:

posterior = density(Sxx, Sx, n, mu, tau) * tau**(-1)
sympy.simplify( sympy.integrate(posterior, (mu,-sympy.oo, sympy.oo)))

Out[80]:

$$\frac{1}{\sqrt{n}} 2^{- \frac{n}{2} + \frac{1}{2}} \pi^{- \frac{n}{2} + \frac{1}{2}} \tau^{\frac{n}{2} - \frac{3}{2}} e^{\frac{\tau}{2} \left(\frac{Sx^{2}}{n} - Sxx\right)}$$

In [81]:

int1 = n**(-1/2) * (2*sympy.pi)**((1-n)/2) * tau**((n-3)/2) * sympy.exp(-tau * alpha / 2)
sympy.integrate(int1, (tau,0,sympy.oo))

Out[81]:

$$\begin{cases} \frac{2}{\alpha n^{0.5}} 2^{- \frac{n}{2} + \frac{1}{2}} \pi^{- \frac{n}{2} + \frac{1}{2}} \left(\frac{\alpha}{2}\right)^{- \frac{n}{2} + \frac{3}{2}} \Gamma{\left(\frac{n}{2} - \frac{1}{2} \right)} & \text{for}\: - \frac{n}{2} + \frac{3}{2} < 1 \\\int_{0}^{\infty} \frac{\tau^{\frac{n}{2} - \frac{3}{2}}}{n^{0.5} e^{\frac{\alpha \tau}{2}}} \left(2 \pi\right)^{- \frac{n}{2} + \frac{1}{2}}\, d\tau & \text{otherwise} \end{cases}$$

In [82]:

def normalise_other_prior(alpha, n):
    return ( 2 / (alpha * n**(1/2)) ) * (2*sympy.pi)**((1-n)/2) * (alpha/2)**((3-n)/2) * sympy.gamma((n-1)/2)

normalise_other_prior(alpha, n)

Out[82]:

$$\frac{2}{\alpha n^{0.5}} \left(\frac{\alpha}{2}\right)^{- \frac{n}{2} + \frac{3}{2}} \left(2 \pi\right)^{- \frac{n}{2} + \frac{1}{2}} \Gamma{\left(\frac{n}{2} - \frac{1}{2} \right)}$$

In [84]:

sympy.simplify( int1 / normalise_other_prior(alpha, n) )

Out[84]:

$$\frac{2^{- \frac{n}{2} + \frac{1}{2}} \alpha^{\frac{n}{2} - \frac{1}{2}} \tau^{\frac{n}{2} - \frac{3}{2}}}{e^{\frac{\alpha \tau}{2}} \Gamma{\left(\frac{n}{2} - \frac{1}{2} \right)}}$$

In [85]:

import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

In [86]:

def make_example(n):
    x = np.random.normal(size = n)
    Sxx = np.sum(x*x)
    Sx = np.sum(x)

    def posterior(mu, tau):
        return ( (tau / (2*sympy.pi))**(n/2) * sympy.exp( - tau * (Sxx - 2*mu*Sx + n*mu*mu) / 2 ) * tau**(-1/2)
                / normalise_my_prior(Sxx-Sx*Sx/n, n) )

    def posterior1(mu, tau):
        return ( (tau / (2*sympy.pi))**(n/2) * sympy.exp( - tau * (Sxx - 2*mu*Sx + n*mu*mu) / 2 ) * tau**(-1)
                / normalise_other_prior(Sxx-Sx*Sx/n, n) )

    return posterior, posterior1

In [87]:

posterior, posterior1 = make_example(10)
mubins = np.linspace(-1, 1, 50)
taubins = np.linspace(0.1, 2, 50)

# Slow as our functions use sympy primitives not numpy primitives.
Z = np.zeros([len(mubins),len(taubins)])
Z1 = np.zeros([len(mubins),len(taubins)])
for x, mux in enumerate(mubins):
    for y, tauy in enumerate(taubins):
        Z[x,y] = posterior(mux, tauy)
        Z1[x,y] = posterior1(mux, tauy)

In [88]:

plt.contour(mubins, taubins, Z.T)

Out[88]:

<matplotlib.contour.QuadContourSet at 0x17e40c50>

In [89]:

plt.contour(mubins, taubins, Z1.T)

Out[89]:

<matplotlib.contour.QuadContourSet at 0x17e9e198>

Marginalise out $\mu$¶

Comparing two dimensional distributions is hard: what do "confidence intervals" mean, for example? So, let's marginalise out $\mu$ and get the posterior distribution on $\tau$. We copy some formula from above.

In [94]:

posterior = (alpha * tau / 2)**(n/2) / (tau * sympy.gamma(n/2)) * sympy.exp(-alpha * tau / 2)
posterior

Out[94]:

$$\frac{\left(\frac{\alpha \tau}{2}\right)^{\frac{n}{2}}}{\tau e^{\frac{\alpha \tau}{2}} \Gamma{\left(\frac{n}{2} \right)}}$$

In [95]:

posterior1 = (alpha * tau / 2)**((n-1)/2) / (tau * sympy.gamma((n-1)/2)) * sympy.exp(-alpha * tau / 2)
posterior1

Out[95]:

$$\frac{\left(\frac{\alpha \tau}{2}\right)^{\frac{n}{2} - \frac{1}{2}}}{\tau e^{\frac{\alpha \tau}{2}} \Gamma{\left(\frac{n}{2} - \frac{1}{2} \right)}}$$

So actually the only difference is that the "other" prior multiplies by $\tau^{-1/2}$.

In [110]:

from scipy.special import gamma

def make_example(n):
    x = np.random.normal(size=n)
    alpha = np.sum(x*x) - np.sum(x)**2/n

    def p1(tau):
        return (alpha * tau / 2)**(n/2) / (tau * gamma(n/2)) * np.exp(-alpha * tau / 2)
    def p2(tau):
        return (alpha * tau / 2)**((n-1)/2) / (tau * gamma((n-1)/2)) * np.exp(-alpha * tau / 2)
    
    return p1, p2

p1, p2 = make_example(10)
x = np.linspace(0.1, 2, 50)
y1 = p1(x)
y2 = p2(x)
plt.plot(x, y1)
plt.plot(x, y2)

Out[110]:

[<matplotlib.lines.Line2D at 0x19b0f9e8>]

In [ ]: