I've been having trouble with the normal distribution for some time now.

Although I'm quite familiar with its shape, there is some stuff that I don't fully understand.

The goal of this post is to "rediscover" some of its attributes with a couple of computations.

Defining the normal¶

The accepted way to write the normal distribution is to use this formula:

$$ f(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x - \mu)^2}{2 \sigma^2}} $$

Now, for me trying to memorize this formula there are several questions:

why is there a square root in the normalizing constant in front?
why is the two in the exponential confronted with $\sigma^2$?
how does this all make sense?

A few pointers¶

While trying to answer the above, I've found a couple of points that make understanding (and memorizing) the above easier.

The Gaussian integral¶

The normal distribution formula can be thought of as a generalization of the so-called Gaussian integral:

$$ \int_{-\infty}^{\infty} e^{-x^2}dx $$

There is a neat trick that involves rewriting this as a polar integral (which is well explained in the linked page). This allows the exact computation of the value of the integral which is $\sqrt{\pi}$.

Once this is known, one can try to scale the $x$ inside the integrand by a factor $1/\sigma^2$. By change of variables, we obtain:

$$ \int_{-\infty}^{\infty} e^{-x^2/\sigma^2}dx = \int_{-\infty}^{\infty} e^{-z^2}\sigma dz = \sigma \sqrt{\pi} = \sqrt{\pi \sigma^2} $$

Dividing the $x$ by $\sigma$ multiplies the integral by $\sigma$. This explains the normalizing factor in the normal distribution (actually obtained by scaling by $1/2\sigma^2$).

Finally, examining what happens when one translates $x$ by $\mu$ and applying another change of variable doesn't change the integral.

$$ \int_{-\infty}^{\infty} e^{-(x- \mu)^2}dx = \int_{-\infty}^{\infty} e^{-z^2} dz = \sqrt{\pi} $$

So that explains a little more what the intuition behind the formula is. But what about the properties of $\sigma$ and $\mu$? How are they actionable?

Sigma and mu¶

Let's do some plots.

In [1]:

import numpy as np
import holoviews as hv
hv.extension('bokeh', logo=False)

def normal(x, mu, sigma):
    return 1/np.sqrt(2 * np.pi * sigma**2) * np.exp(-(x - mu)**2/(2 * sigma**2))

def normal_curve(mu, sigma):
    x = np.linspace(-5, 5, num=100)
    y = normal(x, mu, sigma)
    return hv.Curve((x, y)).opts(show_grid=True, tools=['hover'])

mus      = [-2, -0.5, 0, 2, 4]
sigmas = [0.3, 0.75, 1.0, 2.5, 4]
curve_dict_2D = {(mu,sigma):normal_curve(mu,sigma) for mu in mus for sigma in sigmas}

gridspace = hv.GridSpace(curve_dict_2D, kdims=['mu', 'sigma'])
gridspace.opts()

Out[1]:

I think this plot summarizes well what mu and sigma do:

mu centers the distribution on a given value
the higher sigma is, the more spread out the distribution is

Some numbers to remember¶

Of course, one would like to memorize some interesting things about how a given parameter relates the probabilities to parameters.

Two things are interesting here: the value the distribution takes at certain points as well as the value the integral of the distribution takes at certain points. Why these two different things ?

Value of the distribution¶

A quote from the scipy documentation

The function has its peak at the mean, and its “spread” increases with the standard deviation (the function reaches 0.607 times its maximum at x + \sigma and x - \sigma [R255]).

Let's see if we can verify this. [computation omitted]

In [2]:

np.exp(-1/2)

Out[2]:

0.6065306597126334

In [3]:

for n_sigma in range(1, 4):
    print(f"value at mu + {n_sigma} sigma: {np.exp(-n_sigma/2):.3f} times its maximum")

value at mu + 1 sigma: 0.607 times its maximum
value at mu + 2 sigma: 0.368 times its maximum
value at mu + 3 sigma: 0.223 times its maximum

Value of the integral of the distribution¶

Here, we want to know what mass of the function is comprised between the mean and 1, 2 or 3 sigmas. Let's do some numerical integration to find out!

In [4]:

mu, sigma = 2, 4.5

for n_sigma in range(1, 4):
    x = np.linspace(mu -n_sigma * sigma, mu + n_sigma * sigma, num=10000)
    y = normal(x, mu, sigma)
    integral = np.trapz(y, x)
    print(f"integral between mu - {n_sigma} sigma and mu + {n_sigma} sigma: {integral:.4f}")

integral between mu - 1 sigma and mu + 1 sigma: 0.6827
integral between mu - 2 sigma and mu + 2 sigma: 0.9545
integral between mu - 3 sigma and mu + 3 sigma: 0.9973

Statistical estimation¶

If you have a sample following a given normal distribution, you can estime the parameters of the normal distribution from the following estimators (warning: the sample variance is biased for small N, see https://en.wikipedia.org/wiki/Normal_distribution#Statistical_inference!).

$${\hat {\mu }}={\overline {x}}\equiv {\frac {1}{n}}\sum _{i=1}^{n}x_{i},\qquad {\hat {\sigma }}^{2}={\frac {1}{n}}\sum _{i=1}^{n}(x_{i}-{\overline {x}})^{2}$$

Let's make an example of that.

In [5]:

for n in [10, 100, 1000, 10000]:

    sample = np.random.normal(loc=10., scale=4.5, size=(n,))

    mu_hat = 1/sample.size * sample.sum()
    sigma_square_hat = 1/sample.size * np.square(sample - mu_hat).sum()

    mu_hat, sigma_square_hat, np.sqrt(sigma_square_hat)
    
    print(f"n={n}, true mu: 10., estimated mu: {mu_hat:.2f}, true sigma: 4.5, estimated sigma: {np.sqrt(sigma_square_hat):.2f}")

n=10, true mu: 10., estimated mu: 9.21, true sigma: 4.5, estimated sigma: 3.50
n=100, true mu: 10., estimated mu: 10.02, true sigma: 4.5, estimated sigma: 3.99
n=1000, true mu: 10., estimated mu: 9.85, true sigma: 4.5, estimated sigma: 4.60
n=10000, true mu: 10., estimated mu: 9.99, true sigma: 4.5, estimated sigma: 4.54

Why are these inference formulae useful? Because you often want to describe a distribution by its mean and standard deviation. If your sample is Gaussian distributed, then giving your standard deviation means that 68% of the expected values should fall in the $\mu \pm \sigma$ interval.

In any case, you can always compute the mean and standard deviation, which will serve as a proxy for your noisy measurement. A nice shortcut for this in numpy is:

In [6]:

sample.mean(), sample.std()

Out[6]:

(9.991956526467522, 4.535559846165549)