I've been having trouble with the normal distribution for some time now.
Although I'm quite familiar with its shape, there is some stuff that I don't fully understand.
The goal of this post is to "rediscover" some of its attributes with a couple of computations.
The accepted way to write the normal distribution is to use this formula:
$$ f(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x - \mu)^2}{2 \sigma^2}} $$Now, for me trying to memorize this formula there are several questions:
While trying to answer the above, I've found a couple of points that make understanding (and memorizing) the above easier.
The normal distribution formula can be thought of as a generalization of the so-called Gaussian integral:
$$ \int_{-\infty}^{\infty} e^{-x^2}dx $$There is a neat trick that involves rewriting this as a polar integral (which is well explained in the linked page). This allows the exact computation of the value of the integral which is $\sqrt{\pi}$.
Once this is known, one can try to scale the $x$ inside the integrand by a factor $1/\sigma^2$. By change of variables, we obtain:
Dividing the $x$ by $\sigma$ multiplies the integral by $\sigma$. This explains the normalizing factor in the normal distribution (actually obtained by scaling by $1/2\sigma^2$).
Finally, examining what happens when one translates $x$ by $\mu$ and applying another change of variable doesn't change the integral.
So that explains a little more what the intuition behind the formula is. But what about the properties of $\sigma$ and $\mu$? How are they actionable?
Let's do some plots.
import numpy as np
import holoviews as hv
hv.extension('bokeh', logo=False)
def normal(x, mu, sigma):
return 1/np.sqrt(2 * np.pi * sigma**2) * np.exp(-(x - mu)**2/(2 * sigma**2))
def normal_curve(mu, sigma):
x = np.linspace(-5, 5, num=100)
y = normal(x, mu, sigma)
return hv.Curve((x, y)).opts(show_grid=True, tools=['hover'])
mus = [-2, -0.5, 0, 2, 4]
sigmas = [0.3, 0.75, 1.0, 2.5, 4]
curve_dict_2D = {(mu,sigma):normal_curve(mu,sigma) for mu in mus for sigma in sigmas}
gridspace = hv.GridSpace(curve_dict_2D, kdims=['mu', 'sigma'])
gridspace.opts()
I think this plot summarizes well what mu and sigma do:
Of course, one would like to memorize some interesting things about how a given parameter relates the probabilities to parameters.
Two things are interesting here: the value the distribution takes at certain points as well as the value the integral of the distribution takes at certain points. Why these two different things ?
A quote from the scipy documentation
The function has its peak at the mean, and its “spread” increases with the standard deviation (the function reaches 0.607 times its maximum at x + \sigma and x - \sigma [R255]).
Let's see if we can verify this. [computation omitted]
np.exp(-1/2)
0.6065306597126334
for n_sigma in range(1, 4):
print(f"value at mu + {n_sigma} sigma: {np.exp(-n_sigma/2):.3f} times its maximum")
value at mu + 1 sigma: 0.607 times its maximum value at mu + 2 sigma: 0.368 times its maximum value at mu + 3 sigma: 0.223 times its maximum
Here, we want to know what mass of the function is comprised between the mean and 1, 2 or 3 sigmas. Let's do some numerical integration to find out!
mu, sigma = 2, 4.5
for n_sigma in range(1, 4):
x = np.linspace(mu -n_sigma * sigma, mu + n_sigma * sigma, num=10000)
y = normal(x, mu, sigma)
integral = np.trapz(y, x)
print(f"integral between mu - {n_sigma} sigma and mu + {n_sigma} sigma: {integral:.4f}")
integral between mu - 1 sigma and mu + 1 sigma: 0.6827 integral between mu - 2 sigma and mu + 2 sigma: 0.9545 integral between mu - 3 sigma and mu + 3 sigma: 0.9973
If you have a sample following a given normal distribution, you can estime the parameters of the normal distribution from the following estimators (warning: the sample variance is biased for small N, see https://en.wikipedia.org/wiki/Normal_distribution#Statistical_inference!).
Let's make an example of that.
for n in [10, 100, 1000, 10000]:
sample = np.random.normal(loc=10., scale=4.5, size=(n,))
mu_hat = 1/sample.size * sample.sum()
sigma_square_hat = 1/sample.size * np.square(sample - mu_hat).sum()
mu_hat, sigma_square_hat, np.sqrt(sigma_square_hat)
print(f"n={n}, true mu: 10., estimated mu: {mu_hat:.2f}, true sigma: 4.5, estimated sigma: {np.sqrt(sigma_square_hat):.2f}")
n=10, true mu: 10., estimated mu: 9.21, true sigma: 4.5, estimated sigma: 3.50 n=100, true mu: 10., estimated mu: 10.02, true sigma: 4.5, estimated sigma: 3.99 n=1000, true mu: 10., estimated mu: 9.85, true sigma: 4.5, estimated sigma: 4.60 n=10000, true mu: 10., estimated mu: 9.99, true sigma: 4.5, estimated sigma: 4.54
Why are these inference formulae useful? Because you often want to describe a distribution by its mean and standard deviation. If your sample is Gaussian distributed, then giving your standard deviation means that 68% of the expected values should fall in the $\mu \pm \sigma$ interval.
In any case, you can always compute the mean and standard deviation, which will serve as a proxy for your noisy measurement.
A nice shortcut for this in numpy
is:
sample.mean(), sample.std()
(9.991956526467522, 4.535559846165549)