Gaussian Probabilities

In [1]:
#format the book
%matplotlib inline
from __future__ import division, print_function
import matplotlib.pyplot as plt
import book_format
book_format.load_style()
Out[1]:

Introduction

The last chapter ended by discussing some of the drawbacks of the Discrete Bayesian filter. For many tracking and filtering problems our desire is to have a filter that is unimodal and continuous. That is, we want to model our system using floating point math (continuous) and to have only one belief represented (unimodal). For example, we want to say an aircraft is at (12.34381, -95.54321,2389.5) where that is latitude, longitude, and altitude. We do not want our filter to tell us "it might be at (1,65,78) or at (34,656,98)" That doesn't match our physical intuition of how the world works, and as we discussed, it is prohibitively expensive to compute.

So we desire a unimodal, continuous way to represent probabilities that models how the real world works, and that is very computationally efficient to calculate. As you might guess from the chapter name, Gaussian distributions provide all of these features.

Before we go into the math, lets just look at a graph of the Gaussian distribution to get a sense of what we are talking about.

In [2]:
from stats import plot_gaussian

plot_gaussian(mean=100, variance=15**2, xlabel='IQ')

Probably this is immediately recognizable to you as a 'bell curve'. This curve is ubiquitous because under real world conditions most observations are distributed in such a manner. In fact, this is the bell curve for IQ (Intelligence Quotient). You've probably seen this before, and understand it. It tells us that the average IQ is 100, and that the number of people that have IQs higher or lower than that drops off as they get further away from 100. It's hard to see the exact number, but we can see that very few people have an IQ over 150 or under 50, but a lot have an IQ of 90 or 110.

This curve is not unique to IQ distributions - a vast amount of natural phenomena exhibits this sort of distribution, including the sensors that we use in filtering problems. As we will see, it also has all the attributes that we are looking for - it represents a unimodal belief or value as a probability, it is continuous, and it is computationally efficient. We will soon discover that it also other desirable qualities that we do not yet recognize we need.

Nomenclature

A bit of nomenclature before we continue - this chart depicts the probability density of of a random variable having any value between ($-\infty..\infty)$. What does that mean? Imagine we take an infinite number of infinitely precise measurements of the speed of automobiles on a section of highway. We could then plot the results by showing the relative number of cars going at any given speed. If the average was 120 kph, it might look like this:

In [3]:
plot_gaussian(mean=120, variance=17**2, xlabel='speed(kph)')

The y-axis depicts the probability density - the relative amount of cars that are going the speed at the corresponding x-axis.

Random variable will be precisely defined later. For now just think of it as a variable that can 'freely' and 'randomly' vary. A dog's position in a hallway, air temperature, and a drone's height above the ground are all random variables. The position of the North Pole is not, nor is a sin wave (a sin wave is anything but 'free').

You may object that human IQs or automobile speeds cannot be less than zero, let alone $-\infty$. This is true, but this is a common limitation of mathematical modeling. "The map is not the territory" is a common expression, and it is true for Bayesian filtering and statistics. The Gaussian distribution above very closely models the distribution of IQ test results, but being a model it is necessarily imperfect. The difference between model and reality will come up again and again in these filters.

You will see these distributions called Gaussian distributions, normal distributions, and bell curves. Bell curve is ambiguous because there are other distributions which also look bell shaped but are not Gaussian distributions, so we will not use it further in this book. But Gaussian and normal both mean the same thing, and are used interchangeably. I will use both throughout this book as different sources will use either term, and so I want you to be used to seeing both. Finally, as in this paragraph, it is typical to shorten the name and just talk about a Gaussian or normal - these are both typical shortcut names for the Gaussian distribution.

Gaussian Distributions

So let us explore how Gaussians work. A Gaussian is a continuous probability distribution that is completely described with two parameters, the mean ($\mu$) and the variance ($\sigma^2$). It is defined as: $$ f(x, \mu, \sigma) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}{(x-\mu)^2}/\sigma^2 } $$

Don't be dissuaded by the equation if you haven't seen it before; you will not need to memorize or manipulate it. The computation of this function is stored in stats.py with the function gaussian(x, mean, var).

Optional: Let's remind ourselves how to look at a function stored in a file by using the %load magic. If you type %load -s gaussian stats.py into a code cell and then press CTRL-Enter, the notebook will create a new input cell and load the function into it.

%load -s gaussian stats.py

def gaussian(x, mean, var):
    """returns normal distribution for x given a 
    gaussian with the specified mean and variance. 
    """
    return math.exp((-0.5*(x-mean)**2)/var) / \
                     math.sqrt(_two_pi*var)

We will plot a Gaussian with a mean of 22 $(\mu=22)$, with a variance of 4 $(\sigma^2=4)$, and then discuss what this means.

In [4]:
from stats import gaussian, norm_cdf
plot_gaussian(22, 4, mean_line=True, xlabel='$^{\circ}C$')

So what does this curve mean? Assume for a moment that we have a thermometer, which reads 22$\,^{\circ}C$. No thermometer is perfectly accurate, and so we normally expect that thermometer will read $\pm$ that temperature by some amount each time we read it. Furthermore, a theorem called Central Limit Theorem states that if we make many measurements that the measurements will be normally distributed. So, when we look at this chart we can sort of think of it as representing the probability of the thermometer reading a particular value given the actual temperature of 22$^{\circ}C$. However, that is not quite accurate mathematically.

Recall that we said that the distribution is continuous. Think of an infinitly long straight line - what is the probability that a point you pick randomly is at, say, 2.0. Clearly 0%, as there is an infinite number of choices to choose from. The same is true for normal distributions; in the graph above the probability of being exactly 22$^{\circ}C$ is 0% because there are an infinite number of values the reading can take.

So what then is this curve? It is something we call the probability density function. Later we will delve into this in greater detail; for now just understand that the area under the curve at any region gives you the probability of those values. So, for example, if you compute the area under the curve between 20 and 22 the result will be the probability of the temperature reading being between those two temperatures.

So how do you compute the probability, or area under the curve? Well, you integrate the equation for the Gaussian

$$ \int^{x_1}_{x_0} \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}{(x-\mu)^2}/\sigma^2 } dx$$

I wrote stats.norm_cdf which computes the integral for you. So, for example, we can compute

In [5]:
print('Probability of value in range 21.5 to 22.5 is {:.2f}%'.format(
      norm_cdf((21.5, 22.5), 22,4)*100))
print('Probability of value in range 23.5 to 24.5 is {:.2f}%'.format(
      norm_cdf((23.5, 24.5), 22,4)*100))
Probability of value in range 21.5 to 22.5 is 19.74%
Probability of value in range 23.5 to 24.5 is 12.10%

So the mean ($\mu$) is what it sounds like - the average of all possible probabilities. Because of the symmetric shape of the curve it is also the tallest part of the curve. The thermometer reads $22^{\circ}C$, so that is what we used for the mean.

Important: I will repeat what I wrote at the top of this section: "A Gaussian...is completely described with two parameters"

The standard notation for a normal distribution for a random variable $X$ is $X \sim\ \mathcal{N}(\mu,\sigma^2)$. This means I can express the temperature reading of our thermometer as

$$temp = \mathcal{N}(22,4)$$

This is an extremely important result. Gaussians allow me to capture an infinite number of possible values with only two numbers! With the values $\mu=22$ and $\sigma^2=4$ I can compute the distribution of measurements for over any range.

The Variance

Since this is a probability density distribution it is required that the area under the curve always equals one. This should be intuitively clear - the area under the curve represents all possible occurences, which must sum to one. We can prove this ourselves with a bit of code. (If you are mathematically inclined, integrate the Gaussian equation from $-\infty$ to $\infty$)

In [6]:
print(norm_cdf((-1e8, 1e8), mu=0, var=4))
1.0

This leads to an important insight. If the variance is small the curve will be narrow. this is because the variance is a measure of how much the samples vary from the mean. To keep the area equal to 1, the curve must also be tall. On the other hand if the variance is large the curve will be wide, and thus it will also have to be short to make the area equal to 1.

Let's look at that graphically:

In [7]:
import numpy as np
import matplotlib.pyplot as plt
    
xs = np.arange(15, 30, 0.05)
plt.plot(xs, [gaussian(x, 23, .2) for x in xs], label='var=0.2')
plt.plot(xs, [gaussian(x, 23, 1) for x in xs], label='var=1')
plt.plot(xs, [gaussian(x, 23, 5) for x in xs], label='var=5')
plt.legend()
plt.show()

So what is this telling us? The blue gaussian is very narrow. It is saying that we believe $x=23$, and that we are very sure about that. In contrast, the red gaussian also believes that $x=23$, but we are much less sure about that. Our believe that $x=23$ is lower, and so our belief about the likely possible values for $x$ is spread out - we think it is quite likely that $x=20$ or $x=26$, for example. The blue gaussian has almost completely eliminated $22$ or $24$ as possible value - their probability is almost $0\%$, whereas the red curve considers them nearly as likely as $23$.

If we think back to the thermometer, we can consider these three curves as representing the readings from three different thermometers. The blue curve represents a very accurate thermometer, and the red one represents a fairly inaccurate one. Green of course represents one in between the two others. Note the very powerful property the Gaussian distribution affords us - we can entirely represent both the reading and the error of a thermometer with only two numbers - the mean and the variance.

The standard notation for a normal distribution for a random variable $X$ is just $X \sim\ \mathcal{N}(\mu,\sigma^2)$ where $\mu$ is the mean and $\sigma^2$ is the variance. It may seem odd to use $\sigma$ squared - why not just $\sigma$? We will not go into great detail about the math at this point, but in statistics $\sigma$ is the standard deviation of a normal distribution. Variance is defined as the square of the standard deviation, hence $\sigma^2$.

It is worth spending a few words on standard deviation now. The standard deviation is a measure of how much variation from the mean exists. For Gaussian distributions, 68% of all the data falls within one standard deviation($1\sigma$) of the mean, 95% falls within two standard deviations ($2\sigma$), and 99.7% within three ($3\sigma$). This is often called the 68-95-99.7 rule. So if you were told that the average test score in a class was 71 with a standard deviation of 9.4, you could conclude that 95% of the students received a score between 52.2 and 89.8 if the distribution is normal (that is calculated with $71 \pm (2 * 9.4)$).

The following graph depicts the relationship between the standard deviation and the normal distribution.

In [8]:
from gaussian_internal import display_stddev_plot
display_stddev_plot()
plt.show()

An equivalent formation for a Gaussian is $\mathcal{N}(\mu,1/\tau)$ where $\mu$ is the mean and $tau$ the precision. Here $1/\tau = \sigma^2$; it is the reciprocal of the variance. While we do not use this formulation in this book, it underscores that the variance is a measure of how precise our data is. A small variance yields large precision - our measurement is very precise. Conversely, a large variance yields low precision - our belief is spread out across a large area. You should become comfortable with thinking about Gaussians in these equivalent forms. Gaussians reflect our belief about a measurement, they express the precision of the measurement, and they express how much variance there is in the measurements. These are all different ways of stating the same fact.</div>

Interactive Gaussians

For those that are reading this in IPython Notebook, here is an interactive version of the Gaussian plots. Use the sliders to modify $\mu$ and $\sigma^2$. Adjusting $\mu$ will move the graph to the left and right because you are adjusting the mean, and adjusting $\sigma^2$ will make the bell curve thicker and thinner.

In [9]:
import math
from IPython.html.widgets import interact, interactive, fixed
import IPython.html.widgets as widgets

def plt_g(mu,variance):
    xs = np.arange(2, 8, 0.1)
    ys = [gaussian (x, mu, variance) for x in xs]
    plt.plot(xs, ys)
    plt.ylim((0, 1))
    plt.show()

interact (plt_g, mu=(0, 10), 
          variance=widgets.FloatSliderWidget(value=0.6, min=0.2, max=4.5))
Out[9]:
<function __main__.plt_g>

Finally, if you are reading this in an IPython Notebook, here is an animation of a Gaussian. First, the mean is being shifted to the right. Then the mean is centered at $\mu=5$ and the variance is modified.

Computational Properties of the Gaussian

Recall how our discrete Bayesian filter worked. We had a vector implemented as a numpy array representing our belief at a certain moment in time. When we performed another measurement using the update() function we had to multiply probabilities together, and when we performed the motion step using the predict() function we had to shift and add probabilities. I've promised you that the Kalman filter uses essentially the same process, and that it uses Gaussians instead of histograms, so you might reasonable expect that we will be multiplying, adding, and shifting Gaussians in the Kalman filter.

A typical textbook would directly launch into a multipage proof of the behavior of Gaussians under these operations, but I don't see the value in that right now. I think the math will be much more intuitive and clear if we just start developing a Kalman filter using Gaussians. I will provide the equations for multiplying and shifting Gaussians at the appropriate time. You will then be able to develop a physical intuition for what these operations do, rather than be forced to digest a lot of fairly abstract math.

The key point, which I will only assert for now, is that all the operations are very simple, and that they preserve the properties of the Gaussian. This is somewhat remarkable, in that the Gaussian is a nonlinear function, and typically if you multiply a nonlinear equation with itself you end up with a different equation. For example, the shape of sin(x)sin(x) is very different from sin(x). But the result of multiplying two Gaussians is yet another Gaussian. This is a fundamental property, and a key reason why Kalman filters are possible.

Computing Probabilities with scipy.stats

In this chapter I have used by custom written code for computing Gaussians, plotting, and so on. I chose to do that to give you a chance to look at the code and see how these functions are implemented. However, Python comes with "batteries included" as the saying goes, and it comes with a wide range of statistics functions in the module scipy.stats. I find the performance of some of the functions rather slow (the scipy.stats documentation contains a warning to this effect), but this is offset by the fact that this is standard code available to everyone, and it is well tested. So let's walk through how to use scipy.stats to compute various things.

The scipy.stats module contains a number of objects which you can use to compute attributes of various probability distributions. The full documentation for this module is here: http://http://docs.scipy.org/doc/scipy/reference/stats.html. However, we will focus on the norm variable, which implements the normal distribution. Let's look at some code that uses scipy.stats.norm to compute a Gaussian, and compare its value to the value returned by the gaussian() function.

In [10]:
from scipy.stats import norm
print(norm(2, 3).pdf(1.5))
print(gaussian(x=1.5, mean=2, var=3*3))
0.131146572034
0.13114657203397997

The call norm(2,3) creates what scipy calls a 'frozen' distribution - it creates and returns an object with a mean of 2 and a standard deviation of 3. You can then use this object multiple times to get the probability density of various values, like so:

In [11]:
n23 = norm(2,3)
print('probability density of 1.5 is %.4f' % n23.pdf(1.5))
print('probability density of 2.5 is also %.4f' % n23.pdf(2.5))
print('whereas probability density of 2 is %.4f' % n23.pdf(2))
probability density of 1.5 is 0.1311
probability density of 2.5 is also 0.1311
whereas probability density of 2 is 0.1330

If we look at the documentation for scipy.stats.norm here[1] we see that there are many other functions that norm provides.

For example, we can generate $n$ samples from the distribution with the rvs() function.

In [12]:
print(n23.rvs(size=15))
[ 5.21312379  4.07998256 -1.88727855  1.8810597   5.83528749  3.17323975
  3.61095259  6.35085461  0.85512558  7.86604454  1.68210737  4.50882311
  6.25585727  7.3637816   2.37546818]

We can get the cumulative distribution function (CDF), which is the probability that a randomly drawn value from the distribution is less than or equal to $x$.

In [13]:
# probability that a random value is less than the mean 2
print(n23.cdf(2))
0.5

We can get various properties of the distribution:

In [14]:
print('variance is', n23.var())
print('standard deviation is', n23.std())
print('mean is', n23.mean())
variance is 9.0
standard deviation is 3.0
mean is 2.0

There are many other functions available, and if you are interested I urge you to peruse the documentation. I find the documentation to be excessively terse, but with a bit of googling you can find out what a function does and some examples of how to use it. Most of this functionality is not of immediate interest to the book, so I will leave the topic in your hands to explore. The SciPy tutorial [2] is quite approachable, and I suggest starting there.

Fat Tails

Earlier I spoke very briefly about the central limit theorem, which states that under certain conditions the arithmetic sum of any independent random variables will be normally distributed, regardless of how the random variables are distributed. This is extremely important for (at least) two reasons. First, nature is full of distributions which are not normal, but when we apply the central limit theorem over large populations we end up with normal distributions. Second, Gaussians are mathematically tractable. We will see this more as we develop the Kalman filter theory, but there are very nice closed form solutions for operations on Gaussians that allow us to use them analytically.

However, a key part of the proof is "under certain conditions". These conditions often do not hold for the physical world. The resulting distributions are called fat tailed. Let's consider a trivial example. We think of things like test scores as being normally distributed. If you have ever had a professor 'grade on a curve' you have been subject to this assumption. But of course test scores cannot follow a normal distribution. This is because the distribution assigns a nonzero probability distribution for any value, no matter how far from the mean. So, for example, say your mean is 90 and the standard deviation is 13. The normal distribution assumes that there is a large chance of somebody getting a 90, and a small chance of somebody getting a 40. However, it also assumes that there is a tiny chance of somebody getting a -10, or of 150. It assumes that there is a infitesimal chance of getting a score of -1e300, or 4e50. In other words, the tails of the distribution are infinite.

But for a test we know this is not true. You cannot get less than 0, or more than 100. Let's plot this using a normal distribution.

In [15]:
xs = np.arange(10,100, 0.05)
plt.plot(xs, [gaussian(x, 90, 13) for x in xs], label='var=0.2')
plt.xlim((60,120))
plt.show()

So the problem with this chart is that the area under the drawn curve cannot equal 1, so it is not a probability distribution. What actually happens is that a few more students get scores nearer the upper end of the range (for example), and that tail becomes 'fat'. Also, the test is probably not able to perfectly distinguish incredibly minute differences in skill in the students, so the distribution to the left of the mean is also probably a bit bunched up in places. The resulting distribution is called a fat tail distribution.

We are not going to be applying Kalman filters to grades of student papers, but we are using physical sensors to measure the world. The errors in sensor's measurements are rarely truly Gaussian. And, of course, the signal itself will have time dependent correlations. For example, consider a sensor mounted on a helicopter. The rotating blades cause vibrations in the air frame, which will cause small deviations in the flight path. This deviation will be related to the frequency of the vibration, hence the values at t and t+1 are not independent. The vibration will also effect sensors such as the INS. No physical object can react instantly to changes, so the INS will somewhat lag the vibrations in a perhaps quite complex way, and so the noise introduced by the vibration will be anything but independent.

It is far too early to be talking about the difficulties that this presents to the Kalman filter designer. It is worth keeping in the back of your mind the fact that the Kalman filter math is based on a somewhat idealized model of the world. For now I will present a bit of code that I will be using later in the book to form fat tail distributions to simulate various processes and sensors. This distribution is called the student's t distribution.

Let's say I want to model a sensor that has some noise in the output. For simplicity, let's say the signal is a constant 10, and the standard deviation of the noise is 2. We can use the function numpy.random.randn() to get a random number with a mean of 0 and a standard deviation of 1. So I could simulate this sensor with

In [16]:
from numpy.random import randn

def sense():
    return 10 + randn()*2

Let's plot that signal and see what it looks like.

In [17]:
zs = [sense() for i in range(5000)]
plt.plot(zs, lw=1)
plt.show()

That looks like I would expect. The signal is centered around 10. A standard deviation of 2 means that 68% of the measurements will be within $\pm$ 2 of 10, and 99% will be within $\pm$ 6 of 10, and that looks like what is happenning.

Now let's look at a fat tailed distribution. There are many choices, I will use the Student's T distribution. I will not go into the math, but just give you the source code for it and then plot a distribution using it.

In [18]:
import random
import math

def rand_student_t(df, mu=0, std=1):
    """return random number distributed by student's t distribution with
    `df` degrees of freedom with the specified mean and standard deviation.
    """
    x = random.gauss(0, std)
    y = 2.0*random.gammavariate(0.5*df, 2.0)
    return x / (math.sqrt(y/df)) + mu  
In [19]:
def sense_t():
    return 10 + rand_student_t(7)*2

zs = [sense_t() for i in range(5000)]
plt.plot(zs, lw=1)
plt.show()

We can see from the plot that while the output is similar to the normal distribution there are outliers that go far more than 3 standard deviations from the mean. This is what causes the 'fat tail'.

It is unlikely that the Student's T distribution is an accurate model of how your sensor (say, a GPS or doppler) performs, and this is not a book on how to model physical systems. However, it does produce reasonable data to test your filter's performance when presented with real world noise. We will be using distributions like these throughout the rest of the book in our simulations and tests.

This is not an idle concern. The Kalman filter equations assume the noise is normally distributed, and perform sub-optimally if this is not true. Designers for mission critical filters, such as the filters on spacecraft, need to master a lot of theory and emperical knowledge about the performance of the sensors on their spacecraft.

The code for rand_student_t is included in filterpy.common. You may use it with

from filterpy.common import rand_student_t

Summary and Key Points

The following points must be understood by you before we continue:

  • Normal distributions occur throughout nature
  • They express a continuous probability distribution
  • They are completely described by two parameters: the mean ($\mu$) and variance ($\sigma^2$)
  • $\mu$ is the average of all possible values
  • The variance $\sigma^2$ represents how much our measurements vary from the mean
  • The standard deviation ($\sigma$) is the square root of the variance ($\sigma^2$)

References