Daniela Huppenkothen and Fernando Perez
This notebook is intended for the newcomer to scientific inference in the classical (frequentist) statistics framework. Below is a number of simple explanations (which by no means replace a good textbook; see literature suggestions at the end) interspersed with bits of (interactive) python code aimed at illustrating key concepts on the one hand, and allowing the user to become familiar with simple statistical computations in python.
Some familiarity with python would be helpful for understanding the code below, but much of it, especially the interactive sections, can be understood without. Introductions into python on the internet abound, see the list of literature at the end for a few suggestions.
Much of this ipython notebook is built on numpy and scipy functionality, crucial to modern statistical analysis. These must be installed. For the interactive elements to work, ipython 2.0 or newer is required. Plotting requires matplotlib. On top of that, we have built some of the plots using the package seaborn.
[also requires special html package?]
Except for the additional plotting functionality (and prettiness!) provided by seaborn, we have attempted to keep dependencies low, and show that many statistical analyses require not much more than numpy and scipy. At the end of the notebook, we also provide a list of potentially useful, more advanced packages for statistical analyses.
## Need to run this to render MathJAX
#from IPython.external import mathjax; mathjax.install_mathjax()
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats
import scipy.optimize
import seaborn as sns
sns.set_style("whitegrid")
sns.set_style("ticks")
A statistic is nothing else but a way to encode the information contained in a data set in such a way that it is useful to the question we want to answer. When we talk about statistics, what we really mean is that we wish to find a way to reduce our data (an image, a light curve, ...) to one (or more) numbers that summarise the information we are interested in. In turn, based on this encoded information, we can try to make a decision about the question at hand: this is the basis of scientific inference.
In what follows below, we are mainly concerned with classical statistics, as opposed to another approach to scientific inference, called Bayesian statistics. The difference between the two lies mainly in the interpretation of what the word probability truly means. To a classical (also frequentist) statistician, probability describes the long-term frequency of an outcome given infinite trials. This definition is born out of early games of chance: if I toss a coin 100 times, I can count the number of favourable outcomes (say, heads) and define probability as the frequency of that outcome compared to all coin tosses: $$\frac{\mathrm{number of favourable outcomes}}{\mathrm{total number of trials}}$$.
In contrast, for a Bayesian statistician, probability encodes an uncertainty in measuring a quantity. In contrast to classical statistics, where, by definition, probability is only defined for random processes (such as coin tosses, games of cards etc.), this allows to assign probabilities to quantities that are not inherently random. This is an important distinction that accounts for much of the differences between the two approaches.
Below we limit ourselves to classical statistics, but perhaps this tutorial could be extended in the future to include the alternative approaches as well.
Any data set can be represented very simply with just a few numbers: mean, median, variance are among the most popular ones, but the "best" representation depends on the problem at hand. Most of those are really easy to do in python using numpy. In general, it will be preferable to use numpy implementations of many standard functions included in the library; they usually link to an underlying C-package, and will thus be much faster than any pure python implementation.
The mean is the average of the sample: it is calculated by summing up all elements in the sample, and then dividing by the total number of elements. In contrast, the median encodes the value that cuts the sample in half: 50% of values fall below this number, the other 50% above it. In many cases the median may be preferable as a measure of average properties of a sample, because it is less sensitive to strong outliers.
The code and the figure below illustrate this. The figure shows a two-dimensional distribution of two quantities (here $a$ and $b$, but they could be anything you measure!), with a single outlier (in black). We also show the mean and median of the sample. The slider bars allow you to change the number of samples and the value of the outlier (i.e. the distance from the sample).
def outlier_test(samples,outlier):
xsamples = list(np.random.normal(10,10, size=samples))
ysamples = list(np.random.normal(10,10, size=samples))
s = samples
o = outlier
xsamples.append(o)
ysamples.append(o)
#print(xsamples)
mn = np.mean(xsamples)
md = np.median(xsamples)
fig, ax = plt.subplots(1, 1, figsize=(6,6))
plt.subplots_adjust(top=0.9, bottom=0.1, left=0.06, right=0.97, wspace=0.12, hspace=0.1)
cmap = sns.cubehelix_palette(start=0, light=1, as_cmap=True)
ax.hexbin(xsamples, ysamples, cmap=cmap, gridsize=50)
#ax.scatter(xsample_small, ysample_small, s=20)
ax.scatter(o,o, s=100, marker="h", color=sns.cubehelix_palette()[5])
#ax.scatter(a_true, b_true, s=200)
ax.set_xlim(0, o+10)
ax.set_ylim(0, o+10)
ax.set_xlabel("a", fontsize=30)
ax.set_ylabel("b", fontsize=30)
#ax.legend(["mean = " + str(mn)])
ax.text(0.1*o, 0.9*o+10, "mean = %.4f"%mn, fontsize=20)
ax.text(0.1*o, 0.8*o+10, "median = %.4f"%md, fontsize=20)
return
from IPython.html.widgets import interact
interact(outlier_test, samples=(10, 10000, 10), outlier=(10, 1000, 10));
In contrast to the mean and the median, the standard deviation (and relatedly, the variance) do not measure the average properties of a sample, but rather the scatter around this average. Below is another figure of a sample drawn from a normal distribution; the lever allows you to change the standard deviation of the distribution that produced the sample. We also show variance and standard deviation of the sample.
def standarddeviation(std):
xsamples = list(np.random.normal(10,std, size=1000))
ysamples = list(np.random.normal(10,std, size=1000))
vr = np.var(xsamples)
st = np.std(xsamples)
fig, ax = plt.subplots(1, 1, figsize=(6,6))
plt.subplots_adjust(top=0.9, bottom=0.1, left=0.06, right=0.97, wspace=0.12, hspace=0.1)
cmap = sns.cubehelix_palette(start=0, light=1, as_cmap=True)
ax.hexbin(xsamples, ysamples, cmap=cmap, gridsize=50)
ax.set_xlim(-1000, 1000)
ax.set_ylim(-1000, 1000)
ax.set_xlabel("a", fontsize=30)
ax.set_ylabel("b", fontsize=30)
#ax.text(0.1*o, 0.9*o+10, "standard deviation = %.4f"%st, fontsize=20)
#ax.text(0.1*o, 0.8*o+10, "variance = %.4f"%vr, fontsize=20)
return
from IPython.html.widgets import interact
interact(standarddeviation, std=(10, 500, 10));
Also popular are quantiles. To compute these, put all data points in order (from smallest to largest), and then find the point where you have summed up $q\%$ of your total sample. For example, order your data points, then find that data point in your sample below which the number of data points corresponds to $25\%$ of all data points you have collected. This data point is the $25\%$ quantile.
The code below draws 10000 numbers from a normal distribution, a Poisson distribution and a Chi-square distribution (see further below for details) with a mean of 2 and a standard deviation of 2, then computes the mean (blue vertical line), median (red vertical line), variance, standard deviation and quantiles (yellow: $25\%$, purple: $50\%$, red: $75\%$) for that sample, then prints it to the screen below and plots all three distribution with their respective properties. Note that all three quantiles extend from the leftmost edge of each plot; but only for the yellow ($25%) is this visible!
## import quantiles from scipy
from scipy.stats.mstats import mquantiles as quantiles
## a popular choice are the 0.25, 0.5 and 0.75 quantiles
## create s random normally distributed numbers with a mean of 2 and a standard deviation of 2
s = 10000
def statistics(chisquare_mean, poisson_mean):
samples = [np.random.normal(2,2, size=s), np.random.chisquare(chisquare_mean, size=s), np.random.poisson(poisson_mean, size=s)]
dists = ["Normal", "Chi-square", "Poisson"]
fig, ax_all = plt.subplots(1, 3, figsize=(16,6))
for i,(samp, dist) in enumerate(zip(samples, dists)):
mn = np.mean(samp)
md = np.median(samp)
vr = np.var(samp)
st = np.std(samp)
qn = quantiles(samp, [0.25, 0.5, 0.75])
if dist == "Poisson":
h1, b1, p1 = ax_all[i].hist(samp, bins=25, range=[0,25],color="black", facecolor="white",
histtype="step", lw=1, zorder=1.)
else:
h1, b1, p1 = ax_all[i].hist(samp, bins=50, color="black", facecolor="white",
histtype="step", lw=1, zorder=1.)
ax_all[i].vlines(mn, 0, np.max(h1), lw=3, color=sns.color_palette("Paired")[1], label="mean")
ax_all[i].vlines(md, 0, np.max(h1), lw=3, color=sns.color_palette("Paired")[5], label="median")
#colours = sns.cubehelix_palette(len(qn), start=2.5, rot=.75)
for c,q in zip(sns.color_palette()[2:len(qn)+2],reversed(qn)):
b_ind = b1.searchsorted(q)
ax_all[i].fill_between(b1[:b_ind]+0.5*(b1[1]-b1[0]), 0, h1[:b_ind], color=c,
alpha=1.0, label="%i quantile"%(q*100.))
ax_all[i].legend(loc="upper right")
ax_all[i].set_ylim(0.0, None)
## print results
print("The mean of the sample is: " + str(mn))
print("The median of the sample is: " + str(md))
print("The variance of the sample is: " + str(vr))
print("The standard deviation of the sample is: " + str(st))
print("The 25% quantile of the sample is: " + str(qn[0]))
print("The 50% quantile of the sample is: " + str(qn[1]))
print("The 75% quantile of the sample is: " + str(qn[2]))
return
interact(statistics, chisquare_mean=(1, 20), poisson_mean=(1, 20));
The mean of the sample is: 10.0535 The median of the sample is: 10.0 The variance of the sample is: 10.13263775 The standard deviation of the sample is: 3.18318044572 The 25% quantile of the sample is: 8.0 The 50% quantile of the sample is: 10.0 The 75% quantile of the sample is: 12.0
To summarise, below is, in a concise form, how to compute simple representations of a data set in python.
## create s random normally distributed numbers with a mean of 2 and a standard deviation of 2
s = 10000
samples = np.random.normal(2,2, size=s)
mn = np.mean(samples)
md = np.median(samples)
vr = np.var(samples)
st = np.std(samples)
qn = quantiles(samples, [0.25, 0.5, 0.75])
print("The mean of the sample is: " + str(mn))
print("The median of the sample is: " + str(md))
print("The variance of the sample is: " + str(vr))
print("The standard deviation of the sample is: " + str(st))
print("The 25% quantile of the sample is: " + str(qn[0]))
print("The 50% quantile of the sample is: " + str(qn[1]))
print("The 75% quantile of the sample is: " + str(qn[2]))
The mean of the sample is: 1.97793262815 The median of the sample is: 1.96617546827 The variance of the sample is: 4.00591529855 The standard deviation of the sample is: 2.00147827831 The 25% quantile of the sample is: 0.655412448749 The 50% quantile of the sample is: 1.96617546827 The 75% quantile of the sample is: 3.3186827989
Measurements taken with any kind of scientific instrument are never perfect. Images observed from a telescope, for example, will be affected by seeing---random jitters in the atmosphere---, manmade light sources, the CCD of the camera recording the infalling light, and others. These effects will add an uncertainty in whatever quantity measured from the data set that was taken. In this context, we have to distinguish two types of uncertainty:
On the one hand, your data may be affected by random fluctuations in the atmosphere, your apparatus, or some other source. This leads to a scatter around some mean value that we would have measured if these fluctuations didn't exist. This type of uncertainty is often referred to as precision: the larger the scatter introduced by the measurement process (although sometimes, these errors can be due to a physical process as well!), the harder it will be to measure the mean quantity, especially for low numbers of samples. Often, it is assumed that this type of uncertainty follows a normal distribution (which has nice properties that make the analysis easier), although this need not be true!
The other type of uncertainty comes from a systematic offset in your measurement. Perhaps the calibration of the instrument is wrong, such that all the stars in your image are systematically brighter than you'd expect them to be based on previous measurements. The term accuracy describes how strong these systematic effects are in a data set: the larger the offset is between a measured quantity and its true value, the less accurate the measurement is. Characterising and correcting for the systematic effects in a data set is a very important step in data analysis, and should be taken very seriously; make sure your data set is as accurate as can be!
The plot below demonstrates the principles of accuracy and precision defined above. Imagine you have two quantities, $a$ and $b$, whose true values are $a=1$ and $b=2$. We collect data with a given accuracy (which encodes your systematic bias, for example due to an incomplete model or an error in the instrument that measured the data) and precision (which encodes the random error on the data, perhaps due to measurement uncertainty or due to the source itself).
The left panel shows a sample of measured data points with a specified accuracy and precision, scattered in parameter space together with the true value. On the right side, we show a histogram of quantity $a$ only, with the true value of $a$ as a black vertical line superimposed. The horizontal arrow to the left shows the distance between the sample mean and the true value of $a$, i.e. the accuracy. The horizontal arrow towards the right is a measure of precision.
import pandas as pd
from matplotlib.patches import FancyArrow
## TODO: put in some labels and arrows for accuracy and precision
## TODO: Add sliders
def plot_ap(accuracy, precision):
# Hardcoded values good for illustration
a_true = 1.0
b_true = 2.0
xlim = (-2, 4)
ylim = (-1, 5)
# So users get an intuitive idea of '0==low', '1==high'
accuracy, precision = np.clip(1-np.array([accuracy, precision]), 0.001, 1)
xsample = np.random.normal(a_true+2*accuracy, precision**2.0, size=10000)
xsample_small = xsample[:5000]
xmean = np.mean(xsample_small)
xstd = np.std(xsample_small)
ysample = np.random.normal(b_true+2*accuracy, precision**2.0, size=10000)
ysample_small = ysample[:5000]
ymean = np.mean(ysample_small)
d = pd.DataFrame(zip(xsample_small, ysample_small), columns=["x", "y"])
fig, (ax, ax2) = plt.subplots(1, 2, figsize=(12,6))
plt.subplots_adjust(top=0.9, bottom=0.1, left=0.06, right=0.97, wspace=0.2, hspace=0.1)
cmap = sns.cubehelix_palette(start=0, light=1, as_cmap=True)
ax.hexbin(xsample_small,ysample_small, cmap=cmap, gridsize=50)
#ax.scatter(xsample_small, ysample_small, s=20)
ax.scatter(a_true, b_true, s=200, marker="h", color=sns.cubehelix_palette()[5])
#ax.scatter(a_true, b_true, s=200)
ax.set_xlim(xlim)
ax.set_ylim(ylim)
ax.set_xlabel("a", fontsize=30)
ax.set_ylabel("b", fontsize=30)
h1, b1, p1 = plt.hist(xsample, bins=100, histtype="stepfilled", color=cmap.colors[150], alpha=1.0, normed=True)
hmax = np.max(h1)
plt.vlines(a_true, 0.0, np.max(h1), lw=6)
ar1 = FancyArrow(xmean,hmax*0.1, a_true-xmean,0.0, length_includes_head=True,
color='k', head_width=0.03*hmax, head_length=0.2, width=0.001, linewidth=3)
ar2 = FancyArrow(xmean,hmax*0.5, xstd,0.0, length_includes_head=True,
color='k', head_width=0.03*hmax, head_length=0.2, width=0.001, linewidth=3)
# Add the arrow to the axes.
ax2 = plt.gca()
ax2.add_artist(ar1)
ax2.add_artist(ar2)
ax2.set_xlim(xlim)
ax2.set_ylim([0.0, hmax*1.1])
ax2.set_xlabel("a", fontsize=24)
ax2.set_ylabel("p(a)", fontsize=24)
return
#accuracy = 0.5
#precision = 0.5
#plot_ap(accuracy, precision)
interact(plot_ap, accuracy=(0, 1.0, 0.1), precision=(0, 1.0, 0.1));
Probability theory provides the basis for manipulating probabilities. It has its roots in measure theory, but only a few key concepts are required to understand the basis of probability calculus. Note that these concepts are independent of your preferred interpretation of probability; they are universally valid. Below, we outline the basic ideas for how to thing about probability, and provide an interactive figure that illustrates many of these concepts.
A conditional probability expresses the probability of an event occuring given that another event has already occurred. For example, one may want to compute the probability of rain in the next hour given that it is cloudy, which should differ from the probability of rain in the next hour given that it is sunny. Formally, this is expressed as
P(A|B) .
The conditional probability of event A given another event $B$ is defined as the intersection between A and B divided by the probability of B [ADD EQUATION]. Imagine a space of all possible outcomes (say, rain, wind, hail, sun, thunderstorms, ...). Both P(A) (e.g. the probability of rain, where rain could be a number of possible outcomes, such as drizzle, sheeting rain etc.) and P(B) (e.g. the probability of thunderstorms) occupy a subset of this space. We now wish to compute P(A|B), the probability that it's raining given that there is a thunderstorm. For this, we need to find all possible outcomes that obey both restrictions: this is the intersection between A and B. However, the probability of the intersection is computed with respect to the entire space of possible outcomes; but we already know that most of this space is irrelevant: since there is a thunderstorm, sunny weather is no longer in out space of possible outcomes. Thus, the probability of the intersection must be renormalised to the relevant space of outcomes, i.e. thunderstorms only. This is P(B).
Note that the probability of A given B is not the same as the probability of B given A: P(A|B) $\neq$ P(B|A)! This can easily be unerstood: the probability that it is raining given that it is cloudy is not the same as the probability that it is cloudy given that it is raining. It can be cloudy without raining (P(A|B) < 1), however, as soon as we know that it is raining, we also know it must be cloudy (P(B|A) = 1)! Thus, the two are not the same.
Below, we provide an interactive figure to visualise key concepts of probability theory. We show a possible space of outcomes in yellow, and two subsets of outcomes (A and B) in red and blue. Think of this as a collection of outcomes (they can be discrete, such as in the weather example above, but they need not be): A could be all possible types of rainy weather you can think of, B could be storms (including, say, thunderstorms, hurricanes, tornados, ...). As you change the number of outcomes in B, and change their position relative to the outcomes in A, observe how the probabilities change.
def circle_intersection(r1, r2, d):
tinner_left = (d**2. + r2**2. - r1**2.)/(2.*d*r2)
tinner_right = (d**2. + r1**2. - r2**2.)/(2.*d*r1)
if d >= (r1 + r2):
area = 0.0
elif r2 <= (r1+d):
area = (r2**2.)*np.arccos(tinner_left) + (r1**2.)*np.arccos(tinner_right) \
- 0.5*np.sqrt((-d+r2+r1)*(d+r2-r1)*(d+r1-r2)*(d+r1+r2))
else:
area = np.pi*r1**2.
return area
import numpy.testing as npt
def check_circle(r1):
r2 = 0.5*r1
npt.assert_allclose(circle_intersection(r1,r2,r1+r2), 0.0)
npt.assert_allclose(circle_intersection(r1,r2, r1-r2), np.pi*r2**2.)
return
def test_circle():
for r in (1,2,3):
check_circle(r)
test_circle()
from IPython.display import display, HTML
def html_figure(fig):
"""Return an HTML <img> tag with the data for a matplotlib figure embedded"""
from IPython.core.pylabtools import print_figure
import base64
fdata64 = base64.b64encode(print_figure(fig))
html_tpl = '<img alt="Figure" src="data:image/png;base64,{}">'
return html_tpl.format(fdata64)
def html_table(dct, title=None, fmt="{:.4f}"):
"""Produce an HTML table from a dict, with an optional title.
"""
import html as H
h = H.HTML()
if title:
h.h2(title)
t = h.table(border='1')
for k, v in dct.items():
r = t.tr
r.td(str(k))
r.td(fmt.format(v))
return str(h)
from collections import OrderedDict
def probabilities(ra, rb, offset):
square_area = 1
aa = np.pi*ra**2.
ab = np.pi*rb**2.
ab_complement = 1.0 - ab
intersection = circle_intersection(ra, rb, offset)
union = aa + ab - intersection
agivenb = intersection/ab
bgivena = intersection/aa
od = OrderedDict()
od["P(A)"] = aa
od["P(B)"] = ab
od["P(A^c)"] = ab_complement
od["P(A and B)"] = intersection
od["P(A or B)"] = union
od["P(A|B)"] = agivenb
od["P(B|A)"] = bgivena
return od
def draw_circles(r2):
xlim = (-0.25,.75)
ylim = (-0.5,.5)
a1 = 0.05
r1 = np.sqrt(a1/np.pi)
offset = 0.25
d = 2.*offset
colors = sns.color_palette()
fig, ax = plt.subplots(figsize=(6,6), subplot_kw=dict(axisbg=colors[5]))
ax.set_xlim(xlim)
ax.set_ylim(ylim)
ax.set_xticks([])
ax.set_yticks([])
ax.set_aspect('equal')
c1 = plt.Circle((0, 0), r1, facecolor=colors[2], fill=True, alpha=0.8)
ax.text(0-0.02,0-0.02, "A", color='black', fontsize=20)
c2 = plt.Circle((offset, 0), r2, facecolor=colors[1], fill=True, alpha=0.8 )
ax.text(offset-0.01,0-0.02, "B", color='black', fontsize=20)
ax.add_artist(c1)
ax.add_artist(c2)
plt.close(fig)
probtab = probabilities(r1, r2, offset)
# Produce HTML side-by-side display
fh = html_figure(fig)
h = html_table(probtab, title="Probabilities")
html = HTML("""
<div style="float:left;width=50%">
{fh}
</div>
<div style="width=50%">
{h}
</div>
""".format(**locals()))
display(html)
return
#draw_circles(0.1)
interact(draw_circles, r2=(0.001,0.5, 0.01));
--------------------------------------------------------------------------- ImportError Traceback (most recent call last) <ipython-input-147-b444e3d7159f> in draw_circles(r2) 58 # Produce HTML side-by-side display 59 fh = html_figure(fig) ---> 60 h = html_table(probtab, title="Probabilities") 61 html = HTML(""" 62 <div style="float:left;width=50%"> <ipython-input-146-e39db9587ad5> in html_table(dct, title, fmt) 16 """Produce an HTML table from a dict, with an optional title. 17 """ ---> 18 import html as H 19 20 h = H.HTML() ImportError: No module named html
The bar toggles the radius of circle $B$.
Both numpy and scipy have many of the most useful standard distributions already in their code base. You've already seen above how to quickly and easily make random numbers in numpy. If that's all you want (and this can be incredibly powerful for producing simulated data), you're good to go.
However, scipy can do more. In scipy.stats, there are classes for many distributions, which have methods that can do a whole lot more than just produce random numbers for you! For example, to produce an instance of a normal distribution with "frozen" parameters (mean=2, standard deviation=2 here), goes like this:
import scipy.stats
n = scipy.stats.norm(2, 4)
You can now use this distribution in various ways. For example, to plot the pdf and cdf, you can do this:
## define some x coordinates
x = np.linspace(-10, 20, 2000)
## make a figure
fig, (ax,ax2) = plt.subplots(1,2,figsize=(12,6))
ax.plot(x, n.pdf(x))
ax.set_title("PDF")
ax2.plot(x, n.cdf(x))
ax2.set_title("CDF")
<matplotlib.text.Text at 0xb497390>
Similarly, it allows you to play around with the mean, median, standard deviation and other moments.
Something else that might be useful is the capability of fitting a distribution to a data sample:
## let's make another sample
sample = np.random.normal(2,2,size=1000)
## now fit a normal distribution to the sample with the
## fit method of the norm class
f = scipy.stats.norm.fit(sample)
print("The fit parameter for the mean of the distribution is: %.4f"%f[0])
print("The fit parameter for the standard deviation of the distribution is: %.4f"%f[1])
The fit parameter for the mean of the distribution is: 2.0043 The fit parameter for the standard deviation of the distribution is: 1.9015
This is an explanation of the normal distribution!
def plot_dist(x,n, continuous=True, distname=""):
fig, (ax1,ax2) = plt.subplots(1,2,figsize=(12,6))
if continuous:
density = n.pdf
title = "Probability Density Function"
plot_symbol = '-'
else:
density = n.pmf
title = "Probability Mass Function"
plot_symbol = 'bo'
distribution = density(x)
ax1.plot(x, distribution, plot_symbol, color="blue", label=distname)
if continuous:
ax1.fill_between(x, 0, distribution, color="blue", alpha=0.6)
else:
ax1.vlines(x, 0, distribution, colors='b', lw=5, alpha=0.5)
ax1.set_title(title)
cdf = n.cdf(x)
ax2.plot(x, cdf, lw=1, color="blue", label=distname)
ax2.fill_between(x, 0, cdf, color="blue", alpha=0.6)
ax2.set_title("CDF")
return ax1, ax2
def normal(m,s):
x = np.linspace(-10, 20, 2000)
n = scipy.stats.norm(m, s)
plot_dist(x,n)
return
interact(normal, m=(-5.,10.), s=(0.001, 10.0));
This is an explanation of the Poisson distribution.
def poisson(m):
d = scipy.stats.poisson(m)
x = np.arange(d.ppf(0.01), d.ppf(0.99))
plot_dist(x, d, continuous=False)
return
interact(poisson, m=(1,200,1));
Some information about the Binomial distribution
#def binomial(sample, success):
# d = scipy.stats.binom(samples, success)
# x = d.ppf(0.01), d.ppf(0.99)
# plot_dist(x, d, continuous=False)
# return
#
#interact(binomial, sample=(1,200,1), success=(1,200,1));
d = scipy.stats.binom(100, 10)
x = np.arange(scipy.stats.binom(0.01, 100, 10), scipy.stats.binom(0.99, 100, 10))
#x = np.arange(d.ppf(0.01), d.ppf(0.99))
print(x)
print(d.pmf(x))
plt.plot(x, d.pmf(x), "o", color="blue", label="Binomial Distribution")
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-184-9fa355dad5ed> in <module>() 7 #interact(binomial, sample=(1,200,1), success=(1,200,1)); 8 d = scipy.stats.binom(100, 10) ----> 9 x = np.arange(scipy.stats.binom(0.01, 100, 10), scipy.stats.binom(0.99, 100, 10)) 10 11 #x = np.arange(d.ppf(0.01), d.ppf(0.99)) TypeError: unsupported operand type(s) for -: 'rv_frozen' and 'rv_frozen'
[two iterations:
Let's go back to the Poisson distribution. Below is a plot of the distribution where, again, you can adjust the scale parameter. The drop-down menu allows you to overplot a Gaussian distribution.
def poisson2(m):
d = scipy.stats.poisson(m)
x = np.arange(d.ppf(0.01), d.ppf(0.99))
ax1, ax2 = plot_dist(x, d, continuous=False, distname="Poisson distribution")
ax1, ax2 = overplot_gaussian(ax1, ax2, d)
ax2.legend(loc="upper left")
return
def overplot_gaussian(ax1, ax2, d):
x = np.arange(d.ppf(0.01), d.ppf(0.99))
n = scipy.stats.norm(d.args[0], np.sqrt(d.args[0]))
ax1.plot(x, n.pdf(x), lw=3, color="red")
ax2.plot(x, n.cdf(x), lw=3, color="red",label="Normal distribution")
return ax1, ax2
interact(poisson2, m=(1,200,1));
#from IPython.html.widgets import fixed, interact, interactive
#from collections import OrderedDict
#functions = OrderedDict()
#for func in (poisson2, overplot_gaussian):
# functions[func.__name__] = func
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-156-13f146c68ea7> in poisson2(m) 2 d = scipy.stats.poisson(m) 3 x = np.arange(d.ppf(0.01), d.ppf(0.99)) ----> 4 ax1, ax2 = plot_dist(x, d, continuous=False, distname="Poisson distribution") 5 ax1, ax2 = overplot_gaussian(ax1, ax2, d) 6 NameError: global name 'plot_dist' is not defined
<function __main__.poisson2>
#from IPython.display import display
# Create a plotter that takes a function to plot and displays the plot
#def p(x, function):
# p.ax.plot(x, function(x))
# display(p.fig)
# Little hack to attach a persistent figure to the function object
# so we can overplot. Note that we close the figure to avoid
# IPython's auto-display of figures that would create a duplicate
#p.fig, p.ax = plt.subplots()
#plt.close(p.fig)
#from IPython.html.widgets import fixed, interactive
#interactive(p, x=fixed(x), function=functions)
#interact(poisson2, m=(1,100,1));
NOTE Add explanation and illustration on how, in the limit of large $\mu$, the Poisson distribution approaches the Normal.
averaging data makes a Gaussian distribution, too!
x = np.linspace(0,1000, 1000)
def sine_data(x, period, amplitude):
y = amplitude*np.sin(2.*np.pi*x/period)
return y
def compute_error(y, sigma):
return np.random.normal(0, sigma, size=len(y))
#y = sine_data(x, 200, 10)
#plt.plot(x,add_error(y, 30))
def average_sims(nsims):
period=200
amplitude = 10
noise = 30
x = np.linspace(0,1000, 1000)
y = sine_data(x, period, amplitude)
yall, yerr_all = [], []
for n in xrange(nsims):
yerr = compute_error(y, noise)
yall.append(y+yerr)
yall = np.array(yall)
ymean = np.mean(yall, axis=0)
yres = ymean - amplitude*np.sin(2.*np.pi*x/period)
yerr_res = np.mean(yerr_all, axis=0)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14,7))
#ax1.plot(x, yres, lw=2, color="blue", linestyle='steps-mid')
ax1.plot(x, ymean, lw=2, color="black", linestyle='steps-mid')
ax1.set_ylim(-100, 100)
ax2.hist(yres, bins=100, range=[-100,100], color="blue", histtype="stepfilled", alpha=0.6)
ax2.set_xlim(-100, 100)
return
interact(average_sims, nsims=(1,500,1));
Estimating the best-fit parameters of a distribution is one of the standard problems in astronomy. Numpy and scipy have many different tools for this available already. In the following, we're going to play around with a Poisson likelihood for a while. The Poisson distribution is useful anywhere where you have discrete, rare events. These can be, for example, particle in a detector, car accidents, or other data sets like these.
Because likelihood functions can have quite large and small values, we'll work with the log-likelihood without loss of any capabilities in inferring from the data.
First, we define the log-likelihood. Then we make some Poisson sample data, with a rate parameter of 5. This could be, for example, a count rate in a photon detector detecting of the order of 5 photons per second.
def poisson_loglike(rate, data):
## make rate parameter into array of same length as data:
r = np.ones(len(data))*rate
## now we can compute the log-likelihood:
llike = -np.sum(r) + np.sum(data*np.log(r))-np.sum(scipy.special.gammaln(data + 1))
return llike
sample = np.random.poisson(10, size=10000)
I'll leave it to you to verify that the expression in the definition of the likelihood function is indeed that of the Poisson likelihood for N samples.
Now, let's have a look at the shape of the likelihood. We'll use the first 100 events from our sample:
## return likelihood for sample s for a bunch of guesses
def guess_like(s):
guesses = np.arange(100)
like_all = [poisson_loglike(g, s) for g in guesses]
return like_all
def plot_like(nsamples):
np.random.poisson(10, size=10000)
fig = plt.figure(figsize=(12,12))
plt.plot(np.arange(100), np.array(guess_like(sample[:nsamples])), color="black", label="%i samples"%nsamples)
interact(plot_like, nsamples=(10,10000));
As more data points are included, the values close to that from which the data was drawn (here, a rate parameter of 5).
Usually, our models are more complex than a simple distribution with one or two parameters. Say you have a spectrum and want to fit a spectral model. You have pretty good data, so you can approximate it with a Gaussian distribution. scipy already gives you the tools you need to do this. scipy.optimize allows you to find the minimum of a given function. For weighted least squares, scipy.optimize.curve_fit will do the trick:
## define a straight line function:
def straight(x, a, b):
return x*a + b
## x-coordinate
x = np.linspace(0, 10, 1000)
## some data
a = -2
b = 3
y = straight(x, a, b)
## add some random Gaussian noise:
data = np.random.normal(0, 2, size=len(x))+y
## fit straight line to data:
popt, pcov = scipy.optimize.curve_fit(straight, x, data)
print("Fit straight line data with a=-2, b=3")
print("Fitted parameter a is %.3f"%popt[0])
print("Fitted parameter b is %.3f"%popt[1])
Fit straight line data with a=-2, b=3 Fitted parameter a is -1.993 Fitted parameter b is 2.924
curve_fit also gives you the covariance matrix, which is really handy, because it allows you to estimate the errors on the parameters you've just derived, by taking the square root of the diagonal of the covariance matrix. Taking the square root will give you the estimated standard deviation on the parameter.
errors = np.sqrt(np.diag(pcov))
print("Fitted parameter a with error is %.3f +/- %.3f"%(popt[0], errors[0]))
print("Fitted parameter b with error is %.3f +/- %.3f"%(popt[1], errors[1]))
Fitted parameter a with error is -1.993 +/- 0.023 Fitted parameter b with error is 2.924 +/- 0.130
However, it is also possible to use likelihood functions other than Gaussian ones. In this case, define the log-likelihood explicitly and use one of scipy.optimize's many minimisation routines. Be sure to minimise the negative log-likelihood!
Recall the Poisson likelihood from earlier. Instead of plotting the log-likelihood and finding the maximum on the plot, we can estimate it by minimising the negative log-likelihood:
sample = np.random.poisson(5, size=10000)
## define the -log-likelihood
neg_poisson = lambda r: -poisson_loglike(r, sample)
## minimise using Powell's algorithm, starting value for rate parameter is 10
result = scipy.optimize.fmin_powell(neg_poisson, 10, full_output=True)
print("The fit-parameter for the Poisson rate parameter: %.3f "%result[0])
Optimization terminated successfully. Current function value: 22094.062420 Iterations: 2 Function evaluations: 24 The fit-parameter for the Poisson rate parameter: 4.996
-c:5: RuntimeWarning: invalid value encountered in log
There are quite a large range of optimisation routines available in scipy.optimize. All of them have certain strengths and weaknesses. I invite you to have a look in the documentation, and to play around with some of the algorithms to see which works for a given problem.
Below some literature suggestions. This list is by no means exhaustive; we welcome additional suggestions! Books are marked by (b), papers are marked by a (p)
Statistics: