Annotated notes from Lecture 21

GOALS:

  • Learn the tools needed for the data fitting homework assignment
  • Learn some generally useful data fitting techniques

TUTORIAL TODAY:
Load up this ipython notebook and change the values of m and b in the "Fitting Lines to Data" section, A and B in "Curve Fitting", and then again a and b in "Fitting a Power Law"

In [1]:
# You can ignore this, it's just for aesthetic purposes
matplotlib.rcParams['figure.figsize'] = (8,5)
rcParams['savefig.dpi'] = 100

Fitting Lines to Data

We'll cover very basic line fitting, largely ignoring the subtleties of the statistics in favor of showing you how to perform simple fits of models to data.

In [2]:
# These import commands set up the environment so we have access to numpy and pylab functions
import numpy as np
import pylab as pl

# Data Fitting
# First, we'll generate some fake data to use
x = np.linspace(0,10,50) # 50 x points from 0 to 10

# Remember, you can look at the help for linspace too:
# help(np.linspace)
In [3]:
# y = m x + b
y = 2.5 * x + 1.2
In [4]:
# let's plot that
pl.clf()
pl.plot(x,y)
Out[4]:
[<matplotlib.lines.Line2D at 0x105959150>]
In [5]:
# looks like a simple line.  But we want to see the individual data points
pl.plot(x,y,marker='s')
Out[5]:
[<matplotlib.lines.Line2D at 0x1031a3290>]
In [6]:
# We need to add noise first
noise = pl.randn(y.size)
# Like IDL, python has a 'randn' function that is centered at 0 with a standard deviation of 1.  
# IDL's 'randomu' is 'pl.rand' instead
# What's y.size?
print y.size
print len(y)
50
50

y.size is the number of elements in y, just like len(y) or, in IDL, n_elements(y)

In [7]:
# We can add arrays in python just like in IDL
noisy_flux = y + noise
# We'll plot it too, but this time without any lines
# between the points, and we'll use black dots
# ('k' is a shortcut for 'black', '.' means 'point')
pl.clf() # clear the figure
pl.plot(x,noisy_flux,'k.')
# We need labels, of course
pl.xlabel("Time")
pl.ylabel("Flux")
Out[7]:
<matplotlib.text.Text at 0x1059705d0>

Now we're onto the fitting stage. We're going to fit a function of the form $$y = m*x + b$$ which is the same as $$f(x) = p[1]*x + p[0]$$ to the data. This is called "linear regression", but it is also a special case of a more general concept: this is a first-order polynomial. "First Order" means that the highest exponent of x in the equation is 1

In [8]:
# We'll use polyfit to find the values of the coefficients.  The third
# parameter is the "order"
p = np.polyfit(x,noisy_flux,1)
# help(polyfit) if you want to find out more
In [9]:
# print our fit parameters.  They are not exact because there's noise in the data!
# note that this is an array!
print p
print type(p) # you can ask python to tell you what type a variable is
[ 2.59289715  0.71443764]
<type 'numpy.ndarray'>

In [10]:
# Great!  We've got our fit.  Let's overplot the data and the fit now
pl.clf() # clear the figure
pl.plot(x,noisy_flux,'k.') # repeated from above
pl.plot(x,p[0]*x+p[1],'r-') # A red solid line
pl.xlabel("Time") # labels again
pl.ylabel("Flux")
Out[10]:
<matplotlib.text.Text at 0x107b2ca50>
In [11]:
# Cool, but there's another (better) way to do this.  We'll use the polyval
# function instead of writing out the m x + b equation ourselves
pl.clf() # clear the figure
pl.plot(x,noisy_flux,'k.') # repeated from above
pl.plot(x,np.polyval(p,x),'r-') # A red solid line
pl.xlabel("Time") # labels again
pl.ylabel("Flux")
Out[11]:
<matplotlib.text.Text at 0x105978dd0>
In [12]:
# help(polyval) if you want to find out more

Let's do the same thing with a noisier data set. I'm going to leave out most of the comments this time.

In [13]:
noisy_flux = y+noise*10
p = polyfit(x,noisy_flux,1)
print p
[ 3.4289715  -3.65562359]

In [14]:
# plot it
pl.clf() # clear the figure
pl.plot(x,noisy_flux,'k.') # repeated from above
pl.plot(x,np.polyval(p,x),'r-',label="Best fit") # A red solid line
pl.plot(x,2.5*x+1.2,'b--',label="Input") # a blue dashed line showing the REAL line
pl.legend(loc='best') # make a legend in the best location
pl.xlabel("Time") # labels again
pl.ylabel("Flux")
Out[14]:
<matplotlib.text.Text at 0x107617750>

Despite the noisy data, our fit is still pretty good! One last plotting trick, then we'll move on.

In [15]:
pl.clf() # clear the figure
pl.errorbar(x,noisy_flux,yerr=10,marker='.',color='k',linestyle='none') # errorbar requires some extras to look nice
pl.plot(x,np.polyval(p,x),'r-',label="Best fit") # A red solid line
pl.plot(x,2.5*x+1.2,'b--',label="Input") # a blue dashed line showing the REAL line
pl.legend(loc='best') # make a legend in the best location
pl.xlabel("Time") # labels again
pl.ylabel("Flux")
Out[15]:
<matplotlib.text.Text at 0x1036977d0>

Curve Fitting

We'll now move on to more complicated curves. What if the data looks more like a sine curve? We'll create "fake data" in basically the same way as above.

In [16]:
# this time we want our "independent variable" to be in radians
x = np.linspace(0,2*np.pi,50)
y = np.sin(x)
pl.clf()
pl.plot(x,y)
Out[16]:
[<matplotlib.lines.Line2D at 0x105f919d0>]
In [17]:
# We'll make it noisy again
noise = pl.randn(y.size)
noisy_flux = y + noise
pl.plot(x,noisy_flux,'k.') # no clear this time
Out[17]:
[<matplotlib.lines.Line2D at 0x107b1cbd0>]

That looks like kind of a mess. Let's see how well we can fit it. The function we're trying to fit has the form: $$f(x) = A * sin(x - B)$$ where $A$ is a "scale" parameter and $B$ is the side-to-side offset (or the "delay" if the x-axis is time). For our data, they are $A=1$ and $B=0$ respectively, because we made $y=sin(x)$

In [18]:
# curve_fit is the function we need for this, but it's in another package called scipy
from scipy.optimize import curve_fit
# we need to know what it does:
help(curve_fit)
Help on function curve_fit in module scipy.optimize.minpack:

curve_fit(f, xdata, ydata, p0=None, sigma=None, **kw)
    Use non-linear least squares to fit a function, f, to data.
    
    Assumes ``ydata = f(xdata, *params) + eps``
    
    Parameters
    ----------
    f : callable
        The model function, f(x, ...).  It must take the independent
        variable as the first argument and the parameters to fit as
        separate remaining arguments.
    xdata : An N-length sequence or an (k,N)-shaped array
        for functions with k predictors.
        The independent variable where the data is measured.
    ydata : N-length sequence
        The dependent data --- nominally f(xdata, ...)
    p0 : None, scalar, or M-length sequence
        Initial guess for the parameters.  If None, then the initial
        values will all be 1 (if the number of parameters for the function
        can be determined using introspection, otherwise a ValueError
        is raised).
    sigma : None or N-length sequence
        If not None, it represents the standard-deviation of ydata.
        This vector, if given, will be used as weights in the
        least-squares problem.
    
    Returns
    -------
    popt : array
        Optimal values for the parameters so that the sum of the squared error
        of ``f(xdata, *popt) - ydata`` is minimized
    pcov : 2d array
        The estimated covariance of popt.  The diagonals provide the variance
        of the parameter estimate.
    
    See Also
    --------
    leastsq
    
    Notes
    -----
    The algorithm uses the Levenburg-Marquardt algorithm through `leastsq`.
    Additional keyword arguments are passed directly to that algorithm.
    
    Examples
    --------
    >>> import numpy as np
    >>> from scipy.optimize import curve_fit
    >>> def func(x, a, b, c):
    ...     return a*np.exp(-b*x) + c
    
    >>> x = np.linspace(0,4,50)
    >>> y = func(x, 2.5, 1.3, 0.5)
    >>> yn = y + 0.2*np.random.normal(size=len(x))
    
    >>> popt, pcov = curve_fit(func, x, yn)


Look at the returns:

Returns
-------
popt : array
    Optimal values for the parameters so that the sum of the squared error
    of ``f(xdata, *popt) - ydata`` is minimized
pcov : 2d array
    The estimated covariance of popt.  The diagonals provide the variance
    of the parameter estimate.

So the first set of returns is the "best-fit parameters", while the second set is the "covariance matrix"

In [19]:
def sinfunc(x,a,b):
    return a*np.sin(x-b)
fitpars, covmat = curve_fit(sinfunc,x,noisy_flux)
# The diagonals of the covariance matrix are variances
# variance = standard deviation squared, so we'll take the square roots to get the standard devations!
# You can get the diagonals of a 2D array easily:
variances = covmat.diagonal()
std_devs = np.sqrt(variances)
print fitpars,std_devs
[ 1.24208994  0.15896795] [ 0.2237797   0.17677306]

In [20]:
# Let's plot our best fit, see how well we did
# These two lines are equivalent:
pl.plot(x, sinfunc(x, fitpars[0], fitpars[1]), 'r-')
pl.plot(x, sinfunc(x, *fitpars), 'r-')
Out[20]:
[<matplotlib.lines.Line2D at 0x105978110>]

Again, this is pretty good despite the noisiness.

Fitting a Power Law

Power laws occur all the time in physis, so it's a good idea to learn how to use them.

What's a power law? Any function of the form: $$f(t) = a t^b$$ where $x$ is your independent variable, $a$ is a scale parameter, and $b$ is the exponent (the power).

When fitting power laws, it's very useful to take advantage of the fact that "a power law is linear in log-space". That means, if you take the log of both sides of the equation (which is allowed) and change variables, you get a linear equation! $$\ln(f(t)) = \ln(a t^b) = \ln(a) + b \ln(t)$$ We'll use the substitutions $y=\ln(f(t))$, $A=\ln(a)$, and $x=\ln(t)$, so that $$y=a+bx$$ which looks just like our linear equation from before (albeit with different letters for the fit parameters).

We'll now go through the same fitting exercise as before, but using powerlaws instead of lines.

In [21]:
t = np.linspace(0.1,10)  
a = 1.5
b = 2.5
z = a*t**b 
In [22]:
pl.clf()
pl.plot(t,z)
Out[22]:
[<matplotlib.lines.Line2D at 0x1077799d0>]
In [23]:
# Change the variables
# np.log is the natural log
y = np.log(z)
x = np.log(t)
pl.clf()
pl.plot(x,y)
pl.ylabel("log(z)")
pl.xlabel("log(t)")
Out[23]:
<matplotlib.text.Text at 0x10779f190>

It's a straight line. Now, for our "fake data", we'll add the noise before transforming from "linear" to "log" space

In [24]:
noisy_z = z + pl.randn(z.size)*10
pl.clf()
pl.plot(t,z)
pl.plot(t,noisy_z,'k.')
Out[24]:
[<matplotlib.lines.Line2D at 0x107785c10>]
In [25]:
noisy_y = np.log(noisy_z)
pl.clf()
pl.plot(x,y)
pl.plot(x,noisy_y,'k.')
pl.ylabel("log(z)")
pl.xlabel("log(t)")
Out[25]:
<matplotlib.text.Text at 0x105a9aed0>

Note how different this looks from the "noisy line" we plotted earlier. Power laws are much more sensitive to noise! In fact, there are some data points that don't even show up on this plot because you can't take the log of a negative number. Any points where the random noise was negative enough that the curve dropped below zero ended up being "NAN", or "Not a Number". Luckily, our plotter knows to ignore those numbers, but polyfit doesnt.

In [26]:
print noisy_y
[ 2.64567168  1.77226366         nan  1.49412348         nan  1.28855427
  0.58257979 -0.50719229  1.84461177  2.72918637         nan -1.02747374
 -0.65048639  3.4827123   2.60910913  3.48086587  3.84495668  3.90751514
  4.10072678  3.76322421  4.06432005  4.23511474  4.26996586  4.25153639
  4.4608377   4.5945862   4.66004888  4.8084364   4.81659305  4.97216304
  5.05915226  5.02661678  5.05308181  5.26753675  5.23242563  5.36407125
  5.41827503  5.486903    5.50263105  5.59005234  5.6588601   5.7546482
  5.77382726  5.82299095  5.92653466  5.93264584  5.99879722  6.07711596
  6.13466015  6.12248799]

In [27]:
# try to polyfit a line
pars = np.polyfit(x,noisy_y,1)
print pars
[ nan  nan]

In order to get around this problem, we need to mask the data. That means we have to tell the code to ignore all the data points where noisy_y is nan.

My favorite way to do this is to take advantage of a curious fact: $1=1$, but nan!=nan

In [28]:
print 1 == 1
print np.nan == np.nan
True
False

So if we find all the places were noisy_y != noisy_y, we can get rid of them. Or we can just use the places where noisy_y equals itself.

In [29]:
OK = noisy_y == noisy_y
print OK
[ True  True False  True False  True  True  True  True  True False  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True]

This OK array is a "boolean mask". We can use it as an "index array", which is pretty neat.

In [30]:
print "There are %i OK values" % (OK.sum())
masked_noisy_y = noisy_y[OK]
masked_x = x[OK]
print "masked_noisy_y has length",len(masked_noisy_y)
There are 47 OK values
masked_noisy_y has length 47

In [31]:
# now polyfit again
pars = np.polyfit(masked_x,masked_noisy_y,1)
print pars
[ 1.50239281  1.9907672 ]

In [32]:
# cool, it worked.  But the fit looks a little weird!
fitted_y = polyval(pars,x)
pl.plot(x, fitted_y, 'r--')
Out[32]:
[<matplotlib.lines.Line2D at 0x10368e1d0>]

The noise seems to have affected our fit.

In [33]:
# Convert bag to linear-space to see what it "really" looks like
fitted_z = np.exp(fitted_y)
pl.clf()
pl.plot(t,z)
pl.plot(t,noisy_z,'k.')
pl.plot(t,fitted_z,'r--')
pl.xlabel('t')
pl.ylabel('z')
Out[33]:
<matplotlib.text.Text at 0x10761fb50>

That's pretty bad. A "least-squares" approach, as with curve_fit, is probably going to be the better choice. However, in the absence of noise (i.e., on your homework), this approach should work

In [34]:
def powerlaw(x,a,b):
    return a*(x**b)
pars,covar =  curve_fit(powerlaw,t,noisy_z)
pl.clf()
pl.plot(t,z)
pl.plot(t,noisy_z,'k.')
pl.plot(t,powerlaw(t,*pars),'r--')
pl.xlabel('t')
pl.ylabel('z')
Out[34]:
<matplotlib.text.Text at 0x107451e10>

Tricks with Arrays

We need to cover a few syntactic things comparing IDL and python.

In IDL, if you wanted the maximum value in an array, you would do:
maxval = max(array, location_of_max)

In python, it's more straightforward:
location_of_max = array.argmax()
or
location_of_max = np.argmax(array)

Now, say we want to determine the location of the maximum of a number of different functions. The functions we'll use are:
sin(x)
sin$^2$`(x)` `sin`$^3$(x)
sin(x)cos(x)

We'll define these functions, then loop over them.

In [35]:
# sin(x) is already defined
def sin2x(x):
    """ sin^2 of x """
    return np.sin(x)**2
def sin3x(x):
    """ sin^3 of x """
    return np.sin(x)**3
def sincos(x):
    """ sin(x)*cos(x) """
    return np.sin(x)*np.cos(x)
list_of_functions = [np.sin, sin2x, sin3x, sincos]
In [36]:
# we want 0-2pi for these functions
t = np.linspace(0,2*np.pi)
# this is the cool part: we can make a variable function
for fun in list_of_functions:
    # the functions know their own names (in a "secret hidden variable" called __name__)
    print "The maximum of ",fun.__name__," is ", fun(t).max()
The maximum of  sin  is  0.999486216201
The maximum of  sin2x  is  0.998972696375
The maximum of  sin3x  is  0.998459440388
The maximum of  sincos  is  0.4997431081

In [37]:
# OK, but we wanted the location of the maximum....
for fun in list_of_functions:
    print "The location of the maximum of ",fun.__name__," is ", fun(t).argmax()
The location of the maximum of  sin  is  12
The location of the maximum of  sin2x  is  12
The location of the maximum of  sin3x  is  12
The location of the maximum of  sincos  is  6

In [38]:
# well, that's not QUITE what we want, but it's close
# We want to know the value of t, not the index!
for fun in list_of_functions:
    print "The location of the maximum of ",fun.__name__," is ", t[fun(t).argmax()]
The location of the maximum of  sin  is  1.5387392589
The location of the maximum of  sin2x  is  1.5387392589
The location of the maximum of  sin3x  is  1.5387392589
The location of the maximum of  sincos  is  0.769369629451

In [39]:
# Finally, what if we want to store all that in an array?
# Well, here's a cool trick: you can sort of invert the for loop
# This is called a "list comprehension":
maxlocs = [ t[fun(t).argmax()] for fun in list_of_functions ]
print maxlocs
[1.5387392589011231, 1.5387392589011231, 1.5387392589011231, 0.76936962945056153]

In [40]:
# Confused?  OK.  Try this one:
print range(6)
print [ii**2 for ii in range(6)]
[0, 1, 2, 3, 4, 5]
[0, 1, 4, 9, 16, 25]

Recursion

The next big programming topic is called recursion.

Recursion

The next little programming topic is called recursion.

For better or worse, you must visit the wiki page on recursion

Recursion is when a function calls itself. But, a recursive function must have an "end condition" to avoid an infinite loop. If you try to make an infinite loop, you will get a "stack overflow" (python will give you an error).

In [41]:
def fail(x):
    """ an infinite recursion. Will fail """
    return fail(x)
# WARNING: this will definitely fail, but it may take a long time! 
# print fail(1)

Recursion is a useful mathematical tool. There are some problems that are best solved via recursion. The most commonly used example is the fibonacci sequence:
0 1 1 2 3 5 8 13 21 ....
this sequence is defined by the recursive relation
$$F_n = F_{n-1} + F_{n-2}$$
or $$f(n) = f(n-1) + f(n-2)$$

It has the base conditions: $$f(0) = 0$$ $$f(1) = 1$$

This one is pretty easy to define in code. But I'll leave it as an exercise.

The following code shows a hierarchy of what will happen if you call fibonacci(5). Don't worry about trying to understand the 'digraph' code, it's just a graph visualization tool.

In [42]:
%install_ext https://raw.github.com/tkf/ipython-hierarchymagic/master/hierarchymagic.py
Installed hierarchymagic.py. To use it, type:
  %load_ext hierarchymagic

In [43]:
%load_ext hierarchymagic
In [44]:
%%dot --format svg -- 
digraph G { 
    f5 [label="f(5)"];
    Lf1 [label="f(1)"];
LLf0 [label="f(0)"];
Lf2 [label="f(2)"];
Lf3 [label="f(3)"];
RRf1  [label="f(1)"];
RRMf1 [label="f(1)"];
RRf0 [label="f(0)"];
LLf1 [label="f(1)"];
RMf0 [label="f(0)"];
RMf1 [label="f(1)"];
Rf2 [label="f(2)"];
RMf2 [label="f(2)"];
Rf3 [label="f(3)"];
Rf4 [label="f(4)"];
    f5->Lf3; f5->Rf4;  
    Lf3->Lf2; Lf3->Lf1;
    Rf4->Rf3; Rf4->Rf2;
    Rf2->RRf1; Rf2->RRf0;
    Lf2->LLf1; Lf2->LLf0;
    Rf3->RMf2; Rf3->RMf1;
    RMf2->RMf0; RMf2->RRMf1;
}
G f5 f(5) Lf3 f(3) f5->Lf3 Rf4 f(4) f5->Rf4 Lf1 f(1) LLf0 f(0) Lf2 f(2) Lf2->LLf0 LLf1 f(1) Lf2->LLf1 Lf3->Lf1 Lf3->Lf2 RRf1 f(1) RRMf1 f(1) RRf0 f(0) RMf0 f(0) RMf1 f(1) Rf2 f(2) Rf2->RRf1 Rf2->RRf0 RMf2 f(2) RMf2->RRMf1 RMf2->RMf0 Rf3 f(3) Rf3->RMf1 Rf3->RMf2 Rf4->Rf2 Rf4->Rf3
In [45]:
# EXERCISE: Write a recursive Fibonacci function!
def fibonacci(n):
    return
    # put your code here!
In [46]:
# EXERCISE: Write a recursive factorial function
# Recall: factorial(x) = factorial(x-1) * x
# factorial(1) = 1
def factorial(x):
    return
In [47]:
# This is the test for the fibonacci code.  
# Run this after you've written your fibonnacci sequence code
# it will fail if your code is wrong!
for ii,ff in zip(range(9),[0,1,1,2,3,5,8,13,21]):
    assert fibonacci(ii) == ff
print "Success!"
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-47-02b5a0d0ea36> in <module>()
      3 # it will fail if your code is wrong!
      4 for ii,ff in zip(range(9),[0,1,1,2,3,5,8,13,21]):
----> 5     assert fibonacci(ii) == ff
      6 print "Success!"

AssertionError: 
In []:
# This is the test for the factorial code.  
# Run this after you've written your factorial code
# it will fail if your code is wrong!
for ii,ff in zip(range(1,9),[1,2,6,24,120,720,5040,40320]):
    assert factorial(ii) == ff
print "Success!"
In []: