Notebook

Exploratory Computing with Python¶

Developed by Mark Bakker

Statistics Notebook 4: Linear regression and curve fitting¶

In this notebook we will perform linear regression on some of the data in the data set on wooden beams, and we perform do curve fitting on a data set of groundwater head observations

In [2]:

%pylab inline

Populating the interactive namespace from numpy and matplotlib

Linear regression¶

We apply linear regression to fit a straight line through a set of data. The function polyfit fits a polynomial of arbitrary degree through a set of data (polyfit is part of the numpy package). The input arguments are x,y,degree. When the degree of the polynomial is 1, it fits a straight line of the form $y=p[0]*x+p[1]$ and it returns the array of parameters p. The parameters are obtained by polyfit by minimizing the sum of the squares of the errors between the data (the $y$-values) and the fitted polynomial. For example, consider the xdata and ydata below. The slope and $y$-intercept of the best-fit line are computed and both the data and best-fit line are drawn.

In [9]:

xdata = array([0.0,1.0,2.0,3.0,4.0,5.0])  # Observed value of x
ydata = array([1.0,3.0,2.0,5.0,5.0,6.0])  # Observed value of y
a,b = polyfit(xdata,ydata,1)
print 'fitted slope: ',a
print 'fitted y-intercept: ',b
plot(xdata, ydata, 'bo', label='observed')
yfit = a*xdata + b
error = ydata - yfit  # Error
plot(xdata, yfit, 'r', label='fit')
xlabel('xdata')
ylabel('ydata')
legend(loc='best')

fitted slope:  0.971428571429
fitted y-intercept:  1.2380952381

Out[9]:

<matplotlib.legend.Legend at 0x9273190>

Exercise 1. Straight line fit between `Edyn` and `Estat`¶

The data set of experiments on wooden beams contains two measurements of the elasticity modulus. The column labeled Estat contains measurements of the elasticity modulus using a standard static bending experiment. The column labeled Edyn contains measurements of the elasticity modulus using a dynamic mechanical analysis where an oscillatory force is applied. The two experiments don't give exactly the same value. You are asked to determine the linear relationship between the two measurements. Let's first assume that the measurement of Estat is much more accurate than the measurement of Edyn (we will consider the reverse some other time).

Plot the Edyn data on the $y$-axis vs. the Estat data on the $x$-axis using blue markers. Use polyfit to determine the parameters of the best-fit straight line. Add the best-fit straight line as a red line to the graph. Label the axes and add a legend.

In [3]:

from pandas import read_csv
w = read_csv('douglas_data.csv',skiprows=[1],skipinitialspace=True)

In [5]:

a,b = polyfit(w.Estat,w.Edyn,1)
xfit = linspace(w.Estat.min(),w.Estat.max(),100)
yfit = a * xfit + b
plot(w.Estat,w.Edyn,'bo',label='observed')
plot(xfit,yfit,'r',label='best-fit line')
xlabel('Estat')
ylabel('Edyn')
legend(loc='best')

Out[5]:

<matplotlib.legend.Legend at 0x8c6bbf0>

Exercise 2. Error analysis of linear regression of `Edyn` and `Estat`¶

Compute the difference between the fitted and observed values of Edyn; these are called the errors. Compute the mean and standard deviation of the error (if you have done the fit correctly, the mean should be very close to zero). Create a histogram of the errors. Add to the same graph the Normal distribution using the sample mean and sample standard deviation you just computed. On the same graph, add vertical lines for the 2.5 and 97.5 percentiles according to the Normal distribution.

In [6]:

from scipy.stats import norm
error = w.Edyn - (a*w.Estat + b)
mu = mean(error)
sig = std(error)
hist(error,normed=True)
x = linspace(-6000,6000,100)
y = norm.pdf(x,loc=mu,scale=sig)
plot(x,y,'r')
x025 = norm.ppf(0.025,loc=mu,scale=sig)
x975 = norm.ppf(0.975,loc=mu,scale=sig)
axvline(x025,color='k')
axvline(x975,color='k')

Out[6]:

<matplotlib.lines.Line2D at 0x8f18b90>

Count how many data points fall outside the 95% interval according to the corresponding Normal distribution. The data points outside the 95% interval are potential outliers. Recreate the plot you made in Exercise 3, but now plot the data points inside the 95% interval with black circles and the data points outside the 95% interval with red circles (refer to Notebook 4 of quarter 1 if you forgot how to do that).

In [7]:

print 'number of points outside 95 percentile: ',sum(abs(error)>x975)
outside = abs(error)>x975
plot(w.Estat[~outside],w.Edyn[~outside],'ko')
plot(w.Estat[outside],w.Edyn[outside],'ro')

number of points outside 95 percentile:  19

Out[7]:

[<matplotlib.lines.Line2D at 0x9131650>]

Exercise 3. Fit `Estat` vs `Edyn` or the other way around?¶

In the previous two exercises, we assumed that Estat is more accurate than Edyn so we fit: Edyn = a1 * Estat + b1; let's call this line 1. Next, we assume Edyn is more accurate than Estat so we fit Estat = a2 * Edyn + b2; let's call this line 2. Plot the Edyn data on the $y$-axis vs. the Estat data on the $x$-axis using blue markers. Plot the two best-fit lines you computed using red (line 1) and green (line 2), label the axes and add a legend. Report the slope and intercept of the best fit lines as they are shown on the graph (Note: that requires a bit of algebra for line 2 as it needs to be reworked in the form Edyn = slope * Estat + intercept).

In [4]:

a1,b1 = polyfit(w.Estat,w.Edyn,1)
a2,b2 = polyfit(w.Edyn,w.Estat,1)
print 'a1,b1: ',a1,b1
print 'a2,b2: ',1.0/a2,-b2/a2
plot(w.Estat,w.Edyn,'bo',label='observed')
x1 = array([w.Estat.min(),w.Estat.max()])
y1 = a1*x1 + b1
plot(x1,y1,'r',label='line 1')
y2 = 1.0/a2 * x1 - b2/a2
plot(x1,y2,'g',label='line 2')
xlabel('Estat')
ylabel('Edyn')
legend(loc='best')

a1,b1:  0.836674348143 3010.55026869
a2,b2:  1.11078518985 -625.844196038

Out[4]:

<matplotlib.legend.Legend at 0x39c5490>

Exercise 4. Error analysis of `Estat` vs `Edyn` and the other way around.¶

In the previous exercise, two straight lines were fit of the form Edyn = slope * Estat + intercept. Compute and report the mean error and the square root of the mean squared error of both straight lines, where the error is defined as the measured Edyn minus the fitted Edyn. Plot the errors vs. Estat for the two fitted lines with red and green dots, respectively. Does either of the errors show a trend?

In [5]:

error1 = w.Edyn - a1 * w.Estat - b1
error2 = w.Edyn - 1/a2 * w.Estat + b2/a2
print 'mean error line 1: ',mean(error1)
print 'mean error line 2: ',mean(error2)
print 'mean squared error line 1: ',sqrt(mean(error1**2))
print 'mean squared error line 2: ',sqrt(mean(error2**2))
#
plot(w.Estat,error1,'ro')
plot(w.Estat,error2,'go')
title('Line 2 (green dots) shows a trend')

mean error line 1:  8.57377589649e-12
mean error line 2:  7.05113869914e-12
mean squared error line 1:  1536.36115
mean squared error line 2:  1770.23176695

Out[5]:

<matplotlib.text.Text at 0x3bb6370>

Curve fitting¶

To be added

In [ ]:

Exploratory Computing with Python¶

Statistics Notebook 4: Linear regression and curve fitting¶

Linear regression¶

Exercise 1. Straight line fit between Edyn and Estat¶

Exercise 2. Error analysis of linear regression of Edyn and Estat¶

Exercise 3. Fit Estat vs Edyn or the other way around?¶

Exercise 4. Error analysis of Estat vs Edyn and the other way around.¶

Curve fitting¶

Exercise 1. Straight line fit between `Edyn` and `Estat`¶

Exercise 2. Error analysis of linear regression of `Edyn` and `Estat`¶

Exercise 3. Fit `Estat` vs `Edyn` or the other way around?¶

Exercise 4. Error analysis of `Estat` vs `Edyn` and the other way around.¶