In this notebook we will perform linear regression on some of the data in the data set on wooden beams, and we perform do curve fitting on a data set of groundwater head observations
%pylab inline
Populating the interactive namespace from numpy and matplotlib
We apply linear regression to fit a straight line through a set of data. The function polyfit
fits a polynomial of arbitrary degree through a set of data (polyfit
is part of the numpy
package). The input arguments are x,y,degree
. When the degree of the polynomial is 1, it fits a straight line of the form $y=p[0]*x+p[1]$ and it returns the array of parameters p
. The parameters are obtained by polyfit
by minimizing the sum of the squares of the errors between the data (the $y$-values) and the fitted polynomial. For example, consider the xdata
and ydata
below. The slope and $y$-intercept of the best-fit line are computed and both the data and best-fit line are drawn.
xdata = array([0.0,1.0,2.0,3.0,4.0,5.0]) # Observed value of x
ydata = array([1.0,3.0,2.0,5.0,5.0,6.0]) # Observed value of y
a,b = polyfit(xdata,ydata,1)
print 'fitted slope: ',a
print 'fitted y-intercept: ',b
plot(xdata, ydata, 'bo', label='observed')
yfit = a*xdata + b
error = ydata - yfit # Error
plot(xdata, yfit, 'r', label='fit')
xlabel('xdata')
ylabel('ydata')
legend(loc='best')
fitted slope: 0.971428571429 fitted y-intercept: 1.2380952381
<matplotlib.legend.Legend at 0x9273190>
Edyn
and Estat
¶The data set of experiments on wooden beams contains two measurements of the elasticity modulus. The column labeled Estat
contains measurements of the elasticity modulus using a standard static bending experiment. The column labeled Edyn
contains measurements of the elasticity modulus using a dynamic mechanical analysis where an oscillatory force is applied. The two experiments don't give exactly the same value. You are asked to determine the linear relationship between the two measurements. Let's first assume that the measurement of Estat
is much more accurate than the measurement of Edyn
(we will consider the reverse some other time).
Plot the Edyn
data on the $y$-axis vs. the Estat
data on the $x$-axis using blue markers. Use polyfit
to determine the parameters of the best-fit straight line. Add the best-fit straight line as a red line to the graph. Label the axes and add a legend.
from pandas import read_csv
w = read_csv('douglas_data.csv',skiprows=[1],skipinitialspace=True)
a,b = polyfit(w.Estat,w.Edyn,1)
xfit = linspace(w.Estat.min(),w.Estat.max(),100)
yfit = a * xfit + b
plot(w.Estat,w.Edyn,'bo',label='observed')
plot(xfit,yfit,'r',label='best-fit line')
xlabel('Estat')
ylabel('Edyn')
legend(loc='best')
<matplotlib.legend.Legend at 0x8c6bbf0>
Edyn
and Estat
¶Compute the difference between the fitted and observed values of Edyn
; these are called the errors. Compute the mean and standard deviation of the error (if you have done the fit correctly, the mean should be very close to zero). Create a histogram of the errors. Add to the same graph the Normal distribution using the sample mean and sample standard deviation you just computed. On the same graph, add vertical lines for the 2.5 and 97.5 percentiles according to the Normal distribution.
from scipy.stats import norm
error = w.Edyn - (a*w.Estat + b)
mu = mean(error)
sig = std(error)
hist(error,normed=True)
x = linspace(-6000,6000,100)
y = norm.pdf(x,loc=mu,scale=sig)
plot(x,y,'r')
x025 = norm.ppf(0.025,loc=mu,scale=sig)
x975 = norm.ppf(0.975,loc=mu,scale=sig)
axvline(x025,color='k')
axvline(x975,color='k')
<matplotlib.lines.Line2D at 0x8f18b90>
Count how many data points fall outside the 95% interval according to the corresponding Normal distribution. The data points outside the 95% interval are potential outliers. Recreate the plot you made in Exercise 3, but now plot the data points inside the 95% interval with black circles and the data points outside the 95% interval with red circles (refer to Notebook 4 of quarter 1 if you forgot how to do that).
print 'number of points outside 95 percentile: ',sum(abs(error)>x975)
outside = abs(error)>x975
plot(w.Estat[~outside],w.Edyn[~outside],'ko')
plot(w.Estat[outside],w.Edyn[outside],'ro')
number of points outside 95 percentile: 19
[<matplotlib.lines.Line2D at 0x9131650>]
Estat
vs Edyn
or the other way around?¶In the previous two exercises, we assumed that Estat
is more accurate than Edyn
so we fit: Edyn = a1 * Estat + b1
; let's call this line 1. Next, we assume Edyn
is more accurate than Estat
so we fit Estat = a2 * Edyn + b2
; let's call this line 2. Plot the Edyn
data on the $y$-axis vs. the Estat
data on the $x$-axis using blue markers. Plot the two best-fit lines you computed using red (line 1) and green (line 2), label the axes and add a legend. Report the slope and intercept of the best fit lines as they are shown on the graph (Note: that requires a bit of algebra for line 2 as it needs to be reworked in the form Edyn = slope * Estat + intercept
).
a1,b1 = polyfit(w.Estat,w.Edyn,1)
a2,b2 = polyfit(w.Edyn,w.Estat,1)
print 'a1,b1: ',a1,b1
print 'a2,b2: ',1.0/a2,-b2/a2
plot(w.Estat,w.Edyn,'bo',label='observed')
x1 = array([w.Estat.min(),w.Estat.max()])
y1 = a1*x1 + b1
plot(x1,y1,'r',label='line 1')
y2 = 1.0/a2 * x1 - b2/a2
plot(x1,y2,'g',label='line 2')
xlabel('Estat')
ylabel('Edyn')
legend(loc='best')
a1,b1: 0.836674348143 3010.55026869 a2,b2: 1.11078518985 -625.844196038
<matplotlib.legend.Legend at 0x39c5490>
Estat
vs Edyn
and the other way around.¶In the previous exercise, two straight lines were fit of the form Edyn = slope * Estat + intercept
. Compute and report the mean error and the square root of the mean squared error of both straight lines, where the error is defined as the measured Edyn
minus the fitted Edyn
. Plot the errors vs. Estat
for the two fitted lines with red and green dots, respectively. Does either of the errors show a trend?
error1 = w.Edyn - a1 * w.Estat - b1
error2 = w.Edyn - 1/a2 * w.Estat + b2/a2
print 'mean error line 1: ',mean(error1)
print 'mean error line 2: ',mean(error2)
print 'mean squared error line 1: ',sqrt(mean(error1**2))
print 'mean squared error line 2: ',sqrt(mean(error2**2))
#
plot(w.Estat,error1,'ro')
plot(w.Estat,error2,'go')
title('Line 2 (green dots) shows a trend')
mean error line 1: 8.57377589649e-12 mean error line 2: 7.05113869914e-12 mean squared error line 1: 1536.36115 mean squared error line 2: 1770.23176695
<matplotlib.text.Text at 0x3bb6370>
To be added