linfit.py
¶This notebook demonstrates the function linfit
, which I propose adding to the SciPy library.
linfit
is designed to be a fast, lightweight function, written entirely in Python, that only calculates only as much as the user desires, and no more. It can handle arbitrarily large data sets.
linfit
and what does it do?¶linfit
is a function that performs least squares fitting of a straight line to an (x,y) data set. It has the following features:
linfit
can fit a straight line ax+b to unweighted (x,y) data where the output is the slope a and y-intercept b that minimizes the square of the residuals E=n∑i=1[yi−(axi+b)]2
Optional outputs:
Alternatively, linfit
can fit a straight line ax+b to (x,y) data that is weighted using estimates of the uncertainties of the data provided as an optional keyword argument. The uncertanties can be expressed as either as (1) a single number σ or (2) as an array of uncertainties σi. In the first case, the output is the slope a and y-intercept b that minimizes χ2, which is defined as χ2=1σ2n∑i=0[yi−(axi+b)]2
Optional outputs:
χ2/(n−2), the reduced value of χ2, which should be approximately equal to 1 if a straight line is a good model of the data and good error estimates σi are provided.
Δa and Δb, the uncertainties Δa and Δb in the fitted values of a and b. By default, these estimates of Δa and Δb use the residuals |yi−(axi+b)| as estimates of the uncertainties σi of the data.
yi−(axi+b), the residuals.
linfit
can also be used for nonlinear fitting for functions that can be transformed to be linear in the fitting parameters. In this case, using weighting is essential to get the fit right.
linfit.py
¶Import libraries we will need.
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec # for unequal plot boxes
from linfit import linfit
Perform a simple linear fit without any weighting.
x = np.array([0, 1, 2, 3])
y = np.array([-1, 0.2, 0.9, 2.1])
fit, cvm = linfit(x, y)
print("slope = {0:0.2f}, y-intercept = {1:0.2f}".format(fit[0], fit[1]))
slope = 1.00, y-intercept = -0.95
Plot the above data and fit.
xfit = np.array([-0.2, 3.2])
yfit = fit[0]*xfit + fit[1]
plt.plot(x, y, 'oC3', label="data")
plt.plot(xfit, yfit, zorder=-1, label="ax+b")
plt.text(-0.3, 1.1, "a={0:0.2f}\nb={1:0.2f}"
.format(fit[0], fit[1]), fontsize=12)
plt.legend(loc="upper left")
plt.xlabel('x')
plt.show()
Perform same fit without weighting but get estimates for uncertainties in slope and y-intercept from covariance matrix.
fit, cvm = linfit(x, y, relsigma=True)
dfit = [np.sqrt(cvm[i,i]) for i in range(2)]
print(u"slope = {0:0.2f} \xb1 {1:0.2f}".format(fit[0], dfit[0]))
print(u"y-intercept = {0:0.2f} \xb1 {1:0.2f}".format(fit[1], dfit[1]))
slope = 1.00 ± 0.07 y-intercept = -0.95 ± 0.13
# data set for linear fitting
x = np.array([2.3, 4.7, 7.1, 9.6, 11.7, 14.1, 16.4, 18.8, 21.1, 23.0])
y = np.array([-25., 3., 110., 110., 230., 300., 270., 320., 450., 400.])
sigmay = np.array([15., 30., 30., 40., 40., 50., 40., 30., 50., 30.])
# Fit linear data set with weighting
fit, cvm, info = linfit(x, y, sigmay=sigmay, relsigma=False, return_all=True)
dfit = [np.sqrt(cvm[i,i]) for i in range(2)]
print(u"slope = {0:0.1f} \xb1 {1:0.1f}".format(fit[0], dfit[0]))
print(u"y-intercept = {0:0.0f} \xb1 {1:0.0f}".format(fit[1], dfit[1]))
slope = 21.7 ± 1.2 y-intercept = -72 ± 15
Plot the data with error bars together with the fit. Plot the residuals in a separate graph above the data with fit.
# Open figure window for plotting data with linear fit
fig = plt.figure(1, figsize=(8, 8))
gs = gridspec.GridSpec(2, 1, height_ratios=[2.5, 6])
# Bottom plot: data and fit
ax1 = fig.add_subplot(gs[1])
# Plot data with error bars
ax1.errorbar(x, y, yerr=sigmay, ecolor='k', mec='k', fmt='oC3', ms=6)
# Plot fit (behind data)
endx = 0.05 * (x.max() - x.min())
xFit = np.array([x.min() - endx, x.max() + endx])
yFit = fit[0] * xFit + fit[1]
ax1.plot(xFit, yFit, '-', zorder=-1)
# Print out results of fit on plot
ax1.text(0.05, 0.9, # slope of fit
u'slope = {0:0.1f} \xb1 {1:0.1f}'.format(fit[0], dfit[0]),
ha='left', va='center', transform=ax1.transAxes)
ax1.text(0.05, 0.83, # y-intercept of fit
u'y-intercept = {0:0.1f} \xb1 {1:0.1f}'.format(fit[1], dfit[1]),
ha='left', va='center', transform=ax1.transAxes)
ax1.text(0.05, 0.76, # reduced chi-squared of fit
'redchisq = {0:0.2f}'.format(info.rchisq),
ha='left', va='center', transform=ax1.transAxes)
ax1.text(0.05, 0.69, # correlation coefficient of fitted slope & y-intercept
'rcov = {0:0.2f}'.format(cvm[0, 1] / (dfit[0] * dfit[1])),
ha='left', va='center', transform=ax1.transAxes)
# Label axes
ax1.set_xlabel('time')
ax1.set_ylabel('velocity')
# Top plot: residuals
ax2 = fig.add_subplot(gs[0])
ax2.axhline(color='gray', lw=0.5, zorder=-1)
ax2.errorbar(x, info.resids, yerr=sigmay, ecolor='k', mec='k', fmt='oC3', ms=6)
ax2.set_ylabel('residuals')
ax2.set_ylim(-100, 150)
ax2.set_yticks((-100, 0, 100))
plt.show()
Fit the same (x,y) data set but with a single value of σ for the entire data set.
sigmay0 = 34.9
fit, cvm, info = linfit(x, y, sigmay=sigmay0, relsigma=False, return_all=True)
dfit = [np.sqrt(cvm[i,i]) for i in range(2)]
print(u"slope = {0:0.1f} \xb1 {1:0.1f}".format(fit[0], dfit[0]))
print(u"y-intercept = {0:0.0f} \xb1 {1:0.0f}".format(fit[1], dfit[1]))
print(u"redchisq = {0:0.2f}".format(info.rchisq))
slope = 22.4 ± 1.7 y-intercept = -72 ± 24 redchisq = 1.34
Create data set with 100000 data points.
def randomData(xmax, npts):
x = np.random.uniform(-xmax, xmax, npts)
scale = np.sqrt(xmax)
a, b = scale * (np.random.rand(2)-0.5)
y = a*x + b + a * scale * np.random.randn(npts)
dy = a * scale * (1.0 + np.random.rand(npts))
return x, y, dy
npts = 100000
x, y, dy = randomData(100., npts)
Fit straight line to the data.
fit, cvm = linfit(x, y)
slope, yint = fit
Plot the data together with the fit.
xm = 0.05*(x.max()-x.min())
xfit = np.array([x.min()-xm, x.max()+xm])
yfit = slope*xfit + yint
plt.plot(xfit, yfit, '-k')
plt.plot(x, y, ".C3", ms=1, zorder=-1)
[<matplotlib.lines.Line2D at 0x7f8173057410>]
Compare execution times of linfit and polyfit to fit a randomly generated data set of 10000 (x,y) data points. On my computers, linfit is about 6 times faster than polyfit for unweighted data and about 3 times faster for weighted data.
import timeit
setup = '''
from linfit import linfit
import numpy as np
def randomData(xmax, npts):
x = np.random.uniform(-xmax, xmax, npts)
scale = np.sqrt(xmax)
a, b = scale * (np.random.rand(2)-0.5)
y = a*x + b + a * scale * np.random.randn(npts)
dy = a * scale * (1.0 + np.random.rand(npts))
return x, y, dy
npts = 100000
x, y, dy = randomData(100., npts)
'''
nreps = 7
nruns = 100
linfitNOwt = min(timeit.Timer('fit, cvm = linfit(x, y)', setup=setup).repeat(nreps, nruns))
polyfitNOwt = min(timeit.Timer('slope, yint = np.polyfit(x, y, 1)', setup=setup).repeat(nreps, nruns))
print("TIME COMPARISON WITH NO WEIGHTING OF DATA")
print(" linfit time = {}\n polyfit time = {}\n ratio = {}"
.format(linfitNOwt, polyfitNOwt, polyfitNOwt/linfitNOwt))
linfitWT = min(timeit.Timer('slope, yint = linfit(x, y, sigmay=dy)', setup=setup).repeat(nreps, nruns))
polyfitWT = min(timeit.Timer('slope, yint = np.polyfit(x, y, 1, w=dy)', setup=setup).repeat(nreps, nruns))
print("TIME COMPARISON WITH WEIGHTING OF DATA")
print(" linfit time = {}\n polyfit time = {}\n ratio = {}"
.format(linfitWT, polyfitWT, polyfitWT/linfitWT))
TIME COMPARISON WITH NO WEIGHTING OF DATA linfit time = 0.04812828300000005 polyfit time = 0.3567316079999996 ratio = 7.412099201627434 TIME COMPARISON WITH WEIGHTING OF DATA linfit time = 0.08441347999999937 polyfit time = 0.4123130230000003 ratio = 4.884445268694091
Linear fitting with weighting can be used to fit functions that are nonlinear in the fitting parameters, provided the fitting function can be transformed into one that is linear in the fitting paramters. This can be done for exponential functions and power-law functions. This approach is illusutrated in the next two examples.
Nuclear decay provides a convenient example of an exponential fitting function: N(t)=N0e−t/τ.
Here are the N vs t data together with the uncertainties ΔN.
t = np.array([0., 32.8, 65.6, 98.4, 131.2, 164., 196.8, 229.6, 262.4,
295.2, 328., 360.8, 393.6, 426.4, 459.2, 492.])
N = np.array([5.08, 3.29, 2.23, 1.48, 1.11, 0.644, 0.476, 0.273, 0.188,
0.141, 0.0942, 0.0768, 0.0322, 0.0322, 0.0198, 0.0198])
dN = np.array([0.11, 0.09, 0.07, 0.06, 0.05, 0.04, 0.03, 0.03,
0.02, 0.02, 0.015, 0.014, 0.009, 0.009, 0.007, 0.007])
Linear and semi-log plots of the data with error bars:
fig = plt.figure(1, figsize=(10, 3.5))
ax1 = fig.add_subplot(1,2,1)
ax2 = fig.add_subplot(1,2,2)
ax2.set_yscale("log")
for ax in [ax1, ax2]:
ax.errorbar(t, N, yerr=dN, xerr=None, fmt='oC3', ecolor='k', ms=3)
ax.set_xlim(-10, 500)
ax.set_xlabel('t')
ax.set_ylabel('N')
The semi-log plot suggests we can use linfit to fit the data by taking the logarithm of the y data. Taking the logarithm of the exponential fitting function gives lnN=−tτ+lnN0.
The uncertainties Δy are related to ΔN by taking the differential of the tranformation y=lnN: Δy=(∂y∂N)ΔN=ΔNN
y = np.log(N)
dy = dN/N
Next we perform the fit on the tranformed data:
fit, cvm, info = linfit(t, y, sigmay=dy, relsigma=False, return_all=True)
Extract τ and N0 from fit of transformed data
a, b = fit[0], fit[1]
tau = -1.0/a
N0 = np.exp(b)
Extract the uncertainties in the fitting parameters a and b.
dfit = [np.sqrt(cvm[i,i]) for i in range(2)]
da, db = dfit[0], dfit[1]
Get the uncertainties in τ and N0 from the transformation equations: τ=−1/a⇒Δτ=|∂τ∂a|Δa⇒Δτ=Δaa2N0=eb⇒ΔN0=|∂N0∂b|Δb⇒ΔN0=ebΔb
dtau = da/(a*a)
dN0 = np.exp(b)*db
Plot the data and fit on linear and semi-log plots
tm = 0.05*(t.max()-t.min())
tfit = np.linspace(t.min()-tm, t.max()+tm, 50)
Nfit = N0*np.exp(-tfit/tau)
fig = plt.figure(1, figsize=(10, 3.5))
ax1 = fig.add_subplot(1,2,1)
ax2 = fig.add_subplot(1,2,2)
ax2.set_yscale("log")
ax2.set_ylim(0.01, 10.)
for ax in [ax1, ax2]:
ax.errorbar(t, N, yerr=dN, xerr=None, fmt='oC3', ecolor='k', ms=4)
ax.plot(tfit, Nfit, '-', color="gray", zorder=-1)
ax.set_xlim(-10, 500)
ax.set_xlabel('t')
ax.set_ylabel('N')
ax.text(0.95, 0.95,"$\\tau = {0:0.1f}\pm{1:0.1f}$\n$N_0 = {2:0.2f}\pm{3:0.2f}$"
.format(tau, dtau, N0, dN0), fontsize=12,
ha='right', va='top', transform=ax.transAxes)
A similar procedure can be used to fit data to a power-law y=Axp with A and p as the fitting parameters.
Currently, there is no function available in numpy or scipy expressly designed to perform chi-squared least square fitting of a straight line to a single data set with weighting (error bars). There are two functions available in numpy and scipy that can be adapted to fit single straight lines to data with weighting: numpy.polyfit
and scipy.linalg.lstsq
. Both of these have drawbacks if the desired task is to fit a straight line to a single data set. In addition to these two functions, scipy.stats.linregress
fits a straight line but without any provision for weighting.
numpy.polyfit and scipy.linalg.lstsq are both slower than linfit
: numpy.polyfit
is generally a few times slower; scipy.linalg.lstsq is also a few times slower but can be several hundreds of times slower depending on the weighting used and the size of the data set. This is because, in part, both numpy.polyfit
and scipy.linalg.lstsq
involve matrix inversion while linfit does not. More generally, both numpy.polyfit
and scipy.linalg.lstsq
have significant overhead associated with the more general fitting problems they are designed to address.
In its current configuration, numpy.polyfit
does not allow absolute weighting of the data; only relative weighting is implemented. linfit
allows either relative (the default) or absolute weighting (by setting the keyword argument relsigma=False)