This is a series of notebooks (in progress) to document my learning, and hopefully to help others learn machine learning. I would love suggestions / corrections / feedback for these notebooks.
Email me: email.ryan.kelly@gmail.com
I'd love for you to share if you liked this post.
social()
The goal of any curve fitting is to determine the underlying trend in the data in the presence of variance. This tutorial gets us going with the basics.
import scipy as sp
import numpy as np
import pandas as pd
from ggplot import *
# Enable inline plotting
%matplotlib inline
Create some fake data
x = sp.linspace(-5, 5,100) # 1000 values from 0 - 10
# Quadratic function
y = 2*x**2 + 5*x + 0
Load data into pandas data frame
df = pd.DataFrame(zip(x,y), columns = ['x','y'])
Lets plot it quick to take a look
ggplot(df, aes("x", "y"))+geom_line()+geom_point()
<ggplot: (279439225)>
Next we will add some error (noise) to make the data more "life-like", and then plot it without the line.
noisy_y = y + np.random.randn(y.size) # Random normal noise
ggplot(df, aes("x", noisy_y))+geom_point()
<repr(<ggplot.ggplot.ggplot at 0x10a8b3250>) failed: ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()>
To do this we can utilize SciPy's polyfit ( ) function. We will then fit several polynomial shapes (starting with a straight line) and aim to find the order of the polynomial that minimizes the error function defined above.
First we will define an error function, which will guide us in choosing the correct model. The error is simply the squared distance of the model's prediction to the observed data above.
def error(f,x,y):
return sp.sum((f(x)-y)**2)
Where f will be our fitted model and x / y contain our original data.
The most basic model fit would be a straight line. This is actually "linear regression" but it is also a special case of a more general function: a first-order polynomial. First-Order means that the highest exponent of x in the equation is 1.
pf1, residuals, rank, sv, rcond = sp.polyfit(x, noisy_y, 1, full =True)
By setting the full parameter to True we get additional output coefficients from the fitting process, but we only really care about the residuals, which is the error of the estimation.
print ("Model parameters: %s" % pf1)
print "\n", residuals
print pf1
Model parameters: [ 4.88882894 16.42093795] [ 28290.55679858] [ 4.88882894 16.42093795]
Therefore the minimized straight line fits the following function:
print "f(x) = {} * x^2 + {}".format(pf1[0], pf1[1])
f(x) = 4.88882893728 * x^2 + 16.4209379538
Now we use poly1d ( ) to create a model function object from the model parameters derived above.
f1 = sp.poly1d(pf1)
print(error(f1, x, noisy_y))
28290.5567986
Lets take a look at what we've fit.
ggplot(df, aes("x", noisy_y))+geom_point() + \
geom_line(aes(x, y = f1(x)), color = "red")
<repr(<ggplot.ggplot.ggplot at 0x10a8d6650>) failed: ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()>
This clearly is not the best we can do.
Lets do the same thing, but increase the variance.
noisy_y = y + np.random.randn(y.size)*10
pf1 = sp.polyfit(x, noisy_y, 1)
f1 = sp.poly1d(pf1)
ggplot(df, aes(x, noisy_y))+geom_point() + \
geom_line(aes(x, y = f1(x)), color = "red") + \
geom_line(aes(x, 2*x**2 + 5*x + 0), color = "blue", size = 2)
<repr(<ggplot.ggplot.ggplot at 0x10a44a690>) failed: ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()>
There are some small bugs with the ggplot python module, which cuts off the legends.
The dotted blue line is the actual data behind the noise
The red line is our current estimation.
We can see that we have are capturing some of the trend with the red line, but not all of it. Lets try higher-order polynomials.
Recall the error term from our first order polynomial model:
print error(f1, x, noisy_y)
29972.3791058
Alone this error means nothing, but by comparing competing models, we can use the reported error to judge the most appropriate model.
# Second order Polynomial
pf2 = sp.polyfit(x, noisy_y, 2)
f2 = sp.poly1d(pf2)
print f2
2 1.909 x + 5.027 x + 1.714
print error(f2, x, noisy_y)
8901.42029545
print "The error was reduced from {} to {}, a difference of {}".format(
error(f1, x, noisy_y), error(f2, x, noisy_y), error(f1, x, noisy_y)-error(f2, x, noisy_y))
The error was reduced from 29972.3791058 to 8901.42029545, a difference of 21070.9588103
ggplot(df, aes("x", noisy_y))+geom_point(size = 4) + \
geom_line(aes(x, y = f1(x)), color = "red") + \
geom_line(aes(x, 2*x**2 + 5*x + 0), color = "blue", linetype = "dashed") + \
geom_line(aes(x, y = f2(x)), color = "green", size=2)
<repr(<ggplot.ggplot.ggplot at 0x10af98c50>) failed: ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()>
Here we go, we can clearly see that the green line (second-order polynomial) is almost exactly the actually underlying trend.
How about we test a few polynomial errors at once to be a bit more efficient.
def poly_test(range, x, y):
func_l = []
error_l = []
i = 1
for order in range:
pf = sp.polyfit(x, y, order)
func = sp.poly1d(pf)
err = error(func, x, y)
print "Order: {}, Error = {}".format(i, err)
func_l.append(func)
error_l.append([i, err])
i += 1
error_l = pd.DataFrame((error_l), columns = ["Order", "Error"])
return error_l, func_l
error_tests, poly_tests = poly_test(xrange(1,20), x, noisy_y)
Order: 1, Error = 29972.3791058 Order: 2, Error = 8901.42029545 Order: 3, Error = 8889.43020048 Order: 4, Error = 8827.05812909 Order: 5, Error = 8785.33129955 Order: 6, Error = 8733.05890881 Order: 7, Error = 8498.3744727 Order: 8, Error = 8407.08751693 Order: 9, Error = 8384.44871846 Order: 10, Error = 8074.70017756 Order: 11, Error = 8035.84966396 Order: 12, Error = 7942.28166492 Order: 13, Error = 7698.98413081 Order: 14, Error = 7612.94832123 Order: 15, Error = 7602.49546217 Order: 16, Error = 7567.91985168 Order: 17, Error = 7368.93508242 Order: 18, Error = 7089.18637868 Order: 19, Error = 6979.14519774
Plot the best fit
ggplot(df, aes("x", noisy_y))+geom_point(size = 4) + \
geom_line(aes(x, 2*x**2 + 5*x + 0), color = "blue", linetype = "dashed") + \
geom_line(aes(x, y = poly_tests[18](x)), color = "green")
<repr(<ggplot.ggplot.ggplot at 0x10b7f9750>) failed: ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()>
Woah! Hang on... Are we actually capturing the underlying process in this data? By using this higher order polynomial (9), we are actually capturing the underlying process and the noise. This is called overfitting
A simplistic way to tell if we are overfitting is to examine the error improvement decay with increasing polynomial levels. In these data, we can see that there isnt really much error reduction after polynomial 2, therefore this would be our chosen model fit. There are more sophisticated ways of doing this, which will be discussed later.
breaks = xrange(0,21)
ggplot(error_tests, aes(x='Order', y='Error'))+geom_line() + \
scale_x_continuous(breaks = breaks, labels = breaks)
<ggplot: (280480369)>
Ok, so say we have chosen a second-order polynomial to fit our model. Lets utilize our ability to estimate the underlying trend and predict future data. Generally more sophisticated statistics would be used here to analyze the variance when predicting farther and farther into the future or unknown.
To find the desired unknown y value, we subtract y at desired position i from the polynomial. We then use fsolve from SciPy's optimize module to simply solve the polynomial function we created. Say we want to know what x will be when y = 150.
print f2
from scipy.optimize import fsolve
converged = fsolve(f2 - 150, 0)
print ("y = 150 is expected when x = {}". format(converged[0]))
2 1.909 x + 5.027 x + 1.714 y = 150 is expected when x = 7.59426937669
We can take advantage of the fact that a power law is linear in log-space.
Lets make some new data.
t = sp.linspace(0.1,5, 1000)
# Power function
a = 1.5
b = 2
z = a * t**b
noisy_y = z + sp.random.randn(z.size)**1/2
df = pd.DataFrame(zip(t,noisy_y), columns = ["x", "y"])
# Plot data points and actual data as a line
ggplot(df, aes("x", "y"))+geom_point(size = 4)+ \
geom_line(df,aes(x="x", y=z), color = "blue")
<ggplot: (279250721)>
First we need to compute the error for a polynomial fit on untransformed data
error_tests, poly_tests = poly_test(xrange(1,10), t, noisy_y)
Order: 1, Error = 7582.21726058 Order: 2, Error = 239.360725199 Order: 3, Error = 239.229365939 Order: 4, Error = 238.469422373 Order: 5, Error = 238.424914846 Order: 6, Error = 238.396983769 Order: 7, Error = 238.3955276 Order: 8, Error = 237.987503728 Order: 9, Error = 237.975969779
Again, a second-order polynomial is most appropriate.
Next lets try linearizing the data with a natural log
# Change the variables using a natural log
x = np.log(t)
y = np.log(noisy_y) # Add error
df = pd.DataFrame(zip(x, y), columns = ["x", "y"])
ggplot(df, aes("x", "y"))+geom_point(size = 4)+ \
geom_line(aes(x="x", y=np.log(a * t**b)), color = "blue")
<ggplot: (279196401)>
The blue line is actually our observed trend, without noise. We can see that power laws are very sensitive to variation and noise. Some data points are not even plotted here because they became NaN after we logged them. We could either:
To make this simpler, we will just remove these values
# First make sure we didnt create any NaN values by logging a negative number
print "# of missing values in y: {}".format(len(y[pd.isnull(y)]))
# of missing values in y: 39
To remove these values, we will just subset the data which are not NaN. We do this by applying a boolean mask. There are several ways to do this, but an easy one is to use pandas isnull ( ), which will check for all different types of NaN values, such as NA, NAN, NULL, NaN.
bad = pd.isnull(y)
bad[0:5]
array([False, True, False, False, False], dtype=bool)
Once we have all the index's where noisy_y = NaN, we pass it to the array as an index array. We use - before the mask because we want everything that is not bad.
clean_y = y[-bad]
clean_x = x[-bad]
Now finally we can fit our first-order polynomial.
power_law_fit = sp.polyfit(clean_x, clean_y, 1)
power_law_func = sp.poly1d(power_law_fit)
#report error
print "Error improvement is {}".format(error_tests[1:2]['Error'].values - \
(error(power_law_func, clean_x, clean_y)))
fitted_y = sp.polyval(power_law_fit, clean_x)
Error improvement is [ 38.86933803]
df = pd.DataFrame(zip(clean_x, clean_y), columns = ["x", "y"])
expected_y = np.log(a* t**b)
# Red line is new fit
# Blue line is actual data
ggplot(df, aes("x", "y"))+geom_point(size = 4)+ \
geom_line(aes(x="x", y=fitted_y, color = "red")) + \
geom_line(aes(x=np.log(t), y = np.log(z), color = "blue"))
<ggplot: (279260945)>
By combining the results of the error and the plot we can see that this new fit (red line) is slightly more accurate compared to our regular second-order polynomial fitting.
However, to achieve this result, I had to reduce the error quite significantly.
Let's create some new fake data.
x = np.linspace(0,2*np.pi, 50)
def sinfun(x,a,b):
return a*np.sin(x-b)
y = sinfun(x,1,0)
df = pd.DataFrame(zip(x,y), columns = ['x','y'])
ggplot(df, aes(x,y))+geom_line(color="blue")+geom_point()
<repr(<ggplot.ggplot.ggplot at 0x10b7f9f90>) failed: ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()>
Lets add some error, just like before.
noisy_y = y + np.random.randn(y.size)
ggplot(df, aes("x", noisy_y))+geom_point()+geom_line(aes("x", "y", color = "blue"))
<repr(<ggplot.ggplot.ggplot at 0x10a4de350>) failed: ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()>
Sometimes it's best to let SciPy handle more complex curves using the curve_fit ( ) function from scipy.optimize. The function returns the optimized fitted parameters and a covariance matrix.
The diagonals of a covariance matrix are the variances and we take the root of them to get the standard deviation.
from scipy.optimize import curve_fit
fitpars, covm = curve_fit(sinfun, x, noisy_y)
variance = np.sqrt(covm.diagonal())
print fitpars, variance
[ 1.49563761 -0.01659767] [ 0.18290973 0.11987486]
ggplot(df, aes("x", noisy_y))+geom_point()+ \
geom_line(aes("x", "y", color="blue")) + \
geom_line(aes("x", sinfun(x, *fitpars), color = 'red'))
<repr(<ggplot.ggplot.ggplot at 0x10a404c10>) failed: ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()>
This is a decent fit, but you can see that it is catching quite a bit of the noise still. Granted, there is a lot of error present in this graph.
Visit my webpage for more. Email me: ryan@rmdk.ca
from IPython.core.display import HTML
def css_styling():
styles = open("/users/ryankelly/desktop/custom_notebook.css", "r").read()
return HTML(styles)
css_styling()
def social():
code = """
<a style='float:left; margin-right:5px;' href="https://twitter.com/share" class="twitter-share-button" data-text="Check this out" data-via="Ryanmdk">Tweet</a>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
<a style='float:left; margin-right:5px;' href="https://twitter.com/Ryanmdk" class="twitter-follow-button" data-show-count="false">Follow @Ryanmdk</a>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
<a style='float:left; margin-right:5px;'target='_parent' href="http://www.reddit.com/submit" onclick="window.location = 'http://www.reddit.com/submit?url=' + encodeURIComponent(window.location); return false"> <img src="http://www.reddit.com/static/spreddit7.gif" alt="submit to reddit" border="0" /> </a>
<script src="//platform.linkedin.com/in.js" type="text/javascript">
lang: en_US
</script>
<script type="IN/Share"></script>
"""
return HTML(code)