A tutorial adventure with Python 3, cytoolz, NumPy, and matplotlib
This tutorial was adapted from the Linear Regression tutorial for the Clojure scientific computing platform Incanter. It has been significantly modified to take advantage of the following, however:
Okay, to be honest, we don't really use cytoolz in the linear regression exercise; but we really wanted to :-) The opportunity just didn't arise. Seriously, though, cytoolz some some pretty sweet stuff, and provides fast utility functions you may be missing if you're coming (or returning!) to Python after living in the land of functional programming. In particular, it borrows heavily from the Clojure API (which is, quite bluntly, awesome).
From wikipedia:
Curve fitting is the process of constructing a curve, or mathematical function, that has the best fit to a series of data points, possibly subject to constraints. Curve fitting can involve either interpolation, where an exact fit to the data is required, or smoothing, in which a "smooth" function is constructed that approximately fits the data.
A related topic is regression analysis, which focuses more on questions of statistical inference such as how much uncertainty is present in a curve that is fit to data observed with random errors. Fitted curves can be used as an aid for data visualization, to infer values of a function where no data are available, and to summarize the relationships among two or more variables. Extrapolation refers to the use of a fitted curve beyond the range of the observed data, and is subject to a degree of uncertainty since it may reflect the method used to construct the curve as much as it reflects the observed data.
The suite of tools available for scientific computing in Python are pretty stunning. Here's an abridged list:
There are so many more, so much to enjoy.
Hy is a Lisp dialect that converts its structure into Python’s abstract syntax tree. It is to Python what LFE is to Erlang.This provides developers from many backgrounds with the following:
To support different languages in IPython notebooks, one either needs to create an IPython kernel for the desired language, or create a cell magic command. To the best of our knowledge, there is as yet no Hy IPython kernel, however, yardsale8 has created HyMagic for hylang, and this entire notebook depends upon it. Thanks, @yardsale8!
If you are reading this with the IPython nbviewer and you want to run the examples on your machine, just do the following:
$ git clone https://github.com/oubiwann/linear-regression-tutorial.git
$ cd linear-regression-tutorial
$ make
That will:
Note that since this tutorial is written in Hy and there is no Hy IPython kernel yet, we use the %%hymagic
command for each cell, which will look like this:
%%hylang
(* 2 (reduce + (range 1 7) 0))
42
Let's get started with the notebook prep, now.
This notebook is started with a custom IPython profile (defined in a make
include file), and that profile has already loaded the Hy lang cell magic extension, so we don't need to.
We do need to set up matplotlib for IPython, though:
import matplotlib
matplotlib.use('nbagg')
%matplotlib inline
Next, switching to Hy, we can do all our imports:
%%hylang
(import [functools [reduce]]
[pprint [pprint]]
[cytoolz [itertoolz]]
[numpy :as np]
[matplotlib :as mpl]
[matplotlib.pyplot :as plt]
[seaborn :as sns])
Now we can configure the color map we'll use in our plots:
%%hylang
(def palette_name "husl")
(def colors (sns.color_palette palette_name 8))
(colors.reverse)
(def cmap (mpl.colors.LinearSegmentedColormap.from_list palette_name colors))
In the case of this tutorial, the "observed data" is the same data set as that used in the Incanter linear regression tutorial: the NIST Filip.dat file. In the ./notebooks
directoy we've saved a local copy of Filip.dat
as well as a CSV file converstion of the data (filip.csv
).
Let's read it in:
%%hylang
(def data (apply np.genfromtxt ["filip.csv"] {"dtype" float "delimiter" "," "names" True}))
Let's set some of that data to variables for use later on:
%%hylang
(def x (get data "x"))
(def y (get data "y"))
(def [x-min x-max] [(x.min) (x.max)])
[-8.7814644949999998, -3.1320024900000001]
Here's what it looks like:
%%hylang
x
array([-6.86012091, -4.32413005, -4.35862506, -4.35842675, -6.95585238, -6.66114525, -6.35546294, -6.11810203, -7.11514802, -6.81530857, -6.51999306, -6.20411998, -5.85387196, -6.10952309, -5.79832982, -5.48267212, -5.17179139, -4.8517059 , -4.51712642, -4.14357323, -3.70907544, -3.49948909, -6.3007695 , -5.95350484, -5.64206515, -5.03137698, -4.6806857 , -4.32984695, -3.9284862 , -8.56735134, -8.36321131, -8.10768274, -7.82390874, -7.52287874, -7.21881928, -6.92081875, -6.62893214, -6.32394687, -5.99139983, -8.78146449, -8.66314018, -8.47353149, -8.24733706, -7.97142875, -7.67612939, -7.3528127 , -7.07206532, -6.77417401, -6.47886192, -6.15951751, -6.83564714, -6.53165267, -6.22409842, -5.91009489, -5.59859946, -5.29064522, -4.97428462, -4.64454848, -4.29056043, -3.88505558, -3.40837896, -3.13200249, -8.72676717, -8.66695597, -8.51102647, -8.16538858, -7.88605665, -7.58804376, -7.28341242, -6.99567863, -6.69186262, -6.39254498, -6.06737406, -6.68402965, -6.37871983, -6.06585519, -5.75227217, -5.13241467, -4.8113527 , -4.09826931, -3.66174277, -3.2644011 ])
%%hylang
y
array([ 0.8116, 0.9072, 0.9052, 0.9039, 0.8053, 0.8377, 0.8667, 0.8809, 0.7975, 0.8162, 0.8515, 0.8766, 0.8885, 0.8859, 0.8959, 0.8913, 0.8959, 0.8971, 0.9021, 0.909 , 0.9139, 0.9199, 0.8692, 0.8872, 0.89 , 0.891 , 0.8977, 0.9035, 0.9078, 0.7675, 0.7705, 0.7713, 0.7736, 0.7775, 0.7841, 0.7971, 0.8329, 0.8641, 0.8804, 0.7668, 0.7633, 0.7678, 0.7697, 0.77 , 0.7749, 0.7796, 0.7897, 0.8131, 0.8498, 0.8741, 0.8061, 0.846 , 0.8751, 0.8856, 0.8919, 0.8934, 0.894 , 0.8957, 0.9047, 0.9129, 0.9209, 0.9219, 0.7739, 0.7681, 0.7665, 0.7703, 0.7702, 0.7761, 0.7809, 0.7961, 0.8253, 0.8602, 0.8809, 0.8301, 0.8664, 0.8834, 0.8898, 0.8964, 0.8963, 0.9074, 0.9119, 0.9228])
Our preparations are complete -- let's get started!
Let's take a look at our data with a quick plot:
%%hylang
(apply plt.scatter [x y])
(plt.show)
That's not the most beautiful graph in the world, so perhaps we can make it a little better. Let's do the following:
Let's start with sizing the points. We'll create an array that's the same size and shape as one of the dimensions from our data set, and then we'll give calculate an arbitrary area to display for each point size:
%%hylang
(def ones (np.ones_like x))
(def size (* np.pi (** (* 20 ones) 2)))
Let's do some more sizes: one for the figure, and another for the label fonts:
%%hylang
(def figsize {"figsize" [18 18]})
(def fontsize {"fontsize" 24})
We'll use the x-axis values as the basis for color value differences. For the pallete, we'll use our previously-defined cmap
variable:
%%hylang
(def plot-options {"s" size "c" x "alpha" 0.5 "cmap" cmap})
Let's set up our labels, create our plot, and then display it:
%%hylang
(apply plt.figure [] figsize)
(apply plt.title ["NIST Data Set for Linear Regression Demo"] fontsize)
(apply plt.xlabel ["$x$ values"] fontsize)
(apply plt.ylabel ["$y$ values"] fontsize)
(apply plt.scatter [x y] plot-options)
(plt.show)
Our graph is looking much better. Now let's get mathematical!
The NIST data set provided a polynomial describing this data:
y = B0 + B1*x + B2*(x**2) + ... + B9*(x**9) + B10*(x**10) + e
Or, if you prefer LaTeX:
y=B0+B1x+B2x2+...+B9x9+B10x10+eUsing NumPy, we can easily fit an 10th-degree polynomial curve to this data. We will use numpy.polyfit
for finding a least squares polynomial fit, passing it the x and y values for the data to fit as well as the degree of our polynomial:
%%hylang
(def coeffs (np.polyfit x y 10))
coeffs
array([ -4.02962521e-05, -2.46781075e-03, -6.70191147e-02, -1.06221497e+00, -1.08753179e+01, -7.51242009e+01, -3.54478230e+02, -1.12797393e+03, -2.31637105e+03, -2.77217956e+03, -1.46748960e+03])
np.polyplot
can return more data, if you are so inclined:
%%hylang
(def [coeffs residuals rank singular-values rcond]
(apply np.polyfit [x y 10] {"full" true}))
[array([ -4.02962521e-05, -2.46781075e-03, -6.70191147e-02, -1.06221497e+00, -1.08753179e+01, -7.51242009e+01, -3.54478230e+02, -1.12797393e+03, -2.31637105e+03, -2.77217956e+03, -1.46748960e+03]), array([ 0.00079585]), 11, array([ 3.12889471e+00, 1.06454867e+00, 2.71803240e-01, 5.29682154e-02, 8.38710833e-03, 1.01575660e-03, 9.58303055e-05, 7.60511579e-06, 4.64910447e-07, 1.98714214e-08, 6.00922228e-10]), 1.8207657603852567e-14]
There is a conveience class in NumPy numpy.poly1d
that, once instantiated with our fit data, we can use to evaluate at any given point. Let's try it out:
%%hylang
(def model (np.poly1d coeffs))
Let's call this function against several values as a sanity check:
%%hylang
[(model -9) (model -7) (model -6) (model -5) (model -4) (model -3)]
[0.77668860985022548, 0.79905917870519261, 0.88604832190185334, 0.89263439047817883, 0.90943486799233142, 0.88930227733135325]
Looking back at our graph, we can see that these check out fine.
Next, let's see if our linear model matches up with what NIST provided. We're going to need to calculate the coefficient of determination, or R2, a value that indicates how well a statistical model fits with measured data. We'll start by feeding our x values into our model:
%%hylang
(def y-predicted (model x))
We will also need several other values in order to calculate R2, per the equation given on the Wikipedia page linked above:
We have already the following:
With these, we will be able to calculate R2:
R2≡1−SSresSStot%%hylang
(def y-mean (/ (np.sum y) y.size))
(def sstot (np.sum (** (- y y-mean) 2)))
(def ssreg (np.sum (** (- y-predicted y-mean) 2)))
(def ssres (np.sum (** (- y y-predicted) 2)))
Now we're ready to get the R2 value for our model:
%%hylang
(def rsquared (- 1 (/ ssres sstot)))
rsquared
0.99672741617239946
If we compare this to the value from NIST, 0.996727416185620
, we see that our model did pretty well:
%%hylang
(- 0.99672741617239946 0.996727416185620)
-1.3220535777236364e-11
That's a tiny number ...
Mathematical!
This is great for an IPython notebook, but kind of awkward for reuse. So let's create a linear model class that gives us everything we need in one shot:
%%hylang
(defclass PolynomialLinearModel []
"A pretty sweet utility Python-Lisp class for creating
a polynomial curve fitting model")
(defun PolynomialLinearModel.--init-- [self x y degree]
(setv self.x x)
(setv self.y y)
(setv self.degree degree)
(setv self.results None)
(setv self.model None)
(setv [self.coeffs self.residuals self.rank self.singular-values self.rcond]
[None None None None None])
(self.polyfit)
None)
(defun PolynomialLinearModel.get-y-mean [self]
"Get the mean value of the observed data"
(/ (np.sum self.y) self.y.size))
(defun PolynomialLinearModel.get-ss-tot [self]
"Get total sum of the squares"
(np.sum (** (- self.y (self.get-y-mean)) 2)))
(defun PolynomialLinearModel.get-ss-reg [self]
"Get the regression sum of squares"
(np.sum (** (- self.y-predicted (self.get-y-mean)) 2)))
(defun PolynomialLinearModel.get-ss-res [self]
"Get the sum of squares of residuals"
(np.sum (** (- self.y self.y-predicted) 2)))
(defun PolynomialLinearModel.get-r-squared [self]
"Get the R^2 value for the polynomial fit"
(- 1 (/ (self.get-ss-res) (self.get-ss-tot))))
(defun PolynomialLinearModel.polyfit [self]
"Do all the business"
(setv [self.coeffs self.residuals self.rank self.singular-values self.rcond]
(apply np.polyfit [self.x self.y self.degree] {"full" true}))
(setv self.model (np.poly1d self.coeffs))
(setv self.y-predicted (self.model self.x))
(setv self.r-squared (self.get-r-squared))
(setv self.results {
"coeffs" (self.coeffs.tolist)
"residuals" (self.residuals.tolist)
"rank" self.rank
"singular-values" (self.singular-values.tolist)
"rcond" self.rcond
"r-squared" self.r-squared}))
(defun PolynomialLinearModel.--str-- [self]
"Provide a string representation of the data"
(str self.results))
(defun PolynomialLinearModel.--repr-- [self]
"Provide a representation of the data"
(self.--str--))
(defun PolynomialLinearModel.predict [self xs]
"Given a set of input values, produce outputs using the model"
(self.model xs))
<function __main__._hy_anon_fn_10>
This approach to adding methods to a class is different from the standard Hy approach (which mirrors the standard Python approach). I chose this form as it more closely matches that used by Common Lisp, and to a certain extent, Clojure. In fact, this lead me to ponder the creation of a defmethod
macro for Hy (as I'm sure has happened to many who use Hy). Instead of clutter up this notebook, though, I've moved that over to a sibling notebook. But we digress ...
Let's take it for a spin!
%%hylang
(def model (PolynomialLinearModel x y 10))
When we created our first model, we ran it against several values to see if the outputs fit our measured data. Of of those was the value -9
which returned 0.77668860985022548
. Let's try that again with our new object:
%%hylang
(model.predict -9)
0.77668860985022548
Looking good!
Let's see what our results data look like:
%%hylang
(pprint model.results)
{'coeffs': [-4.029625205186532e-05, -0.0024678107546401437, -0.06701911469215643, -1.0622149736864719, -10.875317910262362, -75.12420087227511, -354.47822960532113, -1127.973927925715, -2316.371054759451, -2772.1795597594755, -1467.4895971641686], 'r-squared': 0.99672741617239946, 'rank': 11, 'rcond': 1.8207657603852567e-14, 'residuals': [0.0007958513839371895], 'singular-values': [3.128894711145785, 1.064548669029962, 0.27180324022363517, 0.05296821542551952, 0.008387108325776571, 0.0010157565988992792, 9.583030547029836e-05, 7.605115790256685e-06, 4.6491044714423815e-07, 1.9871421381342612e-08, 6.009222284310632e-10]}
We're going to need some data to feed to our fitted-poly
function so that it can create the smooth polynomial curve that we will overlay on our scatter plot. Let's create a linear space between our minimum and maximum x values (200
points should give us a nice, smooth curve). Then let's use fitted-poly
to generate the y values:
%%hylang
(def fitted-xs (np.linspace x-min x-max 200))
(def fitted-ys (model.predict fitted-xs))
Now we're ready to put them together:
%%hylang
(def [figure axes] (apply plt.subplots [] figsize))
(plt.hold True)
(apply axes.set_title ["NIST Data Set and Polynomial Fit"] fontsize)
(apply axes.set_xlabel ["$x$ values"] fontsize)
(apply axes.set_ylabel ["$y$ values"] fontsize)
(apply axes.scatter [x y] plot-options)
(axes.plot fitted-xs fitted-ys)
(plt.show)