In [1]:

%pylab inline
rcParams['figure.figsize'] = 14, 7

Populating the interactive namespace from numpy and matplotlib

Regularized least squares and cross validation¶

While calculating CSD, we often have to deal with noisy measurements.

Finding a set of optimal parameters for our model, we have to solve the bias–variance dilemma which is the problem of simultaneously minimizing the bias (how accurate a model is across different training sets) and variance of the model error (how sensitive the model is to small changes in training set).

The error may be decomposed into:

$Err(x) = \left(E[\hat{f}(x)]-f(x)\right)^2 + E\left[\hat{f}(x)-E[\hat{f}(x)]\right]^2 +\sigma_e^2$

$Err(x) = \mathrm{Bias}^2 + \mathrm{Variance} + \mathrm{Irreducible\ Error}$

Example - noisy measurements¶

Suppose we took some noisy measurements of function $f(x) = 0.08 x^2 + 1.0$ for $x \in \lbrace{ 0,1,2,3,4,5,6,7,8,9 \rbrace}$ The noise will add variance, but not so much bias. The noise is from normal distribution so $E(noise) = 0$

In [2]:

import numpy as np

X = np.array(range(0,10))
Y = 0.08*X**2 + 1.0

fig, (ax1, ax2) = plt.subplots(1, 2, sharex=True, sharey=True)
ax1.plot(X,Y)
ax1.set_title("model function")

Y2 = Y + np.random.random_sample(size(X))-0.5
ax2.scatter(X,Y2)
ax2.plot(X,Y)
ax2.set_title("noisy measurements")

Out[2]:

<matplotlib.text.Text at 0x36c5310>

Now let's try to fit some functions.

In [3]:

from scipy.optimize import curve_fit

def poly_7(x, a, b, c, d, e, f, g, h):
    return h*x**7+g*x**6 + f*x**5 + e*x**4 + d*x**3 + c*x**2 + b*x + a

popt, pcov = curve_fit(poly_7, X, Y2, p0=(10.0, 0.2, 0.1, 0.0,0.0,0.0,0.0,0.0))

a = popt[0]
b = popt[1]
c = popt[2]
d = popt[3]
e = popt[4] 
f = popt[5]
g = popt[6]
h = popt[7]
residuals = Y2 - poly_7(X, a,b,c,d,e,f,g,h)
fres = np.linalg.norm(residuals)
print "7-degree polynomial error: ",fres

fig, (ax1, ax2, ax3) = plt.subplots(1, 3, sharex=True, sharey=True)
ax1.scatter(X,Y2)
xlin = np.linspace(0,9, 50)
ax1.plot(xlin, poly_7(xlin,a,b,c,d,e,f,g,h))
ax1.set_title("Overfit - high variance")

def poly_1(x, a, b):
    return a*x + b

popt, pcov = curve_fit(poly_1, X, Y2, p0=(10.0, 0.2))

a = popt[0]
b = popt[1]
residuals = Y2 - poly_1(X, a,b)
fres = np.linalg.norm(residuals)
print "linear approximation error: ", fres

ax2.scatter(X,Y2)
ax2.plot(xlin, poly_1(xlin, a, b))
ax2.set_title("Underfit - high bias")

def poly_2(x, a, b, c):
    return a*x**2 + b*x + c

popt, pcov = curve_fit(poly_2, X, Y2, p0=(10.0, 0.2, 0.0))

a = popt[0]
b = popt[1]
c = popt[2]
residuals = Y2 - poly_2(X, a,b,c)
fres = np.linalg.norm(residuals)
print "error of the correct model: ", fres

ax3.scatter(X,Y2)
ax3.plot(xlin, poly_2(xlin, a,b,c))
ax3.set_title("Correct model")

7-degree polynomial error:  0.161557406365
linear approximation error:  1.48952709068
error of the correct model:  0.568171379783

Out[3]:

<matplotlib.text.Text at 0x3e8d410>

We can see that the higher the polynomial order is, the smallest the error. However if we were to add few more points without recalculating the polynomial coefficients, the incorrect models would fail miserably.

So how to overcome this problem?¶

In order to give preference to some particular solution, we are can introduce a regularization parameter $\lambda$.

Now the least squares regularization turns into Tikhonov regularization (Regularized Least Squares)

$$\min_{f \in H} \sum_{i = 1}^n\left(f(X_i) - Y_i\right)^2 + \lambda ~ ||{f}||^2_K$$

The regularization parameter $\lambda$ determines how much we distrust our data.

Too low $\lambda$ will leave noise in the measurements and therefore modelled function may not extrapolate well.

Too high $\lambda$ may eventually ignore the data.

How to choose a correct $\lambda$?¶

To validate our data we can only use our own measurements. So we can split them into a training set and testing set, interpolate the function obtained from processing the training set and find how well we can predict the test samples. This procedure is called cross validation.

Repeating the procedure for different $\lambda$, we can find which regularization parameter gives the best results (smallest error).

Very high $\lambda$ suggests that our measurements are too noisy or our model may be invalid. We would expect $\lambda$ to be somewhere between 0 and 1.

The role of Scikit-learn¶

In SciPy (scikit-learn) we have some handful procedures for cross-validation. For example - there are index generators, which enable us to perform:

Leave One Out cross validation
Leave P Out cross validation
K-Fold cross validation
Shuffle split iterated cross validation

Some examples:

Leave One Out¶

In [4]:

from sklearn.cross_validation import LeaveOneOut

#for example
#X = np.array([[0., 0.], [1., 1.], [-1., -1.], [2., 2.]])
Y = np.array([0, 1, 0, 1])

loo = LeaveOneOut(len(Y), indices=True)
print(loo)
print "train  test"
for train_ind, test_ind in loo:
    #do something with the indices and data
    print("%s %s" % (train_ind, test_ind))

sklearn.cross_validation.LeaveOneOut(n=4)
train  test
[1 2 3] [0]
[0 2 3] [1]
[0 1 3] [2]
[0 1 2] [3]

Leave P Out¶

Leave P Out divides the samples into testing subset of size p and training set of size n-p

In [5]:

from sklearn.cross_validation import LeavePOut

lpo = LeavePOut(n=len(Y), p=2, indices=True)
print(lpo)
print "train test"
for train_ind, test_ind in lpo:
    #do something with the indices and data
    print("%s %s" % (train_ind, test_ind))

sklearn.cross_validation.LeavePOut(n=4, p=2)
train test
[2 3] [0 1]
[1 3] [0 2]
[1 2] [0 3]
[0 3] [1 2]
[0 2] [1 3]
[0 1] [2 3]

K-Fold¶

An advantage of K-Fold cross validation is reduction of the number of iterations required to complete the task.

In [6]:

from sklearn.cross_validation import KFold

#for example:
#X = np.array([[0., 0.], [1., 1.], [-1., -1.], [2., 2.]])
Y = np.array([0, 1, 0, 1])

kf = KFold(len(Y), n_folds=2, indices=True)
print(kf)
print "train test"
for train_ind, test_ind in kf:
    print("%s %s" % (train_ind, test_ind))

sklearn.cross_validation.KFold(n=4, n_folds=2)
train test
[2 3] [0 1]
[0 1] [2 3]

Shuffle Split¶

In [7]:

from sklearn.cross_validation import ShuffleSplit
Y = np.array([0, 1, 0, 1])

ss = ShuffleSplit(len(Y), n_iter=2, test_size = 0.5, indices=True)
print(ss)
print "train test"
for train_ind, test_ind in ss:
    print("%s %s" % (train_ind, test_ind))

ShuffleSplit(4, n_iter=2, test_size=0.5, indices=True, random_state=None)
train test
[3 2] [1 0]
[0 1] [2 3]

And more.

We can recalculate our model with different $\lambda$ to find out which model works best.

Cross validation for determining model - example¶

Using the previous functions we can check how the models extrapolate. This is pure cross validation and curve fitting. No regularization parameter was introduced in this example.

In [8]:

def error_7(x_train, y_train, x_test, y_test):
    popt, pcov = curve_fit(poly_7, x_train, y_train, p0=(10.0, 0.2, 0.1, 0.0,0.0,0.0,0.0,0.0))

    a = popt[0]
    b = popt[1]
    c = popt[2]
    d = popt[3]
    e = popt[4] 
    f = popt[5]
    g = popt[6]
    h = popt[7]

    y_estimated = poly_7(x_test, a,b,c,d,e,f,g,h) 
    err = np.linalg.norm(y_test - y_estimated)
    return err

X = np.array(range(0,10))
Y = 0.04*X**2 + 1.0
Y2 = Y + np.random.random_sample(size(X))-0.5

loo = LeaveOneOut(len(Y), indices=True)
error = 0.0
for ind_train, ind_test in loo:
    error += error_7(X[ind_train], Y2[ind_train], X[ind_test], Y2[ind_test])

print "Total error:", error

Total error: 27.2241098604

Now we can see how badly this model extrapolates.

In [9]:

def error_1(x_train, y_train, x_test, y_test):
    popt, pcov = curve_fit(poly_1, x_train, y_train, p0=(10.0, 0.2))

    a = popt[0]
    b = popt[1]

    y_estimated = poly_1(x_test, a,b) 
    err = np.linalg.norm(y_test - y_estimated)
    return err

error = 0.0
for ind_train, ind_test in loo:
    error += error_1(X[ind_train], Y2[ind_train], X[ind_test], Y2[ind_test])

print "Total error:", error

Total error: 3.68014736226

Now this looks better.

In [10]:

def error_2(x_train, y_train, x_test, y_test):
    popt, pcov = curve_fit(poly_2, x_train, y_train, p0=(10.0, 0.2,0.0))

    a = popt[0]
    b = popt[1]
    c = popt[2]

    y_estimated = poly_2(x_test, a,b,c) 
    err = np.linalg.norm(y_test - y_estimated)
    return err

error = 0.0
for ind_train, ind_test in loo:
    error += error_2(X[ind_train], Y2[ind_train], X[ind_test], Y2[ind_test])

print "Total error:", error

Total error: 1.90230633991

Here we can see that the 2nd degree polynomial extrapolates best.

This is how we manage to minimize the error connected with too high variance.

Visualization of $\lambda$ parameter influence¶

Lets now solve a linear equation $Ax = b$

$x = (A+\lambda I)^{-1} b$

In [11]:

A = np.array([
            [1.0, 0.5, 0.2, 0.0, 0.0],
            [0.5, 1.0, 0.5, 0.2, 0.0],
            [0.2, 0.5, 1.0, 0.5, 0.2],
            [0.0, 0.2, 0.5, 1.0, 0.5],
            [0.0, 0.0, 0.2, 0.5, 1.0],
          ])
b = np.array([[1],[2],[3],[4],[5]])
fig, (ax1, ax2) = plt.subplots(1,2)
ax1.imshow(A, interpolation="none")
ax1.set_title("A")
ax2.imshow(b, interpolation="none")
ax2.set_title("b")

Out[11]:

<matplotlib.text.Text at 0x474ab10>

In [12]:

fig, (ax1, ax2, ax3, ax4) = plt.subplots(1,4)

x = dot(inv(A), b)
ax1.imshow(x, interpolation="none")
ax1.set_title("x for lambda=0.0")
ax4.plot(x)

lambd = 1.0
x = dot(inv(A + lambd*identity(A.shape[0])), b)
ax2.imshow(x, interpolation = "none")
ax2.set_title("x for lambda=1.0")
ax4.plot(x)


lambd = 20.0
x = dot(inv(A + lambd*identity(A.shape[0])), b)
ax3.imshow(x, interpolation = "none")
ax3.set_title("x for lambda=20.0")
ax4.plot(x)

Out[12]:

[<matplotlib.lines.Line2D at 0x4d62c90>]

References¶

Computer Science and Artificial Intelligence Laboratory - Technical Report Notes on Regularized Least Squares Ryan M. Rifkin and Ross A. Lippert
http://en.wikipedia.org/wiki/Bias-variance_dilemma
http://scott.fortmann-roe.com/docs/BiasVariance.html
http://en.wikipedia.org/wiki/Tikhonov_regularization
Scikit-learn documentation http://scikit-learn.org/

In [12]: