This presentation is an Ipython Notebook, which you get at github.com/luispedro/PenalizedRegression or view using the NBViewer.
To run this, you need
We will import numpy
using the np
abbreviation and matplplotlib.pyplot
using the plt
abbreviation:
import numpy as np
from matplotlib import pyplot as plt
from IPython.display import display
We also need to perform some magic to get plots inline:
%matplotlib inline
Regression can be used generically to mean "any type of learning from data."
More commonly, though, it is used to mean learn to predict a numeric* output from variables. As opposed to classification* which learns to predict a categorical output.
The goal is to predict house prices in Boston based on variables such as
The boston
dataset comes built in with scikit-learn:
from sklearn.datasets import load_boston
boston = load_boston()
You can use print(boston.DESCR)
to see more information on the dataset.
For all our analyses, it will be important to have split training & testing data.
Scikit-learn makes this easy:
from sklearn.cross_validation import train_test_split
train_data, test_data, train_target, test_target = \
train_test_split(boston.data, boston.target, train_size=.8)
train_data
is our training data with corresponding target variable train_target
.test_data
and test_target
are the equivalent testing variables.It is always a good idea to peak around, get a feeling for the data.
Extremely important to look out for anomalies (real data is rarely clean).
print(train_data.shape)
(404, 13)
_=plt.hist(train_target, bins=100)
We can look at specific elements in the input data, such as the number of rooms:
RoomNr_Index = 5
fig,ax = plt.subplots()
ax.scatter(train_data[:,RoomNr_Index], train_target)
ax.set_xlabel("Number of rooms")
ax.set_ylabel("House price")
pass
from IPython.html.widgets import interact
from scipy import stats
We will use the same plotting code, just generalized to take an index
argument and wrapped with interact
:
@interact(index=(0, train_data.shape[1]))
def plot_scatter(index):
fig,ax = plt.subplots()
x, y = train_data[:,index], train_target
ax.scatter(x, y)
ax.set_xlabel(boston.feature_names[index])
ax.set_ylabel("House price")
print("Correlation: {0[0]:.1} (p-value: {0[1]:.1})".format(stats.pearsonr(x, y)))
return fig
Correlation: -0.4 (p-value: 6e-16)
The interact
features are the killer app for Ipython notebook.
We have now poked around the data, can we build a predictive model?
Instead of a single input variable, we can have multiple inputs, $x_1$, $x_2$, ... $x_n$:
$$y = \sum_j \beta_j x_j + c + \epsilon$$We can write the same using vector notation as:
$$y = {\boldsymbol \beta}^T {\bf x} + c + \epsilon$$We can measure the error as the sum of squared errors:
$$ E = \sum_i \epsilon^2_i $$The criterion is thus
$$ \beta^{*} = \arg\min_{\boldsymbol \beta} \sum_i ({\boldsymbol \beta}^T {\bf x}_i + c - y_i)^2 $$which we can write in short form as:
$$ \beta^{*} = \arg\min_{\boldsymbol \beta} \sum_i \epsilon^2_i $$Pretty obvious but it'll come in handy later too:
E = np.linspace(0, 2, 100)
squared_error_figure, ax = plt.subplots(figsize=(6,6))
ax.plot(E, E**2)
ax.plot([0,1], [1,1], 'k-')
ax.plot([1,1], [0,1], 'k-')
ax.set_xlim(0,4.)
ax.set_xlabel('$E$')
ax.set_ylabel('$E^2$')
pass
display(squared_error_figure)
This method of least squares was first developed to estimate the motion of cellular bodies by Gauss who was officially an astronomer. He also proved that least squares is optimal under Gaussian noise.
Today we would call this the criterion of least squares as method seems to imply that we have an algorithm to solve the problem.
from sklearn import linear_model
linreg = linear_model.LinearRegression()
_=linreg.fit(train_data[:,RoomNr_Index:RoomNr_Index+1], train_target)
fig,ax = plt.subplots()
ax.scatter(train_data[:,RoomNr_Index], train_target)
ax.plot([4, 9], linreg.predict([[4],[9]]), 'k-')
ax.set_xlabel("Number of rooms")
_=ax.set_ylabel("House price")
We can use the same linreg.fit
method:
linreg.fit(train_data, train_target)
prediction = linreg.predict(test_data)
We cannot easily see the fit in high dimensions, but we can plot measured vs prediction:
linreg.fit(train_data, train_target)
fig,ax = plt.subplots()
ax.scatter(test_target, linreg.predict(test_data))
ax.plot([0,50], [0,50], 'k-')
ax.set_xlabel('Target (test)')
ax.set_ylabel('Predicted (test)')
pass
from sklearn import metrics
mse = metrics.mean_squared_error(test_target, linreg.predict(test_data))
print("MSE is {}".format(mse))
MSE is 22.1886932331
Problem: it is not trivial to interpret the value obtained.
rmse = np.sqrt(mse)
print("RMSE is {}".format(rmse))
RMSE is 4.71048757912
Sometimes, there is another value that is very useful, the coefficient of determination.
Idea: compare the MSE with a baseline.
Consider the null predictor, which ignores the input and always returns a contant:
def predict(features):
return constant
The best constant is the mean:
def predict(features):
return mean_of_training_target
Now, compute the ratio of error this predictor with the error of your predictor.
Where $\bar{y}_i$ is the average value of $y$, i.e.:
$$\bar{y} = \frac{\sum_i y_i}{N}$$cod = metrics.r2_score(test_target, linreg.predict(test_data))
print("COD is {}".format(cod))
COD is 0.732540216176
This measure is also called $R^2$, which is confusing because it is not the square of anything.
It is also called $Q^2$ sometimes, which is also confusing.
This measure is the default measure for regression in scikit-learn and we can just use the score
method:
print(linreg.score(test_data, test_target))
0.732540216176
Naturally, the results on the training data are better than those obtained on the testing data
linreg.fit(train_data, train_target)
r2_train = metrics.r2_score(train_target, linreg.predict(train_data))
r2_test = metrics.r2_score(test_target, linreg.predict(test_data))
print("R2 on training: {:.1%}".format(r2_train))
print("R2 on testing: {:.1%}".format(r2_test))
R2 on training: 73.7% R2 on testing: 73.3%
The criterion for least squared regression was:
$$\beta = \arg\min \sum_j \epsilon^2_j$$That is, we minimize the fit on the training data.
The general expresssion for penalized regression is
$$\beta = \arg\min \frac{1}{2N} \sum_j \epsilon^2_j + \alpha R({\boldsymbol \beta}),$$where $R$ is the penalty (regularization) term and $\alpha$ is a positive weight.
That is, we minimize the sum of the fit plus a regularization term.
This is an instance of the Bias-Variance Tradeoff
$L_1$ penalty means that we use the sum of absolute values**:
$$P_1({\boldsymbol \beta}) = \sum_j |\beta_j|$$$L_2$ penalty means that we use the sum of squares**:
$$P_2({\boldsymbol \beta}) = \sum_j \beta_j^2$$(The names come from the generic concept of an $L_p$ norm, see Wikipedia for details)
Using the L1 norm, leads us to the Lasso!
Lasso stands for least absolute shrinkage and selection operator, but nobody uses that long name.
lasso = linear_model.Lasso()
lasso.fit(train_data, train_target)
pass
Setting the $\alpha$ parameter:
lasso = linear_model.Lasso(alpha=0.1)
lasso.fit(train_data, train_target)
pass
(We will discuss how to set the $\alpha$ parameter in a smarter later in the talk)
Some features span much higher values. Thus, when our penalty adds all the weights together, it implicitly weighs the low varying features more.
fig,ax = plt.subplots()
ax.plot(train_data.ptp(axis=0))
ax.set_xticklabels(boston.feature_names)
ax.set_xticks(np.arange(train_data.shape[1]))
ax.set_ylabel('Point to point (max-min)')
pass
Normalization ensures that all the inputs are on the same scale.
lasso = linear_model.Lasso(normalize=True, alpha=.1)
lasso.fit(train_data, train_target)
pass
linreg.fit(train_data, train_target)
r2_ols_train = linreg.score(train_data, train_target)
r2_ols = linreg.score(test_data, test_target)
lasso.fit(train_data, train_target)
r2_lasso_train = lasso.score(train_data, train_target)
r2_lasso = lasso.score(test_data, test_target)
results = """\
| TRAINING | TESTING
------+----------+---------
OLS | {:.2%} | {:.2%}
------+----------+---------
Lasso | {:.2%} | {:.2%}
---------------------------
""".format(r2_ols_train, r2_ols, r2_lasso_train, r2_lasso)
print(results)
| TRAINING | TESTING ------+----------+--------- OLS | 73.66% | 73.25% ------+----------+--------- Lasso | 60.06% | 57.75% ---------------------------
In a bit, we will see another dataset, where Lasso shines brighter.
Using the L2 norm, leads us to Ridge regression (a.k.a. Tikhonov regularization).
Unlike Lasso, which was invented only recently (1996), Ridge goes back a few decades.
ridge = linear_model.Ridge(normalize=True, alpha=.1)
ridge.fit(train_data, train_target)
r2_ridge_train = ridge.score(train_data, train_target)
r2_ridge = ridge.score(test_data, test_target)
results = """\
| TRAINING | TESTING
------+----------+---------
OLS | {:.2%} | {:.2%}
------+----------+---------
Lasso | {:.2%} | {:.2%}
------+----------+---------
Ridge | {:.2%} | {:.2%}
---------------------------
""".format(r2_ols_train, r2_ols,
r2_lasso_train, r2_lasso,
r2_ridge_train, r2_ridge)
print(results)
| TRAINING | TESTING ------+----------+--------- OLS | 73.66% | 73.25% ------+----------+--------- Lasso | 60.06% | 57.75% ------+----------+--------- Ridge | 72.54% | 73.51% ---------------------------
This means that some coefficients are exactly zero:
print(lasso.coef_)
[-0. 0. -0. 0. -0. 3.07533624 -0. 0. -0. -0. -0.37561765 0. -0.42028728]
We can use Lasso to select important features!
Higher penalties (higher $\alpha$) means more sparsity:
lasso.alpha = .2
lasso.fit(train_data, train_target)
print(lasso.coef_)
[-0. 0. -0. 0. -0. 1.53655857 -0. 0. -0. -0. -0. 0. -0.28228527]
Too high value for $\alpha$ results in zeros everywhere:
lasso.alpha = 1.
lasso.fit(train_data, train_target)
print(lasso.coef_)
[-0. 0. -0. 0. -0. 0. -0. 0. -0. -0. -0. 0. -0.]
This is a way to visualize what happens when we increase/decrease regularization.
alphas = np.linspace(.01, 1000., 1000)
alphas, coefs, _= lasso.path(train_data, train_target, alphas=alphas)
fig,ax = plt.subplots()
ax.plot(alphas, coefs.T)
ax.set_xscale('log')
ax.set_xlim(alphas.max(), alphas.min())
ax.set_xlabel(r'$\alpha$')
ax.set_ylabel('Coefficient value')
pass
Instead of the values, we can also just count how many are non zero:
fig,ax = plt.subplots()
ax.plot(alphas, np.sum(coefs != 0.0, axis=0))
ax.set_xscale('log')
ax.set_xlim(alphas.max(), alphas.min())
ax.set_ylim(-.1, 13.1)
ax.set_xlabel(r'$\alpha$')
ax.set_ylabel('Nr of non-zero coefficients')
pass
Unlike the Lasso, ridge does not set any coefficients to zero!
ridge.fit(train_data, train_target)
print(ridge.coef_)
[ -7.16516257e-02 2.61968547e-02 -6.48444922e-02 2.39713486e+00 -1.09193751e+01 4.03200954e+00 -9.44894418e-03 -1.06942801e+00 1.31698631e-01 -6.45883648e-03 -8.87770177e-01 7.47187438e-03 -4.17026739e-01]
lasso.alpha = 0.1
lasso.fit(train_data, train_target)
ridge.alpha = 0.1
ridge.fit(train_data, train_target)
Ridge(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=None, normalize=True, solver='auto', tol=0.001)
fig,ax = plt.subplots()
ax.plot(linreg.coef_, label='OLS')
ax.plot(lasso.coef_, label=r'Lasso ($\alpha$=0.1)')
ax.plot(ridge.coef_, label=r'Ridge ($\alpha$=0.1)')
ax.set_xticklabels(boston.feature_names)
ax.set_xticks(np.arange(train_data.shape[1]))
ax.legend(loc='best')
pass
Consider the following pseudo-algorithm:
Because Ridge uses squared penalties, $\beta$s that are still stuck at zero are very cheap in step 2, so we might choose them even if they are not very good.
With Lasso, however, they always cost the same. Therefore, the algorithm will typically keep choosing the same index repeatedtly.
Note: An improved version of this idea forms the basis of the famous Least-Angle Regression method to solve the Lasso problem.
(This is Figure 2 from the paper Regression shrinkage and selection via the lasso, by R. Tibshirani [JSTOR Version].)
We can combine the two types of penalty to obtain elastic nets:
$$P({\boldsymbol \beta}) = \alpha_1 P_1({\boldsymbol \beta}) + \alpha_2 P_2({\boldsymbol \beta}) = \alpha_1 \sum_j |\beta_j| + \alpha_2 \sum_j \beta_j^2$$We now have two penalties, with different weights, $\alpha_1$ and $\alpha_2$.
en = linear_model.ElasticNet(normalize=True)
Instead of $\alpha_1$, and $\alpha_2$, in scikit-learn one specifies two parameters, $\alpha$ and $\rho$ (a.k.a. l1_ratio
)
en = linear_model.ElasticNet(normalize=True, alpha=0.1, l1_ratio=.5)
This way, alpha
defines amount of penalty and l1_ratio
how the penalty is split between L1 and L2.
lasso = linear_model.ElasticNet(normalize=True, alpha=0.1, l1_ratio=1.)
ridge = linear_model.ElasticNet(normalize=True, alpha=0.1, l1_ratio=0.)
half_way =linear_model.ElasticNet(normalize=True, alpha=0.1, l1_ratio=.5)
One issue with pure Lasso is that it is greedy and unstable
almost_lasso = linear_model.ElasticNet(normalize=True, alpha=0.1, l1_ratio=.95)
almost_lasso.fit(train_data, train_target)
lasso.fit(train_data, train_target)
print(almost_lasso.coef_)
print(lasso.coef_)
[ -1.86363805e-02 0.00000000e+00 -4.24226376e-02 0.00000000e+00 -1.35017945e+00 1.50491971e+00 -4.61639434e-04 0.00000000e+00 -3.20320213e-03 -1.68314941e-03 -2.77118152e-01 1.22517256e-04 -1.55662223e-01] [-0. 0. -0. 0. -0. 3.07533624 -0. 0. -0. -0. -0.37561765 0. -0.42028728]
The boston dataset is small, so it does not fully do justice to the power of penalized regression.
Quoting from the online page for this dataset:
The corpus contains 10-K reports from many US companies during years 1996-2006, as well as measured volatility of stock returns for the twelve-month periods preceding and following each report
This was generated and used in this paper:
Predicting Risk from Financial Reports with Regression
Shimon Kogan, Dimitry Levin, Bryan R. Routledge, Jacob S. Sagi, and Noah A. Smith NAACL-HLT 2009, Boulder, CO, May–June 2009 Online version of paper
If you are reading this at home, you need to download the E2006 dataset. You can use the following code block (unfortunately, this is Python 2 only code).
source_location = 'http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/E2006.train.bz2'
from os import path
if not path.exists('E2006.train.bz2'):
import urllib
urllib.urlretrieve(source_location, 'E2600.train.bz2')
from sklearn.datasets import load_svmlight_file
E2006_data, E2006_target = load_svmlight_file('E2006.train.bz2')
print(E2006_data.shape)
(16087, 150360)
We have 16k examples and 150k features!
print(type(E2006_data))
<class 'scipy.sparse.csr.csr_matrix'>
The dataset is stored in memory as a sparse matrix as most entries are zero.
E2_train_data, E2_test_data, E2_train_target, E2_test_target = \
train_test_split(E2006_data, E2006_target, train_size=.8)
linreg.fit(E2_train_data, E2_train_target)
r2_ols_train = linreg.score(E2_train_data, E2_train_target)
r2_ols = linreg.score(E2_test_data, E2_test_target)
print("R2 Training (OLS): {:.1%}".format(r2_ols_train))
print("R2 Testing (OLS): {:.1%}".format(r2_ols))
R2 Training (OLS): 100.0% R2 Testing (OLS): 33.7%
(Note that you may get different values as the train/test split was random)
lasso = linear_model.Lasso(normalize=True, alpha=.01)
lasso.fit(E2_train_data, E2_train_target)
r2_lasso_train = lasso.score(E2_train_data, E2_train_target)
r2_lasso = lasso.score(E2_test_data, E2_test_target)
ridge = linear_model.Ridge(normalize=True, alpha=.01)
ridge.fit(E2_train_data, E2_train_target)
r2_ridge_train = ridge.score(E2_train_data, E2_train_target)
r2_ridge = ridge.score(E2_test_data, E2_test_target)
results_p_gt_n = """\
| TRAINING | TESTING
------+----------+---------
OLS | {:.2%} | {:.2%}
------+----------+---------
Lasso | {:.2%} | {:.2%}
------+----------+---------
Ridge | {:.2%} | {:.2%}
---------------------------
""".format(r2_ols_train, r2_ols,
r2_lasso_train, r2_lasso,
r2_ridge_train, r2_ridge)
print(results_p_gt_n)
| TRAINING | TESTING ------+----------+--------- OLS | 100.00% | 33.68% ------+----------+--------- Lasso | 0.00% | -0.12% ------+----------+--------- Ridge | 68.18% | 58.20% ---------------------------
Those $\alpha$ values I set, worked well for the examples, but why did I use them and not others?
So far, we have been cheating and should be sent to machine learning re-education camp.
Validation dataset or cross-validation!
path
method.enCV = linear_model.ElasticNetCV(normalize=True,
l1_ratio=[.1, .5, .7, .9, .95, .99, 1],
)
Use multiple processors
enCV = linear_model.ElasticNetCV(normalize=True,
l1_ratio=[.1, .5, .7, .9, .95, .99, 1],
n_jobs=-1,
)
from sklearn import linear_model
enCV = linear_model.ElasticNetCV(normalize=True,
l1_ratio=[.1, .5, .7, .9, .95, .99, 1],
n_jobs=-1,
)
This can be your default object for regression
Note: only on very up to date versions of scikit-learn (github version right now) does this work well on my computer.
Older versions take too much memory and therefore are better tackled with a single CPU:
enCV.n_jobs = 1
The full example shown above takes too long for a demo, so let us run a scaled down version (which takes a few minutes):
enCV = linear_model.ElasticNetCV(normalize=True,
l1_ratio=[.1, .5, .9, .99],
alphas=[.001, .01, .05, .1, .25, .5, 1.],
n_jobs=-1,
)
enCV.fit(E2_train_data, E2_train_target)
r2_enCV_train = enCV.score(E2_train_data, E2_train_target)
r2_enCV = enCV.score(E2_test_data, E2_test_target)
When you can run analyses offline, then you may gain a bit from exploring more of the parameter space.
result_en_cv = """\
| TRAINING | TESTING
------+----------+---------
OLS | {:.2%} | {:.2%}
------+----------+---------
Lasso | {:.2%} | {:.2%}
------+----------+---------
Ridge | {:.2%} | {:.2%}
------+----------+---------
EN-CV | {:.2%} | {:.2%}
---------------------------
""".format(r2_ols_train, r2_ols,
r2_lasso_train, r2_lasso,
r2_ridge_train, r2_ridge,
r2_enCV_train, r2_enCV)
print(result_en_cv)
| TRAINING | TESTING ------+----------+--------- OLS | 100.00% | 33.68% ------+----------+--------- Lasso | 0.00% | -0.12% ------+----------+--------- Ridge | 68.18% | 58.20% ------+----------+--------- EN-CV | 58.84% | 59.92% ---------------------------
There are some parameters which can be interesting, which we did not look at:
print(enCV)
ElasticNetCV(alphas=[0.001, 0.01, 0.05, 0.1, 0.25, 0.5, 1.0], copy_X=True, cv=None, eps=0.001, fit_intercept=True, l1_ratio=[0.1, 0.5, 0.9, 0.99], max_iter=1000, n_alphas=100, n_jobs=-1, normalize=True, positive=False, precompute='auto', random_state=None, selection='cyclic', tol=0.0001, verbose=0)
alphas
lets you specify the $\alpha$ values to test (similar to l1_ratio
). It is fine to leave this unspecified.fit_intercept
can be set to False
to avoid setting an intercept.n_jobs
sets the number of job (-1 means all CPUs)positive
can be used to force all coefficients to be positive.All the other ones are algorithm-internal and useful mainly if you are having problems (try updating scikit-learn first, though, newer versions have better implementations for large problems).
ElasticNetCV
with a wide range of parameters as your "go to regression method"This presentation was based on Chapter 7 (Regression) of Building Machine Learning Systems with Python
Find me at luispedro.org or on twitter at @luispedrocoelho
This whole presentation can be obtained at github.com/luispedro/PenalizedRegression or viewed using the NBViewer.