This tutorial is intended to complement lecture #5.
We will show how to:
Use cross-validation to estimate $R^2$ and reduce data overfitting.
Use cross-validation to estimate an $R^2$ for two models and select the
most accurate model.
Written by Franco Pestilli, PhD, Stanford University, May 2012
Translated to Python by Michael Waskom, June 2012
%pylab inline
Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.zmq.pylab.backend_inline]. For more information, type 'help(pylab)'.
import seaborn
seaborn.set()
colors = seaborn.color_palette()
import utils
Models are attempts to describe data. Ideally a model's description of the data should be accurate and reliable. This would allow for making predictions on data that we have not collected.
Accuracy and reliability are two important descriptors of the quality of a model.
Accuracy: A model is accurate if the predictions it makes are close to measured data. The coefficient of determination ($R^2$) is a standard measure of model accuracy. $R^2$ expresses the percent of variance in the data explained by the model.
Reliability: A model is reliable if the predictions (and the parameters) of the model vary minimally as the sample data used to fit the model changes. The standard deviation (or the confidence intervals) of a model's estimates is a good measure of model reliability.
Model accuracy and reliability can be estimated using cross-validation and bootstrapping, respectively.
In cross-validation, we fit a model to a subset of the total measured data and then we evaluate how well the model predicts the remaining data.
In bootstrapping, we resample (with replacement) from the original data set and then fit the model of interest to each bootstrap sample. This allows us to quantify reliability; how much the model prediction changes as samples change.
The present Tutorial focuses on model accuracy. Tutorial 6 will focus on model reliability.
The goal of fitting data to a sample is that of describing an underlying population that we cannot access. The only thing we have is the data in the sample.
Unfortunately, every time we compute a model on a sample of data the model will overfit the sample. This means that the model will adjust "too much" to the specific sample, so that the model will be a better description of the sample than of the population we are interested in.
Here after we will show how to use cross-validation to estimate model accuracy by computing a measure of $R^2$ that is not prone to overfitting.
We will create a population using a straight line as model. We will then create a sample out of this population. We will then fit a straight line model to the sample.
We will then estimate the accuracy of our model using the coefficient of determination ($R^2$). $R^2$ represent the percent of variance in the sample data explained by the model.
Because we are using a good model (straight line), a straight line. The model will be accurate, its predictions will be close to the sample data. In other terms, it will represent the data very well.
But the model we fit is tailored to the specific sample we had collected. When fitting a (good) model to a sample limited in size we are prone to overfitting. This means that the model will tend to describe the specific sample better than it would describe the true population. In other terms the $R^2$ estimated on the sample will be higher that should be.
We will show how to use cross-validation to compute a coefficient of determination ($R^2$) that is not prone to overfitting.
# We will first simulate a population. Using a "true" model of our choice; a straight line.
x = repeat(arange(200), 100)
x += randn(len(x)) * 5
a, b = 2, 100 # Parameters of the model
poly_order = 1 # A straight line
y = polyval([a, b], x) + randn(len(x)) * 70
# Show the simulated true population
plot(x, y, "o", markersize=3, alpha=.15)
# Show the true model on top of the population
true_m = polyval([a, b], x)
plot(x, true_m, linewidth=4);
# Now sample from the population
sample_size = 20
ind = permutation(arange(len(x)))[:sample_size]
x_sample = x[ind]
y_sample = y[ind]
# Show the sample
plot(x_sample, y_sample, "o")
# Fit the model to the sample with a straight line model
poly_order = 1
ab = polyfit(x_sample, y_sample, poly_order)
# Evalute the estimated model
y_fit = polyval(ab, x_sample)
# Show the model fit to the sample
plot(x_sample, y_fit, colors[4]);
We can now compute the coefficient of determination ($R^2$) of the fitted model to the sample data. This number shows the percent of variance in the data explained by the model. We will use the following formula:
$$R^2 = 1 - \frac{\sum{(y_i - \hat{y_i})^2}}{\sum{(y_i - \bar{y})^2}}$$# Define a function to compute R2 since we're doing it a few times
compute_r2 = lambda true, pred: 1 - sum(square(true - pred)) / sum(square(true - mean(true)))
r2 = compute_r2(y_sample, y_fit)
print "R^2 = %.2f" % r2
R^2 = 0.81
The percent of explained variance seem to be decent (it should be above 70%). But we know that some overfitting is going on here.
To show this we will now compute the $R^2$ of the true model to the sample data. This value wll be lower then the previous $R^2$, indicating that the fitted model does not describe the true population as well as it describes the given sample. The improved accuracy of the fitted model on the sample data is called overfitting.
y_fit_truem = polyval([a, b], x_sample)
r2_true = compute_r2(y_sample, y_fit_truem)
print "True R^2: %.2f" % r2_true
True R^2: 0.78
We can use cross-validation to reduce overfitting.
To do so, we will use leave-one-out cross validation and split the sample data into two sets, a training- and a testing-%set.
We will fit the straight-line model to the "training set" and compute the R2 on the "test set." This procedures reduces overfitting because the model prediction is never made on the same sample used for fitting. So any excessive tailoring of the model to the training sample will reduced the quality of the predicion on the test sample.
We will use the cross validation objects from the scikit-learn package for concise, flexible code.
# Use leave-one-out cross validation
from sklearn.cross_validation import LeaveOneOut
y_hat = zeros_like(y_fit)
# Iterate through the folds
for i, (train, test) in enumerate(LeaveOneOut(sample_size)):
# Split the data
x_train, x_test = x_sample[train], x_sample[test]
y_train, y_test = y_sample[train], y_sample[test]
# We fit the model to the training set
ab_hat = polyfit(x_train, y_train, 1)
# Make a prediction about the held-out sample and enter into the yhat vector
y_hat[i] = polyval(ab_hat, x_test)
# Compute the cross-validated R^2
r2_cv = compute_r2(y_sample, y_hat)
print "R^2 estimated on the sample: %.2f" % r2
print "R^2 estimated on the population: %.2f" % r2_true
print "Cross-validated R^2: %.2f" % r2_cv
R^2 estimated on the sample: 0.81 R^2 estimated on the population: 0.78 Cross-validated R^2: 0.77
You might noticed that the cross-validated $R^2$ can be smaller than either the $R^2$ computed on the original sample or $R^2$ of the true population model to the sample data. This is because the cross validation $R^2$ is computed on a subset of the data (sample_size - 1 in our case). So each estimate is noisier than the estimate on the full sample.
Cross-validation can be used to reduce overfitting. But the exact value of the $R^2$ obtained via cross-validation will depends on a series of factors:
So all the cross-valdiated $R^2$ really tells us is the accuracy of the model in describing the population without overfitting. What we cannot say is whether the exact value of cross-validated $R^2$ is similar to the population $R^2$.
Here we will show how to use cross-vlidation to select the best of two models.
Ideally, the model that has a higher $R^2$ is better or more accurate at describing the data.
But, if we were to compare how accurate two different models are and we were to judge model accuracy by using the non cross-validated $R^2$, we could be erroneously selecting the wrong mpdel. For example we could select a model that has higher-degree of overfitting. A model that overfits the data more will have a higher non cross-validated R2 on the sample data set.
So, when interested in selecting the best among several models we should always use cross-validate $R^2$. These $R^2$ evaluate the accuracy of a model with out overfitting the data.
# Plot the original population
plot(x, y, "o", markersize=3, alpha=.15)
xlabel("X")
ylabel("Y")
# Show the true model on top of the population
plot(x, true_m, linewidth=3, alpha=.7)
# Show the sample
plot(x_sample, y_sample, "o", markersize=8, alpha=.8)
# Show the fit of the linear model from above
plot(x_sample, y_fit, linewidth=3, alpha=.7)
# Now let's fit a more complex model. A plolynomial of 4th order.
# This model is not similar to a the true model (a straight line), but will
# represent better the specific data in the sample better than a straight
# line, because it will be able to adapt better to the small variability in
# the data in the sample due to chance.
params = polyfit(x_sample, y_sample, deg=4)
y_fit_p4 = polyval(params, x_sample)
sorter = argsort(x_sample)
plot(x_sample[sorter], y_fit_p4[sorter], linewidth=3, alpha=.7);
# The polynomial of 4th order (gold) passes closer to each sample data
# point (red) than the straight line (purple) does. It shuld have a higher
# R2 than the straight-line model.
r2_line = compute_r2(y_sample, y_fit)
r2_cube = compute_r2(y_sample, y_fit_p4)
print "Line model: %.2f" % r2_line
print "Cubic model: %.2f" % r2_cube
Line model: 0.81 Cubic model: 0.84
So if we where to judge the accuracy of these two models from the computed $R^2$, we would select the polynomial of order 4. But what we are really interested in is to choose the model that best represet the true population. So, choosing the polynomial of order 4 would be a mistake. (Remember, that we created the population from a straight line model.)
To choose the best model we need a measure of model accuracy that is not prone to overfitting. We have seen above that cross validation allows us to reduce overfitting. So we will now compute the cross-validate $R^2$ for both models and show that after taking care of overfitting the polynomial does not fit the data better anymore.
yhat_line = zeros_like(y_sample)
yhat_cube = zeros_like(y_sample)
for i, (train, test) in enumerate(LeaveOneOut(sample_size)):
x_train = x_sample[train]
y_train = y_sample[train]
x_test = x_sample[test]
line_model = polyfit(x_train, y_train, 1)
cube_model = polyfit(x_train, y_train, 4)
yhat_line[i] = polyval(line_model, x_test)
yhat_cube[i] = polyval(cube_model, x_test)
r2_line_cv = compute_r2(y_sample, yhat_line)
r2_cube_cv = compute_r2(y_sample, yhat_cube)
print "Line model (full fit): %.2f" % r2_line
print "Line model (cross-validated): %.2f" % r2_line_cv
print "Cubic model (full fit): %.2f" % r2_cube
print "Cubic model (cross-validated): %.2f" % r2_cube_cv
Line model (full fit): 0.81 Line model (cross-validated): 0.77 Cubic model (full fit): 0.84 Cubic model (cross-validated): 0.70
The polynomial had a higher degree of overfitting. This can be seen by comparing the $R^2$ of the polynomial with the cross-validated $R^2$. The larger the difference the larger the overfitting.
This means that polynomial is not a more accurate model in describing the data than a straight line, but instead it overfits the sample more.