Adapted from Chapter 3 of An Introduction to Statistical Learning
Why are we learning linear regression?
Lesson goals:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import cross_val_score
from sklearn import metrics
import statsmodels.formula.api as smf
# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# read data into a DataFrame
data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0)
data.head()
TV | Radio | Newspaper | Sales | |
---|---|---|---|---|
1 | 230.1 | 37.8 | 69.2 | 22.1 |
2 | 44.5 | 39.3 | 45.1 | 10.4 |
3 | 17.2 | 45.9 | 69.3 | 9.3 |
4 | 151.5 | 41.3 | 58.5 | 18.5 |
5 | 180.8 | 10.8 | 58.4 | 12.9 |
What are the observations?
What are the features?
What is the response?
You are asked by the company: On the basis of this data, how should we spend our advertising money in the future?
You come up with more specific questions:
Use a scatter plot to visualize the relationship between the features and the response.
# scatter plot in Seaborn
sns.pairplot(data, x_vars=['TV','Radio','Newspaper'], y_vars='Sales', size=6, aspect=0.7)
<seaborn.axisgrid.PairGrid at 0x117dbcbd0>
# include a "regression line"
sns.pairplot(data, x_vars=['TV','Radio','Newspaper'], y_vars='Sales', size=6, aspect=0.7, kind='reg')
<seaborn.axisgrid.PairGrid at 0x103ee8650>
# scatter plot in Pandas
fig, axs = plt.subplots(1, 3, sharey=True)
data.plot(kind='scatter', x='TV', y='Sales', ax=axs[0], figsize=(16, 6))
data.plot(kind='scatter', x='Radio', y='Sales', ax=axs[1])
data.plot(kind='scatter', x='Newspaper', y='Sales', ax=axs[2])
<matplotlib.axes._subplots.AxesSubplot at 0x11c41a150>
Use a scatter matrix to visualize the relationship between all numerical variables.
# scatter matrix in Seaborn
sns.pairplot(data)
<seaborn.axisgrid.PairGrid at 0x11c270450>
Use a correlation matrix to visualize the correlation between all numerical variables.
# compute correlation matrix
data.corr()
TV | Radio | Newspaper | Sales | |
---|---|---|---|---|
TV | 1.000000 | 0.054809 | 0.056648 | 0.782224 |
Radio | 0.054809 | 1.000000 | 0.354104 | 0.576223 |
Newspaper | 0.056648 | 0.354104 | 1.000000 | 0.228299 |
Sales | 0.782224 | 0.576223 | 0.228299 | 1.000000 |
# display correlation matrix in Seaborn using a heatmap
sns.heatmap(data.corr())
<matplotlib.axes._subplots.AxesSubplot at 0x11f575cd0>
Correlation is a quantification of how two variables are related in a linear fashion
But does NOT reveal if there is any causation between variables
In general, machine learning is great at revealing when variables are correlated, but not if there is any causation between them
Simple linear regression is an approach for predicting a continuous response using a single feature. It takes the following form:
$y = \beta_0 + \beta_1x$
$\beta_0$ and $\beta_1$ are called the model coefficients:
In this diagram:
How do the model coefficients relate to the least squares line?
Linear Regression is highly parametric, meaning that is relies heavily ont he underlying shape of the data. If the data fall into a line, then lienar regression will do well. If the data does not fall in line (get it?) linear regression is likely to fail.
Let's estimate the model coefficients for the advertising data:
### STATSMODELS ###
# create a fitted model
lm = smf.ols(formula='Sales ~ TV', data=data).fit()
# print the coefficients
lm.params
Intercept 7.032594 TV 0.047537 dtype: float64
### SCIKIT-LEARN ###
# create X and y
feature_cols = ['TV']
X = data[feature_cols]
y = data.Sales
# instantiate and fit
linreg = LinearRegression()
linreg.fit(X, y)
# print the coefficients
print linreg.intercept_
print linreg.coef_
7.03259354913 [ 0.04753664]
How do we interpret the TV coefficient ($\beta_1$)?
If an increase in TV ad spending was associated with a decrease in sales, $\beta_1$ would be negative.
Let's say that there was a new market where the TV advertising spend was $50,000. What would we predict for the Sales in that market?
$$y = \beta_0 + \beta_1x$$$$y = 7.0326 + 0.0475 \times 50$$# manually calculate the prediction
7.0326 + 0.0475*50
9.4076
### STATSMODELS ###
# you have to create a DataFrame since the Statsmodels formula interface expects it
X_new = pd.DataFrame({'TV': [50]})
# predict for a new observation
lm.predict(X_new)
array([ 9.40942557])
### SCIKIT-LEARN ###
# predict for a new observation
linreg.predict(50)
array([ 9.40942557])
Thus, we would predict Sales of 9,409 widgets in that market.
Let's say that TV was measured in dollars, rather than thousands of dollars. How would that affect the model?
data['TV_dollars'] = data.TV * 1000
data.head()
TV | Radio | Newspaper | Sales | TV_dollars | |
---|---|---|---|---|---|
1 | 230.1 | 37.8 | 69.2 | 22.1 | 230100.0 |
2 | 44.5 | 39.3 | 45.1 | 10.4 | 44500.0 |
3 | 17.2 | 45.9 | 69.3 | 9.3 | 17200.0 |
4 | 151.5 | 41.3 | 58.5 | 18.5 | 151500.0 |
5 | 180.8 | 10.8 | 58.4 | 12.9 | 180800.0 |
### SCIKIT-LEARN ###
# create X and y
feature_cols = ['TV_dollars']
X = data[feature_cols]
y = data.Sales
# instantiate and fit
linreg = LinearRegression()
linreg.fit(X, y)
# print the coefficients
print linreg.intercept_
print linreg.coef_
7.03259354913 [ 4.75366404e-05]
How do we interpret the TV_dollars coefficient ($\beta_1$)?
# predict for a new observation
linreg.predict(50000)
array([ 9.40942557])
The scale of the features is irrelevant for linear regression models, since it will only affect the scale of the coefficients, and we simply change our interpretation of the coefficients.
Linear regression is a low variance/high bias model:
A closely related concept is confidence intervals.
Statsmodels calculates 95% confidence intervals for our model coefficients, which are interpreted as follows: If the population from which this sample was drawn was sampled 100 times, approximately 95 of those confidence intervals would contain the "true" coefficient.
### STATSMODELS ###
# print the confidence intervals for the model coefficients
lm.conf_int()
0 | 1 | |
---|---|---|
Intercept | 6.129719 | 7.935468 |
TV | 0.042231 | 0.052843 |
Note: 95% confidence intervals are just a convention. You can create 90% confidence intervals (which will be more narrow), 99% confidence intervals (which will be wider), or whatever intervals you like.
A closely related concept is hypothesis testing.
General process for hypothesis testing:
For model coefficients, here is the conventional hypothesis test:
How do we test this hypothesis?
### STATSMODELS ###
# print the p-values for the model coefficients
lm.pvalues
Intercept 1.406300e-35 TV 1.467390e-42 dtype: float64
Thus, a p-value less than 0.05 is one way to decide whether there is likely a relationship between the feature and the response. In this case, the p-value for TV is far less than 0.05, and so we believe that there is a relationship between TV ads and Sales.
Note that we generally ignore the p-value for the intercept.
R-squared:
Here's an example of what R-squared "looks like":
Let's calculate the R-squared value for our simple linear model:
### STATSMODELS ###
# print the R-squared value for the model
lm.rsquared
0.61187505085007099
### SCIKIT-LEARN ###
# calculate the R-squared value for the model
y_pred = linreg.predict(X)
metrics.r2_score(y, y_pred)
0.61187505085007099
Simple linear regression can easily be extended to include multiple features, which is called multiple linear regression:
$y = \beta_0 + \beta_1x_1 + ... + \beta_nx_n$
Each $x$ represents a different feature, and each feature has its own coefficient:
$y = \beta_0 + \beta_1 \times TV + \beta_2 \times Radio + \beta_3 \times Newspaper$
### SCIKIT-LEARN ###
# create X and y
feature_cols = ['TV', 'Radio', 'Newspaper']
X = data[feature_cols]
y = data.Sales
# instantiate and fit
linreg = LinearRegression()
linreg.fit(X, y)
# print the coefficients
print linreg.intercept_
print linreg.coef_
2.93888936946 [ 0.04576465 0.18853002 -0.00103749]
# pair the feature names with the coefficients
zip(feature_cols, linreg.coef_)
[('TV', 0.045764645455397608), ('Radio', 0.18853001691820462), ('Newspaper', -0.0010374930424762972)]
For a given amount of Radio and Newspaper spending, an increase of $1000 in TV spending is associated with an increase in Sales of 45.8 widgets.
For a given amount of TV and Newspaper spending, an increase of $1000 in Radio spending is associated with an increase in Sales of 188.5 widgets.
For a given amount of TV and Radio spending, an increase of $1000 in Newspaper spending is associated with an decrease in Sales of 1.0 widgets. How could that be?
How do I decide which features to include in a linear model?
We could try a model with all features, and only keep features in the model if they have small p-values:
### STATSMODELS ###
# create a fitted model with all three features
lm = smf.ols(formula='Sales ~ TV + Radio + Newspaper', data=data).fit()
# print the p-values for the model coefficients
print lm.pvalues
Intercept 1.267295e-17 TV 1.509960e-81 Radio 1.505339e-54 Newspaper 8.599151e-01 dtype: float64
This indicates we would reject the null hypothesis for TV and Radio (that there is no association between those features and Sales), and fail to reject the null hypothesis for Newspaper. Thus, we would keep TV and Radio in the model.
However, this approach has drawbacks:
We could try models with different sets of features, and compare their R-squared values:
# R-squared value for the model with two features
lm = smf.ols(formula='Sales ~ TV + Radio', data=data).fit()
lm.rsquared
0.89719426108289568
# R-squared value for the model with three features
lm = smf.ols(formula='Sales ~ TV + Radio + Newspaper', data=data).fit()
lm.rsquared
0.89721063817895219
This would seem to indicate that the best model includes all three features. Is that right?
As well, R-squared depends on the same assumptions as p-values, and it's less reliable if those assumptions are violated.
A better approach to feature selection!
Evaluation metrics for classification problems, such as accuracy, are not useful for regression problems. We need evaluation metrics designed for comparing continuous values.
Let's create some example numeric predictions, and calculate three common evaluation metrics for regression problems:
# define true and predicted response values
y_true = [100, 50, 30, 20]
y_pred = [90, 50, 50, 30]
Mean Absolute Error (MAE) is the mean of the absolute value of the errors:
$$\frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|$$print metrics.mean_absolute_error(y_true, y_pred)
10.0
Mean Squared Error (MSE) is the mean of the squared errors:
$$\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2$$print metrics.mean_squared_error(y_true, y_pred)
150.0
Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors:
$$\sqrt{\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2}$$print np.sqrt(metrics.mean_squared_error(y_true, y_pred))
12.2474487139
Comparing these metrics:
All of these are loss functions, because we want to minimize them.
Here's an additional example, to demonstrate how MSE/RMSE punish larger errors:
# same true values as above
y_true = [100, 50, 30, 20]
# new set of predicted values
y_pred = [60, 50, 30, 20]
# MAE is the same as before
print metrics.mean_absolute_error(y_true, y_pred)
# RMSE is larger than before
print np.sqrt(metrics.mean_squared_error(y_true, y_pred))
10.0 20.0
Let's use train/test split with RMSE to decide whether Newspaper should be kept in the model:
# define a function that accepts X and y and computes testing RMSE
def cross_val_rmse(X, y):
linreg = LinearRegression()
scores = cross_val_score(linreg, X, y, cv=5, scoring='mean_squared_error')
return np.sqrt(abs(scores)).mean() # return average RMSE
# include Newspaper
feature_cols = ['TV', 'Radio', 'Newspaper']
X = data[feature_cols]
cross_val_rmse(X, y)
1.7175247278732086
# exclude Newspaper BETTER
feature_cols = ['TV', 'Radio']
X = data[feature_cols]
cross_val_rmse(X, y)
1.7026625340333177
# only TV, not good enough
feature_cols = ['TV']
X = data[feature_cols]
cross_val_rmse(X, y)
3.2756686834314559
Advantages of linear regression:
Disadvantages of linear regression: