Linear Regression -¶

Lance Martin¶

Preface -¶

These are notes I wrote for myself while studying for PhD qualifying exams at Stanford.

I wanted to published them so that others could benefit.

They draw from two sources:

CS229, an excellent machine learning class at Stanford taught by Andrew Ng.
An Introduction to Statistical Learning (ISL), an excellent text by G. James, D. Witten, T. Hastie and R. Tibshirani (Springer, 2013).

Some of the figures are taken from ISL with permission from the authors.

Basics -¶

Linear regression is a parametric linear model that captures the relationship between $y$ and $X$:

$y = h_{\theta}(X) = \sum\limits_{i=0}^n \theta_i x_i = \theta^T X $.

$y$ is a real number.
$X$ is a vector of $n$ features.
$\theta$ is a vector of $n$ parameters, which we will learn using the methods below.

It is discriminatative, meaning that it will try to learn the output $y$ given a set of features, $P(y \mid X)$.

Assumptions -¶

Additive - The effect of changes in a predictor $X_j$ is independent of the values of the other predictors.

Linear - Change in the response Y due to a one-unit change in $X_j$ is constant, regardless of the value of $X_j$.

Learning strategy 1: Minimize sum of squared error with gradient descent -¶

One idea is to learn parameters ($\theta$) that minimize the squared error between the model ($\theta^T X$) and measured target ($Y$) for our $m$ training examples.

In CS229, the objective is defined below (ISL defines this as residual sum of squares, or RSS):

$J(\theta) = \frac{1}{2} \sum\limits_{i=1}^m (h_{\theta} (X^i) - y^i)^2$

This cost fuction is convex quadratic, meaning it has a single global minimum.

This is shown below using $\theta = (\beta_o,\beta_1)$.

In [75]:

from IPython.display import Image
i = Image(filename=os.getcwd()+'/Images/RSS.png') # from the ISL text.
i

Out[75]:

There is a point (red, above) in parameter space that minimizes the cost funtion ($RSS$, above).

The challenge is to find the parameters, $\theta = (\beta_o,\beta_1)$, at this minimum.

Because of the shape of the funtion (convex), this slope is negative until the global minimum is reached:

$ \frac{\partial}{\partial \theta_j} J(\theta) \leq 0 $

So, we can use the gradient descent algorithm:

Start with a guess for $\theta$
Compute the slope
Iterativly change $\theta$ in the direction of greatest descent of $J(\theta)$

$$ \theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta) $$

We can see:

$\alpha$ as the learning rate
:= indicates assignment of values any element $j$ of $\theta$

$\theta_j$ is updated until $\frac{\partial}{\partial \theta_j} J(\theta) = 0$ at the global minimum.

In order to use this, we need to know $$\frac{\partial}{\partial \theta_j} J(\theta)$.

Starting with the cost funtion:

$$J(\theta) = \frac{1}{2} \sum\limits_{i=1}^m (h_{\theta} (x^i) - y^i)^2$$

Lets neglect the summation, so we will show this for one training example:

$$J(\theta) = \frac{1}{2} (h_{\theta} (x^i) - y^i)^2$$

We compute the partial with respect to one of our parameters, $\theta_j$:

$$ \frac{\partial J(\theta)}{\partial \theta_j} = \frac{\partial}{\partial \theta_j}\frac{1}{2} (h_{\theta} (x) - y)^2 $$$$ \frac{\partial J(\theta)}{\partial \theta_j} = (h_{\theta} (x) - y) \frac{\partial}{\partial \theta_j} (\sum\limits_{j=1}^m \theta_j (x_j)) - y^i) $$$$ \frac{\partial J(\theta)}{\partial \theta_j} = (h_{\theta} (x) - y) x_j $$

This results in the least mean squares (Widrow-Hoff) updater rule:

$$ \theta_j := \theta_j + \alpha (y^i - (h_{\theta} (x^i)) x_j^i $$

In practice, we:

Compute this for all paramters ($\theta_1$ to $\theta_n$) in parallel
Perform this across our training examples in two modes

Batch:

In batch mode, we perform each update to $\theta$ using all training examples.
The problem is that this method can be slow for large $m$, as it must process all $m$ prior to making a step.

$$ \theta_j := \theta_j + \alpha \sum\limits_{i=1}^m (y^i - (h_{\theta} (x^i)) x_j^i $$

Stochastic:

In stochastic mode, we first converge on $\theta$ using a single training example.
Then, we move to the next training example, starting with the prior converged value for $\theta$.

For $i$ in $m$:

$$ \theta_j := \theta_j + \alpha (y^i - (h_{\theta} (x^i)) x_j^i $$

Learning strategy 2: Maximize likelihood function with gradient ascent -¶

Our model prediction deviates from actual value by an error term:

$y^i = \theta^T x^i + \epsilon^i$.

Lets $assume$ errors are IID (independently and identically distributed) according to a Gaussian distribution with mean zero and some variance $\sigma^2$.

Recall the probability density function, which tells use the relative likelihood for this random variable to take on a given value:

$$ P(x) = \frac{1}{\sigma \sqrt{2\pi}} exp(-\frac{(x-\mu)^{2}}{2\sigma^2}) $$

In [76]:

import scipy.stats as stats
u,s=0,1
rv = stats.norm(loc=u,scale=s)   
x = np.arange(-5,5,0.1)
plt.plot(x,rv.pdf(x),lw=2,color='b')

Out[76]:

[<matplotlib.lines.Line2D at 0x10c57db10>]

Useful reminders for the Normal distribution:

(1) Compute the expected value, $ E[x] = \int\limits^{\infty}_{-\infty} x P(x) dx = \mu $

(2) Compute the z-score for a given value, $ Z=\frac{(x-\mu)}{\sigma} $

(3) Known fractions (e.g., 68%, 95%) of the data within specified z-values from mean (e.g., 1, 2).

(4) Integrate density to convert to absolute probability of with observing value within a given interval: $ P[b<x<a] = \int\limits^{a}_{b} P(x) dx $

Thus, we know:

$$ P(\epsilon^i) = \frac{1}{\sigma \sqrt{2\pi}} exp\frac{-(\epsilon^i)^{2}}{2\sigma^2} $$

Substituting for $\epsilon$ using our definition ($y^i = \theta^T x^i + \epsilon^i$) above:

$$ P(y^i \mid x^i;\theta) = \frac{1}{\sqrt{2 \pi \sigma}} exp \frac{ -(y^i - \theta^{T}x^i)^2 }{2 \sigma^2} $$

Becuase we made the assumption that errors are Gaussian with $\mu=0$, the density is maximized when squared error approaches zero, $(y^i - \theta^{T}x^i)^2 \rightarrow 0$.

The likelihood of $\theta$ for all training data is the product of probabilities each example, $i$:

$$L(\theta) = P(y_{tar} \mid X;\theta) = \prod\limits_{i=1}^m P(y^{i} \mid x^i;\theta) $$

So, we can simply find $\theta$ that maximizes $L$ (or log($L$)):

$$ l(\theta) = log L(\theta) = log \prod\limits_{i=1}^m P(y_{i} \mid x^i;\theta) $$$$ l(\theta) = m \times log \frac{1}{\sqrt{2 \pi \sigma}} + \sum\limits_{i=1}^m - \frac{(y^i - \theta^{T}x^i)^2}{2 \sigma^2} $$

Maximizing $L(\theta)$ is the same as minimizing our initial least squares cost funtion, $J(\theta) = \frac{1}{2} \sum\limits_{i=1}^m (h_{\theta} (x^i) - y^i)^2$.

The least squares approach is in fact a special case of maximum likelihood.

Learning strategy 3: Analytical solution -¶

Rather than using an iterative algorithm, we can also try to find a closed-form solution for $\theta$ that minimizes $J(\theta)$.

After using matrix derivatives and some tricks (see CS229 notes), we end up with a simple expression:

$$ \nabla_{\theta} J(\theta) = X_T X \theta - X^T y_{tar} $$

Set this to zero in order get the value of $\theta$ at which there's no change in the cost funtion, indicating minimization:

$$ \theta = (X_T X)^{-1} X^T y_{tar} $$

Parameter evaluation with standard error and p-Value -¶

Of course, model parameterization with the training data is a general statistical approach.

Consider the simple case of computing the sample mean as an estimate of a population mean:

$\hat{\mu} = \frac{1}{n} \sum\limits_{i=1}^n y_i $

If we were to average means ($\hat{\mu}$) over many samples, the result will be close to the population mean.

Standard error of $\hat{\mu}$ gives us the accuracy of a single estimate:

var($\hat{\mu}$) = SE($\hat{\mu})^2$ = $ \frac{\sigma^2}{n} $

Importaintly, the $SE$ can be computed for trained paramters from linear regression (see the ISL text for the equations).

Note that $SE$ can be used to:

Compute confidence intervals
Compute a p-value associated with our estimtae

Confidence intervals provide the range of values that contains the true, unknown values of the parameter.

For example, the $95\%$ confidence intervals for learned parameter, $\hat{\beta_1}$ is $2 SE(\hat{\beta_1})$.

We can also perform hypothesis tests.

In this case, the null hypothesis is simply $\hat\beta_1 = 0$, meaning no relationship between the ourput and feature $1$.

We can use the t-statistic to determine the number of $SE$ away from zero that our sampled parameter is.

$$ t = \frac{\hat\beta_1 - 0}{SE(\hat\beta_1)} $$

Using a table, we can compute the probability of observing any value equal to $t$ or larger, assuming $\hat\beta_1 = 0$, for the training set size.

As expected, this p-value drops as $t$ gets larger: p of 5% and 1% correspond to t-statistics of 2 and 2.75, respectively, for $n=30$.

Measures of linear model fit -¶

Recall the sum of the squared deviations from the mean $ TSS = \sum\limits (y^i-y_{mean})^2 $.

This is related to variance:

$$ v = \frac{TSS}{n-1} $$

Of course, the standard deviation is simply $ v = \sigma^2 $.

The variance is a special case of covariance, which is a measure of how two variables change together:

$$ Cov(a,b) = \frac{1}{n-1} \sum\limits_{i=1}^{n} (a_i - \mu_a) (b_i - \mu_b) $$

When the covariance is normalized by standard deviation, one obtains the Pearson correlation coefficient:

$$ r = Cor(a,b) = \frac{Cov(a,b)}{\sigma_a \sigma_b} $$

This is a measure of linear dependence between the two random variables.

An alternative measure of linear dependence is $R^2$, is the proportion of variability in Y that can be explained using X.

$$ R^2 = 1 - \frac{RSS}{TSS} $$

For example, if $r2=0.9$, then 90% of the variance in Y can be explained by variation our model, a funtion of $x$.

RSS measures the amount of variability that is left unexplained after performing the regression = $ \sum\limits (y^i-\theta^T x^i)^2 $

In the simple (single predictor) linear regression setting, squared Pearson correlation is equivalent to $R^2$.

This breaks if we have more than one predictor, as correlation quantifies association between a single pair of variables only.

Evaluating correlation -¶

Let's load a dataset to examine this.

In [112]:

titanic = sns.load_dataset("titanic").dropna()
titanic.head(3)

Out[112]:

	survived	pclass	sex	age	sibsp	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
1	1	1	female	38	1	71.2833	C	First	woman	False	C	Cherbourg	yes	False
3	1	1	female	35	1	53.1000	S	First	woman	False	C	Southampton	yes	False
6	0	1	male	54	0	51.8625	S	First	man	True	E	Southampton	no	True

Compute Pearson correlation.

In [115]:

from scipy.stats.stats import pearsonr
print pearsonr(titanic.survived,titanic.pclass)

(-0.037697890718650616, 0.61338467913032124)

This returns $r$ along with a p-value.

The p-value indicates the probability of an uncorrelated system producing data with Pearson correlation at least as extreme as the one computed from these datasets.

Similarly, we can compute a correlation matrix with embedded $r$ values along with indications of significance.

In [114]:

sns.set_context(rc={"figure.figsize": (8, 8)})
sns.corrplot(titanic)

Out[114]:

<matplotlib.axes.AxesSubplot at 0x111b6ed90>

Implementation -¶

In scikit-learn, we use an estimator, a Python object that implements the methods fit(X, y) and predict(T).

In the case of linear regression, this simply implements Ordinary Least Squares (scipy.linalg.lstsq).

In addition, the estimator object will return a metric, $r^2$.

In [89]:

import sys,os
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
%matplotlib inline

# Read Ch. 2 data - 
t=pd.read_csv(os.getcwd()+'/Datasets/Advertising.csv',index_col=False)
plt.scatter(t['TV'],t['Sales'],c='b')

# Linear model -
from sklearn.linear_model import LinearRegression
reg = LinearRegression() # estimator that implements the methods, fit and predict

feat=np.array([t['TV']]).transpose()
targ=np.array([t['Sales']]).transpose()

reg.fit(feat,targ)
predicted=reg.predict(feat)
plt.plot(feat,predicted,c='r')

print "The coefficient of determination is: %s"%reg.score(feat, targ)
print "The computed model parameter(s) is(are): %s"%reg.coef_

The coefficient of determination is: 0.61187505085
The computed model parameter(s) is(are): [[ 0.04753664]]

We can also use Seaborn to explore linear relationships.

It displays a scatterplot and regression line (with 95% confidence band to give an impression of the certainty in the model).

In [88]:

sns.lmplot("TV", "Sales", t)

Out[88]:

<seaborn.axisgrid.FacetGrid at 0x10f9074d0>

We can also use a residual plot, which is pltting the model residual ($e^i = y^i - \hat{y^i}$) with respect to the independent parameter.

In [92]:

sns.residplot("TV", "Sales", t)

Out[92]:

<matplotlib.axes.AxesSubplot at 0x10fc2af50>

In [111]:

sns.jointplot("TV","Sales",t,kind="reg",color="seagreen") # Regression plot along with the marginal distributions of the two variables

Out[111]:

<seaborn.axisgrid.JointGrid at 0x110ce8310>

Note that we previously used $R^2$ as a measure of model fit, resulting in $R^2 = 0.61187505085$.

In the case of linear regression, Pearson R (correlation coeffienct) is equal to the coeffienct of determination.

In [110]:

np.sqrt(0.61187505085)

Out[110]:

0.78222442486156107

Data exploration -¶

In [139]:

titanic = sns.load_dataset("titanic")
titanic.head(3)

Out[139]:

	survived	pclass	sex	age	sibsp	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
0	0	3	male	22	1	7.2500	S	Third	man	True	NaN	Southampton	no	False
1	1	1	female	38	1	71.2833	C	First	woman	False	C	Cherbourg	yes	False
2	1	3	female	26	0	7.9250	S	Third	woman	False	NaN	Southampton	yes	True

First, recall that we can easily use factor plots to split and plot the data by categorical labels.

In [129]:

sns.factorplot("survived",data=titanic,hue="sex",palette="Set1")

Out[129]:

<seaborn.axisgrid.FacetGrid at 0x114009310>

In [133]:

sns.factorplot("sex","age",data=titanic,hue="who",palette="Set1")

Out[133]:

<seaborn.axisgrid.FacetGrid at 0x11499c110>

We can use facet grids to view the underlying distributions.

In [134]:

fg = sns.FacetGrid(titanic,hue="who",aspect=3,palette="Set1")
fg.map(sns.kdeplot, "age", shade=True)
fg.set(xlim=(0, 80))

Out[134]:

<seaborn.axisgrid.FacetGrid at 0x116a1e350>

Or, violin plots.

In [140]:

pruned=titanic.dropna()
sns.violinplot(pruned.age,pruned.who,"points",palette="Set1")

Out[140]:

<matplotlib.axes.AxesSubplot at 0x117262710>

We can then explore linear relationships, using jitter to make the data easier to view.

In [143]:

sns.lmplot("age","survived",titanic,y_jitter=.05,palette="Set1")

Out[143]:

<seaborn.axisgrid.FacetGrid at 0x1170ce150>

Use bins on the data.

In [144]:

sns.lmplot("age","survived",titanic,y_jitter=.05,palette="Set1",x_bins=4)

Out[144]:

<seaborn.axisgrid.FacetGrid at 0x1170116d0>

In [145]:

age_bins = [15, 30, 45, 60] # Specify bins
sns.lmplot("age", "survived",titanic,hue="sex",palette="Set1",x_bins=age_bins)

Out[145]:

<seaborn.axisgrid.FacetGrid at 0x1179ac750>

Where does this break -¶

Non-linearity:

The linear regression model assumes that there is a straight-line relationship between the predictors and the response.
Residual ($y_i - \hat{y_i}$)plots are a useful graphical tool for identifying non-linearity.

Of course, examining the residuals is useful to identify this (below).

In [107]:

# Mock non-linear data
t['non_linear'] = t.TV + 1.5 * t.TV ** 2
sns.lmplot("TV","non_linear",t,color="indianred")

Out[107]:

<seaborn.axisgrid.FacetGrid at 0x1106a8c50>

In [106]:

# Residuals
sns.residplot(t.TV,t.non_linear,color="indianred")

Out[106]:

<matplotlib.axes.AxesSubplot at 0x11066c050>