These are notes I wrote for myself while studying for PhD qualifying exams at Stanford.
I wanted to published them so that others could benefit.
They draw from two sources:
Some of the figures are taken from ISL with permission from the authors.
Linear regression is a parametric linear model that captures the relationship between $y$ and $X$:
$y = h_{\theta}(X) = \sum\limits_{i=0}^n \theta_i x_i = \theta^T X $.
It is discriminatative, meaning that it will try to learn the output $y$ given a set of features, $P(y \mid X)$.
Additive - The effect of changes in a predictor $X_j$ is independent of the values of the other predictors.
Linear - Change in the response Y due to a one-unit change in $X_j$ is constant, regardless of the value of $X_j$.
One idea is to learn parameters ($\theta$) that minimize the squared error between the model ($\theta^T X$) and measured target ($Y$) for our $m$ training examples.
In CS229, the objective is defined below (ISL defines this as residual sum of squares, or RSS):
$J(\theta) = \frac{1}{2} \sum\limits_{i=1}^m (h_{\theta} (X^i) - y^i)^2$
This cost fuction is convex quadratic, meaning it has a single global minimum.
This is shown below using $\theta = (\beta_o,\beta_1)$.
from IPython.display import Image
i = Image(filename=os.getcwd()+'/Images/RSS.png') # from the ISL text.
i
There is a point (red, above) in parameter space that minimizes the cost funtion ($RSS$, above).
The challenge is to find the parameters, $\theta = (\beta_o,\beta_1)$, at this minimum.
Because of the shape of the funtion (convex), this slope is negative until the global minimum is reached:
$ \frac{\partial}{\partial \theta_j} J(\theta) \leq 0 $
So, we can use the gradient descent algorithm:
We can see:
$\theta_j$ is updated until $\frac{\partial}{\partial \theta_j} J(\theta) = 0$ at the global minimum.
In order to use this, we need to know $$\frac{\partial}{\partial \theta_j} J(\theta)$.
Starting with the cost funtion:
$$J(\theta) = \frac{1}{2} \sum\limits_{i=1}^m (h_{\theta} (x^i) - y^i)^2$$Lets neglect the summation, so we will show this for one training example:
$$J(\theta) = \frac{1}{2} (h_{\theta} (x^i) - y^i)^2$$We compute the partial with respect to one of our parameters, $\theta_j$:
$$ \frac{\partial J(\theta)}{\partial \theta_j} = \frac{\partial}{\partial \theta_j}\frac{1}{2} (h_{\theta} (x) - y)^2 $$$$ \frac{\partial J(\theta)}{\partial \theta_j} = (h_{\theta} (x) - y) \frac{\partial}{\partial \theta_j} (\sum\limits_{j=1}^m \theta_j (x_j)) - y^i) $$$$ \frac{\partial J(\theta)}{\partial \theta_j} = (h_{\theta} (x) - y) x_j $$This results in the least mean squares (Widrow-Hoff) updater rule:
$$ \theta_j := \theta_j + \alpha (y^i - (h_{\theta} (x^i)) x_j^i $$In practice, we:
Batch:
Stochastic:
For $i$ in $m$:
$$ \theta_j := \theta_j + \alpha (y^i - (h_{\theta} (x^i)) x_j^i $$Our model prediction deviates from actual value by an error term:
$y^i = \theta^T x^i + \epsilon^i$.
Lets $assume$ errors are IID (independently and identically distributed) according to a Gaussian distribution with mean zero and some variance $\sigma^2$.
Recall the probability density function, which tells use the relative likelihood for this random variable to take on a given value:
$$ P(x) = \frac{1}{\sigma \sqrt{2\pi}} exp(-\frac{(x-\mu)^{2}}{2\sigma^2}) $$import scipy.stats as stats
u,s=0,1
rv = stats.norm(loc=u,scale=s)
x = np.arange(-5,5,0.1)
plt.plot(x,rv.pdf(x),lw=2,color='b')
[<matplotlib.lines.Line2D at 0x10c57db10>]
Useful reminders for the Normal distribution:
(1) Compute the expected value, $ E[x] = \int\limits^{\infty}_{-\infty} x P(x) dx = \mu $
(2) Compute the z-score for a given value, $ Z=\frac{(x-\mu)}{\sigma} $
(3) Known fractions (e.g., 68%, 95%) of the data within specified z-values from mean (e.g., 1, 2).
(4) Integrate density to convert to absolute probability of with observing value within a given interval: $ P[b<x<a] = \int\limits^{a}_{b} P(x) dx $
Thus, we know:
$$ P(\epsilon^i) = \frac{1}{\sigma \sqrt{2\pi}} exp\frac{-(\epsilon^i)^{2}}{2\sigma^2} $$Substituting for $\epsilon$ using our definition ($y^i = \theta^T x^i + \epsilon^i$) above:
$$ P(y^i \mid x^i;\theta) = \frac{1}{\sqrt{2 \pi \sigma}} exp \frac{ -(y^i - \theta^{T}x^i)^2 }{2 \sigma^2} $$Becuase we made the assumption that errors are Gaussian with $\mu=0$, the density is maximized when squared error approaches zero, $(y^i - \theta^{T}x^i)^2 \rightarrow 0$.
The likelihood of $\theta$ for all training data is the product of probabilities each example, $i$:
$$L(\theta) = P(y_{tar} \mid X;\theta) = \prod\limits_{i=1}^m P(y^{i} \mid x^i;\theta) $$So, we can simply find $\theta$ that maximizes $L$ (or log($L$)):
$$ l(\theta) = log L(\theta) = log \prod\limits_{i=1}^m P(y_{i} \mid x^i;\theta) $$$$ l(\theta) = m \times log \frac{1}{\sqrt{2 \pi \sigma}} + \sum\limits_{i=1}^m - \frac{(y^i - \theta^{T}x^i)^2}{2 \sigma^2} $$Maximizing $L(\theta)$ is the same as minimizing our initial least squares cost funtion, $J(\theta) = \frac{1}{2} \sum\limits_{i=1}^m (h_{\theta} (x^i) - y^i)^2$.
The least squares approach is in fact a special case of maximum likelihood.
Rather than using an iterative algorithm, we can also try to find a closed-form solution for $\theta$ that minimizes $J(\theta)$.
After using matrix derivatives and some tricks (see CS229 notes), we end up with a simple expression:
$$ \nabla_{\theta} J(\theta) = X_T X \theta - X^T y_{tar} $$Set this to zero in order get the value of $\theta$ at which there's no change in the cost funtion, indicating minimization:
$$ \theta = (X_T X)^{-1} X^T y_{tar} $$Of course, model parameterization with the training data is a general statistical approach.
Consider the simple case of computing the sample mean as an estimate of a population mean:
$\hat{\mu} = \frac{1}{n} \sum\limits_{i=1}^n y_i $
If we were to average means ($\hat{\mu}$) over many samples, the result will be close to the population mean.
Standard error of $\hat{\mu}$ gives us the accuracy of a single estimate:
var($\hat{\mu}$) = SE($\hat{\mu})^2$ = $ \frac{\sigma^2}{n} $
Importaintly, the $SE$ can be computed for trained paramters from linear regression (see the ISL text for the equations).
Note that $SE$ can be used to:
Confidence intervals provide the range of values that contains the true, unknown values of the parameter.
For example, the $95\%$ confidence intervals for learned parameter, $\hat{\beta_1}$ is $2 SE(\hat{\beta_1})$.
We can also perform hypothesis tests.
In this case, the null hypothesis is simply $\hat\beta_1 = 0$, meaning no relationship between the ourput and feature $1$.
We can use the t-statistic to determine the number of $SE$ away from zero that our sampled parameter is.
$$ t = \frac{\hat\beta_1 - 0}{SE(\hat\beta_1)} $$Using a table, we can compute the probability of observing any value equal to $t$ or larger, assuming $\hat\beta_1 = 0$, for the training set size.
As expected, this p-value drops as $t$ gets larger: p of 5% and 1% correspond to t-statistics of 2 and 2.75, respectively, for $n=30$.
Recall the sum of the squared deviations from the mean $ TSS = \sum\limits (y^i-y_{mean})^2 $.
This is related to variance:
$$ v = \frac{TSS}{n-1} $$Of course, the standard deviation is simply $ v = \sigma^2 $.
The variance is a special case of covariance, which is a measure of how two variables change together:
$$ Cov(a,b) = \frac{1}{n-1} \sum\limits_{i=1}^{n} (a_i - \mu_a) (b_i - \mu_b) $$When the covariance is normalized by standard deviation, one obtains the Pearson correlation coefficient:
$$ r = Cor(a,b) = \frac{Cov(a,b)}{\sigma_a \sigma_b} $$This is a measure of linear dependence between the two random variables.
An alternative measure of linear dependence is $R^2$, is the proportion of variability in Y that can be explained using X.
$$ R^2 = 1 - \frac{RSS}{TSS} $$For example, if $r2=0.9$, then 90% of the variance in Y can be explained by variation our model, a funtion of $x$.
RSS measures the amount of variability that is left unexplained after performing the regression = $ \sum\limits (y^i-\theta^T x^i)^2 $
In the simple (single predictor) linear regression setting, squared Pearson correlation is equivalent to $R^2$.
This breaks if we have more than one predictor, as correlation quantifies association between a single pair of variables only.
Let's load a dataset to examine this.
titanic = sns.load_dataset("titanic").dropna()
titanic.head(3)
survived | pclass | sex | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alive | alone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | female | 38 | 1 | 0 | 71.2833 | C | First | woman | False | C | Cherbourg | yes | False |
3 | 1 | 1 | female | 35 | 1 | 0 | 53.1000 | S | First | woman | False | C | Southampton | yes | False |
6 | 0 | 1 | male | 54 | 0 | 0 | 51.8625 | S | First | man | True | E | Southampton | no | True |
Compute Pearson correlation.
from scipy.stats.stats import pearsonr
print pearsonr(titanic.survived,titanic.pclass)
(-0.037697890718650616, 0.61338467913032124)
This returns $r$ along with a p-value.
The p-value indicates the probability of an uncorrelated system producing data with Pearson correlation at least as extreme as the one computed from these datasets.
Similarly, we can compute a correlation matrix with embedded $r$ values along with indications of significance.
sns.set_context(rc={"figure.figsize": (8, 8)})
sns.corrplot(titanic)
<matplotlib.axes.AxesSubplot at 0x111b6ed90>
In scikit-learn, we use an estimator, a Python object that implements the methods fit(X, y) and predict(T).
In the case of linear regression, this simply implements Ordinary Least Squares (scipy.linalg.lstsq).
In addition, the estimator object will return a metric, $r^2$.
import sys,os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# Read Ch. 2 data -
t=pd.read_csv(os.getcwd()+'/Datasets/Advertising.csv',index_col=False)
plt.scatter(t['TV'],t['Sales'],c='b')
# Linear model -
from sklearn.linear_model import LinearRegression
reg = LinearRegression() # estimator that implements the methods, fit and predict
feat=np.array([t['TV']]).transpose()
targ=np.array([t['Sales']]).transpose()
reg.fit(feat,targ)
predicted=reg.predict(feat)
plt.plot(feat,predicted,c='r')
print "The coefficient of determination is: %s"%reg.score(feat, targ)
print "The computed model parameter(s) is(are): %s"%reg.coef_
The coefficient of determination is: 0.61187505085 The computed model parameter(s) is(are): [[ 0.04753664]]
We can also use Seaborn to explore linear relationships.
It displays a scatterplot and regression line (with 95% confidence band to give an impression of the certainty in the model).
sns.lmplot("TV", "Sales", t)
<seaborn.axisgrid.FacetGrid at 0x10f9074d0>
We can also use a residual plot, which is pltting the model residual ($e^i = y^i - \hat{y^i}$) with respect to the independent parameter.
sns.residplot("TV", "Sales", t)
<matplotlib.axes.AxesSubplot at 0x10fc2af50>
sns.jointplot("TV","Sales",t,kind="reg",color="seagreen") # Regression plot along with the marginal distributions of the two variables
<seaborn.axisgrid.JointGrid at 0x110ce8310>
Note that we previously used $R^2$ as a measure of model fit, resulting in $R^2 = 0.61187505085$.
In the case of linear regression, Pearson R (correlation coeffienct) is equal to the coeffienct of determination.
np.sqrt(0.61187505085)
0.78222442486156107
titanic = sns.load_dataset("titanic")
titanic.head(3)
survived | pclass | sex | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alive | alone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | male | 22 | 1 | 0 | 7.2500 | S | Third | man | True | NaN | Southampton | no | False |
1 | 1 | 1 | female | 38 | 1 | 0 | 71.2833 | C | First | woman | False | C | Cherbourg | yes | False |
2 | 1 | 3 | female | 26 | 0 | 0 | 7.9250 | S | Third | woman | False | NaN | Southampton | yes | True |
First, recall that we can easily use factor plots to split and plot the data by categorical labels.
sns.factorplot("survived",data=titanic,hue="sex",palette="Set1")
<seaborn.axisgrid.FacetGrid at 0x114009310>
sns.factorplot("sex","age",data=titanic,hue="who",palette="Set1")
<seaborn.axisgrid.FacetGrid at 0x11499c110>
We can use facet grids to view the underlying distributions.
fg = sns.FacetGrid(titanic,hue="who",aspect=3,palette="Set1")
fg.map(sns.kdeplot, "age", shade=True)
fg.set(xlim=(0, 80))
<seaborn.axisgrid.FacetGrid at 0x116a1e350>
Or, violin plots.
pruned=titanic.dropna()
sns.violinplot(pruned.age,pruned.who,"points",palette="Set1")
<matplotlib.axes.AxesSubplot at 0x117262710>
We can then explore linear relationships, using jitter to make the data easier to view.
sns.lmplot("age","survived",titanic,y_jitter=.05,palette="Set1")
<seaborn.axisgrid.FacetGrid at 0x1170ce150>
Use bins on the data.
sns.lmplot("age","survived",titanic,y_jitter=.05,palette="Set1",x_bins=4)
<seaborn.axisgrid.FacetGrid at 0x1170116d0>
age_bins = [15, 30, 45, 60] # Specify bins
sns.lmplot("age", "survived",titanic,hue="sex",palette="Set1",x_bins=age_bins)
<seaborn.axisgrid.FacetGrid at 0x1179ac750>
Non-linearity:
Of course, examining the residuals is useful to identify this (below).
# Mock non-linear data
t['non_linear'] = t.TV + 1.5 * t.TV ** 2
sns.lmplot("TV","non_linear",t,color="indianred")
<seaborn.axisgrid.FacetGrid at 0x1106a8c50>
# Residuals
sns.residplot(t.TV,t.non_linear,color="indianred")
<matplotlib.axes.AxesSubplot at 0x11066c050>