Simple linear regression

The first type of model, which we will spend a lot of time on, is the simple linear regresssion model. One simple way to think of it is via scatter plots. Below are heights of mothers and daughters collected by Karl Pearson in the late 19th century.

In [1]:
%%R -o M,D
library(alr3)
data(heights)
M = heights$Mheight
D = heights$Dheight
Loading required package: car
Loading required package: MASS
Loading required package: nnet

Attaching package: ‘alr3’

The following object(s) are masked from ‘package:MASS’:

    forbes


In [2]:
heights_fig = plt.figure(figsize=(10,6))
axes = heights_fig.gca()
axes.scatter(M, D, c='red')
axes.set_xlabel("Mother's height (inches)", size=20)
axes.set_ylabel("Daughter's height (inches)", size=20)
Out[2]:
<matplotlib.text.Text at 0x97e3e10>

A simple linear regression model fits a line through the above scatter plot in a particular way. Specifically, it tries to estimate the height of a new daughter in this population, say $D_{new}$, whose mother had height $H_{new}$. It does this by considering each slice of the data. Here is a slice of the data near $M=66$, the slice is taken over a window of size 1 inch.

In [3]:
X = 66
xf, yf = mlab.poly_between([X-.5,X+.5], [50,50], [75, 75])
selected_points = (M <= X+.5) * (M >= X-.5)
mean_within_slice = D[selected_points].mean()
scatterplot_slice = axes.fill(xf, yf, facecolor='blue', alpha=0.1, hatch='/')[0]
axes.scatter([X],[mean_within_slice], s=130, c='yellow', marker='^')
heights_fig
Out[3]:
In [4]:
print mean_within_slice
65.1733333333

We see that, in our sample, the average height of daughters whose height fell within our slice is about 65.2 inches. Of course this height varies by slice. For instance, at 60 inches:

In [5]:
X = 60
selected_points = (M <= X+.5) * (M >= X-.5)
mean_within_slice = D[selected_points].mean()
print mean_within_slice
62.4282894737

The regression model puts a line through this scatter plot in an optimal fashion.

In [6]:
%%R -o slope,intercept
parameters = lm(D~M)$coef
print(parameters)
intercept = parameters[1]
slope = parameters[2]
(Intercept)           M 
  29.917437    0.541747 

In [7]:
axes.plot([M.min(), M.max()], [intercept + slope * M.min(), intercept + slope * M.max()],
          linewidth=3)
heights_fig
Out[7]:

What is a "regression" model?

A regression model is a model of the relationships between some covariates (predictors) and an outcome. Specifically, regression is a model of the average outcome given the covariates.

Mathematical formulation

For height of couples data: a mathematical model: $$ {\tt Daughter} = f({\tt Mother}) + \varepsilon $$ where $f$ gives the average height of the daughter of a mother of height Mother and $\varepsilon$ is the random variation within the slice.

Linear regression models

  • A linear regression model says that the function $f$ is a sum (linear combination) of functions of ${\tt Mother}$.

  • Simple linear regression model: $$f({\tt Mother}) = \beta_0 + \beta_1 \cdot {\tt Mother}$$ for some unknown parameter vector $(\beta_0, \beta_1)$.

  • Could also be a sum (linear combination) of fixed functions of Mother: $$f({\tt Mother}) = \beta_0 + \beta_1 \cdot {\tt Mother} + \beta_2 \cdot {\tt Mother}^2 $$

Simple linear regression model

  • Simple linear regression is the case when there is only one predictor: $$ f({\tt Mother}) = \beta_0 + \beta_1 \cdot {\tt Mother}.$$
  • Let $Y_i$ be the height of the $i$-th daughter in the sample, $X_i$ be the height of the $i$-th mother.

  • Model: $$ Y_i = \underbrace{\beta_0 + \beta_1 X_i}_{\text{regression equation}} + \underbrace{\varepsilon_i}_{\text{error}}$$ where $\varepsilon_i \sim N(0, \sigma^2)$ are independent.

  • This specifies a distribution for the $Y$'s given the $X$'s, i.e. it is a statistical model.

Fitting the model

  • We will be using least squares regression. This measures the goodness of fit of a line by the sum of squared errors, $SSE$.

  • Least squares regression chooses the line that minimizes $$ SSE(\beta_0, \beta_1) = \sum_{i=1}^n (Y_i - \beta_0 - \beta_1 \cdot X_i)^2.$$

  • In principle, we might measure goodness of fit differently: $$ SAD(\beta_0, \beta_1) = \sum_{i=1}^n |Y_i - \beta_0 - \beta_1 \cdot X_i|.$$

  • For some loss function $L$ we might try to minimize $$ L(\beta_0,\beta_1) = \sum_{i=1}^n L(Y_i-\beta_0-\beta_1X_i) $$

Why least squares?

  • With least squares, the minimizers have explicit formulae -- not so important with today's computer power -- especially when $L$ is convex.

  • Resulting formulae are linear in the outcome $Y$. This is important for inferential reasons. For only predictive power, this is also not so important.

  • If assumptions are correct, then this is maximum likelihood estimation.

  • Statistical theory tells us the maximum likelihood estimators (MLEs) are generally good estimators.

Choice of loss function

The choice of the function we use to measure goodness of fit, or the loss function, has an outcome on what sort of estimates we get out of our procedure. For instance, if, instead of fitting a line to a scatterplot, we were estimating a center of a distribution, which we denote by $\mu$, then we might consider minimizing several loss functions.

  • If we choose the sum of squared errors: $$ SSE(\mu) = \sum_{i=1}^n (Y_i - \mu)^2. $$ Then, we know that the minimizer of $SSE(\mu)$ is the sample mean.

  • On the other hand, if we choose the sum of the absolute errors $$ SAD(\mu) = \sum_{i=1}^n |Y_i - \mu|.$$ Then, the resulting minimizer is the sample median.

  • Both of these minimization problems also have population versions as well. For instance, the population mean minimizes, as a function of $\mu$ $$ \mathbb{E}((Y-\mu)^2) $$ while the population median minimizes $$ \mathbb{E}(|Y-\mu|). $$

Visualizing the loss function

Let's take some a random scatter plot and view the loss function.

In [8]:
X = np.random.standard_normal(50)
Y = np.random.standard_normal(50) * 2 + 1.5 + 0.1 * X
plt.scatter(X, Y)
Out[8]:
<matplotlib.collections.PathCollection at 0xa8382d0>

We've briefly seen how to fit a model in R. Let's take a look at how we might fit it in python directly.

In [9]:
import statsmodels.api as sm
design = sm.add_constant(X)
intercept, slope = sm.OLS(Y,design).fit().params
intercept, slope, np.sum((Y - intercept - slope*X)**2)
Out[9]:
(1.3915332980559882, 0.093151878839468488, 186.94312652152718)

Now let's plot the loss as a function of the parameters. Note that the true intercept is 1.5 while the true slope is 0.1.

In [10]:
squared_error_loss = plt.figure(figsize=(10,6))
grid = np.mgrid[(intercept-2):(intercept+2):100j,(slope-2):(slope+2):100j]
squared_errors = (Y[:,None,None] - grid[0] - X[:,None,None] * grid[1])**2
plt.imshow(-squared_errors.sum(0), origin='lower')
axes = squared_error_loss.gca()
axes.set_xticks([])
axes.set_yticks([])
plt.colorbar()
Out[10]:
<matplotlib.colorbar.Colorbar instance at 0xce74878>
In [11]:
squared_error_loss
Out[11]:

Let's contrast this with the sum of absolute errors.

In [12]:
absolute_error_loss = plt.figure(figsize=(10,6))
absolute_errors = np.fabs(Y[:,None,None] - grid[0] - X[:,None,None] * grid[1])
plt.imshow(-absolute_errors.sum(0), origin='lower')
axes = absolute_error_loss.gca()
axes.set_xticks([])
axes.set_yticks([])
plt.colorbar()
Out[12]:
<matplotlib.colorbar.Colorbar instance at 0xcee7bc0>
In [13]:
absolute_error_loss
Out[13]:

Geometry of least squares

The following picture will be with us, in various guises, throughout much of the course. It depicts the geometric picture involved in least squares regression.

It requires some imagination but the picture should be thought as representing vectors in $n$-dimensional space, l where $n$ is the number of points in the scatterplot. In our height data, $n=1375$. The bottom two axes should be thought of as 2-dimensional, while the axis marked "$\perp$" should be thought of as $(n-2)$ dimensional, or, 1373 in this case.

Important lengths

The (squared) lengths of the above vectors are important quantities in what follows.

There are three to note: $$ \begin{aligned} SSE &= \sum_{i=1}^n(Y_i - \widehat{Y}_i)^2 = \sum_{i=1}^n (Y_i - \widehat{\beta}_0 - \widehat{\beta}_1 X_i)^2 \\ SSR &= \sum_{i=1}^n(\overline{Y} - \widehat{Y}_i)^2 = \sum_{i=1}^n (\overline{Y} - \widehat{\beta}_0 - \widehat{\beta}_1 X_i)^2 \\ SST &= \sum_{i=1}^n(Y_i - \overline{Y})^2 = SSE + SSR \\ R^2 &= \frac{SSR}{SST} = 1 - \frac{SSE}{SST} = \widehat{Cor}(\pmb{X},\pmb{Y})^2. \end{aligned} $$

An important summary of the fit is the ratio $$ R^2 = \frac{SSR}{SST} = 1 - \frac{SSE}{SST} $$ which measures how much variability in $Y$ is explained by $X$.

Example: wages vs. experience

In this example, we'll look at the output of lm for the wage data and verify that some of the equations we present for the least squares solutions agree with the output. The data was compiled from a study in econometrics Learning about Heterogeneity in Returns to Schooling.

In [14]:
%%R
url = 'http://stats191.stanford.edu/data/wage.csv'
wages = read.table(url, sep=',', header=T)
print(head(wages))
mean(logwage)
  education  logwage
1  16.75000 2.845000
2  15.00000 2.446667
3  10.00000 1.560000
4  12.66667 2.099167
5  15.00000 2.490000
6  15.00000 2.330833
Error in mean(logwage) : object 'logwage' not found

In order to access the variables in wages we attach it so that the variables are in the toplevel namespace.

In [15]:
%%R
attach(wages)
mean(logwage)
[1] 2.240279

Let's fit the linear regression model.

In [16]:
%%R
wages.lm = lm(logwage ~ education)
print(wages.lm)

Call:
lm(formula = logwage ~ education)

Coefficients:
(Intercept)    education  
     1.2392       0.0786  


As in the mother-daughter data, we might want to plot the data and add the regression line.

In [17]:
%%R -h 800 -w 800
plot(education, logwage, pch=23, bg='red', cex=2)
abline(wages.lm, lwd=4, col='black')

Least squares estimators

There are explicit formulae for the least squares estimators, i.e. the minimizers of the error sum of squares.

For the slope, $\hat{\beta}_1$, it can be shown that $$ \widehat{\beta}_1 = \frac{\sum_{i=1}^n(X_i - \overline{X})(Y_i - \overline{Y} )}{\sum_{i=1}^n (X_i-\overline{X})^2} = \frac{\widehat{Cov}(X,Y)}{\widehat{Var}( X)}.$$

Knowing the slope estimate, the intercept estimate can be found easily: $$ \widehat{\beta}_0 = \overline{Y} - \widehat{\beta}_1 \cdot \overline{ X}.$$

Wages example

In [18]:
%%R
beta.1.hat = cov(education, logwage) / var(education)
beta.0.hat = mean(logwage) - beta.1.hat * mean(education)
print(c(beta.0.hat, beta.1.hat))
print(coef(wages.lm))
[1] 1.23919433 0.07859951
(Intercept)   education 
 1.23919433  0.07859951 

Estimate of $\sigma^2$

There is one final quantity needed to estimate all of our parameters in our (statistical) model for the scatterplot. This is $\sigma^2$, the variance of the random variation within each slice (the regression model assumes this variance is constant within each slice...).

The estimate most commonly used is $$ \hat{\sigma}^2 = \frac{1}{n-2} \sum_{i=1}^n (Y_i - \hat{\beta}_0 - \hat{\beta}_1 X_i)^2 = \frac{SSE}{n-2} = MSE $$

Above, note the practice of replacing the quantity $SSE(\hat{\beta}_0,\hat{\beta}_1)$, i.e. the minimum of this function, with just $SSE$.

The term MSE above refers to mean squared error: a sum of squares divided by what we call its degrees of freedom. The degrees of freedom of SSE, the error sum of squares is therefore $n-2$. Remember this $n-2$ corresponded to $\perp$ in the picture above...

Using some statistical calculations that we will not dwell on, if our simple linear regression model is correct, then we can see that $$ \frac{\hat{\sigma}^2}{\sigma^2} \sim \frac{\chi^2_{n-2}}{n-2} $$ where the right hand side denotes a chi-squared distribution with $n-2$ degrees of freedom.

Wages example

In [19]:
%%R
sigma.hat = sqrt(sum(resid(wages.lm)^2) / wages.lm$df.resid)
sigma.hat
[1] 0.4037828

The summary from R also contains this estimate of $\sigma$:

In [20]:
%%R
summary(wages.lm)

Call:
lm(formula = logwage ~ education)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.78239 -0.25265  0.01636  0.27965  1.61101 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.239194   0.054974   22.54   <2e-16 ***
education   0.078600   0.004262   18.44   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.4038 on 2176 degrees of freedom
Multiple R-squared: 0.1351,	Adjusted R-squared: 0.1347 
F-statistic:   340 on 1 and 2176 DF,  p-value: < 2.2e-16 


Inference for the simple linear regression model

What do we mean by inference?

  • Generally, by inference, we mean "learning something about the relationship between the sample $(X_1, \dots, X_n)$ and $(Y_1, \dots, Y_n)$."

  • In the simple linear regression model, this often means learning about $\beta_0, \beta_1$. Particular forms of inference are confidence intervals or hypothesis tests. More on these later.

  • Most of the questions of inference in this course can be answered in terms of $t$-statistics or $F$-statistics.

  • First we will talk about $t$-statistics, later $F$-statistics.

Examples of (statistical) hypotheses

  • One sample problem: given an independent sample $\pmb{X}=(X_1, \dots, X_n)$ where $X_i\sim N(\mu,\sigma^2)$, the null hypothesis $H_0:\mu=\mu_0$ says that in fact the population mean is some specified value $\mu_0$.

  • Two sample problem: given two independent samples $\pmb{Z}=(Z_1, \dots, Z_n)$, $\pmb{W}=(W_1, \dots, W_m)$ where $Z_i\sim N(\mu_1,\sigma^2)$ and $W_i \sim N(\mu_2, \sigma^2)$, the null hypothesis $H_0:\mu_1=\mu_2$ says that in fact the population means from which the two samples are drawn are identical.

Testing a hypothesis

We test a null hypothesis, $H_0$ based on some test statistic $T$ whose distribution is fully known when $H_0$ is true.

For example, in the one-sample problem, if $\bar{X}$ is the sample mean of our sample $(X_1, \dots, X_n)$ and $$ S^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i-\bar{X})^2 $$ is the sample variance. Then $$ T = \frac{\bar{X}-\mu_0}{S/\sqrt{n}} $$ has what is called a Student's t distribution with $n-1$ degrees of freedom when $H_0:\mu=\mu_0$ is true. When the null hypothesis is not true, it does not have this distribution!

General form of a (Student's) $T$ statistic

  • A $t$ statistic with $k$ degrees of freedom, has a form that becomes easy to recognize after seeing it several times.

  • It has two main parts: a numerator and a denominator. The numerator $Z \sim N(0,1)$ while $D \sim \sqrt{\chi^2_k/k}$ that is assumed independent of $Z$.

  • The $t$-statistic has the form $$ T = \frac{Z}{D}. $$

  • Another form of the $t$-statistic is $$ T = \frac{\text{estimate of parameter} - \text{true parameter}}{\text{accuracy of the estimate}}. $$

  • In more formal terms, we write this as $$ T = \frac{\hat{\theta} - \theta}{SE(\hat{\theta})}. $$ Note that the denominator is the accuracy of the estimate and not the true parameter (which is usually assumed fixed, at least for now). The term $SE$ or standard error will, in this course, usually refer to an estimate of the accuracy of estimator. Therefore, it is often the square root of an estimate of the variance of an estimator.

  • In our simple linear regression model, a natural $t$-statistic is $$ \frac{\hat{\beta}_1 - \beta_1}{SE(\hat{\beta}_1)}. $$ We've seen how to compute $\hat{\beta}_1$, we never get to see the true $\beta_1$, so the only quantity we have anything left to say about is the standard error $SE(\hat{\beta}_1)$.

  • How many degrees of freedom would this $T$ have?

Comparison of Student's $t$ to normal distribution

As the degrees of freedom increases, the population histogram, or density, of the $T_k$ distribution looks more and more like the standard normal distribution usually denoted by $N(0,1)$.

In [21]:
from scipy.stats import norm as ndist, t as tdist

density_fig = plt.figure(figsize=(10,6))
density_fig.clf()
df = 10
X = np.linspace(-4,4,101)
axes = plt.gca()
axes.plot(X,ndist.pdf(X), linewidth=2, label='Normal')

axes.legend()
axes.set_xlabel('standardized units')
axes.set_ylabel('density')
Out[21]:
<matplotlib.text.Text at 0xc19c090>

This change in the density has an effect on the rejection rule for hypothesis tests based on the $T_k$ distribution. For instance, for the standard normal, the 5% rejection rule is to reject if the so-called $Z$-score is larger than about 2 in absolute value.

In [22]:
X = np.linspace(ndist.isf(0.025), 4, 101)
Y = ndist.pdf(X)
xR, yR = mlab.poly_between(X, 0*X, Y)
axes.fill(xR, yR, facecolor='red', hatch='\\', alpha=0.5)

X = np.linspace(-4,-ndist.isf(0.025), 101)
Y = ndist.pdf(X)
xL, yL = mlab.poly_between(X, 0*X, Y)
axes.fill(xL, yL, facecolor='red', hatch='\\', alpha=0.5)
density_fig
Out[22]:

For the $T_{10}$ distribution, however, this rule must be modified.

In [36]:
X = np.linspace(-4,4,101)
df = 10
axes.plot(X, tdist.pdf(X, df), linewidth=2, label=r'$T$, df=10')
quantile = tdist.isf(0.025, df)
print('the T_10 quantile is: %0.2f' % quantile)

X = np.linspace(quantile, 4, 101)
Y = tdist.pdf(X,df)
xR, yR = mlab.poly_between(X, 0*X, Y)
axes.fill(xR, yR, facecolor='grey', hatch='\\', alpha=0.5)

X = np.linspace(-4, -quantile, 101)
Y = tdist.pdf(X, df)
xL, yL = mlab.poly_between(X, 0*X, Y)
axes.fill(xL, yL, facecolor='grey', hatch='\\', alpha=0.5)
axes.legend()
density_fig
the T_10 quantile is: 2.23

Out[36]:

One sample problem revisited

Above, we used the one sample problem as an example of a $t$-statistic. Let's be a little more specific.

  • Given an independent sample $\pmb{X}=(X_1, \dots, X_n)$ where $X_i\sim N(\mu,\sigma^2)$ we can test $H_0:\mu=0$ using a $T$-statistic.

  • We can prove that the random variables $$\overline{X} \sim N(\mu, \sigma^2/n), \qquad \frac{S^2_X}{\sigma^2} \sim \frac{\chi^2_{n-1}}{n-1}$$ are independent.

  • Therefore, whatever the true $\mu$ is $$ \frac{\overline{X} - \mu}{S_X / \sqrt{n}} = \frac{ (\overline{X}-\mu) / (\sigma/\sqrt{n})}{S_X / \sigma} \sim t_{n-1}.$$

  • Our null hypothesis specifies a particular value for $\mu$, i.e. 0. Therefore, under $H_0:\mu=0$ (i.e. assuming that $H_0$ is true), $$\overline{X}/(S_X/\sqrt{n}) \sim t_{n-1}.$$

Confidence interval

The following are examples of confidence intervals you may have already seen.

  • One sample problem: instead of deciding whether $\mu=0$, we might want to come up with an (random) interval $[L,U]$ based