Topics covered¶

• Simple linear regression.
• Diagnostics for simple linear regression.
• Multiple linear regression.
• Diagnostics.
• Interactions and ANOVA.
• Weighted Least Squares.
• Autocorrelation.
• Model selection.
• Penalized regression.
• Logistic regression.

Simple linear regression¶

In [3]:
heights_fig

Out[3]:

Least squares¶

• We used "least squares" regression. This measures the goodness of fit of a line by the sum of squared errors, $SSE$.
• Least squares regression chooses the line that minimizes $SSE(\beta_0, \beta_1) = \sum_{i=1}^n (Y_i - \beta_0 - \beta_1 \cdot X_i)^2.$

What is a $t$-statistic?¶

• Start with $Z \sim N(0,1)$ is standard normal and $S^2 \sim \chi^2_{\nu}$, independent of $Z$.
• Compute $T = \frac{Z}{\sqrt{\frac{S^2}{\nu}}}.$
• Then, $T \sim t_{\nu}$ has a $t$-distribution with $\nu$ degrees of freedom.
• Generally, a $t$-statistic has the form $$T = \frac{\hat{\theta} - \theta}{SE(\hat{\theta})}$$

Interval for $\beta_1$¶

A $(1-\alpha) \cdot 100 \%$ confidence interval: $\widehat{\beta}_1 \pm SE(\widehat{\beta}_1) \cdot t_{n-2, 1-\alpha/2}.$ Interval for regression line $\beta_0 + \beta_1 \cdot X$

• $(1-\alpha) \cdot 100 \%$ confidence interval: $\widehat{\beta}_0 + \widehat{\beta}_1 X \pm SE(\widehat{\beta}_0 + \widehat{\beta}_1 X) \cdot t_{n-2, 1-\alpha/2}$ where $SE(a_0\widehat{\beta}_0 + a_1\widehat{\beta}_1) = \widehat{\sigma} \sqrt{\frac{a_0^2}{n} + \frac{(a_0\overline{X} - a_1)^2}{\sum_{i=1}^n \left(X_i-\overline{X}\right)^2}}$

Prediction intervals¶

• $SE(\widehat{\beta}_0 + \widehat{\beta}_1 X_{\text{new}} + \varepsilon_{\text{new}}) = \widehat{\sigma} \sqrt{1 + \frac{1}{n} + \frac{(\overline{X} - X_{\text{new}})^2}{\sum_{i=1}^n \left(X_i-\overline{X}\right)^2}}.$
• Prediction interval is $\widehat{\beta}_0 + \widehat{\beta}_1 X_{\text{new}} \pm t_{n-2, 1-\alpha/2} \cdot SE(\widehat{\beta}_0 + \widehat{\beta}_1 X_{\text{new}} + \varepsilon_{\text{new}})$

Sums of squares¶

\begin{aligned} SSE &= \sum_{i=1}^n(Y_i - \widehat{Y}_i)^2 = \sum_{i=1}^n (Y_i - \widehat{\beta}_0 - \widehat{\beta}_1 X_i)^2 \\ SSR &= \sum_{i=1}^n(\overline{Y} - \widehat{Y}_i)^2 = \sum_{i=1}^n (\overline{Y} - \widehat{\beta}_0 - \widehat{\beta}_1 X_i)^2 \\ SST &= \sum_{i=1}^n(Y_i - \overline{Y})^2 = SSE + SSR \\ R^2 &= \frac{SSR}{SST} = 1 - \frac{SSE}{SST} = \widehat{Cor}(\pmb{X},\pmb{Y})^2. \end{aligned}

$F$-test in simple linear regression¶

• Full (bigger) model : $FM: \qquad Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i$
• Reduced (smaller) model: $RM: \qquad Y_i = \beta_0 + \varepsilon_i$
• The $F$-statistic has the form $F=\frac{(SSE(RM) - SSE(FM)) / (df_{RM} - df_{FM})}{SSE(FM) / df_{FM}}.$
• Reject $H_0: RM$ is correct, if $F > F_{1-\alpha, 1, n-2}$.

Assumptions in the simple linear regression model¶

• $Y_i = \beta_0 + \beta_1 X_{i} + \varepsilon_i$
• Errors $\varepsilon_i$ are assumed independent $N(0,\sigma^2)$.

Diagnostic plots¶

In [4]:
%%R -h 800 -w 800
simple.lm = lm(y2 ~ x2, data=anscombe)
plot(anscombe$x2, resid(simple.lm), ylab='Residual', xlab='X', pch=23, bg='orange', cex=2) abline(h=0, lwd=2, col='red', lty=2)  In [5]: %%R par(mfrow=c(2,2)) plot(simple.lm)  Quadratic model¶ In [6]: %%R -h 800 -w 800 quadratic.lm <- lm(y2 ~ poly(x2, 2), data=anscombe) Xsort <- sort(anscombe$x2)
plot(anscombe$x2, anscombe$y2, pch=23, bg='orange', cex=2, ylab='Y', xlab='X')
lines(Xsort, predict(quadratic.lm, list(x2=Xsort)), col='red', lty=2, lwd=2)

In [7]:
%%R -h 800 -w 800
par(mfrow=c(2,2))


Simple linear diagnostics¶

• Outliers

• Nonconstant variance

In [8]:
%%R -h 800 -w 800
url = 'http://stats191.stanford.edu/data/HIV.VL.table'

plot(GSS, VL, pch=23, bg='orange', cex=2)
viral.lm = lm(VL ~ GSS)
abline(viral.lm, col='red', lwd=2)


Multiple linear regression model¶

• Rather than one predictor, we have $p=6$ predictors.
• $Y_i = \beta_0 + \beta_1 X_{i1} + \dots + \beta_p X_{ip} + \varepsilon_i$
• Errors $\varepsilon$ are assumed independent $N(0,\sigma^2)$, as in simple linear regression.
• Coefficients are called (partial) regression coefficients because they "allow" for the effect of other variables.

Overall $F$-test¶

• Full (bigger) model : $$Y_i = \beta_0 + \beta_1 X_{i1} + \dots \beta_p X_{ip} + \varepsilon_i$$
• Reduced (smaller) model: $$Y_i = \beta_0 + \varepsilon_i$$
• The $F$-statistic has the form $F=\frac{(SSE(R) - SSE(F)) / (df_R - df_F)}{SSE(F) / df_F}.$

Matrix formulation¶

• ${ Y}_{n \times 1} = {X}_{n \times (p + 1)} {\beta}_{(p+1) \times 1} + {\varepsilon}_{n \times 1}$
• ${X}$ is called the design matrix of the model
• ${\varepsilon} \sim N(0, \sigma^2 I_{n \times n})$ is multivariate normal $SSE$ in matrix form $$SSE(\beta) = ({Y} - {X} {\beta})'({Y} - {X} {\beta})$$

OLS estimators¶

• Normal equations yield $$\widehat{\beta} = ({X}^t{X})^{-1}{X}^t{Y}$$
• Properties: $$\hat{\beta} \sim N(\beta, \sigma^2 (X^TX)^{-1} )$$

Confidence interval for $\sum_{j=0}^p a_j \beta_j$¶

• Suppose we want a $(1-\alpha)\cdot 100\%$ CI for $\sum_{j=0}^p a_j\beta_j$.
• Just as in simple linear regression:
• $\sum_{j=0}^p a_j \widehat{\beta}_j \pm t_{1-\alpha/2, n-p-1} \cdot SE\left(\sum_{j=0}^p a_j\widehat{\beta}_j\right).$

General $F$-tests¶

• Given two models $R \subset F$ (i.e. $R$ is a subspace of $F$), we can consider testing $$H_0: H_0: \text{R is adequate (i.e. \mathbb{E}(Y) \in R)}$$ vs. $$H_a: \text{F is adequate (i.e. \mathbb{E}(Y) \in F)}$$

• The test statistic is $$F = F = \frac{(SSE(R) - SSE(F)) / (df_R - df_F)}{SSE(F)/df_F}$$
• If $H_0$ is true, $F \sim F_{df_R-df_F, df_F}$ so we reject $H_0$ at level $\alpha$ if $F > F_{df_R-df_F, df_F, 1-\alpha}$.

Diagnostics: What can go wrong?¶

• Regression function can be wrong: maybe regression function should have some other form (see diagnostics for simple linear regression).

• Model for the errors may be incorrect:

• may not be normally distributed.

• may not be independent.

• may not have the same variance.

• Detecting problems is more art then science, i.e. we cannot test for all possible problems in a regression model.

• Basic idea of diagnostic measures: if model is correct then residuals $e_i = Y_i -\widehat{Y}_i, 1 \leq i \leq n$ should look like a sample of (not quite independent) $N(0, \sigma^2)$ random variables.

Diagnostics¶

In [9]:
%%R -h 800 -w 800
url = 'http://stats191.stanford.edu/data/scottish_races.table'
attach(races.table)
races.lm = lm(Time ~ Distance + Climb)
par(mfrow=c(2,2))
plot(races.lm, pch=23 ,bg='orange',cex=2)


Diagnostics measures¶

• DFFITS: $$DFFITS_i = \frac{\widehat{Y}_i - \widehat{Y}_{i(i)}}{\widehat{\sigma}_{(i)} \sqrt{H_{ii}}}$$

• Cook's Distance: $$D_i = \frac{\sum_{j=1}^n(\widehat{Y}_j - \widehat{Y}_{j(i)})^2}{(p+1) \, \widehat{\sigma}^2}$$

• DFBETAS: $$DFBETAS_{j(i)} = \frac{\widehat{\beta}_j - \widehat{\beta}_{j(i)}}{\sqrt{\widehat{\sigma}^2_{(i)} (X^TX)^{-1}_{jj}}}.$$

In [10]:
%%R
influence.measures(races.lm)

Influence measures of
lm(formula = Time ~ Distance + Climb) :

dfb.1_  dfb.Dstn  dfb.Clmb    dffit  cov.r   cook.d    hat inf
1   0.03781 -0.016613 -0.004743  0.03861 1.1595 5.13e-04 0.0538
2  -0.05959  0.067223 -0.073404 -0.11957 1.1269 4.88e-03 0.0495
3  -0.04858 -0.006707  0.028036 -0.06310 1.1329 1.37e-03 0.0384
4  -0.00767 -0.005677  0.008766 -0.01368 1.1556 6.44e-05 0.0485
5  -0.05047  0.084718 -0.145019 -0.20949 1.0837 1.47e-02 0.0553
6   0.00348 -0.004311  0.007567  0.01219 1.1536 5.12e-05 0.0468
7  -0.89062 -0.712743  2.364517  2.69897 0.8179 1.89e+00 0.4204   *
8  -0.00845 -0.001650  0.005567 -0.01116 1.1467 4.29e-05 0.0410
9  -0.01437  0.000913  0.006163 -0.01664 1.1453 9.52e-05 0.0403
10  0.04703  0.013056 -0.036517  0.06399 1.1431 1.41e-03 0.0457
11 -0.30124  0.768854 -0.479935  0.78583 3.4524 2.11e-01 0.6898   *
12 -0.01150  0.009662 -0.007493 -0.01673 1.1492 9.62e-05 0.0435
13 -0.03173 -0.029912 -0.000707 -0.11771 1.0922 4.70e-03 0.0323
14  0.11803  0.042034 -0.104886  0.16610 1.1039 9.34e-03 0.0513
15 -0.10038  0.057704 -0.022318 -0.11921 1.1062 4.83e-03 0.0388
16 -0.01852  0.006789 -0.099867 -0.21136 1.0501 1.49e-02 0.0444
17  0.01197 -0.066512  0.034459 -0.08338 1.1908 2.39e-03 0.0831
18  1.75830 -0.406551 -0.655944  1.84240 0.0493 4.07e-01 0.0554   *
19 -0.15890  0.044315  0.029416 -0.17485 1.0634 1.03e-02 0.0385
20  0.00865  0.001423 -0.005941  0.01101 1.1526 4.17e-05 0.0459
21  0.04776 -0.010017 -0.019196  0.05031 1.1611 8.70e-04 0.0566
22 -0.01890  0.013863 -0.006469 -0.02235 1.1546 1.72e-04 0.0483
23 -0.04117  0.033985 -0.032914 -0.06939 1.1327 1.65e-03 0.0398
24  0.07483 -0.046382  0.006427  0.07839 1.1571 2.11e-03 0.0584
25  0.03691 -0.012632 -0.008256  0.03808 1.1557 4.99e-04 0.0507
26 -0.13774  0.136139 -0.101317 -0.19784 1.0914 1.32e-02 0.0550
27 -0.02921 -0.005702  0.019240 -0.03857 1.1431 5.11e-04 0.0410
28 -0.04765  0.006938  0.014993 -0.05447 1.1345 1.02e-03 0.0376
29 -0.00214  0.000648 -0.000329 -0.00310 1.1338 3.30e-06 0.0299
30 -0.08532 -0.007706  0.054842 -0.10363 1.1323 3.67e-03 0.0482
31  0.02099  0.170132 -0.373652 -0.44140 1.0960 6.41e-02 0.1216
32 -0.02858 -0.008694  0.023276 -0.03931 1.1513 5.31e-04 0.0475
33 -0.15822  0.097011  0.155697  0.33383 1.2609 3.77e-02 0.1716
34 -0.00356  0.000706  0.001056 -0.00393 1.1461 5.31e-06 0.0403
35  0.20872 -0.199049 -0.100907 -0.39445 1.2764 5.24e-02 0.1910


Outliers¶

• Observations $(Y, X_1, \dots, X_p)$ that do not follow the model, while most other observations seem to follow the model.
• One solution: Bonferroni correction, threshold at $t_{1 - \alpha/(2*n), n-p-2}$.
• Bonferroni: if we are doing many $t$ (or other) tests, say $m >>1$ we can control overall false positive rate at $\alpha$ by testing each one at level $\alpha/m$.
In [11]:
%%R
library(car)
outlierTest(races.lm)

   rstudent unadjusted p-value Bonferonni p
18 7.610958         1.3968e-08    4.889e-07


Qualitative variables and interactions¶

In [12]:
%%R -h 800 -w 800
url = 'http://stats191.stanford.edu/data/jobtest.table'
jobtest.table$ETHN <- factor(jobtest.table$ETHN)
attach(jobtest.table)
plot(TEST, JPERF, type='n')
points(TEST[(ETHN == 0)], JPERF[(ETHN == 0)], pch=21, cex=2, bg='purple')
points(TEST[(ETHN == 1)], JPERF[(ETHN == 1)], pch=25, cex=2, bg='green')

In [13]:
%%R -h 800 -w 800
jobtest.lm1 <- lm(JPERF ~ TEST, jobtest.table)
print(summary(jobtest.lm1))
plot(TEST, JPERF, type='n')
points(TEST[(ETHN == 0)], JPERF[(ETHN == 0)], pch=21, cex=2, bg='purple')
points(TEST[(ETHN == 1)], JPERF[(ETHN == 1)], pch=25, cex=2, bg='green')


Model selection criteria¶

• Mallow's $C_p$: $$C_p({\cal M}) = \frac{SSE({\cal M})}{\widehat{\sigma}^2} + 2 \cdot p({\cal M}) - n.$$
• Akaike (AIC) defined as $$AIC({\cal M}) = - 2 \log L({\cal M}) + 2 p({\cal M})$$ where $L({\cal M})$ is the maximized likelihood of the model.
• Bayes (BIC) defined as $$BIC({\cal M}) = - 2 \log L({\cal M}) + \log n \cdot p({\cal M})$$
• Adjusted $R^2$
• Stepwise (step ) vs. best subsets (leaps ).

$K$-fold cross-validation¶

• Fix a model ${\cal M}$. Break data set into $K$ approximately equal sized groups $(G_1, \dots, G_K)$.
• for (i in 1:K) Use all groups except $G_i$ to fit model, predict outcome in group $G_i$ based on this model $\widehat{Y}_{j,{\cal M}, G_i}, j \in G_i$.
• Estimate $$CV({\cal M}) = \frac{1}{n}\sum_{i=1}^K \sum_{j \in G_i} (Y_j - \widehat{Y}_{j,{\cal M},G_i})^2.$$

Shrinkage estimator¶

• In one sample problem, when trying to estimate $\mu$ from $Y_i \sim N(\mu, \sigma^2)$ we looked at the estimator $$\hat{Y}_{\alpha} = \alpha \cdot \bar{Y}.$$

• The "quality" of the estimator decomposed as $$E((\hat{Y}_{\alpha}-\mu)^2) = \text{Bias}(\hat{Y}_{\alpha})^2 + \text{Var}(\hat{Y}_{\alpha})$$

In [21]:
%%R -h 800 -w 800
nsample = 40
ntrial = 500
mu = 0.5
sigma = 2.5
MSE <- function(mu.hat, mu) {
return(sum((mu.hat - mu)^2) / length(mu))
}

alpha <- seq(0.0,1,length=20)

mse <- numeric(length(alpha))

for (i in 1:ntrial) {
Z = rnorm(nsample) * sigma + mu
for (j in 1:length(alpha)) {
mse[j] = mse[j] + MSE(alpha[j] * mean(Z) * rep(1, nsample), mu * rep(1, nsample)) / ntrial
}
}

plot(alpha, mse, type='l', lwd=2, col='red', ylim=c(0, max(mse)),
xlab=expression(paste('Penalty parameter,', alpha)),
ylab=expression(paste('MSE(', alpha, ')')),
cex.lab=1.5)


Ridge regression¶

$$\hat{\beta}_{\lambda} = \text{argmin}_{\beta} \|Y-X\beta\|^2_2 + \lambda \|\beta\|_2^2$$
In [22]:
%%R -h 600 -w 600
library(lars)
data(diabetes)
library(MASS)
diabetes.ridge <- lm.ridge(diabetes$y ~ diabetes$x, lambda=seq(0,10,0.05))
plot(diabetes.ridge, lwd=3)

Loaded lars 1.2

Attaching package: â€˜MASSâ€™

The following object is masked from â€˜package:alr3â€™:

forbes


In [23]:
%%R -h 600 -w 600
par(cex.lab=2)
plot(diabetes.ridge$lambda, diabetes.ridge$GCV, xlab='Lambda', ylab='GCV', type='l', lwd=3, col='orange')
select(diabetes.ridge)

modified HKB estimator is 5.462251
modified L-W estimator is 7.641667
smallest value of GCV  at 3.25


LASSO¶

$$\hat{\beta}_{\lambda} = \text{argmin}_{\beta} \|Y-X\beta\|^2_2 + \lambda \|\beta\|_1$$
In [24]:
%%R
library(lars)
data(diabetes)
diabetes.lasso = lars(diabetes$x, diabetes$y, type='lasso')
plot(diabetes.lasso)

In [25]:
%%R -h 600 -w 600
par(cex.lab=2)
cv.lars(diabetes$x, diabetes$y, K=10, type='lasso')