Topics covered

  • Simple linear regression.
  • Diagnostics for simple linear regression.
  • Multiple linear regression.
  • Diagnostics.
  • Interactions and ANOVA.
  • Weighted Least Squares.
  • Autocorrelation.
  • Model selection.
  • Logistic regression.

Simple linear regression

In [4]:
%%R -o M,D,slope,intercept
library(alr3)
data(heights)
M = heights$Mheight
D = heights$Dheight
parameters = lm(D~M)$coef
print(parameters)
intercept = parameters[1]
slope = parameters[2]
(Intercept)           M 
  29.917437    0.541747 

In [5]:
heights_fig = plt.figure(figsize=(10,6))
axes = heights_fig.gca()
axes.scatter(M, D, c='red')
axes.set_xlabel("Mother's height (inches)", size=20)
axes.set_ylabel("Daughter's height (inches)", size=20)
X = 66
xf, yf = mlab.poly_between([X-.5,X+.5], [50,50], [75, 75])
selected_points = (M <= X+.5) * (M >= X-.5)
mean_within_slice = D[selected_points].mean()
scatterplot_slice = axes.fill(xf, yf, facecolor='blue', alpha=0.1, hatch='/')[0]
axes.scatter([X],[mean_within_slice], s=130, c='yellow', marker='^')
axes.plot([M.min(), M.max()], [intercept + slope * M.min(), intercept + slope * M.max()],
          linewidth=3)
Out[5]:
[<matplotlib.lines.Line2D at 0x106e61710>]
In [6]:
heights_fig
Out[6]:

Least squares

  • We used "least squares" regression. This measures the goodness of fit of a line by the sum of squared errors, $SSE$.
  • Least squares regression chooses the line that minimizes $SSE(\beta_0, \beta_1) = \sum_{i=1}^n (Y_i - \beta_0 - \beta_1 \cdot X_i)^2.$

Geometry of Least Squares: Simple Linear Model

What is a $t$-statistic?

  • Start with $Z \sim N(0,1)$ is standard normal and $S^2 \sim \chi^2_{\nu}$, independent of $Z$.
  • Compute $T = \frac{Z}{\sqrt{\frac{S^2}{\nu}}}.$
  • Then, $T \sim t_{\nu}$ has a $t$-distribution with $\nu$ degrees of freedom.
  • Generally, a $t$-statistic has the form $$ T = \frac{\hat{\theta} - \theta}{SE(\hat{\theta})}$$

Interval for $\beta_1$

A $(1-\alpha) \cdot 100 \%$ confidence interval: $\widehat{\beta}_1 \pm SE(\widehat{\beta}_1) \cdot t_{n-2, 1-\alpha/2}.$ Interval for regression line $\beta_0 + \beta_1 \cdot X$

  • $(1-\alpha) \cdot 100 \%$ confidence interval: $\widehat{\beta}_0 + \widehat{\beta}_1 X \pm SE(\widehat{\beta}_0 + \widehat{\beta}_1 X) \cdot t_{n-2, 1-\alpha/2}$ where $SE(a_0\widehat{\beta}_0 + a_1\widehat{\beta}_1) = \widehat{\sigma} \sqrt{\frac{a_0^2}{n} + \frac{(a_0\overline{X} - a_1)^2}{\sum_{i=1}^n \left(X_i-\overline{X}\right)^2}}$

Prediction intervals

  • $SE(\widehat{\beta}_0 + \widehat{\beta}_1 X_{\text{new}} + \varepsilon_{\text{new}}) = \widehat{\sigma} \sqrt{1 + \frac{1}{n} + \frac{(\overline{X} - X_{\text{new}})^2}{\sum_{i=1}^n \left(X_i-\overline{X}\right)^2}}.$
  • Prediction interval is $\widehat{\beta}_0 + \widehat{\beta}_1 X_{\text{new}} \pm t_{n-2, 1-\alpha/2} \cdot SE(\widehat{\beta}_0 + \widehat{\beta}_1 X_{\text{new}} + \varepsilon_{\text{new}})$

Sums of squares

$$\begin{aligned} SSE &= \sum_{i=1}^n(Y_i - \widehat{Y}_i)^2 = \sum_{i=1}^n (Y_i - \widehat{\beta}_0 - \widehat{\beta}_1 X_i)^2 \\ SSR &= \sum_{i=1}^n(\overline{Y} - \widehat{Y}_i)^2 = \sum_{i=1}^n (\overline{Y} - \widehat{\beta}_0 - \widehat{\beta}_1 X_i)^2 \\ SST &= \sum_{i=1}^n(Y_i - \overline{Y})^2 = SSE + SSR \\ R^2 &= \frac{SSR}{SST} = 1 - \frac{SSE}{SST} = \widehat{Cor}(\pmb{X},\pmb{Y})^2. \end{aligned}$$

$F$-test in simple linear regression

  • Full (bigger) model : $FM: \qquad Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i$
  • Reduced (smaller) model: $RM: \qquad Y_i = \beta_0 + \varepsilon_i$
  • The $F$-statistic has the form $F=\frac{(SSE(RM) - SSE(FM)) / (df_{RM} - df_{FM})}{SSE(FM) / df_{FM}}.$
  • Reject $H_0: RM$ is correct, if $F > F_{1-\alpha, 1, n-2}$.

Assumptions in the simple linear regression model

  • $Y_i = \beta_0 + \beta_1 X_{i} + \varepsilon_i$
  • Errors $\varepsilon_i$ are assumed independent $N(0,\sigma^2)$.

Diagnostic plots

In [7]:
%%R -h 800 -w 800
simple.lm = lm(y2 ~ x2, data=anscombe)
plot(anscombe$x2, resid(simple.lm), ylab='Residual', xlab='X',
     pch=23, bg='orange', cex=2)
abline(h=0, lwd=2, col='red', lty=2)
In [8]:
%%R
par(mfrow=c(2,2))
plot(simple.lm)

Quadratic model

In [9]:
%%R -h 800 -w 800
quadratic.lm <- lm(y2 ~ poly(x2, 2), data=anscombe)
Xsort <- sort(anscombe$x2)
plot(anscombe$x2, anscombe$y2, pch=23, bg='orange', cex=2, ylab='Y', xlab='X')
lines(Xsort, predict(quadratic.lm, list(x2=Xsort)), col='red', lty=2, lwd=2)
In [10]:
%%R -h 800 -w 800
par(mfrow=c(2,2))
plot(quadratic.lm)

Simple linear diagnostics

  • Outliers

  • Nonconstant variance

In [11]:
%%R -h 800 -w 800
url = 'http://stats191.stanford.edu/data/HIV.VL.table'

viral.load = read.table(url, header=T)
attach(viral.load)

plot(GSS, VL, pch=23, bg='orange', cex=2)
viral.lm = lm(VL ~ GSS)
abline(viral.lm, col='red', lwd=2)

Multiple linear regression model

  • Rather than one predictor, we have $p=6$ predictors.
  • $Y_i = \beta_0 + \beta_1 X_{i1} + \dots + \beta_p X_{ip} + \varepsilon_i$
  • Errors $\varepsilon$ are assumed independent $N(0,\sigma^2)$, as in simple linear regression.
  • Coefficients are called (partial) regression coefficients because they "allow" for the effect of other variables.

Geometry of Least Squares: Multiple Regression

Overall $F$-test

  • Full (bigger) model : $$Y_i = \beta_0 + \beta_1 X_{i1} + \dots \beta_p X_{ip} + \varepsilon_i$$
  • Reduced (smaller) model: $$Y_i = \beta_0 + \varepsilon_i$$
  • The $F$-statistic has the form $F=\frac{(SSE(R) - SSE(F)) / (df_R - df_F)}{SSE(F) / df_F}.$

Matrix formulation

  • ${ Y}_{n \times 1} = {X}_{n \times (p + 1)} {\beta}_{(p+1) \times 1} + {\varepsilon}_{n \times 1}$
  • ${X}$ is called the design matrix of the model
  • ${\varepsilon} \sim N(0, \sigma^2 I_{n \times n})$ is multivariate normal $SSE$ in matrix form $$SSE(\beta) = ({Y} - {X} {\beta})'({Y} - {X} {\beta})$$

OLS estimators

  • Normal equations yield $$\widehat{\beta} = ({X}^t{X})^{-1}{X}^t{Y} $$
  • Properties: $$\hat{\beta} \sim N(\beta, \sigma^2 (X^TX)^{-1} )$$

Confidence interval for $\sum_{j=0}^p a_j \beta_j$

  • Suppose we want a $(1-\alpha)\cdot 100\%$ CI for $\sum_{j=0}^p a_j\beta_j$.
  • Just as in simple linear regression:
  • $\sum_{j=0}^p a_j \widehat{\beta}_j \pm t_{1-\alpha/2, n-p-1} \cdot SE\left(\sum_{j=0}^p a_j\widehat{\beta}_j\right).$

General $F$-tests

  • Given two models $R \subset F$ (i.e. $R$ is a subspace of $F$), we can consider testing $$ H_0: H_0: \text{$R$ is adequate (i.e. $\mathbb{E}(Y) \in R$)} $$ vs. $$ H_a: \text{$F$ is adequate (i.e. $\mathbb{E}(Y) \in F$)} $$

    • The test statistic is $$ F = F = \frac{(SSE(R) - SSE(F)) / (df_R - df_F)}{SSE(F)/df_F} $$
  • If $H_0$ is true, $F \sim F_{df_R-df_F, df_F}$ so we reject $H_0$ at level $\alpha$ if $F > F_{df_R-df_F, df_F, 1-\alpha}$.

Diagnostics: What can go wrong?

  • Regression function can be wrong: maybe regression function should have some other form (see diagnostics for simple linear regression).

  • Model for the errors may be incorrect:

    • may not be normally distributed.

    • may not be independent.

    • may not have the same variance.

  • Detecting problems is more art then science, i.e. we cannot test for all possible problems in a regression model.

  • Basic idea of diagnostic measures: if model is correct then residuals $e_i = Y_i -\widehat{Y}_i, 1 \leq i \leq n$ should look like a sample of (not quite independent) $N(0, \sigma^2)$ random variables.

Diagnostics

In [12]:
%%R -h 800 -w 800
url = 'http://stats191.stanford.edu/data/scottish_races.table'
races.table = read.table(url, header=T)
attach(races.table)
races.lm = lm(Time ~ Distance + Climb)
par(mfrow=c(2,2))
plot(races.lm, pch=23 ,bg='orange',cex=2)

Diagnostics measures

  • DFFITS: $$DFFITS_i = \frac{\widehat{Y}_i - \widehat{Y}_{i(i)}}{\widehat{\sigma}_{(i)} \sqrt{H_{ii}}}$$

  • Cook's Distance: $$D_i = \frac{\sum_{j=1}^n(\widehat{Y}_j - \widehat{Y}_{j(i)})^2}{(p+1) \, \widehat{\sigma}^2}$$

  • DFBETAS: $$DFBETAS_{j(i)} = \frac{\widehat{\beta}_j - \widehat{\beta}_{j(i)}}{\sqrt{\widehat{\sigma}^2_{(i)} (X^TX)^{-1}_{jj}}}.$$

In [24]:
%%R
influence.measures(races.lm)
Influence measures of
	 lm(formula = Time ~ Distance + Climb) :

     dfb.1_  dfb.Dstn  dfb.Clmb    dffit  cov.r   cook.d    hat inf
1   0.03781 -0.016613 -0.004743  0.03861 1.1595 5.13e-04 0.0538    
2  -0.05959  0.067223 -0.073404 -0.11957 1.1269 4.88e-03 0.0495    
3  -0.04858 -0.006707  0.028036 -0.06310 1.1329 1.37e-03 0.0384    
4  -0.00767 -0.005677  0.008766 -0.01368 1.1556 6.44e-05 0.0485    
5  -0.05047  0.084718 -0.145019 -0.20949 1.0837 1.47e-02 0.0553    
6   0.00348 -0.004311  0.007567  0.01219 1.1536 5.12e-05 0.0468    
7  -0.89062 -0.712743  2.364517  2.69897 0.8179 1.89e+00 0.4204   *
8  -0.00845 -0.001650  0.005567 -0.01116 1.1467 4.29e-05 0.0410    
9  -0.01437  0.000913  0.006163 -0.01664 1.1453 9.52e-05 0.0403    
10  0.04703  0.013056 -0.036517  0.06399 1.1431 1.41e-03 0.0457    
11 -0.30124  0.768854 -0.479935  0.78583 3.4524 2.11e-01 0.6898   *
12 -0.01150  0.009662 -0.007493 -0.01673 1.1492 9.62e-05 0.0435    
13 -0.03173 -0.029912 -0.000707 -0.11771 1.0922 4.70e-03 0.0323    
14  0.11803  0.042034 -0.104886  0.16610 1.1039 9.34e-03 0.0513    
15 -0.10038  0.057704 -0.022318 -0.11921 1.1062 4.83e-03 0.0388    
16 -0.01852  0.006789 -0.099867 -0.21136 1.0501 1.49e-02 0.0444    
17  0.01197 -0.066512  0.034459 -0.08338 1.1908 2.39e-03 0.0831    
18  1.75830 -0.406551 -0.655944  1.84240 0.0493 4.07e-01 0.0554   *
19 -0.15890  0.044315  0.029416 -0.17485 1.0634 1.03e-02 0.0385    
20  0.00865  0.001423 -0.005941  0.01101 1.1526 4.17e-05 0.0459    
21  0.04776 -0.010017 -0.019196  0.05031 1.1611 8.70e-04 0.0566    
22 -0.01890  0.013863 -0.006469 -0.02235 1.1546 1.72e-04 0.0483    
23 -0.04117  0.033985 -0.032914 -0.06939 1.1327 1.65e-03 0.0398    
24  0.07483 -0.046382  0.006427  0.07839 1.1571 2.11e-03 0.0584    
25  0.03691 -0.012632 -0.008256  0.03808 1.1557 4.99e-04 0.0507    
26 -0.13774  0.136139 -0.101317 -0.19784 1.0914 1.32e-02 0.0550    
27 -0.02921 -0.005702  0.019240 -0.03857 1.1431 5.11e-04 0.0410    
28 -0.04765  0.006938  0.014993 -0.05447 1.1345 1.02e-03 0.0376    
29 -0.00214  0.000648 -0.000329 -0.00310 1.1338 3.30e-06 0.0299    
30 -0.08532 -0.007706  0.054842 -0.10363 1.1323 3.67e-03 0.0482    
31  0.02099  0.170132 -0.373652 -0.44140 1.0960 6.41e-02 0.1216    
32 -0.02858 -0.008694  0.023276 -0.03931 1.1513 5.31e-04 0.0475    
33 -0.15822  0.097011  0.155697  0.33383 1.2609 3.77e-02 0.1716    
34 -0.00356  0.000706  0.001056 -0.00393 1.1461 5.31e-06 0.0403    
35  0.20872 -0.199049 -0.100907 -0.39445 1.2764 5.24e-02 0.1910    

Outliers

  • Observations $(Y, X_1, \dots, X_p)$ that do not follow the model, while most other observations seem to follow the model.
  • One solution: Bonferroni correction, threshold at $t_{1 - \alpha/(2*n), n-p-2}$.
  • Bonferroni: if we are doing many $t$ (or other) tests, say $m >>1$ we can control overall false positive rate at $\alpha$ by testing each one at level $\alpha/m$.
In [13]:
%%R
library(car)
outlierTest(races.lm)
   rstudent unadjusted p-value Bonferonni p
18 7.610958         1.3968e-08    4.889e-07

Qualitative variables and interactions

In [14]:
%%R -h 800 -w 800
url = 'http://stats191.stanford.edu/data/jobtest.table'
jobtest.table <- read.table(url, header=T)
jobtest.table$ETHN <- factor(jobtest.table$ETHN)
attach(jobtest.table)
plot(TEST, JPERF, type='n')
points(TEST[(ETHN == 0)], JPERF[(ETHN == 0)], pch=21, cex=2, bg='purple')
points(TEST[(ETHN == 1)], JPERF[(ETHN == 1)], pch=25, cex=2, bg='green')
In [15]:
%%R -h 800 -w 800
jobtest.lm1 <- lm(JPERF ~ TEST, jobtest.table)
print(summary(jobtest.lm1))
plot(TEST, JPERF, type='n')
points(TEST[(ETHN == 0)], JPERF[(ETHN == 0)], pch=21, cex=2, bg='purple')
points(TEST[(ETHN == 1)], JPERF[(ETHN == 1)], pch=25, cex=2, bg='green')
abline(jobtest.lm1$coef, lwd=3, col='blue')

Call:
lm(formula = JPERF ~ TEST, data = jobtest.table)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.3558 -0.8798 -0.1897  1.2735  2.3312 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.0350     0.8680   1.192 0.248617    
TEST          2.3605     0.5381   4.387 0.000356 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.591 on 18 degrees of freedom
Multiple R-squared:  0.5167,	Adjusted R-squared:  0.4899 
F-statistic: 19.25 on 1 and 18 DF,  p-value: 0.0003555


In [16]:
%%R -h 800 -w 800
jobtest.lm4 = lm(JPERF ~ TEST * ETHN)
print(summary(jobtest.lm4))
plot(TEST, JPERF, type='n')
points(TEST[(ETHN == 0)], JPERF[(ETHN == 0)], pch=21, cex=2, bg='purple')
points(TEST[(ETHN == 1)], JPERF[(ETHN == 1)], pch=25, cex=2, bg='green')
abline(jobtest.lm4$coef['(Intercept)'], jobtest.lm4$coef['TEST'], lwd=3, col='purple')
abline(jobtest.lm4$coef['(Intercept)'] + jobtest.lm4$coef['ETHN1'],
      jobtest.lm4$coef['TEST'] + jobtest.lm4$coef['TEST:ETHN1'], lwd=3, col='green')

Call:
lm(formula = JPERF ~ TEST * ETHN)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.0734 -1.0594 -0.2548  1.2830  2.1980 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   2.0103     1.0501   1.914   0.0736 .
TEST          1.3134     0.6704   1.959   0.0677 .
ETHN1        -1.9132     1.5403  -1.242   0.2321  
TEST:ETHN1    1.9975     0.9544   2.093   0.0527 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.407 on 16 degrees of freedom
Multiple R-squared:  0.6643,	Adjusted R-squared:  0.6013 
F-statistic: 10.55 on 3 and 16 DF,  p-value: 0.0004511


ANOVA models

In [17]:
from ipy_table import *
anova_oneway = [('Source', 'SS', 'df', r'$\mathbb{E}(MS)$'), 
                ('Treatment', r'$SSTR=\sum_{i=1}^r n_i \left(\overline{Y}_{i\cdot} - \overline{Y}_{\cdot\cdot}\right)^2$', 
                'r-1', r'$\sigma^2 + \frac{\sum_{i=1}^r n_i \alpha_i^2}{r-1}$'),
                ('Error', r'$SSE=\sum_{i=1}^r \sum_{j=1}^{n_i}(Y_{ij} - \overline{Y}_{i\cdot})^2$',
                 r'$\sum_{i=1}^r (n_i - 1)$', r'$\sigma^2$')]
  
In [18]:
make_table(anova_oneway)   
apply_theme('basic')
Out[18]:

ANOVA models

In [19]:
anova_twoway = [('Source', 'SS', '  df  ', r'$\mathbb{E}(MS)$'), 
                ('A', r'$SSA=nm\sum_{i=1}^r  \left(\overline{Y}_{i\cdot\cdot} - \overline{Y}_{\cdot\cdot\cdot}\right)^2$',
                 'r-1', r'$\sigma^2 + nm\frac{\sum_{i=1}^r \alpha_i^2}{r-1}$'),
                ('B', r'$SSB=nr\sum_{j=1}^m  \left(\overline{Y}_{\cdot j\cdot} - \overline{Y}_{\cdot\cdot\cdot}\right)^2$',
                 'm-1', r'$\sigma^2 + nr\frac{\sum_{j=1}^m \beta_j^2}{m-1}$'),
                ('A:B', 
                 r'$SSAB = n\sum_{i=1}^r \sum_{j=1}^m  \left(\overline{Y}_{ij\cdot} - \overline{Y}_{i\cdot\cdot} - \overline{Y}_{\cdot j\cdot} + \overline{Y}_{\cdot\cdot\cdot}\right)^2$',
                r'(m-1)(r-1)', r'$\sigma^2 + n\frac{\sum_{i=1}^r\sum_{j=1}^m (\alpha\beta)_{ij}^2}{(r-1)(m-1)}$'),
                ('Error', r'$SSE = \sum_{i=1}^r \sum_{j=1}^m \sum_{k=1}^{n}(Y_{ijk} - \overline{Y}_{ij\cdot})^2$',
                 '(n-1)mr', r'$\sigma^2$')]
In [20]:
make_table(anova_twoway)   
apply_theme('basic')
Out[20]:
SourceSSdf$\mathbb{E}(MS)$
Treatment$SSTR=\sum_{i=1}^r n_i \left(\overline{Y}_{i\cdot} - \overline{Y}_{\cdot\cdot}\right)^2$r-1$\sigma^2 + \frac{\sum_{i=1}^r n_i \alpha_i^2}{r-1}$
Error$SSE=\sum_{i=1}^r \sum_{j=1}^{n_i}(Y_{ij} - \overline{Y}_{i\cdot})^2$$\sum_{i=1}^r (n_i - 1)$$\sigma^2$

Weighted Least Squares

  • Weighted Least Squares $$SSE(\beta, w) = \sum_{i=1}^n w_i \left(Y_i - \beta_0 - \beta_1 X_i\right)^2.$$
  • In general, weights should be like: $$w_i = \frac{1}{\text{Var}(\varepsilon_i)}.$$

  • WLS estimator: $$ \hat{\beta}_W = (X^TWX)^{-1}(X^TWY). $$

  • Briefly talked about efficiency of estimators.

Correlated errors: NASDAQ daily close 2011

In [21]:
%%R
url = 'http://stats191.stanford.edu/data/nasdaq_2011.csv'
nasdaq.data = read.table(url, header=TRUE, sep=',')

plot(nasdaq.data$Date, nasdaq.data$Close, xlab='Date', ylab='NASDAQ close',
     pch=23, bg='red', cex=2)

AR(1) noise

  • Suppose that, instead of being independent, the errors in our model were $\varepsilon_t = \rho \cdot \varepsilon_{t-1} + \omega_t, \qquad -1 < \rho < 1$ with $\omega_t \sim N(0,\sigma^2)$ independent.
  • If $\rho$ is close to 1, then errors are very correlated, $\rho=0$ is independence.
  • This is "Auto-Regressive Order (1)" noise (AR(1)). Many other models of correlation exist: ARMA, ARIMA, ARCH, GARCH, etc.

Correcting for AR(1)

  • Suppose we know $\rho$, if we "whiten" the data and regressors $$\begin{aligned} \tilde{Y}_{t+1} &= Y_{t+1} - \rho Y_t, t > 1 \\ \tilde{X}_{(t+1)j} &= X_{(t+1)j} - \rho X_{tj}, i > 1 \end{aligned}$$ for $1 \leq t \leq n-1$. This model satisfies "usual" assumptions, i.e. the errors $\tilde{\varepsilon}_t = \omega_{t+1} = \varepsilon_{t+1} - \rho \cdot \varepsilon_t$ are independent $N(0,\sigma^2)$.
  • For coefficients in new model $\tilde{\beta}$, $\beta_0 = \tilde{\beta}_0 / (1 - \rho)$, $\beta_j = \tilde{\beta}_j.$
  • Problem: in general, we don’t know $\rho$, but estimated it.
In [22]:
%%R
acf(nasdaq.data$Close)

Model selection criteria

  • Mallow's $C_p$: $$C_p({\cal M}) = \frac{SSE({\cal M})}{\widehat{\sigma}^2} + 2 \cdot p({\cal M}) - n.$$
  • Akaike (AIC) defined as $$AIC({\cal M}) = - 2 \log L({\cal M}) + 2 p({\cal M})$$ where $L({\cal M})$ is the maximized likelihood of the model.
  • Bayes (BIC) defined as $$BIC({\cal M}) = - 2 \log L({\cal M}) + \log n \cdot p({\cal M})$$
  • Adjusted $R^2$
  • Stepwise (step ) vs. best subsets (leaps ).

$K$-fold cross-validation

  • Fix a model ${\cal M}$. Break data set into $K$ approximately equal sized groups $(G_1, \dots, G_K)$.
  • for (i in 1:K) Use all groups except $G_i$ to fit model, predict outcome in group $G_i$ based on this model $\widehat{Y}_{j,{\cal M}, G_i}, j \in G_i$.
  • Estimate $$CV({\cal M}) = \frac{1}{n}\sum_{i=1}^K \sum_{j \in G_i} (Y_j - \widehat{Y}_{j,{\cal M},G_i})^2.$$

Shrinkage estimator

In [23]:
%%R -h 800 -w 800
nsample = 40
ntrial = 500
mu = 0.5
sigma = 2.5
MSE <- function(mu.hat, mu) {
  return(sum((mu.hat - mu)^2) / length(mu))
}

alpha <- seq(0.0,1,length=20)

mse <- numeric(length(alpha))

for (i in 1:ntrial) {
  Z = rnorm(nsample) * sigma + mu
  for (j in 1:length(alpha)) {
    mse[j] = mse[j] + MSE(alpha[j] * mean(Z) * rep(1, nsample), mu * rep(1, nsample)) / ntrial
  }
}

plot(alpha, mse, type='l', lwd=2, col='red', ylim=c(0, max(mse)),
     xlab=expression(paste('Penalty parameter,', alpha)), 
     ylab=expression(paste('MSE(', alpha, ')')), 
     cex.lab=1.5)

Logistic regression model

  • Logistic model $$E(Y|X) = \pi(X) = \frac{\exp(X^T\beta)}{1 + \exp(X^T\beta)}$$
  • This automatically fixes $0 \leq E(Y) = \pi(X) \leq 1$.
  • The logistic transform: $\text{logit}(\pi(X)) = \log\left(\frac{\pi(X)}{1 - \pi(X)}\right) = X^T\beta$

  • An example of a generalized linear model

    • link function $\text{logit}(\pi(X)) = X^T\beta$
    • Variance function: $\text{Var}(Y|X) = \pi(X)(1 - \pi(X))$

Odds Ratios

  • One reason logistic models are popular is that the parameters have simple interpretations in terms of odds.
  • Logistic model: $$OR_{X_j} = \frac{ODDS(Y=1|\dots, X_j=x_j+1, \dots)}{ODDS(Y=1|\dots, X_j=x_j, \dots)} = e^{\beta_j}$$
  • If $X_j \in {0, 1}$ is dichotomous, then odds for group with $X_j = 1$ are $e^{\beta_j}$ higher, other parameters being equal.

Deviance

  • For logistic regression model ${\cal M}$, $DEV({\cal M})$ replaces $SSE({\cal M})/\sigma^2$.
  • In least squares regression, we use $$\frac{SSE({\cal M}_R) - SSE({\cal M}_F)}{\sigma^2} \sim \chi^2_{df_R-df_F}$$
  • This is replaced with $DEV({\cal M}_R) - DEV({\cal M}_F) \overset{n \rightarrow \infty}{\sim} \chi^2_{df_R-df_F}$
In [ ]:
 
SourceSS  df  $\mathbb{E}(MS)$
A$SSA=nm\sum_{i=1}^r  \left(\overline{Y}_{i\cdot\cdot} - \overline{Y}_{\cdot\cdot\cdot}\right)^2$r-1$\sigma^2 + nm\frac{\sum_{i=1}^r \alpha_i^2}{r-1}$
B$SSB=nr\sum_{j=1}^m  \left(\overline{Y}_{\cdot j\cdot} - \overline{Y}_{\cdot\cdot\cdot}\right)^2$m-1$\sigma^2 + nr\frac{\sum_{j=1}^m \beta_j^2}{m-1}$
A:B$SSAB = n\sum_{i=1}^r \sum_{j=1}^m  \left(\overline{Y}_{ij\cdot} - \overline{Y}_{i\cdot\cdot} - \overline{Y}_{\cdot j\cdot} + \overline{Y}_{\cdot\cdot\cdot}\right)^2$(m-1)(r-1)$\sigma^2 + n\frac{\sum_{i=1}^r\sum_{j=1}^m (\alpha\beta)_{ij}^2}{(r-1)(m-1)}$
Error$SSE = \sum_{i=1}^r \sum_{j=1}^m \sum_{k=1}^{n}(Y_{ijk} - \overline{Y}_{ij\cdot})^2$(n-1)mr$\sigma^2$