Model selection

In a given regression situation, there are often many choices to be made. Recall our usual setup $$ Y_{n \times 1} = X_{n \times p} \beta_{p \times 1} + \epsilon_{n \times 1}. $$

Any subset $A \subset \{1, \dots, p\}$ yields a new regression model $$ {\cal M}(A): Y_{n \times 1} = X[,A] \beta[A] + \epsilon_{n \times 1} $$ by setting $\beta[A^c]=0$.

Model selection is, roughly speaking, how to choose $A$ among the $2^p$ possible choices.

Election data

Here is a dataset from the book that we will use to explore model selection.

Variable Description
$V$ votes for a presidential candidate
$I$ are they incumbent?
$D$ Democrat or Republican incumbent?
$W$ wartime election?
$G$ GDP growth rate in election year
$P$ (absolute) GDP deflator growth rate
$N$ number of quarters in which GDP growth rate $> 3.2\%$
In [2]:
%%R
url = 'http://stats191.stanford.edu/data/election.table'
election.table = read.table(url, header=T)
pairs(election.table[,2:ncol(election.table)], cex.labels=3, pch=23,
      bg='orange', cex=2)

Problem & Goals

  • When we have many predictors (with many possible interactions), it can be difficult to find a good model.
  • Which main effects do we include?
  • Which interactions do we include?
  • Model selection procedures try to simplify / automate this task.
  • Election data has $2^6=64$ different models with just main effects!

General comments

  • This is generally an "unsolved" problem in statistics: there are no magic procedures to get you the "best model."

  • In some sense, model selection is "data mining."

  • Data miners / machine learners often work with very many predictors.

  • Our model selection problem is generally at a much smaller scale than "data mining" problems.

  • Still, it is a hard problem.

  • Inference after selection is full of pitfalls!

Hypothetical example

  • Suppose we fit a a model $F: \quad Y_{n \times 1} = X_{n \times p} \beta_{p \times 1} + \varepsilon_{n \times 1}$ with predictors ${ X}_1, \dots, { X}_p$.
  • In reality, some of the $\beta$’s may be zero. Let’s suppose that $\beta_{j+1}= \dots= \beta_{p}=0$.
  • Then, any model that includes $\beta_0, \dots, \beta_j$ is correct: which model gives the best estimates of $\beta_0, \dots, \beta_j$?
  • Principle of parsimony (i.e. Occam’s razor) says that the model with only ${X}_1, \dots, {X}_j$ is "best".

Justifying parsimony

  • For simplicity, let’s assume that $j=1$ so there is only one coefficient to estimate.
  • Then, because each model gives an unbiased estimate of $\beta_1$ we can compare models based on $\text{Var}(\widehat{\beta}_1).$
  • The best model, in terms of this variance, is the one containing only ${ X}_1$.
  • What if we didn’t know that only $\beta_1$ was non-zero (which we don’t know in general)?
  • In this situation, we must choose a set of variables.

Model selection: choosing a subset of variables

  • To "implement" a model selection procedure, we first need a criterion or benchmark to compare two models.
  • Given a criterion, we also need a search strategy.
  • With a limited number of predictors, it is possible to search all possible models (leaps in R).

Candidate criteria

Possible criteria

  • $R^2$: not a good criterion. Always increase with model size $\implies$ "optimum" is to take the biggest model.
  • Adjusted $R^2$: better. It "penalized" bigger models. Follows principle of parsimony / Occam’s razor.
  • Mallow’s $C_p$ – attempts to estimate a model’s predictive power, i.e. the power to predict a new observation.

Best subsets, $R^2$

Leaps takes a design matrix as argument: throw away the intercept column or leaps will complain.

In [3]:
%%R
election.lm = lm(V ~ I + D + W +G:I + P + N, election.table)
election.lm

Call:
lm(formula = V ~ I + D + W + G:I + P + N, data = election.table)

Coefficients:
(Intercept)            I            D            W            P            N  
  0.5111627   -0.0201077    0.0546159    0.0133905   -0.0007224   -0.0051822  
        I:G  
  0.0096901  


In [4]:
%%R
X = model.matrix(election.lm)[,-1]
library(leaps)
election.leaps = leaps(X, election.table$V, nbest=3, method='r2')
best.model.r2 = election.leaps$which[which((election.leaps$r2 == max(election.leaps$r2))),]
best.model.r2
   1    2    3    4    5    6 
TRUE TRUE TRUE TRUE TRUE TRUE 

Let's plot the $R^2$ as a function of the model size. We see that the full model does include all variables.

In [5]:
%%R
plot(election.leaps$size, election.leaps$r2, pch=23, bg='orange', cex=2)

Best subsets, adjusted $R^2$

  • As we add more and more variables to the model – even random ones, $R^2$ will increase to 1.

  • Adjusted $R^2$ tries to take this into account by replacing sums of squares by mean squares $$R^2_a = 1 - \frac{SSE/(n-p-1)}{SST/(n-1)} = 1 - \frac{MSE}{MST}.$$

In [6]:
%%R
election.leaps = leaps(X, election.table$V, nbest=3, method='adjr2')
plot(election.leaps$size, election.leaps$adjr2, pch=23, bg='orange', cex=2)
best.model.adjr2 = election.leaps$which[which((election.leaps$adjr2 == max(election.leaps$adjr2))),]
best.model.adjr2
    1     2     3     4     5     6 
 TRUE  TRUE FALSE FALSE  TRUE  TRUE 

Mallow’s $C_p$

  • $C_p({\cal M}) = \frac{SSE({\cal M})}{\widehat{\sigma}^2} + 2 \cdot p({\cal M}) - n.$
  • $\widehat{\sigma}^2=SSE(F)/df_F$ is the "best" estimate of $\sigma^2$ we have (use the fullest model), i.e. in the election data it uses all 6 main effects.
  • $SSE({\cal M})$ is the $SSE$ of the model ${\cal M}$.
  • $p({\cal M})$ is the number of predictors in ${\cal M}$.
  • This is an estimate of the expected mean-squared error of $\widehat{Y}({\cal M})$, it takes bias and variance into account.
In [7]:
%%R
election.leaps = leaps(X, election.table$V, nbest=3, method='Cp')
plot(election.leaps$size, election.leaps$Cp, pch=23, bg='orange', cex=2)
best.model.Cp = election.leaps$which[which((election.leaps$Cp == min(election.leaps$Cp))),]
best.model.Cp
    1     2     3     4     5     6 
FALSE  TRUE FALSE FALSE  TRUE  TRUE 

Search strategies

  • Given a criterion, we now have to decide how we are going to search through the possible models.

  • "Best subset": search all possible models and take the one with highest $R^2_a$ or lowest $C_p$ leaps. Such searches are typically feasible only up to $p=30$ or $40$.

  • Stepwise (forward, backward or both): useful when the number of predictors is large. Choose an initial model and be "greedy".

  • "Greedy" means always take the biggest jump (up or down) in your selected criterion.

Implementations in R

  • "Best subset": use the function leaps. Works only for multiple linear regression models.
  • Stepwise: use the function step. Works for any model with Akaike Information Criterion (AIC). In multiple linear regression, AIC is (almost) a linear function of $C_p$.

Akaike / Bayes Information Criterion

  • Akaike (AIC) defined as $$AIC({\cal M}) = - 2 \log L({\cal M}) + 2 p({\cal M})$$ where $L({\cal M})$ is the maximized likelihood of the model.
  • Bayes (BIC) defined as $$BIC({\cal M}) = - 2 \log L({\cal M}) + \log n \cdot p({\cal M})$$
  • Strategy can be used for whenever we have a likelihood, so this generalizes to many statistical models.

AIC for regression

  • In linear regression with unknown $\sigma^2$ $-2 \log L({\cal M}) = n \log(2\pi \widehat{\sigma}^2_{MLE}) + n$ where $\widehat{\sigma}^2_{MLE} = \frac{1}{n} SSE(\widehat{\beta})$
  • In linear regression with known $\sigma^2$ $-2 \log L({\cal M}) = n \log(2\pi \sigma^2) + \frac{1}{\sigma^2} SSE(\widehat{\beta})$ so AIC is very much like Mallow’s $C_p$.
In [8]:
%%R
n = nrow(X)
p = 7 + 1 
c(n * log(2*pi*sum(resid(election.lm)^2)/n) + n + 2*p, AIC(election.lm))
[1] -66.94026 -66.94026

Properties of AIC / BIC

  • BIC will always choose a model as small or smaller than AIC (if using the same search direction).

  • As our sample size grows, under some assumptions, it can be shown that

    • AIC will (asymptotically) always choose a model that contains the true model, i.e. it won’t leave any variables out.
    • BIC will (asymptotically) choose exactly the right model.

Election example

Let's take a look at step in action. Probably the simplest strategy is forward stepwise which tries to add one variable at a time, as long as it can find a resulting model whose AIC is better than its current position.

When it can make no further additions, it terminates.

In [9]:
%%R
election.step.forward = step(lm(V ~ 1, election.table), 
                             list(upper = ~ I + D + W + G:I + P + N), 
                             direction='forward')
summary(election.step.forward)
Start:  AIC=-107.78
V ~ 1

       Df Sum of Sq      RSS     AIC
+ D     1 0.0280805 0.084616 -111.80
+ I     1 0.0135288 0.099167 -108.47
+ P     1 0.0124463 0.100250 -108.24
<none>              0.112696 -107.78
+ N     1 0.0024246 0.110271 -106.24
+ W     1 0.0009518 0.111744 -105.96

Step:  AIC=-111.8
V ~ D

       Df Sum of Sq      RSS     AIC
+ P     1 0.0099223 0.074693 -112.42
<none>              0.084616 -111.80
+ W     1 0.0068141 0.077801 -111.56
+ I     1 0.0012874 0.083328 -110.12
+ N     1 0.0000033 0.084612 -109.80

Step:  AIC=-112.42
V ~ D + P

       Df  Sum of Sq      RSS     AIC
<none>               0.074693 -112.42
+ W     1 0.00031940 0.074374 -110.51
+ N     1 0.00018496 0.074508 -110.47
+ I     1 0.00002633 0.074667 -110.42

Call:
lm(formula = V ~ D + P, data = election.table)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.101121 -0.036838 -0.006987  0.019029  0.163250 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.514022   0.022793  22.552  1.2e-14 ***
D            0.043134   0.017381   2.482   0.0232 *  
P           -0.006017   0.003891  -1.546   0.1394    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.06442 on 18 degrees of freedom
Multiple R-squared: 0.3372,	Adjusted R-squared: 0.2636 
F-statistic: 4.579 on 2 and 18 DF,  p-value: 0.02468 


We notice that although the full model we gave it had the interaction I:G, the function step never tried to use it. This is due to some rules implemented in step that do not include an interaction unless both main effects are already in the model. In this case, because neither $I$ nor $G$ were added, the interaction was never considered.

In the leaps example, we gave the function the design matrix and it did not have to consider interactions: they were already encoded in the design matrix.

BIC example

The only difference between AIC and BIC is the price paid per variable. This is the argument k to step. By default k=2 and for BIC we set k=log(n). If we set k=0 it will always add variables.

In [10]:
%%R
election.step.forward.BIC = step(lm(V ~ 1, election.table), 
                                 list(upper = ~ I + D + W +G:I + P + N), 
                                 direction='forward', k=log(nrow(X)))
summary(election.step.forward)
Start:  AIC=-106.73
V ~ 1

       Df Sum of Sq      RSS     AIC
+ D     1 0.0280805 0.084616 -109.71
<none>              0.112696 -106.73
+ I     1 0.0135288 0.099167 -106.38
+ P     1 0.0124463 0.100250 -106.15
+ N     1 0.0024246 0.110271 -104.15
+ W     1 0.0009518 0.111744 -103.87

Step:  AIC=-109.71
V ~ D

       Df Sum of Sq      RSS     AIC
<none>              0.084616 -109.71
+ P     1 0.0099223 0.074693 -109.28
+ W     1 0.0068141 0.077801 -108.43
+ I     1 0.0012874 0.083328 -106.99
+ N     1 0.0000033 0.084612 -106.67

Call:
lm(formula = V ~ D + P, data = election.table)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.101121 -0.036838 -0.006987  0.019029  0.163250 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.514022   0.022793  22.552  1.2e-14 ***
D            0.043134   0.017381   2.482   0.0232 *  
P           -0.006017   0.003891  -1.546   0.1394    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.06442 on 18 degrees of freedom
Multiple R-squared: 0.3372,	Adjusted R-squared: 0.2636 
F-statistic: 4.579 on 2 and 18 DF,  p-value: 0.02468 


Backward selection

Just for fun, let's consider backwards stepwise. This starts at a full model and tries to delete variables.

There is also a direction="both" option.

In [11]:
%%R
election.step.backward = step(election.lm, direction='backward')
summary(election.step.backward)
Start:  AIC=-128.54
V ~ I + D + W + G:I + P + N

       Df Sum of Sq      RSS     AIC
- P     1  0.000055 0.023741 -130.49
- W     1  0.000170 0.023855 -130.39
<none>              0.023686 -128.54
- N     1  0.003133 0.026818 -127.93
- D     1  0.011926 0.035612 -121.97
- I:G   1  0.050640 0.074325 -106.52

Step:  AIC=-130.49
V ~ I + D + W + N + I:G

       Df Sum of Sq      RSS     AIC
- W     1  0.000120 0.023860 -132.38
<none>              0.023741 -130.49
- N     1  0.003281 0.027021 -129.77
- D     1  0.013983 0.037724 -122.76
- I:G   1  0.053507 0.077248 -107.71

Step:  AIC=-132.38
V ~ I + D + N + I:G

       Df Sum of Sq      RSS     AIC
<none>              0.023860 -132.38
- N     1  0.003199 0.027059 -131.74
- D     1  0.013867 0.037727 -124.76
- I:G   1  0.059452 0.083312 -108.12

Call:
lm(formula = V ~ I + D + N + I:G, data = election.table)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.043509 -0.019208 -0.004912  0.009626  0.090627 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.506530   0.020689  24.483 4.15e-14 ***
I           -0.019417   0.014701  -1.321  0.20515    
D            0.055436   0.018180   3.049  0.00765 ** 
N           -0.004653   0.003177  -1.465  0.16241    
I:G          0.009588   0.001519   6.314 1.03e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.03862 on 16 degrees of freedom
Multiple R-squared: 0.7883,	Adjusted R-squared: 0.7353 
F-statistic: 14.89 on 4 and 16 DF,  p-value: 2.95e-05 


Cross-validation

Yet another model selection criterion is $K$-fold cross-validation.

  • Fix a model ${\cal M}$. Break data set into $K$ approximately equal sized groups $(G_1, \dots, G_K)$.
  • For (i in 1:K) Use all groups except $G_i$ to fit model, predict outcome in group $G_i$ based on this model $\widehat{Y}_{j,{\cal M}, G_i}, j \in G_i$.
  • Similar to what we saw in Cook's distance which is $n$-fold cross validation.
  • Estimate $CV({\cal M}) = \frac{1}{n}\sum_{i=1}^K \sum_{j \in G_i} (Y_j - \widehat{Y}_{j,{\cal M},G_i})^2.$

Comments about cross-validation.

  • It is a general principle that can be used in other situations to "choose parameters."
  • Pros (partial list): "objective" measure of a model.
  • Cons (partial list): inference is usually "out the window" (also true for other model selection procedures).
  • If goal is not really inference about certain specific parameters, it is a reasonable way to compare models.

$C_p$ versus 5-fold cross-validation

Let's plot our $C_p$ versus the $CV$ score. Keep in mind that there is additional randomness in the $CV$ score due to the random assignments to groups. Below, I have set the random seed so the results stay the same each time I run this chunk of code.

In [12]:
%%R
library(boot)
set.seed(0)
election.leaps = leaps(X, election.table$V, nbest=3, method='Cp')
V = election.table$V
election.leaps$cv = 0 * election.leaps$Cp
for (i in 1:nrow(election.leaps$which)) {
    subset = c(1:ncol(X))[election.leaps$which[i,]]
    if (length(subset) > 1) {
       Xw = X[,subset]
       wlm = glm(V ~ Xw)
       election.leaps$CV[i] = cv.glm(model.frame(wlm), wlm, K=5)$delta[1]
    }
    else {
       Xw = X[,subset[1]]
       wlm = glm(V ~ Xw)
       election.leaps$CV[i] = cv.glm(model.frame(wlm), wlm, K=5)$delta[1]
    }
}
In [13]:
%%R
plot(election.leaps$Cp, election.leaps$CV, pch=23, bg='orange', cex=2)
In [14]:
%%R
plot(election.leaps$size, election.leaps$CV, pch=23, bg='orange', cex=2)
best.model.Cv = election.leaps$which[which((election.leaps$CV == min(election.leaps$CV))),]
best.model.Cv
    1     2     3     4     5     6 
FALSE  TRUE FALSE  TRUE  TRUE  TRUE 

Conclusions

The model selected depends on the criterion used.

Criterion Model
$R^2$ ~ $ I + D + W +G:I + P + N$
$R^2_a$ ~ $ I + D + P + N$
$C_p$ ~ $D+P+N$
AIC forward ~ $D+P$
BIC forward ~ $D+P$
AIC backward ~ $I + D + N + I:G$
5-fold CV ~ $ I+W$

The selected model is random!

Where we are so far

  • Many other "criteria" have been proposed.
  • Some work well for some types of data, others for different data.
  • Check diagnostics!
  • These criteria (except cross-validation) are not "direct measures" of predictive power, though Mallow’s $C_p$ is a step in this direction.
  • $C_p$ measures the quality of a model based on both bias and variance of the model. Why is this important?
  • Bias-variance tradeoff is ubiquitous in statistics. More next class.

Inference after selection

Each of the above criteria return a model. The summary provides $p$-values.

In [15]:
%%R
summary(election.step.forward)

Call:
lm(formula = V ~ D + P, data = election.table)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.101121 -0.036838 -0.006987  0.019029  0.163250 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.514022   0.022793  22.552  1.2e-14 ***
D            0.043134   0.017381   2.482   0.0232 *  
P           -0.006017   0.003891  -1.546   0.1394    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.06442 on 18 degrees of freedom
Multiple R-squared: 0.3372,	Adjusted R-squared: 0.2636 
F-statistic: 4.579 on 2 and 18 DF,  p-value: 0.02468 


We can also form confidence intervals. But, can we trust these intervals or tests!

In [16]:
%%R
confint(election.step.forward)
                   2.5 %      97.5 %
(Intercept)  0.466136288 0.561907650
D            0.006618001 0.079649847
P           -0.014193177 0.002158189

A variant of forward stepwise

To illustrate the "dangers" of trusting the above $p$-values, I will use the lars package which we will see later.

This algorithm adds variables to a model one at a time as does forward stepwise. The algorithm can be viewed as a modification of forward stepwise, sometimes called a more democratic version of forward stepwise.

In [17]:
%%R
library(lars)
plot(lars(X, V, type='lar'))
Loaded lars 1.1


The path can be parameterized by a positive parameter called $\lambda$. The first value $\lambda_1$ is, up to some multiplicative factor, the largest correlation of any columns of $X$ with $Y$.

In this sense, the first step of LARS and forward stepwise are the same.

In [18]:
%%R
nsim = 5000
exactP = c()
naiveP = c()
for (i in 1:nsim) {
    Z = rnorm(nrow(X))
    L0 = lars(X, Z, type='lar', max.steps=3)
    lam1 = L0$lambda[1]
    lam2 = L0$lambda[2]
    naiveP = c(naiveP, 2 * (1 - pnorm(lam1)))
    exactP = c(exactP, (1 - pnorm(lam1)) / (1 - pnorm(lam2)))
}
In [19]:
%%R
plot(ecdf(naiveP), lwd=3, col='red')
abline(0,1, lwd=2, lty=2)

What is the type I error of the first test?

In [20]:
%%R
ecdf(naiveP)(0.05)
[1] 0.2296

It turns out that it is possible to modify the usual $t$-statistic so that we still get valid tests after the first step. This is what I called exactP above.

In [21]:
%%R
plot(ecdf(exactP), lwd=3, col='blue')
plot(ecdf(naiveP), lwd=3, col='red', add=TRUE)
abline(0,1, lwd=2, lty=2)

An exciting area of research

This exactP was only discovered recently!

For at least 10 years, I have been saying things along the lines of

Inference after model selection is basically out the window.
Forget all we taught you about t and F distributions as it
is no longer true...

It turns out that inference after selection is possible, and it doesn't force us to throw away all of our tools for inference.

But, it is a little more complicated to describe.

In [ ]: