Model selection¶

In a given regression situation, there are often many choices to be made. Recall our usual setup $$Y_{n \times 1} = X_{n \times p} \beta_{p \times 1} + \epsilon_{n \times 1}.$$

Any subset $A \subset \{1, \dots, p\}$ yields a new regression model $${\cal M}(A): Y_{n \times 1} = X[,A] \beta[A] + \epsilon_{n \times 1}$$ by setting $\beta[A^c]=0$.

Model selection is, roughly speaking, how to choose $A$ among the $2^p$ possible choices.

Election data¶

Here is a dataset from the book that we will use to explore different model selection approaches.

Variable Description
$V$ votes for a presidential candidate
$I$ are they incumbent?
$D$ Democrat or Republican incumbent?
$W$ wartime election?
$G$ GDP growth rate in election year
$P$ (absolute) GDP deflator growth rate
$N$ number of quarters in which GDP growth rate $> 3.2\%$
In [1]:
%%R
url = 'http://stats191.stanford.edu/data/election.table'
pairs(election.table[,2:ncol(election.table)], cex.labels=3, pch=23,
bg='orange', cex=2)


Problem & Goals¶

• When we have many predictors (with many possible interactions), it can be difficult to find a good model.
• Which main effects do we include?
• Which interactions do we include?
• Model selection procedures try to simplify / automate this task.
• Election data has $2^6=64$ different models with just main effects!

• This is generally an "unsolved" problem in statistics: there are no magic procedures to get you the "best model."

• In some sense, model selection is "data mining."

• Data miners / machine learners often work with very many predictors.

• Our model selection problem is generally at a much smaller scale than "data mining" problems.

• Still, it is a hard problem.

• Inference after selection is full of pitfalls!

Hypothetical example¶

• Suppose we fit a a model $F: \quad Y_{n \times 1} = X_{n \times p} \beta_{p \times 1} + \varepsilon_{n \times 1}$ with predictors ${ X}_1, \dots, { X}_p$.
• In reality, some of the $\beta$â€™s may be zero. Letâ€™s suppose that $\beta_{j+1}= \dots= \beta_{p}=0$.
• Then, any model that includes $\beta_0, \dots, \beta_j$ is correct: which model gives the best estimates of $\beta_0, \dots, \beta_j$?
• Principle of parsimony (i.e. Occamâ€™s razor) says that the model with only ${X}_1, \dots, {X}_j$ is "best".

Justifying parsimony¶

• For simplicity, letâ€™s assume that $j=1$ so there is only one coefficient to estimate.
• Then, because each model gives an unbiased estimate of $\beta_1$ we can compare models based on $\text{Var}(\widehat{\beta}_1).$
• The best model, in terms of this variance, is the one containing only ${ X}_1$.
• What if we didnâ€™t know that only $\beta_1$ was non-zero (which we donâ€™t know in general)?
• In this situation, we must choose a set of variables.

Model selection: choosing a subset of variables¶

• To "implement" a model selection procedure, we first need a criterion or benchmark to compare two models.
• Given a criterion, we also need a search strategy.
• With a limited number of predictors, it is possible to search all possible models (leaps in R).

Candidate criteria¶

Possible criteria:

• $R^2$: not a good criterion. Always increase with model size $\implies$ "optimum" is to take the biggest model.
• Adjusted $R^2$: better. It "penalized" bigger models. Follows principle of parsimony / Occamâ€™s razor.
• Mallowâ€™s $C_p$ â€“ attempts to estimate a modelâ€™s predictive power, i.e. the power to predict a new observation.

Best subsets, $R^2$¶

Leaps takes a design matrix as argument: throw away the intercept column or leaps will complain.

In [2]:
%%R
election.lm = lm(V ~ I + D + W +G:I + P + N, election.table)
election.lm

Call:
lm(formula = V ~ I + D + W + G:I + P + N, data = election.table)

Coefficients:
(Intercept)            I            D            W            P            N
0.5111627   -0.0201077    0.0546159    0.0133905   -0.0007224   -0.0051822
I:G
0.0096901


In [3]:
%%R
X = model.matrix(election.lm)[,-1]
library(leaps)
election.leaps = leaps(X, election.table$V, nbest=3, method='r2') best.model.r2 = election.leaps$which[which((election.leaps$r2 == max(election.leaps$r2))),]
best.model.r2

   1    2    3    4    5    6
TRUE TRUE TRUE TRUE TRUE TRUE


Let's plot the $R^2$ as a function of the model size. We see that the full model does include all variables.

In [4]:
%%R
plot(election.leaps$size, election.leaps$r2, pch=23, bg='orange', cex=2)


Best subsets, adjusted $R^2$¶

• As we add more and more variables to the model â€“ even random ones, $R^2$ will increase to 1.

• Adjusted $R^2$ tries to take this into account by replacing sums of squares by mean squares $$R^2_a = 1 - \frac{SSE/(n-p-1)}{SST/(n-1)} = 1 - \frac{MSE}{MST}.$$

In [5]:
%%R
election.leaps = leaps(X, election.table$V, nbest=3, method='adjr2') plot(election.leaps$size, election.leaps$adjr2, pch=23, bg='orange', cex=2) best.model.adjr2 = election.leaps$which[which((election.leaps$adjr2 == max(election.leaps$adjr2))),]

    1     2     3     4     5     6
TRUE  TRUE FALSE FALSE  TRUE  TRUE


Mallow’s $C_p$¶

• $C_p({\cal M}) = \frac{SSE({\cal M})}{\widehat{\sigma}^2} + 2 \cdot p({\cal M}) - n.$
• $\widehat{\sigma}^2=SSE(F)/df_F$ is the "best" estimate of $\sigma^2$ we have (use the fullest model), i.e. in the election data it uses all 6 main effects.
• $SSE({\cal M})$ is the $SSE$ of the model ${\cal M}$.
• $p({\cal M})$ is the number of predictors in ${\cal M}$.
• This is an estimate of the expected mean-squared error of $\widehat{Y}({\cal M})$, it takes bias and variance into account.
In [6]:
%%R
election.leaps = leaps(X, election.table$V, nbest=3, method='Cp') plot(election.leaps$size, election.leaps$Cp, pch=23, bg='orange', cex=2) best.model.Cp = election.leaps$which[which((election.leaps$Cp == min(election.leaps$Cp))),]
best.model.Cp

    1     2     3     4     5     6
FALSE  TRUE FALSE FALSE  TRUE  TRUE


Search strategies¶

• Given a criterion, we now have to decide how we are going to search through the possible models.

• "Best subset": search all possible models and take the one with highest $R^2_a$ or lowest $C_p$ leaps. Such searches are typically feasible only up to $p=30$ or $40$.

• Stepwise (forward, backward or both): useful when the number of predictors is large. Choose an initial model and be "greedy".

• "Greedy" means always take the biggest jump (up or down) in your selected criterion.

Implementations in R¶

• "Best subset": use the function leaps. Works only for multiple linear regression models.
• Stepwise: use the function step. Works for any model with Akaike Information Criterion (AIC). In multiple linear regression, AIC is (almost) a linear function of $C_p$.

Akaike / Bayes Information Criterion¶

• Akaike (AIC) defined as $$AIC({\cal M}) = - 2 \log L({\cal M}) + 2 \cdot p({\cal M})$$ where $L({\cal M})$ is the maximized likelihood of the model.
• Bayes (BIC) defined as $$BIC({\cal M}) = - 2 \log L({\cal M}) + \log n \cdot p({\cal M})$$
• Strategy can be used for whenever we have a likelihood, so this generalizes to many statistical models.

AIC for regression¶

• In linear regression with unknown $\sigma^2$ $$-2 \log L({\cal M}) = n \log(2\pi \widehat{\sigma}^2_{MLE}) + n$$ where $\widehat{\sigma}^2_{MLE} = \frac{1}{n} SSE(\widehat{\beta})$
• In linear regression with known $\sigma^2$ $$-2 \log L({\cal M}) = n \log(2\pi \sigma^2) + \frac{1}{\sigma^2} SSE(\widehat{\beta})$$ so AIC is very much like Mallowâ€™s $C_p$.
In [7]:
%%R
n = nrow(X)
p = 7 + 1
c(n * log(2*pi*sum(resid(election.lm)^2)/n) + n + 2*p, AIC(election.lm))

[1] -66.94026 -66.94026


Properties of AIC / BIC¶

• BIC will always choose a model as small or smaller than AIC (if using the same search direction).

• As our sample size grows, under some assumptions, it can be shown that

• AIC will (asymptotically) always choose a model that contains the true model, i.e. it wonâ€™t leave any variables out.
• BIC will (asymptotically) choose exactly the right model.

Election example¶

Let's take a look at step in action. Probably the simplest strategy is forward stepwise which tries to add one variable at a time, as long as it can find a resulting model whose AIC is better than its current position.

When it can make no further additions, it terminates.

In [8]:
%%R
election.step.forward = step(lm(V ~ 1, election.table),
list(upper = ~ I + D + W + G:I + P + N),
direction='forward')
summary(election.step.forward)

Start:  AIC=-107.78
V ~ 1

Df Sum of Sq      RSS     AIC
+ D     1 0.0280805 0.084616 -111.80
+ I     1 0.0135288 0.099167 -108.47
+ P     1 0.0124463 0.100250 -108.24
<none>              0.112696 -107.78
+ N     1 0.0024246 0.110271 -106.24
+ W     1 0.0009518 0.111744 -105.96

Step:  AIC=-111.8
V ~ D

Df Sum of Sq      RSS     AIC
+ P     1 0.0099223 0.074693 -112.42
<none>              0.084616 -111.80
+ W     1 0.0068141 0.077801 -111.56
+ I     1 0.0012874 0.083328 -110.12
+ N     1 0.0000033 0.084612 -109.80

Step:  AIC=-112.42
V ~ D + P

Df  Sum of Sq      RSS     AIC
<none>               0.074693 -112.42
+ W     1 0.00031940 0.074374 -110.51
+ N     1 0.00018496 0.074508 -110.47
+ I     1 0.00002633 0.074667 -110.42

Call:
lm(formula = V ~ D + P, data = election.table)

Residuals:
Min        1Q    Median        3Q       Max
-0.101121 -0.036838 -0.006987  0.019029  0.163250

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.514022   0.022793  22.552  1.2e-14 ***
D            0.043134   0.017381   2.482   0.0232 *
P           -0.006017   0.003891  -1.546   0.1394
---
Signif. codes:  0 â€˜***â€™ 0.001 â€˜**â€™ 0.01 â€˜*â€™ 0.05 â€˜.â€™ 0.1 â€˜ â€™ 1

Residual standard error: 0.06442 on 18 degrees of freedom
Multiple R-squared:  0.3372,	Adjusted R-squared:  0.2636
F-statistic: 4.579 on 2 and 18 DF,  p-value: 0.02468



We notice that although the full model we gave it had the interaction I:G, the function step never tried to use it. This is due to some rules implemented in step that do not include an interaction unless both main effects are already in the model. In this case, because neither $I$ nor $G$ were added, the interaction was never considered.

In the leaps example, we gave the function the design matrix and it did not have to consider interactions: they were already encoded in the design matrix.

BIC example¶

The only difference between AIC and BIC is the price paid per variable. This is the argument k to step. By default k=2 and for BIC we set k=log(n). If we set k=0 it will always add variables.

In [9]:
%%R
election.step.forward.BIC = step(lm(V ~ 1, election.table),
list(upper = ~ I + D + W +G:I + P + N),
direction='forward', k=log(nrow(X)))
summary(election.step.forward.BIC)

Start:  AIC=-106.73
V ~ 1

Df Sum of Sq      RSS     AIC
+ D     1 0.0280805 0.084616 -109.71
<none>              0.112696 -106.73
+ I     1 0.0135288 0.099167 -106.38
+ P     1 0.0124463 0.100250 -106.15
+ N     1 0.0024246 0.110271 -104.15
+ W     1 0.0009518 0.111744 -103.87

Step:  AIC=-109.71
V ~ D

Df Sum of Sq      RSS     AIC
<none>              0.084616 -109.71
+ P     1 0.0099223 0.074693 -109.28
+ W     1 0.0068141 0.077801 -108.43
+ I     1 0.0012874 0.083328 -106.99
+ N     1 0.0000033 0.084612 -106.67

Call:
lm(formula = V ~ D, data = election.table)

Residuals:
Min        1Q    Median        3Q       Max
-0.125196 -0.033002 -0.007789  0.018511  0.150298

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.48640    0.01466  33.172   <2e-16 ***
D            0.04509    0.01796   2.511   0.0212 *
---
Signif. codes:  0 â€˜***â€™ 0.001 â€˜**â€™ 0.01 â€˜*â€™ 0.05 â€˜.â€™ 0.1 â€˜ â€™ 1

Residual standard error: 0.06673 on 19 degrees of freedom
Multiple R-squared:  0.2492,	Adjusted R-squared:  0.2097
F-statistic: 6.305 on 1 and 19 DF,  p-value: 0.02124



Backward selection¶

Just for fun, let's consider backwards stepwise. This starts at a full model and tries to delete variables.

There is also a direction="both" option.

In [10]:
%%R
election.step.backward = step(election.lm, direction='backward')
summary(election.step.backward)

Start:  AIC=-128.54
V ~ I + D + W + G:I + P + N

Df Sum of Sq      RSS     AIC
- P     1  0.000055 0.023741 -130.49
- W     1  0.000170 0.023855 -130.39
<none>              0.023686 -128.54
- N     1  0.003133 0.026818 -127.93
- D     1  0.011926 0.035612 -121.97
- I:G   1  0.050640 0.074325 -106.52

Step:  AIC=-130.49
V ~ I + D + W + N + I:G

Df Sum of Sq      RSS     AIC
- W     1  0.000120 0.023860 -132.38
<none>              0.023741 -130.49
- N     1  0.003281 0.027021 -129.77
- D     1  0.013983 0.037724 -122.76
- I:G   1  0.053507 0.077248 -107.71

Step:  AIC=-132.38
V ~ I + D + N + I:G

Df Sum of Sq      RSS     AIC
<none>              0.023860 -132.38
- N     1  0.003199 0.027059 -131.74
- D     1  0.013867 0.037727 -124.76
- I:G   1  0.059452 0.083312 -108.12

Call:
lm(formula = V ~ I + D + N + I:G, data = election.table)

Residuals:
Min        1Q    Median        3Q       Max
-0.043509 -0.019208 -0.004912  0.009626  0.090627

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.506530   0.020689  24.483 4.15e-14 ***
I           -0.019417   0.014701  -1.321  0.20515
D            0.055436   0.018180   3.049  0.00765 **
N           -0.004653   0.003177  -1.465  0.16241
I:G          0.009588   0.001519   6.314 1.03e-05 ***
---
Signif. codes:  0 â€˜***â€™ 0.001 â€˜**â€™ 0.01 â€˜*â€™ 0.05 â€˜.â€™ 0.1 â€˜ â€™ 1

Residual standard error: 0.03862 on 16 degrees of freedom
Multiple R-squared:  0.7883,	Adjusted R-squared:  0.7353
F-statistic: 14.89 on 4 and 16 DF,  p-value: 2.95e-05



Cross-validation¶

Yet another model selection criterion is $K$-fold cross-validation.

• Fix a model ${\cal M}$. Break data set into $K$ approximately equal sized groups $(G_1, \dots, G_K)$.
• For (i in 1:K) Use all groups except $G_i$ to fit model, predict outcome in group $G_i$ based on this model $\widehat{Y}_{j,{\cal M}, G_i}, j \in G_i$.
• Similar to what we saw in Cook's distance which is $n$-fold cross validation.
• Estimate $CV({\cal M}) = \frac{1}{n}\sum_{i=1}^K \sum_{j \in G_i} (Y_j - \widehat{Y}_{j,{\cal M},G_i})^2.$

• It is a general principle that can be used in other situations to "choose parameters."
• Pros (partial list): "objective" measure of a model.
• Cons (partial list): inference is usually "out the window" (also true for other model selection procedures).
• If goal is not really inference about certain specific parameters, it is a reasonable way to compare models.

$C_p$ versus 5-fold cross-validation¶

Let's plot our $C_p$ versus the $CV$ score. Keep in mind that there is additional randomness in the $CV$ score due to the random assignments to groups. Below, I have set the random seed so the results stay the same each time I run this chunk of code.

In [12]:
%%R
plot(election.leaps$Cp, election.leaps$CV, pch=23, bg='orange', cex=2)

In [13]:
%%R
plot(election.leaps$size, election.leaps$CV, pch=23, bg='orange', cex=2)
best.model.Cv = election.leaps$which[which((election.leaps$CV == min(election.leaps$CV))),] best.model.Cv   1 2 3 4 5 6 FALSE TRUE FALSE TRUE TRUE TRUE  Conclusions¶ The model selected depends on the criterion used. Criterion Model$R^2$~$ I + D + W +G:I + P + NR^2_a$~$ I + D + P + NC_p$~$D+P+N$AIC forward ~$D+P$BIC forward ~$D$AIC backward ~$I + D + N + I:G$5-fold CV ~$ I+W$The selected model is random! Where we are so far¶ • Many other "criteria" have been proposed. • Some work well for some types of data, others for different data. • Check diagnostics! • These criteria (except cross-validation) are not "direct measures" of predictive power, though Mallowâ€™s$C_p$is a step in this direction. •$C_p$measures the quality of a model based on both bias and variance of the model. Why is this important? • Bias-variance tradeoff is ubiquitous in statistics. More next class. Inference after selection¶ Each of the above criteria return a model. The summary provides$p$-values. In [14]: %%R summary(election.step.forward)  Call: lm(formula = V ~ D + P, data = election.table) Residuals: Min 1Q Median 3Q Max -0.101121 -0.036838 -0.006987 0.019029 0.163250 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.514022 0.022793 22.552 1.2e-14 *** D 0.043134 0.017381 2.482 0.0232 * P -0.006017 0.003891 -1.546 0.1394 --- Signif. codes: 0 â€˜***â€™ 0.001 â€˜**â€™ 0.01 â€˜*â€™ 0.05 â€˜.â€™ 0.1 â€˜ â€™ 1 Residual standard error: 0.06442 on 18 degrees of freedom Multiple R-squared: 0.3372, Adjusted R-squared: 0.2636 F-statistic: 4.579 on 2 and 18 DF, p-value: 0.02468  We can also form confidence intervals. But, can we trust these intervals or tests! In [15]: %%R confint(election.step.forward)   2.5 % 97.5 % (Intercept) 0.466136288 0.561907650 D 0.006618001 0.079649847 P -0.014193177 0.002158189  LARS: a variant of forward stepwise¶ To illustrate the "dangers" of trusting the above$p$-values, I will use the lars package which we will see later. This algorithm adds variables to a model one at a time as does forward stepwise. The algorithm can be viewed as a modification of forward stepwise, sometimes called a more democratic version of forward stepwise. In [16]: %%R library(lars) plot(lars(X, V, type='lar'))  Loaded lars 1.2  The path can be parameterized by a positive parameter called$\lambda$. The first value$\lambda_1$is, up to some multiplicative factor, the largest correlation of any columns of$X$with$Y$. In this sense, the first step of LARS and forward stepwise are the same. In [17]: %%R nsim = 5000 exactP = c() naiveP = c() for (i in 1:nsim) { Z = rnorm(nrow(X)) L0 = lars(X, Z, type='lar', max.steps=3) lam1 = L0$lambda[1]
lam2 = L0$lambda[2] naiveP = c(naiveP, 2 * (1 - pnorm(lam1))) exactP = c(exactP, (1 - pnorm(lam1)) / (1 - pnorm(lam2))) }  In [18]: %%R plot(ecdf(naiveP), lwd=3, col='red') abline(0,1, lwd=2, lty=2)  What is the type I error of the first test? In [19]: %%R ecdf(naiveP)(0.05)  [1] 0.2296  It turns out that it is possible to modify the usual$t\$-statistic so that we still get valid tests after the first step. This is what I called exactP above.

In [20]:
%%R
plot(ecdf(exactP), lwd=3, col='blue')
abline(0,1, lwd=2, lty=2)


An exciting area of research¶

This exactP was only discovered recently!

For at least 10 years, I have been saying things along the lines of

Inference after model selection is basically out the window.
Forget all we taught you about t and F distributions as it
is no longer true...



It turns out that inference after selection is possible, and it doesn't force us to throw away all of our tools for inference.

But, it is a little more complicated to describe.