In a given regression situation, there are often many choices to be made. Recall our usual setup $$ Y_{n \times 1} = X_{n \times p} \beta_{p \times 1} + \epsilon_{n \times 1}. $$

Any *subset $A \subset \{1, \dots, p\}$* yields a new regression model
$$
{\cal M}(A): Y_{n \times 1} = X[,A] \beta[A] + \epsilon_{n \times 1}
$$
by setting $\beta[A^c]=0$.

**Model selection** is, roughly speaking, how to choose $A$ among the
$2^p$ possible choices.

Here is a dataset from the book that we will use to explore model selection.

Variable | Description |
---|---|

$V$ | votes for a presidential candidate |

$I$ | are they incumbent? |

$D$ | Democrat or Republican incumbent? |

$W$ | wartime election? |

$G$ | GDP growth rate in election year |

$P$ | (absolute) GDP deflator growth rate |

$N$ | number of quarters in which GDP growth rate $> 3.2\%$ |

In [2]:

```
%%R
url = 'http://stats191.stanford.edu/data/election.table'
election.table = read.table(url, header=T)
pairs(election.table[,2:ncol(election.table)], cex.labels=3, pch=23,
bg='orange', cex=2)
```

- When we have many predictors (with many possible interactions), it can be difficult to find a good model.
- Which main effects do we include?
- Which interactions do we include?
- Model selection procedures try to
*simplify / automate*this task. - Election data has $2^6=64$ different models with just main effects!

This is generally an "unsolved" problem in statistics: there are no magic procedures to get you the "best model."

In some sense, model selection is "data mining."

Data miners / machine learners often work with very many predictors.

Our model selection problem is generally at a much smaller scale than "data mining" problems.

Still, it is a hard problem.

**Inference after selection is full of pitfalls!**

- Suppose we fit a a model $F: \quad Y_{n \times 1} = X_{n \times p} \beta_{p \times 1} + \varepsilon_{n \times 1}$ with predictors ${ X}_1, \dots, { X}_p$.
- In reality, some of the $\beta$â€™s may be zero. Letâ€™s suppose that $\beta_{j+1}= \dots= \beta_{p}=0$.
- Then, any model that includes $\beta_0, \dots, \beta_j$ is
*correct*: which model gives the*best*estimates of $\beta_0, \dots, \beta_j$? - Principle of
*parsimony*(i.e. Occamâ€™s razor) says that the model with*only*${X}_1, \dots, {X}_j$ is "best".

- For simplicity, letâ€™s assume that $j=1$ so there is only one coefficient to estimate.
- Then, because each model gives an
*unbiased*estimate of $\beta_1$ we can compare models based on $\text{Var}(\widehat{\beta}_1).$ - The best model, in terms of this variance, is the one containing only ${ X}_1$.
- What if we didnâ€™t know that only $\beta_1$ was non-zero (which we donâ€™t know in general)?
- In this situation, we must choose a set of variables.

- To "implement" a model selection procedure, we first need a criterion or benchmark to compare two models.
- Given a criterion, we also need a search strategy.
- With a limited number of predictors, it is possible to search all possible models (
`leaps`

in`R`

).

Possible criteria

- $R^2$: not a good criterion. Always increase with model size $\implies$ "optimum" is to take the biggest model.
- Adjusted $R^2$: better. It "penalized" bigger models. Follows principle of parsimony / Occamâ€™s razor.
- Mallowâ€™s $C_p$ â€“ attempts to estimate a modelâ€™s predictive power, i.e. the power to predict a new observation.

Leaps takes a design matrix as argument: throw away the intercept column or leaps will complain.

In [3]:

```
%%R
election.lm = lm(V ~ I + D + W +G:I + P + N, election.table)
election.lm
```

In [4]:

```
%%R
X = model.matrix(election.lm)[,-1]
library(leaps)
election.leaps = leaps(X, election.table$V, nbest=3, method='r2')
best.model.r2 = election.leaps$which[which((election.leaps$r2 == max(election.leaps$r2))),]
best.model.r2
```

In [5]:

```
%%R
plot(election.leaps$size, election.leaps$r2, pch=23, bg='orange', cex=2)
```

As we add more and more variables to the model â€“ even random ones, $R^2$ will increase to 1.

Adjusted $R^2$ tries to take this into account by replacing sums of squares by

*mean squares*$$R^2_a = 1 - \frac{SSE/(n-p-1)}{SST/(n-1)} = 1 - \frac{MSE}{MST}.$$

In [6]:

```
%%R
election.leaps = leaps(X, election.table$V, nbest=3, method='adjr2')
plot(election.leaps$size, election.leaps$adjr2, pch=23, bg='orange', cex=2)
best.model.adjr2 = election.leaps$which[which((election.leaps$adjr2 == max(election.leaps$adjr2))),]
best.model.adjr2
```

- $C_p({\cal M}) = \frac{SSE({\cal M})}{\widehat{\sigma}^2} + 2 \cdot p({\cal M}) - n.$
- $\widehat{\sigma}^2=SSE(F)/df_F$ is the "best" estimate of $\sigma^2$ we have (use the fullest model), i.e. in the election data it uses all 6 main effects.
- $SSE({\cal M})$ is the $SSE$ of the model ${\cal M}$.
- $p({\cal M})$ is the number of predictors in ${\cal M}$.
- This is an estimate of the expected mean-squared error of $\widehat{Y}({\cal M})$, it takes
*bias*and*variance*into account.

In [7]:

```
%%R
election.leaps = leaps(X, election.table$V, nbest=3, method='Cp')
plot(election.leaps$size, election.leaps$Cp, pch=23, bg='orange', cex=2)
best.model.Cp = election.leaps$which[which((election.leaps$Cp == min(election.leaps$Cp))),]
best.model.Cp
```

Given a criterion, we now have to decide how we are going to search through the possible models.

"Best subset": search all possible models and take the one with highest $R^2_a$ or lowest $C_p$ leaps. Such searches are typically feasible only up to $p=30$ or $40$.

Stepwise (forward, backward or both): useful when the number of predictors is large. Choose an initial model and be "greedy".

"Greedy" means always take the biggest jump (up or down) in your selected criterion.

`R`

- "Best subset": use the function
`leaps`

. Works only for multiple linear regression models. - Stepwise: use the function step. Works for any model with Akaike Information Criterion (AIC). In multiple linear regression, AIC is (almost) a linear function of $C_p$.

- Akaike (AIC) defined as $$AIC({\cal M}) = - 2 \log L({\cal M}) + 2 p({\cal M})$$ where $L({\cal M})$ is the maximized likelihood of the model.
- Bayes (BIC) defined as $$BIC({\cal M}) = - 2 \log L({\cal M}) + \log n \cdot p({\cal M})$$
- Strategy can be used for whenever we have a likelihood, so this generalizes to many statistical models.

- In linear regression with unknown $\sigma^2$ $-2 \log L({\cal M}) = n \log(2\pi \widehat{\sigma}^2_{MLE}) + n$ where $\widehat{\sigma}^2_{MLE} = \frac{1}{n} SSE(\widehat{\beta})$
- In linear regression with known $\sigma^2$ $-2 \log L({\cal M}) = n \log(2\pi \sigma^2) + \frac{1}{\sigma^2} SSE(\widehat{\beta})$ so AIC is very much like Mallowâ€™s $C_p$.

In [8]:

```
%%R
n = nrow(X)
p = 7 + 1
c(n * log(2*pi*sum(resid(election.lm)^2)/n) + n + 2*p, AIC(election.lm))
```

BIC will always choose a model as small or smaller than AIC (if using the same search direction).

As our sample size grows, under some assumptions, it can be shown that

- AIC will (asymptotically) always choose a model that contains the true model, i.e. it wonâ€™t leave any variables out.
- BIC will (asymptotically) choose exactly the right model.

Let's take a look at `step`

in action. Probably the simplest
strategy is *forward stepwise* which tries to add one variable at a time,
as long as it can find a resulting model whose AIC is better than
its current position.

When it can make no further additions, it terminates.

In [9]:

```
%%R
election.step.forward = step(lm(V ~ 1, election.table),
list(upper = ~ I + D + W + G:I + P + N),
direction='forward')
summary(election.step.forward)
```

We notice that although the *full* model we gave it had the interaction `I:G`

, the function `step`

never tried to use it. This is
due to some rules implemented in `step`

that do not include an interaction unless both main effects are already in the model. In this case, because neither $I$ nor $G$ were added, the interaction was never considered.

In the `leaps`

example, we gave the function the design matrix
and it did not have to consider interactions: they were already encoded in the design matrix.

The only difference between AIC and BIC is the price paid
per variable. This is the argument `k`

to `step`

. By default `k=2`

and for BIC
we set `k=log(n)`

. If we set `k=0`

it will always add variables.

In [10]:

```
%%R
election.step.forward.BIC = step(lm(V ~ 1, election.table),
list(upper = ~ I + D + W +G:I + P + N),
direction='forward', k=log(nrow(X)))
summary(election.step.forward)
```

Just for fun, let's consider backwards stepwise. This starts at a full model and tries to delete variables.

There is also a `direction="both"`

option.

In [11]:

```
%%R
election.step.backward = step(election.lm, direction='backward')
summary(election.step.backward)
```

Yet another model selection criterion is $K$-fold cross-validation.

- Fix a model ${\cal M}$. Break data set into $K$ approximately equal sized groups $(G_1, \dots, G_K)$.
- For (i in 1:K) Use all groups except $G_i$ to fit model, predict outcome in group $G_i$ based on this model $\widehat{Y}_{j,{\cal M}, G_i}, j \in G_i$.
- Similar to what we saw in Cook's distance which is $n$-fold cross validation.
- Estimate $CV({\cal M}) = \frac{1}{n}\sum_{i=1}^K \sum_{j \in G_i} (Y_j - \widehat{Y}_{j,{\cal M},G_i})^2.$

- It is a general principle that can be used in other situations to "choose parameters."
- Pros (partial list): "objective" measure of a model.
- Cons (partial list): inference is
*usually*"out the window" (also true for other model selection procedures). - If goal is not really inference about certain specific parameters, it is a reasonable way to compare models.

Let's plot our $C_p$ versus the $CV$ score. Keep in mind that there is additional randomness in the $CV$ score due to the random assignments to groups. Below, I have set the random seed so the results stay the same each time I run this chunk of code.

In [12]:

```
%%R
library(boot)
set.seed(0)
election.leaps = leaps(X, election.table$V, nbest=3, method='Cp')
V = election.table$V
election.leaps$cv = 0 * election.leaps$Cp
for (i in 1:nrow(election.leaps$which)) {
subset = c(1:ncol(X))[election.leaps$which[i,]]
if (length(subset) > 1) {
Xw = X[,subset]
wlm = glm(V ~ Xw)
election.leaps$CV[i] = cv.glm(model.frame(wlm), wlm, K=5)$delta[1]
}
else {
Xw = X[,subset[1]]
wlm = glm(V ~ Xw)
election.leaps$CV[i] = cv.glm(model.frame(wlm), wlm, K=5)$delta[1]
}
}
```

In [13]:

```
%%R
plot(election.leaps$Cp, election.leaps$CV, pch=23, bg='orange', cex=2)
```

In [14]:

```
%%R
plot(election.leaps$size, election.leaps$CV, pch=23, bg='orange', cex=2)
best.model.Cv = election.leaps$which[which((election.leaps$CV == min(election.leaps$CV))),]
best.model.Cv
```

The model selected depends on the criterion used.

Criterion | Model |
---|---|

$R^2$ | ~ $ I + D + W +G:I + P + N$ |

$R^2_a$ | ~ $ I + D + P + N$ |

$C_p$ | ~ $D+P+N$ |

AIC forward | ~ $D+P$ |

BIC forward | ~ $D+P$ |

AIC backward | ~ $I + D + N + I:G$ |

5-fold CV | ~ $ I+W$ |

**The selected model is random!**

- Many other "criteria" have been proposed.
- Some work well for some types of data, others for different data.
- Check diagnostics!
- These criteria (except cross-validation) are not "direct measures" of predictive power, though Mallowâ€™s $C_p$ is a step in this direction.
- $C_p$ measures the quality of a model based on both
*bias*and*variance*of the model. Why is this important? *Bias-variance*tradeoff is ubiquitous in statistics. More next class.

Each of the above criteria return a model. The `summary`

provides
$p$-values.

In [15]:

```
%%R
summary(election.step.forward)
```

We can also form confidence intervals. **But, can we trust these intervals or tests!**

In [16]:

```
%%R
confint(election.step.forward)
```

To illustrate the "dangers" of trusting the above $p$-values, I will
use the `lars`

package which we will see later.

This algorithm adds variables to a model one at a time as does
forward stepwise. The algorithm can be viewed as a modification of
forward stepwise, sometimes called a more `democratic`

version of forward stepwise.

In [17]:

```
%%R
library(lars)
plot(lars(X, V, type='lar'))
```

The path can be parameterized by a positive parameter called $\lambda$. The first value $\lambda_1$ is, up to some multiplicative factor, the largest correlation of any columns of $X$ with $Y$.

In this sense, the first step of `LARS`

and forward stepwise are the same.

In [18]:

```
%%R
nsim = 5000
exactP = c()
naiveP = c()
for (i in 1:nsim) {
Z = rnorm(nrow(X))
L0 = lars(X, Z, type='lar', max.steps=3)
lam1 = L0$lambda[1]
lam2 = L0$lambda[2]
naiveP = c(naiveP, 2 * (1 - pnorm(lam1)))
exactP = c(exactP, (1 - pnorm(lam1)) / (1 - pnorm(lam2)))
}
```

In [19]:

```
%%R
plot(ecdf(naiveP), lwd=3, col='red')
abline(0,1, lwd=2, lty=2)
```

What is the type I error of the first test?

In [20]:

```
%%R
ecdf(naiveP)(0.05)
```

`exactP`

above.

In [21]:

```
%%R
plot(ecdf(exactP), lwd=3, col='blue')
plot(ecdf(naiveP), lwd=3, col='red', add=TRUE)
abline(0,1, lwd=2, lty=2)
```

This `exactP`

was only discovered recently!

For at least 10 years, I have been saying things along the lines of

```
Inference after model selection is basically out the window.
Forget all we taught you about t and F distributions as it
is no longer true...
```

It turns out that inference after selection is possible, and it doesn't force us to throw away all of our tools for inference.

*But, it is a little more complicated to describe.*

In []:

```
```