One goal of a regression analysis is to "build" a model that predicts well.

This is slightly different than the goal of making inferences about $\beta$ that we've focused on so far.

What does "predict well" mean? $$ \begin{aligned} MSE_{pop}({{\cal M}}) &= {\mathbb{E}}\left((Y_{new} - \widehat{Y}_{new,{\cal M}})^2\right) \\ &= {\text{Var}}(Y_{new}) + {\text{Var}}(\widehat{Y}_{new,{\cal M}}) + \\ & \qquad \quad \text{Bias}(\widehat{Y}_{new,{\cal M}})^2. \end{aligned}$$

With $\sigma^2$ known, the quantity $C_p({\cal M})$ is an unbiased estimator of this quantity.

- Generate $Y_{100 \times 1} \sim N(\mu \cdot 1, 5^2 I_{100 \times 100})$, with $\mu=0.5$.
- For $0 \leq \alpha \leq 1$, set $\hat{Y}(\alpha) = \alpha \bar{Y}.$
- Compute $MSE(\hat{Y}(\alpha)) = \sum_{i=1}^{100} (\hat{Y}_{\alpha} - 0.5)^2$
- Repeat 500 times, plot average of $MSE(\hat{Y}(\alpha))$.

**For what value of $\alpha$ is $\hat{Y}(\alpha)$ unbiased?**

**Is this the best estimate of $\mu$ in terms of MSE?**

In [1]:

```
%%R -h 600 -w 600
par(cex.lab=5)
nsample = 100
ntrial = 500
mu = 0.5
sigma = 5
MSE <- function(mu.hat, mu) {
return(sum((mu.hat - mu)^2) / length(mu))
}
alpha <- seq(0.0,1,length=20)
mse <- numeric(length(alpha))
for (i in 1:ntrial) {
Z = rnorm(nsample) * sigma + mu
for (j in 1:length(alpha)) {
mse[j] = mse[j] + MSE(alpha[j] * mean(Z) * rep(1, nsample), mu * rep(1, nsample)) / ntrial
}
}
plot(alpha, mse, type='l', lwd=2, col='red', ylim=c(0, max(mse)),
xlab=expression(paste('Penalty parameter,', alpha)),
ylab=expression(paste('MSE(', alpha, ')')),
cex.lab=1.5)
```

Shrinkage can be thought of as "constrained" or "penalized" minimization.

Constrained form: $$\text{minimize}_{\mu} \sum_{i=1}^n (Y_i - \mu)^2 \quad \text{subject to $\mu^2 \leq C$}$$

Lagrange multiplier form: equivalent to $$\widehat{\mu}_{\lambda} = \text{argmin}_{\mu} (Y_i - \mu)^2 + \lambda \cdot \mu^2$$ for some $\lambda=\lambda_C$.

As we vary $\lambda$ we solve all versions of the constrained form.

- Differentiating: $- 2 \sum_{i=1}^n (Y_i - \widehat{\mu}_{\lambda}) + 2 \lambda \widehat{\mu}_{\lambda} = 0$
- Solving $\widehat{\mu}_{\lambda} = \frac{\sum_{i=1}^n Y_i}{n + \lambda} = \frac{n}{n+\lambda} \overline{Y}.$
- As $\lambda \rightarrow 0$, $\widehat{\mu}_{\lambda} \rightarrow {\overline{Y}}.$
- As $\lambda \rightarrow \infty$ $\widehat{\mu}_{\lambda} \rightarrow 0.$

** We see that $\widehat{\mu}_{\lambda} = \bar{Y} \cdot \left(\frac{n}{n+\lambda}\right).$ **

** In other words, considering all penalized estimators traces out the MSE curve above.**

In [2]:

```
%%R -h 600 -w 600
lam = nsample / alpha - nsample
plot(lam, mse, type='l', lwd=2, col='red', ylim=c(0, max(mse)),
xlab=expression(paste('Penalty parameter,', lambda)),
ylab=expression(paste('MSE(', lambda, ')')))
```

- In our one-sample example,
- $$\begin{aligned} MSE_{pop}(\alpha) &= {\text{Var}}( \alpha \bar{Y}) + \text{Bias}(\alpha \bar{Y})^2 \\ &= \frac{\alpha^2 \sigma^2}{n} + \mu^2 (1 - \alpha)^2 \end{aligned}$$
- Differentiating and solving: $$\begin{aligned} 0 &= -2 \mu^2(1 - \lambda^*) + 2 \frac{\lambda^* \sigma^2}{n} \\ \lambda^* & = \frac{\mu^2}{\mu^2+\sigma^2/n} \\ &= \frac{0.5^2}{0.5^2+1/10} = 0.71 \end{aligned}$$

** We see that the optimal $\alpha$ depends on the unknown $SNR=\mu/(\sigma/\sqrt{n})$.**

** In practice we might hope to estimate MSE with cross-validation.**

Let's see how our theoretical choice matches the MSE on our 100 sample.

In [3]:

```
%%R -h 600 -w 600
plot(alpha, mse, type='l', lwd=2, col='red', ylim=c(0, max(mse)),
xlab=expression(paste('Shrinkage parameter ', alpha)),
ylab=expression(paste('MSE(', alpha, ')')))
abline(v=mu^2/(mu^2+sigma^2/nsample), col='blue', lty=2)
```

Minimizing $\sum_{i=1}^n (Y_i - \mu)^2 + \lambda \mu^2$ is similar to computing "MLE" of $\mu$ if the likelihood was proportional to $$\exp \left(-\frac{1}{2\sigma^2}\left( \|Y-\mu\|^2_2 + \lambda \mu^2\right) \right).$$

If $\lambda=m$, an integer, then $\widehat{\mu}_{\lambda}$ is the sample mean of $(Y_1, \dots, Y_n,0 ,\dots, 0) \in \mathbb{R}^{n+m}$.

This is equivalent to adding some data with $Y=0$.

To a Bayesian, this extra data is a

*prior distribution*and we are computing the so-called*MAP*or posterior mode.

Model selection with $C_p$ (or AIC with $\sigma^2$ assumed known) is a version of penalized regression.

The best subsets version of AIC (which is not exactly equivalent to

*step*) $$ \hat{\beta}_{AIC} = \text{argmin}_{\beta} \frac{1}{\sigma^2}\|Y-X\beta\|^2_2 + 2 \|\beta\|_0 $$ where $$ \|\beta\|_0 = \#\left\{j : \beta_j \neq 0 \right\} $$ is called the $\ell_0$ norm.The $\ell_0$ penalty can be thought of as a measure of

*complexity*of the model. Most penalties are similar versions of*complexity*.

Not all biased models are better â€“ we need a way to find "good" biased models.

Inference ($F$, $\chi^2$ tests, etc) is not quite exact for biased models. Though, there has been some recent work to address the issue of post-selection inference, at least for some penalized regression problems.

Heuristically, "large $\beta$" (measured by some norm) is interpreted as "complex model". Goal is really to penalize "complex" models, i.e. Occamâ€™s razor.

- If truth really is complex, this may not work! (But, it will then be hard to build a good model anyways ... (statistical lore))

Assume that columns $(X_j)_{1 \leq j \leq p}$ have zero mean, and length 1 and $Y$ has zero mean.

This is called the

*standardized model*.The ridge estimator is $$ \hat{\beta}_{\lambda} = \text{argmin}_{\beta} = \|Y-X\beta\|^2_2 + \lambda \|\beta\|^2_2$$

Corresponds (through Lagrange multiplier) to a quadratic constraint on ${\beta_{}}$â€™s.

This is the natural generalization of the penalized version of our shrinkage estimator.

- Normal equations $$\frac{\partial}{\partial {\beta_{l}}} SSE_{\lambda}({\beta_{}}) = -2 \langle Y - X{\beta_{}}, X_l \rangle + 2 \lambda {\beta_{l}}$$
- $$-2 \langle Y - X{\widehat{\beta}_{\lambda}}, X_l \rangle + 2 \lambda {\widehat{\beta}_{l,\lambda}} = 0, \qquad 1 \leq l \leq p$$
- In matrix form $$-X^TY + (X^TX + \lambda I) {\widehat{\beta}_{\lambda}} = 0.$$
- Or $${\widehat{\beta}_{\lambda}} = (X^TX + \lambda I)^{-1} X^TY.$$

In [4]:

```
%%R -h 600 -w 600
library(lars)
data(diabetes)
library(MASS)
diabetes.ridge <- lm.ridge(diabetes$y ~ diabetes$x, lambda=seq(0,10,0.05))
plot(diabetes.ridge, lwd=3)
```

- If we knew $MSE$ as a function of $\lambda$ then we would simply choose the $\lambda$ that minimizes $MSE$.
- To do this, we need to estimate $MSE$.
- A popular method is cross-validation as a function of $\lambda$. Breaks the data up into smaller groups and uses part of the data to predict the rest.
- We saw this in diagnostics (Cookâ€™s distance measured the fit with and without each point in the data set) and model selection.

- Fix a model (i.e. fix $\lambda$). Break data set into $K$ approximately equal sized groups $(G_1, \dots, G_K)$.
- for (i in 1:K) Use all groups except $G_i$ to fit model, predict outcome in group $G_i$ based on this model $\widehat{Y}_{j(i),\lambda}, j \in G_i$.
- Estimate $CV(\lambda) = \frac{1}{n}\sum_{i=1}^K \sum_{j \in G_i} (Y_j - \widehat{Y}_{j(i),\lambda})^2.$

In [5]:

```
%%R
K = 5
CV = function(Z, alpha) {
cve = numeric(K)
l = length(Z)
for (i in 1:K) {
g = c(((i-1)*l/K+1):(i*l/K))
mu.hat = mean(Z[-g]) * alpha
cve[i] = sum((Z[g]-mu.hat)^2)
}
return(c(cve, sd(cve)))
}
```

Let's see how the parameter chosen by 5-fold CV compares to our theoretical choice.

In [6]:

```
%%R -h 600 -w 600
alpha = seq(0.0,1,length=20)
mse = numeric(length(alpha))
avg.cv = numeric(length(alpha))
for (i in 1:ntrial) {
Z = rnorm(nsample) * sigma + mu
for (j in 1:length(alpha)) {
current_cv = CV(Z, alpha[j])
avg.cv[j] = avg.cv[j] + current_cv[1] / ntrial
}
}
avg.cv = avg.cv / nsample
plot(alpha, avg.cv, type='l', lwd=2, col='green',
xlab='Shrinkage parameter, alpha', ylab='Average CV(alpha)')
abline(v=mu^2/(mu^2+sigma^2/nsample), col='blue', lty=2)
abline(v=min(alpha[avg.cv == min(avg.cv)]), col='red', lty=2)
```

The curve above shows what would happen if we could repeat this and averaeg over many samples.

Let's see what one curve looks like on our sample. This is the result we might get in practice on a given data set.

In [7]:

```
%%R
set.seed(0)
cv = numeric(length(alpha))
cv.sd = numeric(length(alpha))
Z = rnorm(nsample) * sigma + mu
for (j in 1:length(alpha)) {
current_cv = CV(Z, alpha[j])
cv[j] = current_cv[1]
cv.sd[j] = current_cv[2]
}
plot(alpha, cv, type='l', lwd=2, col='green',
xlab='Shrinkage parameter, alpha', ylab='CV(alpha)')
abline(v=mu^2/(mu^2+sigma^2/nsample), col='blue', lty=2)
abline(v=min(alpha[cv == min(cv)]), col='red', lty=2)
```

- A computational shortcut for $n$-fold cross-validation (also known as leave-one out cross-validation).
- Let $S_{\lambda} = S(X^TX + \lambda I)^{-1} X^T$ be the matrix in ridge regression that computes $\hat{Y}_{\lambda}$
- Then $GCV(\lambda) = \frac{\|Y - S_{\lambda}Y\|^2}{n - {\text{Tr}}(S_{\lambda})}.$
- The quantity ${\text{Tr}}(S_{\lambda})$ can be thought of as the
*effective degrees of freedom*for this choice of $\lambda$.

In [8]:

```
%%R -h 600 -w 600
par(cex.lab=2)
plot(diabetes.ridge$lambda, diabetes.ridge$GCV, xlab='Lambda', ylab='GCV', type='l', lwd=3, col='orange')
select(diabetes.ridge)
```

- Another popular penalized regression technique.
- Use the standardized model.
The LASSO estimate is $$ \hat{\beta}_{\lambda} = \text{argmin}_{\beta} \|Y-X\beta\|^2_2 + \lambda \|\beta\|_1$$ where $$ \|\beta\|_1 = \sum_{j=1}^p |\beta_j| $$ is the $\ell_1$ norm.

Corresponds (through Lagrange multiplier) to an $\ell^1$ constraint on ${\beta_{}}$â€™s.

In theory and practice, it works well when many ${\beta_{j}}$â€™s are 0 and gives "sparse" solutions unlike ridge.

It is a (computable) approximation to the best subsets AIC model.

It is computable because the minimization problem is a convex problem.

Why do we get sparse solutions with the LASSO?

In [9]:

```
%%R -h 600 -w 600
library(lars)
data(diabetes)
diabetes.lasso = lars(diabetes$x, diabetes$y, type='lasso')
plot(diabetes.lasso)
```

The `lars`

package has a built in function to estimate CV.

In [10]:

```
%%R -h 600 -w 600
par(cex.lab=2)
cv.lars(diabetes$x, diabetes$y, K=10, type='lasso')
```

- Mix between LASSO and ridge regression.
- The ENET estimator is $$ \hat{\beta}_{\lambda_1, \lambda_2} = \text{argmin}_{\beta} \|Y-X\beta\|^2_2 + \lambda_1 \|\beta\|_1 + \lambda_2 \|\beta\|^2_2.$$

In [11]:

```
%%R -h 600 -w 600
par(cex.lab=2)
library(genlasso)
GL = genlasso(diabetes$y, diabetes$x, diag(rep(1,ncol(diabetes$x))))
plot(GL)
```

The events $\lambda_i=\lambda_i(X,y)$ can be thought of as event "times".

They are random variables themselves, maybe we can say something about their distribution?

Suppose $\text{diag}(X^TX)=1$ (i.e. we rescale columns of design matrix to have constant length)

Then, under $H_0:\beta=0$ $$ \frac{1 - {\tt pnorm}(\lambda_1/\sigma)}{1 - {\tt pnorm}(\lambda_2/\sigma)} \sim \text{Unif}(0,1). $$

Can also describe what happens under other alternatives. Take a look at the paper.

Code coming!