• One goal of a regression analysis is to "build" a model that predicts well.

• This is slightly different than the goal of making inferences about $\beta$ that we've focused on so far.

• What does "predict well" mean? \begin{aligned} MSE_{pop}({{\cal M}}) &= {\mathbb{E}}\left((Y_{new} - \widehat{Y}_{new,{\cal M}})^2\right) \\ &= {\text{Var}}(Y_{new}) + {\text{Var}}(\widehat{Y}_{new,{\cal M}}) + \\ & \qquad \quad \text{Bias}(\widehat{Y}_{new,{\cal M}})^2. \end{aligned}

• With $\sigma^2$ known, the quantity $C_p({\cal M})$ is an unbiased estimator of this quantity.

Shrinkage estimators

1. Generate $Y_{40 \times 1} \sim N(\mu \cdot 1, 2.5^2 I_{40 \times 40})$, with $\mu=0.5$.
2. For $0 \leq \alpha \leq 1$, set $\hat{Y}(\alpha) = \alpha \bar{Y}.$
3. Compute $MSE(\hat{Y}(\alpha)) = \sum_{i=1}^{40} (\hat{Y}_{\alpha} - 0.5)^2$
4. Repeat 100 times, plot average of $MSE(\hat{Y}(\alpha))$.

For what value of $\alpha$ is $\hat{Y}(\alpha)$ unbiased?

Is this the best estimate of $\mu$ in terms of MSE?

In [13]:
%%R -h 800 -w 800
nsample = 40
ntrial = 500
mu = 0.5
sigma = 2.5
MSE <- function(mu.hat, mu) {
return(sum((mu.hat - mu)^2) / length(mu))
}

alpha <- seq(0.0,1,length=20)

mse <- numeric(length(alpha))

for (i in 1:ntrial) {
Z = rnorm(nsample) * sigma + mu
for (j in 1:length(alpha)) {
mse[j] = mse[j] + MSE(alpha[j] * mean(Z) * rep(1, nsample), mu * rep(1, nsample)) / ntrial
}
}

plot(alpha, mse, type='l', lwd=2, col='red', ylim=c(0, max(mse)),
xlab=expression(paste('Penalty parameter,', alpha)),
ylab=expression(paste('MSE(', alpha, ')')),
cex.lab=1.5)


Shrinkage & Penalties

• Shrinkage can be thought of as "constrained" or "penalized" minimization.

• Constrained form: $$\text{minimize}_{\mu} \sum_{i=1}^n (Y_i - \mu)^2 \quad \text{subject to \mu^2 \leq C}$$

• Lagrange multiplier form: equivalent to $$\widehat{\mu}_{\lambda} = \text{argmin}_{\mu} (Y_i - \mu)^2 + \lambda \cdot \mu^2$$ for some $\lambda=\lambda_C$.

• As we vary $\lambda$ we solve all versions of the constrained form.

Solving for $\widehat{\mu}_{\lambda}$

• Differentiating: $- 2 \sum_{i=1}^n (Y_i - \widehat{\mu}_{\lambda}) + 2 \lambda \widehat{\mu}_{\lambda} = 0$
• Solving $\widehat{\mu}_{\lambda} = \frac{\sum_{i=1}^n Y_i}{n + \lambda} = \frac{n}{n+\lambda} \overline{Y}.$
• As $\lambda \rightarrow 0$, $\widehat{\mu}_{\lambda} \rightarrow {\overline{Y}}.$
• As $\lambda \rightarrow \infty$ $\widehat{\mu}_{\lambda} \rightarrow 0.$

We see that $\widehat{\mu}_{\lambda} = \hat{Y}\left(\frac{n}{n+\lambda}\right).$

In other words, considering all penalized estimators traces out the MSE curve above.

In [14]:
%%R -h 800 -w 800
lam = nsample / alpha - nsample
plot(lam, mse, type='l', lwd=2, col='red', ylim=c(0, max(mse)),
xlab=expression(paste('Penalty parameter,',  lambda)),
ylab=expression(paste('MSE(', lambda, ')')), cex.lab=1.5)


How much to shrink?

• In our one-sample example,
• \begin{aligned} MSE_{pop}(\alpha) &= {\text{Var}}( \alpha \bar{Y}) + \text{Bias}(\alpha \bar{Y})^2 \\ &= \frac{\alpha^2 \sigma^2}{n} + \mu^2 (1 - \alpha)^2 \end{aligned}
• Differentiating and solving: \begin{aligned} 0 &= -2 \mu^2(1 - \lambda^*) + 2 \frac{\lambda^* \sigma^2}{n} \\ \lambda^* & = \frac{\mu^2}{\mu^2+\sigma^2/n} \\ &= \frac{0.5^2}{0.5^2+1/10} = 0.71 \end{aligned}

We see that the optimal $\alpha$ depends on the unknown $SNR=\mu/(\sigma/\sqrt{n})$.

In practice we might hope to estimate MSE with cross-validation.

Let's see how our theoretical choice matches the random MSE.

In [3]:
%%R -h 800 -w 800
plot(alpha, mse, type='l', lwd=2, col='red', ylim=c(0, max(mse)),
xlab=expression(paste('Shrinkage parameter ', alpha)),
ylab=expression(paste('MSE(', alpha, ')')), cex.lab=1.5)
abline(v=mu^2/(mu^2+sigma^2/nsample), col='blue', lty=2)


Penalties & Priors

• Minimizing $\sum_{i=1}^n (Y_i - \mu)^2 + \lambda \mu^2$ is similar to computing "MLE" of $\mu$ if the likelihood was proportional to $$\exp \left(-\frac{1}{2\sigma^2}\left( \|Y-\mu\|^2_2 + \lambda \mu^2\right) \right).$$

• If $\lambda=m$, an integer, then $\widehat{\mu}_{\lambda}$ is the sample mean of $(Y_1, \dots, Y_n,0 ,\dots, 0) \in \mathbb{R}^{n+m}$.

• This is equivalent to adding some data with $Y=0$. To a Bayesian, this extra data is a prior distribution and we are computing the so-called MAP or posterior mode.

AIC as penalized regression

• Model selection with $C_p$ (or AIC with $\sigma^2$ assumed known) is a version of penalized regression.

• The best subsets version of AIC (which is not exactly equivalent to step) $$\hat{\beta}_{AIC} = \argmin_{\beta} \frac{1}{\sigma^2}\|Y-X\beta\|^2_2 + 2 \|\beta\|_0$$ where $$\|\beta\|_0 = \#\left\{j : \beta_j \neq 0 \right\}$$ is called the $\ell_0$ norm.

• The $\ell_0$ penalty can be thought of as a measure of complexity of the model. Most penalties are similar versions of complexity.

Penalized regression in general

• Not all biased models are better â€“ we need a way to find "good" biased models.

• Inference ($F$, $\chi^2$ tests, etc) is not quite exact for biased models. Though, there has been some recent work to address the issue of post-selection inference.

• Heuristically, "large $\beta$" (measured by some norm) is interpreted as "complex model". Goal is really to penalize "complex" models, i.e. Occamâ€™s razor.

• If truth really is complex, this may not work! (But, it will then be hard to build a good model anyways ... (statistical lore))

Ridge regression

• Assume that columns $(X_j)_{1 \leq j \leq p}$ have zero mean, and length 1 and $Y$ has zero mean.

• This is called the standardized model.

• The ridge estimator is $$\hat{\beta}_{\lambda} = \argmin_{\beta} = \|Y-X\beta\|^2_2 + \lambda \|\beta\|^2_2$$

• Corresponds (through Lagrange multiplier) to a quadratic constraint on ${\beta_{}}$â€™s.

• This is the natural generalization of the penalized version of our shrinkage estimator.

Solving the normal equations

• Normal equations $\frac{\partial}{\partial {\beta_{l}}} SSE_{\lambda}({\beta_{}}) = -2 \langle Y - X{\beta_{}}, X_l \rangle + 2 \lambda {\beta_{l}}$
• $-2 \langle Y - X{\widehat{\beta}_{\lambda}}, X_l \rangle + 2 \lambda {\widehat{\beta}_{l,\lambda}} = 0, \qquad 1 \leq l \leq p$
• In matrix form $-X^TY + (X^TX + \lambda I) {\widehat{\beta}_{\lambda}} = 0$
• Or ${\widehat{\beta}_{\lambda}} = (X^TX + \lambda I)^{-1} X^TY.$

Ridge regression¶

In [4]:
%%R -h 800 -w 800
library(lars)
data(diabetes)
library(MASS)
diabetes.ridge <- lm.ridge(diabetes$y ~ diabetes$x, lambda=seq(0,10,0.05))
plot(diabetes.ridge, cex.lab=3, lwd=3)

Loaded lars 1.1



Choosing $\lambda$

• If we knew $MSE$ as a function of $\lambda$ then we would simply choose the $\lambda$ that minimizes $MSE$.
• To do this, we need to estimate $MSE$.
• A popular method is cross-validation as a function of $\lambda$. Breaks the data up into smaller groups and uses part of the data to predict the rest.
• We saw this in diagnostics (Cookâ€™s distance measured the fit with and without each point in the data set) and model selection.

$K$-fold cross-validation for penalized model

• Fix a model (i.e. fix $\lambda$). Break data set into $K$ approximately equal sized groups $(G_1, \dots, G_K)$.
• for (i in 1:K) Use all groups except $G_i$ to fit model, predict outcome in group $G_i$ based on this model $\widehat{Y}_{j(i),\lambda}, j \in G_i$.
• Estimate $CV(\lambda) = \frac{1}{n}\sum_{i=1}^K \sum_{j \in G_i} (Y_j - \widehat{Y}_{j(i),\lambda})^2.$

Here is a function to estimate the CV for our one parameter example. In practice, we only have one sample to form the CV curve. In this example below, I will compute the average CV error for 500 trials to show that it is roughly comparable in shape to the MSE curve.

In [15]:
%%R
K = 10
CV = function(Z, alpha) {
cve = numeric(K)
l = length(Z)
for (i in 1:K) {
g = c((i-1)*l/K+1,i*l/K)
mu.hat = mean(Z[-g]) * alpha
cve[i] = sum((Z[g]-mu.hat)^2)
}
return(c(cve, sd(cve)))
}


Let's see how the parameter chosen by 5-fold CV compares to our theoretical choice.

In [17]:
%%R -h 800 -w 800
alpha = seq(0.0,1,length=20)

mse = numeric(length(alpha))
avg.cv = numeric(length(alpha))

for (i in 1:ntrial) {
Z = rnorm(nsample) * sigma + mu
for (j in 1:length(alpha)) {
current_cv = CV(Z, alpha[j])
avg.cv[j] = avg.cv[j] + current_cv[1] / ntrial
}
}

avg.cv = avg.cv / nsample

plot(alpha, avg.cv, type='l', lwd=2, col='green',
xlab='Shrinkage parameter, alpha', ylab='Average CV(alpha)')
abline(v=mu^2/(mu^2+sigma^2/nsample), col='blue', lty=2)
abline(v=min(alpha[avg.cv == min(avg.cv)]), col='red', lty=2)


Let's see what one curve looks like on our smallish sample.

In [7]:
%%R
set.seed(0)
cv = numeric(length(alpha))
cv.sd = numeric(length(alpha))
Z = rnorm(nsample) * sigma + mu
for (j in 1:length(alpha)) {
current_cv = CV(Z, alpha[j])
cv[j] = current_cv[1]
cv.sd[j] = current_cv[2]
}

plot(alpha, cv, type='l', lwd=2, col='green',
xlab='Shrinkage parameter, alpha', ylab='CV(alpha)')
abline(v=mu^2/(mu^2+sigma^2/nsample), col='blue', lty=2)
abline(v=min(alpha[cv == min(cv)]), col='red', lty=2)


$K$-fold cross-validation (continued)

• It is a general principle that can be used in other situations, not just for Ridge.
• Pros (partial list): "objective" measure of a model.
• Cons (partial list): inference is, strictly speaking, "out the window" (also true for other model selection procedures in theory).
• If goal is not really inference about certain specific parameters, it is a reasonable way to compare models.

Generalized Cross Validation

• A computational shortcut for $n$-fold cross-validation (also known as leave-one out cross-validation).
• Let $S_{\lambda} = (X^TX + \lambda I)^{-1} X^T$ be the matrix in ridge regression.
• Then $GCV(\lambda) = \frac{\|Y - S_{\lambda}Y\|^2}{n - {\text{Tr}}(S_{\lambda})}.$
• The quantity ${\text{Tr}}(S_{\lambda})$ can be thought of as the effective degrees of freedom for this choice of $\lambda$.

Ridge regression¶

In [8]:
%%R
plot(diabetes.ridge$lambda, diabetes.ridge$GCV, xlab='Lambda', ylab='GCV', type='l', lwd=3, col='orange')
select(diabetes.ridge)

modified HKB estimator is 5.462251
modified L-W estimator is 7.641667
smallest value of GCV  at 3.25



LASSO

• Another popular penalized regression technique.
• Use the standardized model.
• The LASSO estimate is $$\hat{\beta}_{\lambda} = \argmin_{\beta} \|Y-X\beta\|^2_2 + \lambda \|\beta\|_1$$ where $$\|\beta\|_1 = \sum_{j=1}^p |\beta_j|$$ is the $\ell_1$ norm.

• Corresponds (through Lagrange multiplier) to an $\ell^1$ constraint on ${\beta_{}}$â€™s.

• In theory and practice, it works well when many ${\beta_{j}}$â€™s are 0 and gives "sparse" solutions unlike ridge.

• It is a (computable) approximation to the best subsets AIC model.

• It is computable because the minimization problem is a convex problem.

Why do we get sparse solutions with the LASSO?

LASSO¶

In [9]:
%%R -h 800 -w 800
library(lars)
data(diabetes)
diabetes.lasso = lars(diabetes$x, diabetes$y, type='lasso')
plot(diabetes.lasso)


Cross-validation for the LASSO

The lars package has a built in function to estimate CV.

In [10]:
%%R
cv.lars(diabetes$x, diabetes$y, K=10, type='lasso')


Elastic Net

• Mix between LASSO and ridge regression.
• The ENET estimator is $$\hat{\beta}_{\lambda_1, \lambda_2} = \argmin_{\beta} \|Y-X\beta\|^2_2 + \lambda_1 \|\beta\|_1 + \lambda_2 \|\beta\|^2_2.$$

The beginnings of inference after LASSO

In [11]:
%%R -h 800 -w 800
library(genlasso)
GL = genlasso(diabetes$y, diabetes$x, diag(rep(1,ncol(diabetes\$x))))
plot(GL, cex.lab=1.5)

Loading required package: Matrix