Bias-variance tradeoff

  • One goal of a regression analysis is to "build" a model that predicts well.

  • This is slightly different than the goal of making inferences about $\beta$ that we've focused on so far.

  • What does "predict well" mean? $$\begin{aligned} MSE_{pop}({{\cal M}}) &= {\mathbb{E}}\left((Y_{new} - \widehat{Y}_{new,{\cal M}})^2\right) \\ &= {\text{Var}}(Y_{new}) + {\text{Var}}(\widehat{Y}_{new,{\cal M}}) + \\ & \qquad \quad \text{Bias}(\widehat{Y}_{new,{\cal M}})^2. \end{aligned}$$

  • With $\sigma^2$ known, the quantity $C_p({\cal M})$ is an unbiased estimator of this quantity.

Shrinkage estimators

  1. Generate $Y_{40 \times 1} \sim N(\mu \cdot 1, 2.5^2 I_{40 \times 40})$, with $\mu=0.5$.
  2. For $0 \leq \alpha \leq 1$, set $\hat{Y}(\alpha) = \alpha \bar{Y}.$
  3. Compute $MSE(\hat{Y}(\alpha)) = \sum_{i=1}^{40} (\hat{Y}_{\alpha} - 0.5)^2$
  4. Repeat 100 times, plot average of $MSE(\hat{Y}(\alpha))$.

For what value of $\alpha$ is $\hat{Y}(\alpha)$ unbiased?

Is this the best estimate of $\mu$ in terms of MSE?

In [13]:
%%R -h 800 -w 800
nsample = 40
ntrial = 500
mu = 0.5
sigma = 2.5
MSE <- function(mu.hat, mu) {
  return(sum((mu.hat - mu)^2) / length(mu))
}

alpha <- seq(0.0,1,length=20)

mse <- numeric(length(alpha))

for (i in 1:ntrial) {
  Z = rnorm(nsample) * sigma + mu
  for (j in 1:length(alpha)) {
    mse[j] = mse[j] + MSE(alpha[j] * mean(Z) * rep(1, nsample), mu * rep(1, nsample)) / ntrial
  }
}

plot(alpha, mse, type='l', lwd=2, col='red', ylim=c(0, max(mse)),
     xlab=expression(paste('Penalty parameter,', alpha)), 
     ylab=expression(paste('MSE(', alpha, ')')), 
     cex.lab=1.5)

Shrinkage & Penalties

  • Shrinkage can be thought of as "constrained" or "penalized" minimization.

  • Constrained form: $$\text{minimize}_{\mu} \sum_{i=1}^n (Y_i - \mu)^2 \quad \text{subject to $\mu^2 \leq C$}$$

  • Lagrange multiplier form: equivalent to $$\widehat{\mu}_{\lambda} = \text{argmin}_{\mu} (Y_i - \mu)^2 + \lambda \cdot \mu^2$$ for some $\lambda=\lambda_C$.

  • As we vary $\lambda$ we solve all versions of the constrained form.

Solving for $\widehat{\mu}_{\lambda}$

  • Differentiating: $- 2 \sum_{i=1}^n (Y_i - \widehat{\mu}_{\lambda}) + 2 \lambda \widehat{\mu}_{\lambda} = 0$
  • Solving $\widehat{\mu}_{\lambda} = \frac{\sum_{i=1}^n Y_i}{n + \lambda} = \frac{n}{n+\lambda} \overline{Y}.$
  • As $\lambda \rightarrow 0$, $\widehat{\mu}_{\lambda} \rightarrow {\overline{Y}}.$
  • As $\lambda \rightarrow \infty$ $\widehat{\mu}_{\lambda} \rightarrow 0.$

We see that $\widehat{\mu}_{\lambda} = \hat{Y}\left(\frac{n}{n+\lambda}\right).$

In other words, considering all penalized estimators traces out the MSE curve above.

In [14]:
%%R -h 800 -w 800
lam = nsample / alpha - nsample
plot(lam, mse, type='l', lwd=2, col='red', ylim=c(0, max(mse)),
     xlab=expression(paste('Penalty parameter,',  lambda)), 
     ylab=expression(paste('MSE(', lambda, ')')), cex.lab=1.5)

How much to shrink?

  • In our one-sample example,
  • $$\begin{aligned} MSE_{pop}(\alpha) &= {\text{Var}}( \alpha \bar{Y}) + \text{Bias}(\alpha \bar{Y})^2 \\ &= \frac{\alpha^2 \sigma^2}{n} + \mu^2 (1 - \alpha)^2 \end{aligned}$$
  • Differentiating and solving: $$\begin{aligned} 0 &= -2 \mu^2(1 - \lambda^*) + 2 \frac{\lambda^* \sigma^2}{n} \\ \lambda^* & = \frac{\mu^2}{\mu^2+\sigma^2/n} \\ &= \frac{0.5^2}{0.5^2+1/10} = 0.71 \end{aligned}$$

We see that the optimal $\alpha$ depends on the unknown $SNR=\mu/(\sigma/\sqrt{n})$.

In practice we might hope to estimate MSE with cross-validation.

Let's see how our theoretical choice matches the random MSE.

In [3]:
%%R -h 800 -w 800
plot(alpha, mse, type='l', lwd=2, col='red', ylim=c(0, max(mse)),
     xlab=expression(paste('Shrinkage parameter ', alpha)), 
     ylab=expression(paste('MSE(', alpha, ')')), cex.lab=1.5)
abline(v=mu^2/(mu^2+sigma^2/nsample), col='blue', lty=2)

Penalties & Priors

  • Minimizing $\sum_{i=1}^n (Y_i - \mu)^2 + \lambda \mu^2$ is similar to computing "MLE" of $\mu$ if the likelihood was proportional to $$\exp \left(-\frac{1}{2\sigma^2}\left( \|Y-\mu\|^2_2 + \lambda \mu^2\right) \right).$$

  • If $\lambda=m$, an integer, then $\widehat{\mu}_{\lambda}$ is the sample mean of $(Y_1, \dots, Y_n,0 ,\dots, 0) \in \mathbb{R}^{n+m}$.

  • This is equivalent to adding some data with $Y=0$. To a Bayesian, this extra data is a prior distribution and we are computing the so-called MAP or posterior mode.

AIC as penalized regression

  • Model selection with $C_p$ (or AIC with $\sigma^2$ assumed known) is a version of penalized regression.

  • The best subsets version of AIC (which is not exactly equivalent to step) $$ \hat{\beta}_{AIC} = \argmin_{\beta} \frac{1}{\sigma^2}\|Y-X\beta\|^2_2 + 2 \|\beta\|_0 $$ where $$ \|\beta\|_0 = \#\left\{j : \beta_j \neq 0 \right\} $$ is called the $\ell_0$ norm.

  • The $\ell_0$ penalty can be thought of as a measure of complexity of the model. Most penalties are similar versions of complexity.

Penalized regression in general

  • Not all biased models are better – we need a way to find "good" biased models.

  • Inference ($F$, $\chi^2$ tests, etc) is not quite exact for biased models. Though, there has been some recent work to address the issue of post-selection inference.

  • Heuristically, "large $\beta$" (measured by some norm) is interpreted as "complex model". Goal is really to penalize "complex" models, i.e. Occam’s razor.

  • If truth really is complex, this may not work! (But, it will then be hard to build a good model anyways ... (statistical lore))

Ridge regression

  • Assume that columns $(X_j)_{1 \leq j \leq p}$ have zero mean, and length 1 and $Y$ has zero mean.

  • This is called the standardized model.

  • The ridge estimator is $$ \hat{\beta}_{\lambda} = \argmin_{\beta} = \|Y-X\beta\|^2_2 + \lambda \|\beta\|^2_2$$

  • Corresponds (through Lagrange multiplier) to a quadratic constraint on ${\beta_{}}$’s.

  • This is the natural generalization of the penalized version of our shrinkage estimator.

Solving the normal equations

  • Normal equations $\frac{\partial}{\partial {\beta_{l}}} SSE_{\lambda}({\beta_{}}) = -2 \langle Y - X{\beta_{}}, X_l \rangle + 2 \lambda {\beta_{l}}$
  • $-2 \langle Y - X{\widehat{\beta}_{\lambda}}, X_l \rangle + 2 \lambda {\widehat{\beta}_{l,\lambda}} = 0, \qquad 1 \leq l \leq p$
  • In matrix form $-X^TY + (X^TX + \lambda I) {\widehat{\beta}_{\lambda}} = 0$
  • Or ${\widehat{\beta}_{\lambda}} = (X^TX + \lambda I)^{-1} X^TY.$

Ridge regression

In [4]:
%%R -h 800 -w 800
library(lars)
data(diabetes)
library(MASS)
diabetes.ridge <- lm.ridge(diabetes$y ~ diabetes$x, lambda=seq(0,10,0.05))
plot(diabetes.ridge, cex.lab=3, lwd=3)
Loaded lars 1.1


Choosing $\lambda$

  • If we knew $MSE$ as a function of $\lambda$ then we would simply choose the $\lambda$ that minimizes $MSE$.
  • To do this, we need to estimate $MSE$.
  • A popular method is cross-validation as a function of $\lambda$. Breaks the data up into smaller groups and uses part of the data to predict the rest.
  • We saw this in diagnostics (Cook’s distance measured the fit with and without each point in the data set) and model selection.

$K$-fold cross-validation for penalized model

  • Fix a model (i.e. fix $\lambda$). Break data set into $K$ approximately equal sized groups $(G_1, \dots, G_K)$.
  • for (i in 1:K) Use all groups except $G_i$ to fit model, predict outcome in group $G_i$ based on this model $\widehat{Y}_{j(i),\lambda}, j \in G_i$.
  • Estimate $CV(\lambda) = \frac{1}{n}\sum_{i=1}^K \sum_{j \in G_i} (Y_j - \widehat{Y}_{j(i),\lambda})^2.$

Here is a function to estimate the CV for our one parameter example. In practice, we only have one sample to form the CV curve. In this example below, I will compute the average CV error for 500 trials to show that it is roughly comparable in shape to the MSE curve.

In [15]:
%%R
K = 10
CV = function(Z, alpha) {
    cve = numeric(K)
    l = length(Z)
    for (i in 1:K) {
        g = c((i-1)*l/K+1,i*l/K)
        mu.hat = mean(Z[-g]) * alpha
        cve[i] = sum((Z[g]-mu.hat)^2)
    }
    return(c(cve, sd(cve)))
}

Let's see how the parameter chosen by 5-fold CV compares to our theoretical choice.

In [17]:
%%R -h 800 -w 800
alpha = seq(0.0,1,length=20)

 mse = numeric(length(alpha))
 avg.cv = numeric(length(alpha))
 

 for (i in 1:ntrial) {
     Z = rnorm(nsample) * sigma + mu
     for (j in 1:length(alpha)) {
         current_cv = CV(Z, alpha[j])
         avg.cv[j] = avg.cv[j] + current_cv[1] / ntrial
     }
 }

avg.cv = avg.cv / nsample

 plot(alpha, avg.cv, type='l', lwd=2, col='green',
    xlab='Shrinkage parameter, alpha', ylab='Average CV(alpha)')
 abline(v=mu^2/(mu^2+sigma^2/nsample), col='blue', lty=2)
 abline(v=min(alpha[avg.cv == min(avg.cv)]), col='red', lty=2)

Let's see what one curve looks like on our smallish sample.

In [7]:
%%R
set.seed(0)
cv = numeric(length(alpha))
cv.sd = numeric(length(alpha))
Z = rnorm(nsample) * sigma + mu
for (j in 1:length(alpha)) {
    current_cv = CV(Z, alpha[j])
    cv[j] = current_cv[1]
    cv.sd[j] = current_cv[2]
}

plot(alpha, cv, type='l', lwd=2, col='green',
    xlab='Shrinkage parameter, alpha', ylab='CV(alpha)')
 abline(v=mu^2/(mu^2+sigma^2/nsample), col='blue', lty=2)
 abline(v=min(alpha[cv == min(cv)]), col='red', lty=2)

$K$-fold cross-validation (continued)

  • It is a general principle that can be used in other situations, not just for Ridge.
  • Pros (partial list): "objective" measure of a model.
  • Cons (partial list): inference is, strictly speaking, "out the window" (also true for other model selection procedures in theory).
  • If goal is not really inference about certain specific parameters, it is a reasonable way to compare models.

Generalized Cross Validation

  • A computational shortcut for $n$-fold cross-validation (also known as leave-one out cross-validation).
  • Let $S_{\lambda} = (X^TX + \lambda I)^{-1} X^T$ be the matrix in ridge regression.
  • Then $GCV(\lambda) = \frac{\|Y - S_{\lambda}Y\|^2}{n - {\text{Tr}}(S_{\lambda})}.$
  • The quantity ${\text{Tr}}(S_{\lambda})$ can be thought of as the effective degrees of freedom for this choice of $\lambda$.

Ridge regression

In [8]:
%%R
plot(diabetes.ridge$lambda, diabetes.ridge$GCV, xlab='Lambda', ylab='GCV', type='l', lwd=3, col='orange')
select(diabetes.ridge)
modified HKB estimator is 5.462251 
modified L-W estimator is 7.641667 
smallest value of GCV  at 3.25 

LASSO

  • Another popular penalized regression technique.
  • Use the standardized model.
  • The LASSO estimate is $$ \hat{\beta}_{\lambda} = \argmin_{\beta} \|Y-X\beta\|^2_2 + \lambda \|\beta\|_1$$ where $$ \|\beta\|_1 = \sum_{j=1}^p |\beta_j| $$ is the $\ell_1$ norm.

  • Corresponds (through Lagrange multiplier) to an $\ell^1$ constraint on ${\beta_{}}$’s.

  • In theory and practice, it works well when many ${\beta_{j}}$’s are 0 and gives "sparse" solutions unlike ridge.

  • It is a (computable) approximation to the best subsets AIC model.

  • It is computable because the minimization problem is a convex problem.

Why do we get sparse solutions with the LASSO?

LASSO

In [9]:
%%R -h 800 -w 800
library(lars)
data(diabetes)
diabetes.lasso = lars(diabetes$x, diabetes$y, type='lasso')
plot(diabetes.lasso)

Cross-validation for the LASSO

The lars package has a built in function to estimate CV.

In [10]:
%%R
cv.lars(diabetes$x, diabetes$y, K=10, type='lasso')

Elastic Net

  • Mix between LASSO and ridge regression.
  • The ENET estimator is $$ \hat{\beta}_{\lambda_1, \lambda_2} = \argmin_{\beta} \|Y-X\beta\|^2_2 + \lambda_1 \|\beta\|_1 + \lambda_2 \|\beta\|^2_2.$$

The beginnings of inference after LASSO

In [11]:
%%R -h 800 -w 800
library(genlasso)
GL = genlasso(diabetes$y, diabetes$x, diag(rep(1,ncol(diabetes$x))))
plot(GL, cex.lab=1.5)
Loading required package: Matrix
Loading required package: lattice
Loading required package: igraph

In [11]: