Bias-variance tradeoff

  • One goal of a regression analysis is to "build" a model that predicts well.

  • This is slightly different than the goal of making inferences about $\beta$ that we've focused on so far.

  • What does "predict well" mean? $$ \begin{aligned} MSE{pop}({{\cal M}}) &= {\mathbb{E}}\left((Y{new} - \widehat{Y}_{new,{\cal M}})^2\right) \ &= {\text{Var}}(Y{new}) + {\text{Var}}(\widehat{Y}{new,{\cal M}}) + \ & \qquad \quad \text{Bias}(\widehat{Y}_{new,{\cal M}})^2. \end{aligned}$$

  • With $\sigma^2$ known, the quantity $C_p({\cal M})$ is an unbiased estimator of this quantity.

Shrinkage estimators

  1. Generate $Y_{100 \times 1} \sim N(\mu \cdot 1, 5^2 I_{100 \times 100})$, with $\mu=0.5$.
  2. For $0 \leq \alpha \leq 1$, set $\hat{Y}(\alpha) = \alpha \bar{Y}.$
  3. Compute $MSE(\hat{Y}(\alpha)) = \sum_{i=1}^{100} (\hat{Y}_{\alpha} - 0.5)^2$
  4. Repeat 500 times, plot average of $MSE(\hat{Y}(\alpha))$.

For what value of $\alpha$ is $\hat{Y}(\alpha)$ unbiased?

Is this the best estimate of $\mu$ in terms of MSE?

In [1]:
%%R -h 600 -w 600
par(cex.lab=5)
nsample = 100
ntrial = 500
mu = 0.5
sigma = 5
MSE <- function(mu.hat, mu) {
  return(sum((mu.hat - mu)^2) / length(mu))
}

alpha <- seq(0.0,1,length=20)

mse <- numeric(length(alpha))

for (i in 1:ntrial) {
  Z = rnorm(nsample) * sigma + mu
  for (j in 1:length(alpha)) {
    mse[j] = mse[j] + MSE(alpha[j] * mean(Z) * rep(1, nsample), mu * rep(1, nsample)) / ntrial
  }
}

plot(alpha, mse, type='l', lwd=2, col='red', ylim=c(0, max(mse)),
     xlab=expression(paste('Penalty parameter,', alpha)), 
     ylab=expression(paste('MSE(', alpha, ')')), 
     cex.lab=1.5)

Shrinkage & Penalties

  • Shrinkage can be thought of as "constrained" or "penalized" minimization.

  • Constrained form: $$\text{minimize}_{\mu} \sum_{i=1}^n (Y_i - \mu)^2 \quad \text{subject to $\mu^2 \leq C$}$$

  • Lagrange multiplier form: equivalent to $$\widehat{\mu}_{\lambda} = \text{argmin}_{\mu} (Y_i - \mu)^2 + \lambda \cdot \mu^2$$ for some $\lambda=\lambda_C$.

  • As we vary $\lambda$ we solve all versions of the constrained form.

Solving for $\widehat{\mu}_{\lambda}$

  • Differentiating: $- 2 \sum_{i=1}^n (Y_i - \widehat{\mu}_{\lambda}) + 2 \lambda \widehat{\mu}_{\lambda} = 0$
  • Solving $\widehat{\mu}_{\lambda} = \frac{\sum_{i=1}^n Y_i}{n + \lambda} = \frac{n}{n+\lambda} \overline{Y}.$
  • As $\lambda \rightarrow 0$, $\widehat{\mu}_{\lambda} \rightarrow {\overline{Y}}.$
  • As $\lambda \rightarrow \infty$ $\widehat{\mu}_{\lambda} \rightarrow 0.$

We see that $\widehat{\mu}_{\lambda} = \bar{Y} \cdot \left(\frac{n}{n+\lambda}\right).$

** In other words, considering all penalized estimators traces out the MSE curve above.**

In [2]:
%%R -h 600 -w 600
lam = nsample / alpha - nsample
plot(lam, mse, type='l', lwd=2, col='red', ylim=c(0, max(mse)),
     xlab=expression(paste('Penalty parameter,',  lambda)), 
     ylab=expression(paste('MSE(', lambda, ')')))

How much to shrink?

  • In our one-sample example,
  • $$\begin{aligned} MSE_{pop}(\alpha) &= {\text{Var}}( \alpha \bar{Y}) + \text{Bias}(\alpha \bar{Y})^2 \ &= \frac{\alpha^2 \sigma^2}{n} + \mu^2 (1 - \alpha)^2 \end{aligned}$$
  • Differentiating and solving: $$\begin{aligned} 0 &= -2 \mu^2(1 - \lambda^) + 2 \frac{\lambda^ \sigma^2}{n} \ \lambda^* & = \frac{\mu^2}{\mu^2+\sigma^2/n} \ &= \frac{0.5^2}{0.5^2+1/10} = 0.71 \end{aligned}$$

We see that the optimal $\alpha$ depends on the unknown $SNR=\mu/(\sigma/\sqrt{n})$.

In practice we might hope to estimate MSE with cross-validation.

Let's see how our theoretical choice matches the MSE on our 100 sample.

In [3]:
%%R -h 600 -w 600
plot(alpha, mse, type='l', lwd=2, col='red', ylim=c(0, max(mse)),
     xlab=expression(paste('Shrinkage parameter ', alpha)), 
     ylab=expression(paste('MSE(', alpha, ')')))
abline(v=mu^2/(mu^2+sigma^2/nsample), col='blue', lty=2)

Penalties & Priors

  • Minimizing $\sum_{i=1}^n (Y_i - \mu)^2 + \lambda \mu^2$ is similar to computing "MLE" of $\mu$ if the likelihood was proportional to $$\exp \left(-\frac{1}{2\sigma^2}\left( \|Y-\mu\|^2_2 + \lambda \mu^2\right) \right).$$

  • If $\lambda=m$, an integer, then $\widehat{\mu}_{\lambda}$ is the sample mean of $(Y_1, \dots, Y_n,0 ,\dots, 0) \in \mathbb{R}^{n+m}$.

  • This is equivalent to adding some data with $Y=0$.

  • To a Bayesian, this extra data is a prior distribution and we are computing the so-called MAP or posterior mode.

AIC as penalized regression

  • Model selection with $C_p$ (or AIC with $\sigma^2$ assumed known) is a version of penalized regression.

  • The best subsets version of AIC (which is not exactly equivalent to step) $$ \hat{\beta}{AIC} = \argmin{\beta} \frac{1}{\sigma^2}|Y-X\beta|^2_2 + 2 |\beta|_0 $$ where $$ |\beta|_0 = #\left{j : \beta_j \neq 0 \right} $$ is called the $\ell_0$ norm.

  • The $\ell_0$ penalty can be thought of as a measure of complexity of the model. Most penalties are similar versions of complexity.

Penalized regression in general

  • Not all biased models are better – we need a way to find "good" biased models.

  • Inference ($F$, $\chi^2$ tests, etc) is not quite exact for biased models. Though, there has been some recent work to address the issue of post-selection inference, at least for some penalized regression problems.

  • Heuristically, "large $\beta$" (measured by some norm) is interpreted as "complex model". Goal is really to penalize "complex" models, i.e. Occam’s razor.

  • If truth really is complex, this may not work! (But, it will then be hard to build a good model anyways ... (statistical lore))

Ridge regression

  • Assume that columns $(X_j)_{1 \leq j \leq p}$ have zero mean, and length 1 and $Y$ has zero mean.

  • This is called the standardized model.

  • The ridge estimator is $$ \hat{\beta}{\lambda} = \argmin{\beta} = |Y-X\beta|^2_2 + \lambda |\beta|^2_2$$

  • Corresponds (through Lagrange multiplier) to a quadratic constraint on ${\beta_{}}$’s.

  • This is the natural generalization of the penalized version of our shrinkage estimator.

Solving the normal equations

  • Normal equations $$\frac{\partial}{\partial {\beta_{l}}} SSE_{\lambda}({\beta_{}}) = -2 \langle Y - X{\beta_{}}, X_l \rangle + 2 \lambda {\beta_{l}}$$
  • $$-2 \langle Y - X{\widehat{\beta}_{\lambda}}, X_l \rangle + 2 \lambda {\widehat{\beta}_{l,\lambda}} = 0, \qquad 1 \leq l \leq p$$
  • In matrix form $$-X^TY + (X^TX + \lambda I) {\widehat{\beta}_{\lambda}} = 0.$$
  • Or $${\widehat{\beta}_{\lambda}} = (X^TX + \lambda I)^{-1} X^TY.$$

Ridge regression

In [4]:
%%R -h 600 -w 600
library(lars)
data(diabetes)
library(MASS)
diabetes.ridge <- lm.ridge(diabetes$y ~ diabetes$x, lambda=seq(0,10,0.05))
plot(diabetes.ridge, lwd=3)
Loaded lars 1.2

Choosing $\lambda$

  • If we knew $MSE$ as a function of $\lambda$ then we would simply choose the $\lambda$ that minimizes $MSE$.
  • To do this, we need to estimate $MSE$.
  • A popular method is cross-validation as a function of $\lambda$. Breaks the data up into smaller groups and uses part of the data to predict the rest.
  • We saw this in diagnostics (Cook’s distance measured the fit with and without each point in the data set) and model selection.

$K$-fold cross-validation for penalized model

  • Fix a model (i.e. fix $\lambda$). Break data set into $K$ approximately equal sized groups $(G_1, \dots, G_K)$.
  • for (i in 1:K) Use all groups except $G_i$ to fit model, predict outcome in group $G_i$ based on this model $\widehat{Y}_{j(i),\lambda}, j \in G_i$.
  • Estimate $CV(\lambda) = \frac{1}{n}\sum_{i=1}^K \sum_{j \in G_i} (Y_j - \widehat{Y}_{j(i),\lambda})^2.$

Here is a function to estimate the CV for our one parameter example. In practice, we only have one sample to form the CV curve. In this example below, I will compute the average CV error for 500 trials to show that it is roughly comparable in shape to the MSE curve.

In [5]:
%%R
K = 5
CV = function(Z, alpha) {
    cve = numeric(K)
    l = length(Z)
    for (i in 1:K) {
        g = c(((i-1)*l/K+1):(i*l/K))
        mu.hat = mean(Z[-g]) * alpha
        cve[i] = sum((Z[g]-mu.hat)^2)
    }
    return(c(cve, sd(cve)))
}

Let's see how the parameter chosen by 5-fold CV compares to our theoretical choice.

In [6]:
%%R -h 600 -w 600

alpha = seq(0.0,1,length=20)

mse = numeric(length(alpha))
avg.cv = numeric(length(alpha)) 

for (i in 1:ntrial) {
     Z = rnorm(nsample) * sigma + mu
     for (j in 1:length(alpha)) {
         current_cv = CV(Z, alpha[j])
         avg.cv[j] = avg.cv[j] + current_cv[1] / ntrial
     }
 }

avg.cv = avg.cv / nsample

 plot(alpha, avg.cv, type='l', lwd=2, col='green',
    xlab='Shrinkage parameter, alpha', ylab='Average CV(alpha)')
 abline(v=mu^2/(mu^2+sigma^2/nsample), col='blue', lty=2)
 abline(v=min(alpha[avg.cv == min(avg.cv)]), col='red', lty=2)

The curve above shows what would happen if we could repeat this and averaeg over many samples.

Let's see what one curve looks like on our sample. This is the result we might get in practice on a given data set.

In [7]:
%%R
set.seed(0)

cv = numeric(length(alpha))
cv.sd = numeric(length(alpha))
Z = rnorm(nsample) * sigma + mu
for (j in 1:length(alpha)) {
    current_cv = CV(Z, alpha[j])
    cv[j] = current_cv[1]
    cv.sd[j] = current_cv[2]
}

plot(alpha, cv, type='l', lwd=2, col='green',
    xlab='Shrinkage parameter, alpha', ylab='CV(alpha)')
 abline(v=mu^2/(mu^2+sigma^2/nsample), col='blue', lty=2)
 abline(v=min(alpha[cv == min(cv)]), col='red', lty=2)

Generalized Cross Validation

  • A computational shortcut for $n$-fold cross-validation (also known as leave-one out cross-validation).
  • Let $S_{\lambda} = (X^TX + \lambda I)^{-1} X^T$ be the matrix in ridge regression.
  • Then $GCV(\lambda) = \frac{\|Y - S_{\lambda}Y\|^2}{n - {\text{Tr}}(S_{\lambda})}.$
  • The quantity ${\text{Tr}}(S_{\lambda})$ can be thought of as the effective degrees of freedom for this choice of $\lambda$.

Ridge regression

In [8]:
%%R -h 600 -w 600
par(cex.lab=2)
plot(diabetes.ridge$lambda, diabetes.ridge$GCV, xlab='Lambda', ylab='GCV', type='l', lwd=3, col='orange')
select(diabetes.ridge)
modified HKB estimator is 5.462251 
modified L-W estimator is 7.641667 
smallest value of GCV  at 3.25 

LASSO

  • Another popular penalized regression technique.
  • Use the standardized model.
  • The LASSO estimate is $$ \hat{\beta}{\lambda} = \argmin{\beta} |Y-X\beta|^2_2 + \lambda |\beta|_1$$ where $$ |\beta|1 = \sum{j=1}^p |\beta_j| $$ is the $\ell_1$ norm.

  • Corresponds (through Lagrange multiplier) to an $\ell^1$ constraint on ${\beta_{}}$’s.

  • In theory and practice, it works well when many ${\beta_{j}}$’s are 0 and gives "sparse" solutions unlike ridge.

  • It is a (computable) approximation to the best subsets AIC model.

  • It is computable because the minimization problem is a convex problem.

Why do we get sparse solutions with the LASSO?

LASSO

In [9]:
%%R -h 600 -w 600

library(lars)
data(diabetes)
diabetes.lasso = lars(diabetes$x, diabetes$y, type='lasso')
plot(diabetes.lasso)

Cross-validation for the LASSO

The lars package has a built in function to estimate CV.

In [10]:
%%R -h 600 -w 600
par(cex.lab=2)
cv.lars(diabetes$x, diabetes$y, K=10, type='lasso')

Elastic Net

  • Mix between LASSO and ridge regression.
  • The ENET estimator is $$ \hat{\beta}{\lambda_1, \lambda_2} = \argmin{\beta} |Y-X\beta|^2_2 + \lambda_1 |\beta|_1 + \lambda_2 |\beta|^2_2.$$

The beginnings of inference after LASSO

In [11]:
%%R -h 600 -w 600
par(cex.lab=2)
library(genlasso)
GL = genlasso(diabetes$y, diabetes$x, diag(rep(1,ncol(diabetes$x))))
plot(GL)
Loading required package: Matrix
Loading required package: igraph

Covariance test for LASSO path

  • The events $\lambda_i=\lambda_i(X,y)$ can be thought of as event "times".

  • They are random variables themselves, maybe we can say something about their distribution?

  • Suppose $\text{diag}(X^TX)=1$ (i.e. we rescale columns of design matrix to have constant length)

  • Then, under $H_0:\beta=0$ $$ \frac{1 - {\tt pnorm}(\lambda_1/\sigma)}{1 - {\tt pnorm}(\lambda_2/\sigma)} \sim \text{Unif}(0,1). $$

  • Can also describe what happens under other alternatives. Take a look at the paper.

  • Code coming!

Back to top