Linear regression

  • The bread and butter of applied statistics.

  • Dependent (random) variable $Y \in \mathbb{R}$, independent (random) variable $X \in \mathbb{R}^p$.

  • Goal: to produce a regression function $$ \mu(X) = E(Y|X). $$

  • Linear regression model just says $$ \mu(X) = X\beta, \qquad \beta \in \mathbb{R}^p. $$

Diabetes data

  • In this study, the outcome of interest (dependent variable) is a measure of the progression of diabetes one year after diagnosis.

  • The independent variables included bmi,age,sex,blood pressure and six blood measurements.

In [1]:
%load_ext rmagic
In [2]:
%%R
library(lars)
data(diabetes)
plot(diabetes$x[,"bmi"], diabetes$y, pch=23, bg='red')
Loaded lars 1.1


Least squares regression

  • A parametric model based on observing an IID sample $(Y_i, X_i)_{1 \leq i \leq n}$.

  • Model: $$ Y_{n \times 1} | X_{n \times p} \sim N\left(X_{n \times p}\beta_{p \times 1}, \sigma^2 I_{n \times n} \right) $$

  • Least squares estimator $$ \hat{\beta} = \text{argmin}_{\beta \in \mathbb{R}^p} \frac{1}{2} \|Y-X\beta\|^2_2. $$

  • Under our model, if $n > p$ and $\text{rank}(X)=p$ $$ \hat{\beta} = (X^TX)^{-1}X^TY \sim N\left(\beta, \sigma^2 (X^TX)^{-1} \right). $$

Gaussian facts

  • We write $\mathbb{R}^q \ni Z \sim N(\mu, \Sigma)$ for the multivariate Gaussian distribution.

  • This means that for any $a \in \mathbb{R}^q$ we have $$ a^TZ \sim N(a^T\mu, a^T\Sigma a). $$

  • For $A_{p \times q}$ fixed, we then have $$ AZ \sim N(A\mu, A\Sigma A^T). $$

  • Distribution of $\hat{\beta}$ follows directly with $A=(X^TX)^{-1}X^T, Z \sim N(X\beta, \sigma^2 I)$.

Example

In [3]:
%%R
library(lars)
data(diabetes)
plot(diabetes$x[,"bmi"], diabetes$y, pch=23, bg='red')
In [4]:
%%R
print(summary(lm(diabetes$y ~ diabetes$x[,'bmi'])))

Call:
lm(formula = diabetes$y ~ diabetes$x[, "bmi"])

Residuals:
     Min       1Q   Median       3Q      Max 
-164.920  -43.572   -8.649   46.344  154.878 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)          152.133      2.974   51.16   <2e-16 ***
diabetes$x[, "bmi"]  949.435     62.515   15.19   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 62.52 on 440 degrees of freedom
Multiple R-squared: 0.3439,	Adjusted R-squared: 0.3424 
F-statistic: 230.7 on 1 and 440 DF,  p-value: < 2.2e-16 


Pivots: a basic tool for inference

  • Over and over again in statistics, we talk about $t$-statistics $$ T=T(\theta_0) \overset{H_0:\theta=\theta_0}{=} \frac{\hat{\theta}-\theta_0}{SE(\hat{\theta})}. $$

  • Often, $\theta_0=0$, as in the above table.

  • But, whatever $\theta_0$ the above quantity has the same distribution: Student's distribution with 440 degrees of freedom. We call this a pivot.

  • Pivots generally lead to confidence intervals $$ \hat{\theta} \pm t_{440,0.975} SE(\hat{\theta}) $$ with the property $$ \mathbb{P}\left(\theta_0 \in \hat{\theta} \pm t_{440,0.975} SE(\hat{\theta}) \right)=0.95. $$

  • By the above rule, the confidence interval for the slope is $949 \pm 125$.

  • Correct interpretation: "if we redrew a sample of the same size from this population, 95% of the time the intervals would cover the true parameter (assuming the model is correct)."

  • Incorrect interpretation: "we are 95% certain that the slope is in the interval $949 \pm 125$."

  • Pivots also leads to the testing rules $$ \text{Reject $H_0:\theta=\theta_0$ if $|T(\theta_0)| > t_{440,0.975} \approx 2$.} $$

  • The $t$-statistics in the table correspond to $T(0)$ -- tests that a parameter is 0.

In [5]:
%%R
qt(0.975,440)
pt(qt(0.975,440),440)
[1] 0.975

In [6]:
%%R
print(summary(diabetes$x))
      age                 sex                bmi                 map           
 Min.   :-0.107226   Min.   :-0.04464   Min.   :-0.090275   Min.   :-0.112400  
 1st Qu.:-0.037299   1st Qu.:-0.04464   1st Qu.:-0.034229   1st Qu.:-0.036656  
 Median : 0.005383   Median :-0.04464   Median :-0.007284   Median :-0.005671  
 Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.000000   Mean   : 0.000000  
 3rd Qu.: 0.038076   3rd Qu.: 0.05068   3rd Qu.: 0.031248   3rd Qu.: 0.035644  
 Max.   : 0.110727   Max.   : 0.05068   Max.   : 0.170555   Max.   : 0.132044  
       tc                 ldl                 hdl           
 Min.   :-0.126781   Min.   :-0.115613   Min.   :-0.102307  
 1st Qu.:-0.034248   1st Qu.:-0.030358   1st Qu.:-0.035117  
 Median :-0.004321   Median :-0.003819   Median :-0.006584  
 Mean   : 0.000000   Mean   : 0.000000   Mean   : 0.000000  
 3rd Qu.: 0.028358   3rd Qu.: 0.029844   3rd Qu.: 0.029312  
 Max.   : 0.153914   Max.   : 0.198788   Max.   : 0.181179  
      tch                 ltg                 glu           
 Min.   :-0.076395   Min.   :-0.126097   Min.   :-0.137767  
 1st Qu.:-0.039493   1st Qu.:-0.033249   1st Qu.:-0.033179  
 Median :-0.002592   Median :-0.001948   Median :-0.001078  
 Mean   : 0.000000   Mean   : 0.000000   Mean   : 0.000000  
 3rd Qu.: 0.034309   3rd Qu.: 0.032433   3rd Qu.: 0.027917  
 Max.   : 0.185234   Max.   : 0.133599   Max.   : 0.135612  

In [7]:
%%R
diabetes.lm = lm(y ~ x, data=diabetes)
print(summary(diabetes.lm))

Call:
lm(formula = y ~ x, data = diabetes)

Residuals:
     Min       1Q   Median       3Q      Max 
-155.829  -38.534   -0.227   37.806  151.355 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  152.133      2.576  59.061  < 2e-16 ***
xage         -10.012     59.749  -0.168 0.867000    
xsex        -239.819     61.222  -3.917 0.000104 ***
xbmi         519.840     66.534   7.813 4.30e-14 ***
xmap         324.390     65.422   4.958 1.02e-06 ***
xtc         -792.184    416.684  -1.901 0.057947 .  
xldl         476.746    339.035   1.406 0.160389    
xhdl         101.045    212.533   0.475 0.634721    
xtch         177.064    161.476   1.097 0.273456    
xltg         751.279    171.902   4.370 1.56e-05 ***
xglu          67.625     65.984   1.025 0.305998    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 54.15 on 431 degrees of freedom
Multiple R-squared: 0.5177,	Adjusted R-squared: 0.5066 
F-statistic: 46.27 on 10 and 431 DF,  p-value: < 2.2e-16 


  • Notice that the coefficient of bmi changed when we added other variables. We say that the effect of bmi with the other variables included allows for all the other variables.

  • Notice that there is more than one linear regression model for each pair $Y \in \mathbb{R}^n, X \in \mathbb{R}^{n \times p}$. This raises the issue of model selection.

  • The confidence intervals above are intervals for parameters like $\beta_{{\tt bmi}}$. Each interval has 95% coverage. What can we say about the 11 intervals?

  • Alternatively, in the summary above, there are 11 $p$-values. What can we say about these 11 random variables? This raises the issue of simultaneous inference.

In [8]:
%%R
confint(diabetes.lm)
                  2.5 %     97.5 %
(Intercept)   147.07069  157.19628
xage         -127.44823  107.42383
xsex         -360.15053 -119.48765
xbmi          389.06918  650.61039
xmap          195.80469  452.97616
xtc         -1611.16947   26.80114
xldl         -189.62083 1143.11251
xhdl         -316.68469  518.77383
xtch         -140.31347  494.44183
xltg          413.40886 1089.14978
xglu          -62.06548  197.31626

Simultaneous inference

  • We performed 11 hypothesis tests above. Let's call the null hypotheses $$H_{0,{\tt v}}: \beta_{{\tt v}}=0$$ where v is one of our independent (random) variables. The alternatives are $H_{a, {\tt v}}$.
  • If we make a decision about each hypothesis, this can be represented as a map $$ \mathbb{R}^n \times \mathbb{R}^{n \times p} \overset{\delta}{\rightarrow} \{0,1\}^p. $$

  • In reality, some of these null hypotheses are true, and some are false. The outcomes of our decision can be represented in a table (source: wikipedia)

Null hypothesis is True (H0) Alternative hypothesis is True (H1) Total
Declared significant V S R
Declared non-significant U T m - R
Total m_0 m - m_0 m
  • Above, each entry is a function of our decision rule $\delta$.

Two different error rates (of many)

  • Family Wise Error Rate (FWER) $$ \mathbb{P}(V(\delta) > 0). $$

  • A simple rule to control FWER is to use Bonferroni procedure: of 11 tests, test each at level $0.05 / 11$. The FWER of this procedure is bounded by 0.05.

  • False Discovery Rate (FDR) $$ \mathbb{E} \left(\frac{V(\delta)}{R(\delta)} 1_{\{R(\delta) > 0\}} \right). $$

  • When the test statistics are independent (or almost) there are simple procedures that bound the FDR at some level $q$. For example, the Benjamini-Hochberg procedure.

  • For some other types of dependence, or for many tests, a modification of the BH procedure can still control FDR.

Bonferroni

  • This simple procedure is often pretty accurate.

  • This is one source of a $\sqrt{\log(p)}$ term that we heard about earlier.

  • This $\sqrt{\log(p)}$ comes from the fact that for independent $Z_i \sim N(0,1)$ $$ \lim_{p \to \infty} \frac{\max_{1 \leq i \leq p} |Z_i|}{\sqrt{2 \log p}} = 1. $$

  • To see this is the correct order, just look at Bonferroni (union) bound $$ \mathbb{P} \left(\max_{1 \leq i \leq p} |Z_i| > t \right) \leq p \cdot \mathbb{P}(|Z|_1 > t) \approx p \cdot \frac{e^{-t^2/2}}{\sqrt{2\pi} \cdot t}. $$

  • It is a simple way to bound the distribution of the smallest $p$-value (or largest $t$-statistic).

In [9]:
%%R
p = summary(diabetes.lm)$coef[,4]
print(p <= 0.05 / 11)
(Intercept)        xage        xsex        xbmi        xmap         xtc 
       TRUE       FALSE        TRUE        TRUE        TRUE       FALSE 
       xldl        xhdl        xtch        xltg        xglu 
      FALSE       FALSE       FALSE        TRUE       FALSE 

False Discovery Rate

  • A natural estimate of FDR for a thresholding procedure $$ \delta_t(X,Y)_i = 1_{\{p_i(X,Y) \leq t\}} $$ is $$ \hat{FDR}(t) = \frac{11 \cdot t}{R(t)} 1_{\{R(t) > 0)\}} = \frac{\mathbb{E}_0(V(t))}{R(t)} 1_{\{R(t) > 0)\}}. $$

  • Above, $\mathbb{E}_0$ denotes the distribution under the global (intersection) null hypothesis $\beta=0$.

  • The Benjamini-Hochberg procedure can be described as thresholding at $$ t_q = t_q(X,Y) = \sup\left\{t: \hat{FDR}(t) \leq q \right\}. $$

  • Is it valid for our diabetes data? I don't know...

  • One way to check is to see if there is (strictly) positive dependence...

In [10]:
%%R

fdr = function(t) {
    if (sum(p <= t) > 0) {
    v = 11 * t / sum(p <= t)
    }
    else {
    v = 0
    }
    return(v)
    }

t = seq(0,1,length=101)
R = c()
FDR = c()
for (v in t) {
    FDR = c(FDR, fdr(v))
    R = c(R, sum(p <= v))
}
In [11]:
%%R
plot(t, FDR, type='l', ylab='FDR(t)', col='red', lwd=3)
In [12]:
%%R
C = cor(diabetes$x)
print(sum(C < 0))
image(C)
[1] 16

In [13]:
%%R
print(fdr(0.0227))
[1] 0.04994

In [14]:
%%R
print(p <= 0.0227)
(Intercept)        xage        xsex        xbmi        xmap         xtc 
       TRUE       FALSE        TRUE        TRUE        TRUE       FALSE 
       xldl        xhdl        xtch        xltg        xglu 
      FALSE       FALSE       FALSE        TRUE       FALSE 

This seems to agree with Bonferroni. Good.

Model selection

  • There are many choices of models: $2^{10}$ in this example.

  • Each model has different coefficients (with different interpretations).

  • How do we choose a model?

  • Let's call the collection of models ${\cal V}$.

Information criteria

  • For each model ${\cal M} \in {\cal V}$ let $k({\cal M})$ denote the number of parameters in the model and $L({\cal M})$ be the maximized likelihood $$ L({\cal M}) = \sup_{\theta \in \Theta_{\cal M}} L(\theta). $$

  • For a collection of linear models, we can think of each ${\cal M}$ as a subspace in $\mathbb{R}^n$. Typically this subspace will be the linear span of some subset of columns if $X$ and the $\pmb{1}$ vector. We can also parameterize ${\cal M}$ by this subspace.

  • Akaike Information Criterion $$ AIC({\cal M}) = -2 \log L({\cal M}) + 2 \cdot k({\cal M}) $$

  • Define the selected model as $$ \hat{{\cal M}}_{AIC} = \text{argmin}_{{\cal M} \in {\cal V}} AIC({\cal M}) $$

  • (Schwarz) Bayesion Information Criterion $$ BIC({\cal M}) = -2 \log L({\cal M}) + \log(n) \cdot k({\cal M}) $$ with $\hat{{\cal M}}_{BIC}$ defined analogously.

In [15]:
%%R
X = diabetes$x
y = diabetes$y
map = X[,'map']
age = X[,'age']
tc = X[,'tc']
ldl = X[,'ldl']
sex = X[,'sex']
bmi = X[,'bmi']
hdl = X[,'hdl']
tch = X[,'tch']
ltg = X[,'ltg']
glu = X[,'glu']
In [16]:
%%R
diabetes.lm = lm(y ~ age + map + tc + ldl + sex + bmi + hdl + tch + ltg + glu)
AIC.model = step(diabetes.lm, direction='both')
Start:  AIC=3539.64
y ~ age + map + tc + ldl + sex + bmi + hdl + tch + ltg + glu

       Df Sum of Sq     RSS    AIC
- age   1        82 1264066 3537.7
- hdl   1       663 1264646 3537.9
- glu   1      3080 1267064 3538.7
- tch   1      3526 1267509 3538.9
<none>              1263983 3539.6
- ldl   1      5799 1269782 3539.7
- tc    1     10600 1274583 3541.3
- sex   1     45000 1308983 3553.1
- ltg   1     56015 1319998 3556.8
- map   1     72103 1336086 3562.2
- bmi   1    179028 1443011 3596.2

Step:  AIC=3537.67
y ~ map + tc + ldl + sex + bmi + hdl + tch + ltg + glu

       Df Sum of Sq     RSS    AIC
- hdl   1       646 1264712 3535.9
- glu   1      3001 1267067 3536.7
- tch   1      3543 1267608 3536.9
<none>              1264066 3537.7
- ldl   1      5751 1269817 3537.7
- tc    1     10569 1274635 3539.4
+ age   1        82 1263983 3539.6
- sex   1     45831 1309896 3551.4
- ltg   1     55963 1320029 3554.8
- map   1     73850 1337915 3560.8
- bmi   1    179079 1443144 3594.2

Step:  AIC=3535.9
y ~ map + tc + ldl + sex + bmi + tch + ltg + glu

       Df Sum of Sq     RSS    AIC
- glu   1      3093 1267805 3535.0
- tch   1      3247 1267959 3535.0
<none>              1264712 3535.9
- ldl   1      7505 1272217 3536.5
+ hdl   1       646 1264066 3537.7
+ age   1        66 1264646 3537.9
- tc    1     26840 1291552 3543.2
- sex   1     46382 1311094 3549.8
- map   1     73536 1338248 3558.9
- ltg   1     97509 1362221 3566.7
- bmi   1    178537 1443249 3592.3

Step:  AIC=3534.98
y ~ map + tc + ldl + sex + bmi + tch + ltg

       Df Sum of Sq     RSS    AIC
- tch   1      3686 1271491 3534.3
<none>              1267805 3535.0
- ldl   1      7472 1275277 3535.6
+ glu   1      3093 1264712 3535.9
+ hdl   1       738 1267067 3536.7
+ age   1         0 1267805 3537.0
- tc    1     26378 1294183 3542.1
- sex   1     44686 1312491 3548.3
- map   1     82154 1349959 3560.7
- ltg   1    102520 1370325 3567.3
- bmi   1    189970 1457775 3594.7

Step:  AIC=3534.26
y ~ map + tc + ldl + sex + bmi + ltg

       Df Sum of Sq     RSS    AIC
<none>              1271491 3534.3
+ tch   1      3686 1267805 3535.0
+ glu   1      3532 1267959 3535.0
+ hdl   1       395 1271097 3536.1
+ age   1        11 1271480 3536.3
- ldl   1     39378 1310869 3545.7
- sex   1     41858 1313349 3546.6
- tc    1     65237 1336728 3554.4
- map   1     79627 1351119 3559.1
- bmi   1    190586 1462077 3594.0
- ltg   1    294094 1565585 3624.2

In [17]:
%%R
BIC.model = step(diabetes.lm, direction='both', k=log(length(y)))
Start:  AIC=3584.65
y ~ age + map + tc + ldl + sex + bmi + hdl + tch + ltg + glu

       Df Sum of Sq     RSS    AIC
- age   1        82 1264066 3578.6
- hdl   1       663 1264646 3578.8
- glu   1      3080 1267064 3579.6
- tch   1      3526 1267509 3579.8
- ldl   1      5799 1269782 3580.6
- tc    1     10600 1274583 3582.2
<none>              1263983 3584.6
- sex   1     45000 1308983 3594.0
- ltg   1     56015 1319998 3597.7
- map   1     72103 1336086 3603.1
- bmi   1    179028 1443011 3637.1

Step:  AIC=3578.59
y ~ map + tc + ldl + sex + bmi + hdl + tch + ltg + glu

       Df Sum of Sq     RSS    AIC
- hdl   1       646 1264712 3572.7
- glu   1      3001 1267067 3573.5
- tch   1      3543 1267608 3573.7
- ldl   1      5751 1269817 3574.5
- tc    1     10569 1274635 3576.2
<none>              1264066 3578.6
+ age   1        82 1263983 3584.6
- sex   1     45831 1309896 3588.2
- ltg   1     55963 1320029 3591.6
- map   1     73850 1337915 3597.6
- bmi   1    179079 1443144 3631.1

Step:  AIC=3572.72
y ~ map + tc + ldl + sex + bmi + tch + ltg + glu

       Df Sum of Sq     RSS    AIC
- glu   1      3093 1267805 3567.7
- tch   1      3247 1267959 3567.8
- ldl   1      7505 1272217 3569.2
<none>              1264712 3572.7
- tc    1     26840 1291552 3575.9
+ hdl   1       646 1264066 3578.6
+ age   1        66 1264646 3578.8
- sex   1     46382 1311094 3582.5
- map   1     73536 1338248 3591.6
- ltg   1     97509 1362221 3599.5
- bmi   1    178537 1443249 3625.0

Step:  AIC=3567.71
y ~ map + tc + ldl + sex + bmi + tch + ltg

       Df Sum of Sq     RSS    AIC
- tch   1      3686 1271491 3562.9
- ldl   1      7472 1275277 3564.2
<none>              1267805 3567.7
- tc    1     26378 1294183 3570.7
+ glu   1      3093 1264712 3572.7
+ hdl   1       738 1267067 3573.5
+ age   1         0 1267805 3573.8
- sex   1     44686 1312491 3576.9
- map   1     82154 1349959 3589.4
- ltg   1    102520 1370325 3596.0
- bmi   1    189970 1457775 3623.3

Step:  AIC=3562.9
y ~ map + tc + ldl + sex + bmi + ltg

       Df Sum of Sq     RSS    AIC
<none>              1271491 3562.9
+ tch   1      3686 1267805 3567.7
+ glu   1      3532 1267959 3567.8
+ hdl   1       395 1271097 3568.9
+ age   1        11 1271480 3569.0
- ldl   1     39378 1310869 3570.3
- sex   1     41858 1313349 3571.1
- tc    1     65237 1336728 3578.9
- map   1     79627 1351119 3583.7
- bmi   1    190586 1462077 3618.5
- ltg   1    294094 1565585 3648.8

Our selected model

In [18]:
%%R
summary(AIC.model)

Call:
lm(formula = y ~ map + tc + ldl + sex + bmi + ltg)

Residuals:
     Min       1Q   Median       3Q      Max 
-158.277  -39.476   -2.068   37.221  148.693 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  152.133      2.572  59.159  < 2e-16 ***
map          327.220     62.693   5.219 2.79e-07 ***
tc          -757.938    160.435  -4.724 3.12e-06 ***
ldl          538.586    146.738   3.670 0.000272 ***
sex         -226.511     59.857  -3.784 0.000176 ***
bmi          529.873     65.620   8.075 6.69e-15 ***
ltg          804.192     80.173  10.031  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 54.06 on 435 degrees of freedom
Multiple R-squared: 0.5149,	Adjusted R-squared: 0.5082 
F-statistic: 76.95 on 6 and 435 DF,  p-value: < 2.2e-16 


Consistency

  • Let ${\cal M}_0$ denote the true model.

  • Classical asymptotics $n \to \infty$ with ${\cal M}_0$ generally provide guarantees like $$ \mathbb{P}\left(\hat{{\cal M}}_{AIC} \supset {\cal M}_0 \right) \overset{n \to \infty}{\to} 1 $$ and $$ \mathbb{P}\left(\hat{{\cal M}}_{BIC} = {\cal M}_0 \right) \overset{n \to \infty}{\to} 1. $$

Inference after selection

  • Notice that our model chosen by AIC does not agree with our test-based procedure above. Hmm....

  • Also, you should be careful with the $p$-values above... they were selected.

  • Let's see what happens with pure noise with just one step of model selection starting from no variables: this just picks the independent variable most correlated with the dependent variable.

  • But, remember the outcome is just noise. It is statistically independent of our features (independent variables).

In [19]:
%%R
sim_noise = function(p=10) {
    Z = t(X) %*% rnorm(nrow(X))
    return(2*(1-pnorm(max(abs(Z)))))
    }
selectedP = c()
for (i in 1:2000) {
    selectedP = c(selectedP, sim_noise())
    }
In [20]:
%%R
plot(ecdf(selectedP), lwd=3, col='red')
abline(0,1, lwd=2, lty=2)

Maximum of $N(0,1)$ random variables

  • Even just one step of model selection screws up inference.

  • Highest correlation under $\beta_0=0$ is much smaller than for a fixed correlation, even $p=10$.

  • If the features were orthogonal, we would be maximizing IID $N(0,1)$ random variables: $$ \max_{1 \leq i \leq p} |Z_i| = \sqrt{2 \log p} + O \left( 1 / \sqrt{2 \log p} \right). $$

High dimensional inference

  • In many applications today, $n < p$.

  • Generically, this means a perfect fit exists for our sample. Clearly this can't be correct...

  • Penalized estimators allow for consistent estimation (under some conditions) even in this scenario.

Sparsity

  • The key assumption is that many coefficients can be taken to be 0.

  • Means that if we can find a good sparse model, we can still achieve consistency.

  • If there is a good sparse model with $s$ non-zero coefficients, the best rates are of order $$ \|\hat{\beta}_{\lambda} - \beta_0\|_2 = O \left(\sqrt{\frac{s \cdot \log p}{n}}\right) $$

LASSO

The Lasso is an estimator defined by the following optimization problem (Tibshirani 1996, Chen et al. 1998): $$ \hat{\beta}_{\lambda} = \text{argmin}_{\beta \in \mathbb{R}^p} \frac{1}{2} \|y-X\beta\|^2_2 + \lambda \|\beta\|_1. $$

  • The penalty induces sparsity: large $\lambda$ means many values are 0.

  • Convex optimization problem: makes it computable and fairly easy (?) to analyze.

  • "Performs model selection and estimation simultaneously."

In [21]:
%%R
L = lars(X, y, type='lasso')
plot(L)

On the LHS, $\lambda=\infty$, decreasing to 0 on the RHS.

Why does $\ell_1$-penalty give sparse $\hat{\beta}_{\lambda}$?

Let's look at the bound form: $$ \hat{\beta}_{\text{bound},s} = \text{argmin}_{\|\beta\|_1 \leq s} \frac{1}{2} \|y-X\beta\|^2_2. $$

This picture holds for other convex losses besides squared error. The contours will not generally be elliptical.

How do we choose $\lambda$?

Cross validation

  • Break data into $K$ groups.

  • For each group $g$, use remaining $K-1$ groups to estimate parameters $\hat{\beta}_{\lambda,-g}$.

  • Measure goodness of fit $$ \sum_{g=1}^K \|Y[g] - X[g]\hat{\beta}_{\lambda,-g}\|^2_2. $$

  • This is an estimate of $$ E((\tilde{Y}-\tilde{X}\hat{\beta}_{\lambda})^2) $$ where $(\tilde{Y},\tilde{X})$ are drawn from the same distribution as the $(X_i,Y_i)$ pairs.

  • Typically assumes $(X_i,Y_i)$ are drawn IID from some distribution on $\mathbb{R}^p \times \mathbb{R}$.

In [22]:
%%R
cv.lars(X,y,type='lasso',K=5)
In [23]:
%%R
print(predict(L, type='coefficients', mode='fraction', s=0.4))
$s
[1] 0.4

$fraction
[1] 0.4

$mode
[1] "fraction"

$coefficients
       age        sex        bmi        map         tc        ldl        hdl 
   0.00000  -52.53407  509.64854  221.34215    0.00000    0.00000 -153.09694 
       tch        ltg        glu 
   0.00000  447.38028    0.00000 


In [24]:
%%R
summary(AIC.model)

Call:
lm(formula = y ~ map + tc + ldl + sex + bmi + ltg)

Residuals:
     Min       1Q   Median       3Q      Max 
-158.277  -39.476   -2.068   37.221  148.693 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  152.133      2.572  59.159  < 2e-16 ***
map          327.220     62.693   5.219 2.79e-07 ***
tc          -757.938    160.435  -4.724 3.12e-06 ***
ldl          538.586    146.738   3.670 0.000272 ***
sex         -226.511     59.857  -3.784 0.000176 ***
bmi          529.873     65.620   8.075 6.69e-15 ***
ltg          804.192     80.173  10.031  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 54.06 on 435 degrees of freedom
Multiple R-squared: 0.5149,	Adjusted R-squared: 0.5082 
F-statistic: 76.95 on 6 and 435 DF,  p-value: < 2.2e-16 


In [25]:
%%R
library(covTest)
nsim = 2000

covtestP = c()
exactP = c()
for (i in 1:nsim) {
    Z = rnorm(nrow(X))
    L0 = lars(X, Z, type='lasso', max.steps=3)
    exactP = c(exactP, (1 - pnorm(L0$lambda[1])) / (1 - pnorm(L0$lambda[2])))
    covtestP = c(covtestP, covTest(L0,X,Z)$results[1,3])
}
Loading required package: glmnet
Loading required package: Matrix
Loading required package: lattice
Loaded glmnet 1.9-3

Loading required package: glmpath
Loading required package: survival
Loading required package: splines
Loading required package: MASS
Warning message:
package ‘covTest’ was built under R version 2.15.3 

A selection $p$-value

  • The covariance test.

  • A generalization of the first null spacing of the LASSO knots has a limiting $\text{Exp}(1)$ distribution (under some conditions).

  • Later spacings are stochastically smaller than $\text{Exp}(1)$.

In [26]:
%%R
plot(ecdf(covtestP), lwd=3, col='red')
abline(0,1, lwd=2, lty=2)

An exact $p$-value

  • Here is an exact test that takes selection into account $$ \text{$p$-value = $\frac{1 - \Phi(\lambda_1/\sigma)}{1 - \Phi(\lambda_2/\sigma)}$} $$ with no assumptions on $X$.
In [27]:
%%R
plot(ecdf(exactP), lwd=3, col='red')
abline(0,1, lwd=2, lty=2)

Moral of the story

  • Maximizing correlation as in model selection quickly leads you away from the standard theory.

  • Equivalently, classical pivots are no longer useful after selection.

  • But, in some cases, it is possible to describe the effect of selection.

  • Other approaches include data-splitting, and post-selection inference.

And now for something completely different...

  • Where does this exact $p$-value come from?

  • It is a test based on the maximum of a set of Gaussian variables.

  • We could say that this is a discrete Gaussian process.

  • In fact, a discrete version of Kac-Rice yields this test...

In [27]: