## Linear regression

• The bread and butter of applied statistics.

• Dependent (random) variable $Y \in \mathbb{R}$, independent (random) variable $X \in \mathbb{R}^p$.

• Goal: to produce a regression function $$\mu(X) = E(Y|X).$$

• Linear regression model just says $$\mu(X) = X\beta, \qquad \beta \in \mathbb{R}^p.$$

### Diabetes data

• In this study, the outcome of interest (dependent variable) is a measure of the progression of diabetes one year after diagnosis.

• The independent variables included bmi,age,sex,blood pressure and six blood measurements.

In [1]:
%load_ext rmagic

In [2]:
%%R
library(lars)
data(diabetes)
plot(diabetes$x[,"bmi"], diabetes$y, pch=23, bg='red')

Loaded lars 1.1



## Least squares regression

• A parametric model based on observing an IID sample $(Y_i, X_i)_{1 \leq i \leq n}$.

• Model: $$Y_{n \times 1} | X_{n \times p} \sim N\left(X_{n \times p}\beta_{p \times 1}, \sigma^2 I_{n \times n} \right)$$

• Least squares estimator $$\hat{\beta} = \text{argmin}_{\beta \in \mathbb{R}^p} \frac{1}{2} \|Y-X\beta\|^2_2.$$

• Under our model, if $n > p$ and $\text{rank}(X)=p$ $$\hat{\beta} = (X^TX)^{-1}X^TY \sim N\left(\beta, \sigma^2 (X^TX)^{-1} \right).$$

## Gaussian facts

• We write $\mathbb{R}^q \ni Z \sim N(\mu, \Sigma)$ for the multivariate Gaussian distribution.

• This means that for any $a \in \mathbb{R}^q$ we have $$a^TZ \sim N(a^T\mu, a^T\Sigma a).$$

• For $A_{p \times q}$ fixed, we then have $$AZ \sim N(A\mu, A\Sigma A^T).$$

• Distribution of $\hat{\beta}$ follows directly with $A=(X^TX)^{-1}X^T, Z \sim N(X\beta, \sigma^2 I)$.

## Example¶

In [3]:
%%R
library(lars)
data(diabetes)
plot(diabetes$x[,"bmi"], diabetes$y, pch=23, bg='red')

In [4]:
%%R
print(summary(lm(diabetes$y ~ diabetes$x[,'bmi'])))


Call:
lm(formula = diabetes$y ~ diabetes$x[, "bmi"])

Residuals:
Min       1Q   Median       3Q      Max
-164.920  -43.572   -8.649   46.344  154.878

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)          152.133      2.974   51.16   <2e-16 ***
diabetes$x[, "bmi"] 949.435 62.515 15.19 <2e-16 *** --- Signif. codes: 0 â€˜***â€™ 0.001 â€˜**â€™ 0.01 â€˜*â€™ 0.05 â€˜.â€™ 0.1 â€˜ â€™ 1 Residual standard error: 62.52 on 440 degrees of freedom Multiple R-squared: 0.3439, Adjusted R-squared: 0.3424 F-statistic: 230.7 on 1 and 440 DF, p-value: < 2.2e-16  ## Pivots: a basic tool for inference • Over and over again in statistics, we talk about$t$-statistics $$T=T(\theta_0) \overset{H_0:\theta=\theta_0}{=} \frac{\hat{\theta}-\theta_0}{SE(\hat{\theta})}.$$ • Often,$\theta_0=0$, as in the above table. • But, whatever$\theta_0$the above quantity has the same distribution: Student's distribution with 440 degrees of freedom. We call this a pivot. • Pivots generally lead to confidence intervals $$\hat{\theta} \pm t_{440,0.975} SE(\hat{\theta})$$ with the property $$\mathbb{P}\left(\theta_0 \in \hat{\theta} \pm t_{440,0.975} SE(\hat{\theta}) \right)=0.95.$$ • By the above rule, the confidence interval for the slope is$949 \pm 125$. • Correct interpretation: "if we redrew a sample of the same size from this population, 95% of the time the intervals would cover the true parameter (assuming the model is correct)." • Incorrect interpretation: "we are 95% certain that the slope is in the interval$949 \pm 125$." • Pivots also leads to the testing rules $$\text{Reject H_0:\theta=\theta_0 if |T(\theta_0)| > t_{440,0.975} \approx 2.}$$ • The$t$-statistics in the table correspond to$T(0)$-- tests that a parameter is 0. In [5]: %%R qt(0.975,440) pt(qt(0.975,440),440)  [1] 0.975  In [6]: %%R print(summary(diabetes$x))

      age                 sex                bmi                 map
Min.   :-0.107226   Min.   :-0.04464   Min.   :-0.090275   Min.   :-0.112400
1st Qu.:-0.037299   1st Qu.:-0.04464   1st Qu.:-0.034229   1st Qu.:-0.036656
Median : 0.005383   Median :-0.04464   Median :-0.007284   Median :-0.005671
Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.000000   Mean   : 0.000000
3rd Qu.: 0.038076   3rd Qu.: 0.05068   3rd Qu.: 0.031248   3rd Qu.: 0.035644
Max.   : 0.110727   Max.   : 0.05068   Max.   : 0.170555   Max.   : 0.132044
tc                 ldl                 hdl
Min.   :-0.126781   Min.   :-0.115613   Min.   :-0.102307
1st Qu.:-0.034248   1st Qu.:-0.030358   1st Qu.:-0.035117
Median :-0.004321   Median :-0.003819   Median :-0.006584
Mean   : 0.000000   Mean   : 0.000000   Mean   : 0.000000
3rd Qu.: 0.028358   3rd Qu.: 0.029844   3rd Qu.: 0.029312
Max.   : 0.153914   Max.   : 0.198788   Max.   : 0.181179
tch                 ltg                 glu
Min.   :-0.076395   Min.   :-0.126097   Min.   :-0.137767
1st Qu.:-0.039493   1st Qu.:-0.033249   1st Qu.:-0.033179
Median :-0.002592   Median :-0.001948   Median :-0.001078
Mean   : 0.000000   Mean   : 0.000000   Mean   : 0.000000
3rd Qu.: 0.034309   3rd Qu.: 0.032433   3rd Qu.: 0.027917
Max.   : 0.185234   Max.   : 0.133599   Max.   : 0.135612


In [7]:
%%R
diabetes.lm = lm(y ~ x, data=diabetes)
print(summary(diabetes.lm))


Call:
lm(formula = y ~ x, data = diabetes)

Residuals:
Min       1Q   Median       3Q      Max
-155.829  -38.534   -0.227   37.806  151.355

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  152.133      2.576  59.061  < 2e-16 ***
xage         -10.012     59.749  -0.168 0.867000
xsex        -239.819     61.222  -3.917 0.000104 ***
xbmi         519.840     66.534   7.813 4.30e-14 ***
xmap         324.390     65.422   4.958 1.02e-06 ***
xtc         -792.184    416.684  -1.901 0.057947 .
xldl         476.746    339.035   1.406 0.160389
xhdl         101.045    212.533   0.475 0.634721
xtch         177.064    161.476   1.097 0.273456
xltg         751.279    171.902   4.370 1.56e-05 ***
xglu          67.625     65.984   1.025 0.305998
---
Signif. codes:  0 â€˜***â€™ 0.001 â€˜**â€™ 0.01 â€˜*â€™ 0.05 â€˜.â€™ 0.1 â€˜ â€™ 1

Residual standard error: 54.15 on 431 degrees of freedom
Multiple R-squared: 0.5177,	Adjusted R-squared: 0.5066
F-statistic: 46.27 on 10 and 431 DF,  p-value: < 2.2e-16


• Notice that the coefficient of bmi changed when we added other variables. We say that the effect of bmi with the other variables included allows for all the other variables.

• Notice that there is more than one linear regression model for each pair $Y \in \mathbb{R}^n, X \in \mathbb{R}^{n \times p}$. This raises the issue of model selection.

• The confidence intervals above are intervals for parameters like $\beta_{{\tt bmi}}$. Each interval has 95% coverage. What can we say about the 11 intervals?

• Alternatively, in the summary above, there are 11 $p$-values. What can we say about these 11 random variables? This raises the issue of simultaneous inference.

In [8]:
%%R
confint(diabetes.lm)

                  2.5 %     97.5 %
(Intercept)   147.07069  157.19628
xage         -127.44823  107.42383
xsex         -360.15053 -119.48765
xbmi          389.06918  650.61039
xmap          195.80469  452.97616
xtc         -1611.16947   26.80114
xldl         -189.62083 1143.11251
xhdl         -316.68469  518.77383
xtch         -140.31347  494.44183
xltg          413.40886 1089.14978
xglu          -62.06548  197.31626



## Simultaneous inference

• We performed 11 hypothesis tests above. Let's call the null hypotheses $$H_{0,{\tt v}}: \beta_{{\tt v}}=0$$ where v is one of our independent (random) variables. The alternatives are $H_{a, {\tt v}}$.
• If we make a decision about each hypothesis, this can be represented as a map $$\mathbb{R}^n \times \mathbb{R}^{n \times p} \overset{\delta}{\rightarrow} \{0,1\}^p.$$

• In reality, some of these null hypotheses are true, and some are false. The outcomes of our decision can be represented in a table (source: wikipedia)

Null hypothesis is True (H0) Alternative hypothesis is True (H1) Total
Declared significant $V$ $S$ $R$
Declared non-significant $U$ $T$ $m - R$
Total $m_0$ $m - m_0$ $m$
• Above, each entry is a function of our decision rule $\delta$.

## Two different error rates (of many)

• Family Wise Error Rate (FWER) $$\mathbb{P}(V(\delta) > 0).$$

• A simple rule to control FWER is to use Bonferroni procedure: of 11 tests, test each at level $0.05 / 11$. The FWER of this procedure is bounded by 0.05.

• False Discovery Rate (FDR) $$\mathbb{E} \left(\frac{V(\delta)}{R(\delta)} 1_{\{R(\delta) > 0\}} \right).$$

• When the test statistics are independent (or almost) there are simple procedures that bound the FDR at some level $q$. For example, the Benjamini-Hochberg procedure.

• For some other types of dependence, or for many tests, a modification of the BH procedure can still control FDR.

## Bonferroni

• This simple procedure is often pretty accurate.

• This is one source of a $\sqrt{\log(p)}$ term that we heard about earlier.

• This $\sqrt{\log(p)}$ comes from the fact that for independent $Z_i \sim N(0,1)$ $$\lim_{p \to \infty} \frac{\max_{1 \leq i \leq p} |Z_i|}{\sqrt{2 \log p}} = 1.$$

• To see this is the correct order, just look at Bonferroni (union) bound $$\mathbb{P} \left(\max_{1 \leq i \leq p} |Z_i| > t \right) \leq p \cdot \mathbb{P}(|Z|_1 > t) \approx p \cdot \frac{e^{-t^2/2}}{\sqrt{2\pi} \cdot t}.$$

• It is a simple way to bound the distribution of the smallest $p$-value (or largest $t$-statistic).

In [9]:
%%R
p = summary(diabetes.lm)$coef[,4] print(p <= 0.05 / 11)  (Intercept) xage xsex xbmi xmap xtc TRUE FALSE TRUE TRUE TRUE FALSE xldl xhdl xtch xltg xglu FALSE FALSE FALSE TRUE FALSE  ## False Discovery Rate • A natural estimate of FDR for a thresholding procedure $$\delta_t(X,Y)_i = 1_{\{p_i(X,Y) \leq t\}}$$ is $$\hat{FDR}(t) = \frac{11 \cdot t}{R(t)} 1_{\{R(t) > 0)\}} = \frac{\mathbb{E}_0(V(t))}{R(t)} 1_{\{R(t) > 0)\}}.$$ • Above,$\mathbb{E}_0$denotes the distribution under the global (intersection) null hypothesis$\beta=0$. • The Benjamini-Hochberg procedure can be described as thresholding at $$t_q = t_q(X,Y) = \sup\left\{t: \hat{FDR}(t) \leq q \right\}.$$ • Is it valid for our diabetes data? I don't know... • One way to check is to see if there is (strictly) positive dependence... In [10]: %%R fdr = function(t) { if (sum(p <= t) > 0) { v = 11 * t / sum(p <= t) } else { v = 0 } return(v) } t = seq(0,1,length=101) R = c() FDR = c() for (v in t) { FDR = c(FDR, fdr(v)) R = c(R, sum(p <= v)) }  In [11]: %%R plot(t, FDR, type='l', ylab='FDR(t)', col='red', lwd=3)  In [12]: %%R C = cor(diabetes$x)
print(sum(C < 0))
image(C)

[1] 16


In [13]:
%%R
print(fdr(0.0227))

[1] 0.04994


In [14]:
%%R
print(p <= 0.0227)

(Intercept)        xage        xsex        xbmi        xmap         xtc
TRUE       FALSE        TRUE        TRUE        TRUE       FALSE
xldl        xhdl        xtch        xltg        xglu
FALSE       FALSE       FALSE        TRUE       FALSE



This seems to agree with Bonferroni. Good.

## Model selection

• There are many choices of models: $2^{10}$ in this example.

• Each model has different coefficients (with different interpretations).

• How do we choose a model?

• Let's call the collection of models ${\cal V}$.

## Information criteria

• For each model ${\cal M} \in {\cal V}$ let $k({\cal M})$ denote the number of parameters in the model and $L({\cal M})$ be the maximized likelihood $$L({\cal M}) = \sup_{\theta \in \Theta_{\cal M}} L(\theta).$$

• For a collection of linear models, we can think of each ${\cal M}$ as a subspace in $\mathbb{R}^n$. Typically this subspace will be the linear span of some subset of columns if $X$ and the $\pmb{1}$ vector. We can also parameterize ${\cal M}$ by this subspace.

• Akaike Information Criterion $$AIC({\cal M}) = -2 \log L({\cal M}) + 2 \cdot k({\cal M})$$

• Define the selected model as $$\hat{{\cal M}}_{AIC} = \text{argmin}_{{\cal M} \in {\cal V}} AIC({\cal M})$$

• (Schwarz) Bayesion Information Criterion $$BIC({\cal M}) = -2 \log L({\cal M}) + \log(n) \cdot k({\cal M})$$ with $\hat{{\cal M}}_{BIC}$ defined analogously.

In [15]:
%%R
X = diabetes$x y = diabetes$y
map = X[,'map']
age = X[,'age']
tc = X[,'tc']
ldl = X[,'ldl']
sex = X[,'sex']
bmi = X[,'bmi']
hdl = X[,'hdl']
tch = X[,'tch']
ltg = X[,'ltg']
glu = X[,'glu']

In [16]:
%%R
diabetes.lm = lm(y ~ age + map + tc + ldl + sex + bmi + hdl + tch + ltg + glu)
AIC.model = step(diabetes.lm, direction='both')

Start:  AIC=3539.64
y ~ age + map + tc + ldl + sex + bmi + hdl + tch + ltg + glu

Df Sum of Sq     RSS    AIC
- age   1        82 1264066 3537.7
- hdl   1       663 1264646 3537.9
- glu   1      3080 1267064 3538.7
- tch   1      3526 1267509 3538.9
<none>              1263983 3539.6
- ldl   1      5799 1269782 3539.7
- tc    1     10600 1274583 3541.3
- sex   1     45000 1308983 3553.1
- ltg   1     56015 1319998 3556.8
- map   1     72103 1336086 3562.2
- bmi   1    179028 1443011 3596.2

Step:  AIC=3537.67
y ~ map + tc + ldl + sex + bmi + hdl + tch + ltg + glu

Df Sum of Sq     RSS    AIC
- hdl   1       646 1264712 3535.9
- glu   1      3001 1267067 3536.7
- tch   1      3543 1267608 3536.9
<none>              1264066 3537.7
- ldl   1      5751 1269817 3537.7
- tc    1     10569 1274635 3539.4
+ age   1        82 1263983 3539.6
- sex   1     45831 1309896 3551.4
- ltg   1     55963 1320029 3554.8
- map   1     73850 1337915 3560.8
- bmi   1    179079 1443144 3594.2

Step:  AIC=3535.9
y ~ map + tc + ldl + sex + bmi + tch + ltg + glu

Df Sum of Sq     RSS    AIC
- glu   1      3093 1267805 3535.0
- tch   1      3247 1267959 3535.0
<none>              1264712 3535.9
- ldl   1      7505 1272217 3536.5
+ hdl   1       646 1264066 3537.7
+ age   1        66 1264646 3537.9
- tc    1     26840 1291552 3543.2
- sex   1     46382 1311094 3549.8
- map   1     73536 1338248 3558.9
- ltg   1     97509 1362221 3566.7
- bmi   1    178537 1443249 3592.3

Step:  AIC=3534.98
y ~ map + tc + ldl + sex + bmi + tch + ltg

Df Sum of Sq     RSS    AIC
- tch   1      3686 1271491 3534.3
<none>              1267805 3535.0
- ldl   1      7472 1275277 3535.6
+ glu   1      3093 1264712 3535.9
+ hdl   1       738 1267067 3536.7
+ age   1         0 1267805 3537.0
- tc    1     26378 1294183 3542.1
- sex   1     44686 1312491 3548.3
- map   1     82154 1349959 3560.7
- ltg   1    102520 1370325 3567.3
- bmi   1    189970 1457775 3594.7

Step:  AIC=3534.26
y ~ map + tc + ldl + sex + bmi + ltg

Df Sum of Sq     RSS    AIC
<none>              1271491 3534.3
+ tch   1      3686 1267805 3535.0
+ glu   1      3532 1267959 3535.0
+ hdl   1       395 1271097 3536.1
+ age   1        11 1271480 3536.3
- ldl   1     39378 1310869 3545.7
- sex   1     41858 1313349 3546.6
- tc    1     65237 1336728 3554.4
- map   1     79627 1351119 3559.1
- bmi   1    190586 1462077 3594.0
- ltg   1    294094 1565585 3624.2


In [17]:
%%R
BIC.model = step(diabetes.lm, direction='both', k=log(length(y)))

Start:  AIC=3584.65
y ~ age + map + tc + ldl + sex + bmi + hdl + tch + ltg + glu

Df Sum of Sq     RSS    AIC
- age   1        82 1264066 3578.6
- hdl   1       663 1264646 3578.8
- glu   1      3080 1267064 3579.6
- tch   1      3526 1267509 3579.8
- ldl   1      5799 1269782 3580.6
- tc    1     10600 1274583 3582.2
<none>              1263983 3584.6
- sex   1     45000 1308983 3594.0
- ltg   1     56015 1319998 3597.7
- map   1     72103 1336086 3603.1
- bmi   1    179028 1443011 3637.1

Step:  AIC=3578.59
y ~ map + tc + ldl + sex + bmi + hdl + tch + ltg + glu

Df Sum of Sq     RSS    AIC
- hdl   1       646 1264712 3572.7
- glu   1      3001 1267067 3573.5
- tch   1      3543 1267608 3573.7
- ldl   1      5751 1269817 3574.5
- tc    1     10569 1274635 3576.2
<none>              1264066 3578.6
+ age   1        82 1263983 3584.6
- sex   1     45831 1309896 3588.2
- ltg   1     55963 1320029 3591.6
- map   1     73850 1337915 3597.6
- bmi   1    179079 1443144 3631.1

Step:  AIC=3572.72
y ~ map + tc + ldl + sex + bmi + tch + ltg + glu

Df Sum of Sq     RSS    AIC
- glu   1      3093 1267805 3567.7
- tch   1      3247 1267959 3567.8
- ldl   1      7505 1272217 3569.2
<none>              1264712 3572.7
- tc    1     26840 1291552 3575.9
+ hdl   1       646 1264066 3578.6
+ age   1        66 1264646 3578.8
- sex   1     46382 1311094 3582.5
- map   1     73536 1338248 3591.6
- ltg   1     97509 1362221 3599.5
- bmi   1    178537 1443249 3625.0

Step:  AIC=3567.71
y ~ map + tc + ldl + sex + bmi + tch + ltg

Df Sum of Sq     RSS    AIC
- tch   1      3686 1271491 3562.9
- ldl   1      7472 1275277 3564.2
<none>              1267805 3567.7
- tc    1     26378 1294183 3570.7
+ glu   1      3093 1264712 3572.7
+ hdl   1       738 1267067 3573.5
+ age   1         0 1267805 3573.8
- sex   1     44686 1312491 3576.9
- map   1     82154 1349959 3589.4
- ltg   1    102520 1370325 3596.0
- bmi   1    189970 1457775 3623.3

Step:  AIC=3562.9
y ~ map + tc + ldl + sex + bmi + ltg

Df Sum of Sq     RSS    AIC
<none>              1271491 3562.9
+ tch   1      3686 1267805 3567.7
+ glu   1      3532 1267959 3567.8
+ hdl   1       395 1271097 3568.9
+ age   1        11 1271480 3569.0
- ldl   1     39378 1310869 3570.3
- sex   1     41858 1313349 3571.1
- tc    1     65237 1336728 3578.9
- map   1     79627 1351119 3583.7
- bmi   1    190586 1462077 3618.5
- ltg   1    294094 1565585 3648.8



## Our selected model

In [18]:
%%R
summary(AIC.model)


Call:
lm(formula = y ~ map + tc + ldl + sex + bmi + ltg)

Residuals:
Min       1Q   Median       3Q      Max
-158.277  -39.476   -2.068   37.221  148.693

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  152.133      2.572  59.159  < 2e-16 ***
map          327.220     62.693   5.219 2.79e-07 ***
tc          -757.938    160.435  -4.724 3.12e-06 ***
ldl          538.586    146.738   3.670 0.000272 ***
sex         -226.511     59.857  -3.784 0.000176 ***
bmi          529.873     65.620   8.075 6.69e-15 ***
ltg          804.192     80.173  10.031  < 2e-16 ***
---
Signif. codes:  0 â€˜***â€™ 0.001 â€˜**â€™ 0.01 â€˜*â€™ 0.05 â€˜.â€™ 0.1 â€˜ â€™ 1

Residual standard error: 54.06 on 435 degrees of freedom
Multiple R-squared: 0.5149,	Adjusted R-squared: 0.5082
F-statistic: 76.95 on 6 and 435 DF,  p-value: < 2.2e-16



### Consistency

• Let ${\cal M}_0$ denote the true model.

• Classical asymptotics $n \to \infty$ with ${\cal M}_0$ generally provide guarantees like $$\mathbb{P}\left(\hat{{\cal M}}_{AIC} \supset {\cal M}_0 \right) \overset{n \to \infty}{\to} 1$$ and $$\mathbb{P}\left(\hat{{\cal M}}_{BIC} = {\cal M}_0 \right) \overset{n \to \infty}{\to} 1.$$

## Inference after selection

• Notice that our model chosen by AIC does not agree with our test-based procedure above. Hmm....

• Also, you should be careful with the $p$-values above... they were selected.

• Let's see what happens with pure noise with just one step of model selection starting from no variables: this just picks the independent variable most correlated with the dependent variable.

• But, remember the outcome is just noise. It is statistically independent of our features (independent variables).

In [19]:
%%R
sim_noise = function(p=10) {
Z = t(X) %*% rnorm(nrow(X))
return(2*(1-pnorm(max(abs(Z)))))
}
selectedP = c()
for (i in 1:2000) {
selectedP = c(selectedP, sim_noise())
}

In [20]:
%%R
plot(ecdf(selectedP), lwd=3, col='red')
abline(0,1, lwd=2, lty=2)


## Maximum of $N(0,1)$ random variables

• Even just one step of model selection screws up inference.

• Highest correlation under $\beta_0=0$ is much smaller than for a fixed correlation, even $p=10$.

• If the features were orthogonal, we would be maximizing IID $N(0,1)$ random variables: $$\max_{1 \leq i \leq p} |Z_i| = \sqrt{2 \log p} + O \left( 1 / \sqrt{2 \log p} \right).$$

## High dimensional inference

• In many applications today, $n < p$.

• Generically, this means a perfect fit exists for our sample. Clearly this can't be correct...

• Penalized estimators allow for consistent estimation (under some conditions) even in this scenario.

### Sparsity

• The key assumption is that many coefficients can be taken to be 0.

• Means that if we can find a good sparse model, we can still achieve consistency.

• If there is a good sparse model with $s$ non-zero coefficients, the best rates are of order $$\|\hat{\beta}_{\lambda} - \beta_0\|_2 = O \left(\sqrt{\frac{s \cdot \log p}{n}}\right)$$

## LASSO

The Lasso is an estimator defined by the following optimization problem (Tibshirani 1996, Chen et al. 1998): $$\hat{\beta}_{\lambda} = \text{argmin}_{\beta \in \mathbb{R}^p} \frac{1}{2} \|y-X\beta\|^2_2 + \lambda \|\beta\|_1.$$

• The penalty induces sparsity: large $\lambda$ means many values are 0.

• Convex optimization problem: makes it computable and fairly easy (?) to analyze.

• "Performs model selection and estimation simultaneously."

In [21]:
%%R
L = lars(X, y, type='lasso')
plot(L)


On the LHS, $\lambda=\infty$, decreasing to 0 on the RHS.

## Why does $\ell_1$-penalty give sparse $\hat{\beta}_{\lambda}$?

Let's look at the bound form: $$\hat{\beta}_{\text{bound},s} = \text{argmin}_{\|\beta\|_1 \leq s} \frac{1}{2} \|y-X\beta\|^2_2.$$

This picture holds for other convex losses besides squared error. The contours will not generally be elliptical.

## How do we choose $\lambda$?

### Cross validation

• Break data into $K$ groups.

• For each group $g$, use remaining $K-1$ groups to estimate parameters $\hat{\beta}_{\lambda,-g}$.

• Measure goodness of fit $$\sum_{g=1}^K \|Y[g] - X[g]\hat{\beta}_{\lambda,-g}\|^2_2.$$

• This is an estimate of $$E((\tilde{Y}-\tilde{X}\hat{\beta}_{\lambda})^2)$$ where $(\tilde{Y},\tilde{X})$ are drawn from the same distribution as the $(X_i,Y_i)$ pairs.

• Typically assumes $(X_i,Y_i)$ are drawn IID from some distribution on $\mathbb{R}^p \times \mathbb{R}$.

In [22]:
%%R
cv.lars(X,y,type='lasso',K=5)

In [23]:
%%R
print(predict(L, type='coefficients', mode='fraction', s=0.4))

$s [1] 0.4$fraction
[1] 0.4

$mode [1] "fraction"$coefficients
age        sex        bmi        map         tc        ldl        hdl
0.00000  -52.53407  509.64854  221.34215    0.00000    0.00000 -153.09694
tch        ltg        glu
0.00000  447.38028    0.00000


In [24]:
%%R
summary(AIC.model)


Call:
lm(formula = y ~ map + tc + ldl + sex + bmi + ltg)

Residuals:
Min       1Q   Median       3Q      Max
-158.277  -39.476   -2.068   37.221  148.693

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  152.133      2.572  59.159  < 2e-16 ***
map          327.220     62.693   5.219 2.79e-07 ***
tc          -757.938    160.435  -4.724 3.12e-06 ***
ldl          538.586    146.738   3.670 0.000272 ***
sex         -226.511     59.857  -3.784 0.000176 ***
bmi          529.873     65.620   8.075 6.69e-15 ***
ltg          804.192     80.173  10.031  < 2e-16 ***
---
Signif. codes:  0 â€˜***â€™ 0.001 â€˜**â€™ 0.01 â€˜*â€™ 0.05 â€˜.â€™ 0.1 â€˜ â€™ 1

Residual standard error: 54.06 on 435 degrees of freedom
Multiple R-squared: 0.5149,	Adjusted R-squared: 0.5082
F-statistic: 76.95 on 6 and 435 DF,  p-value: < 2.2e-16


In [25]:
%%R
library(covTest)
nsim = 2000

covtestP = c()
exactP = c()
for (i in 1:nsim) {
Z = rnorm(nrow(X))
L0 = lars(X, Z, type='lasso', max.steps=3)
exactP = c(exactP, (1 - pnorm(L0$lambda[1])) / (1 - pnorm(L0$lambda[2])))
covtestP = c(covtestP, covTest(L0,X,Z)$results[1,3]) }  Loading required package: glmnet Loading required package: Matrix Loading required package: lattice Loaded glmnet 1.9-3 Loading required package: glmpath Loading required package: survival Loading required package: splines Loading required package: MASS Warning message: package â€˜covTestâ€™ was built under R version 2.15.3  ## A selection$p$-value • The covariance test. • A generalization of the first null spacing of the LASSO knots has a limiting$\text{Exp}(1)$distribution (under some conditions). • Later spacings are stochastically smaller than$\text{Exp}(1)$. In [26]: %%R plot(ecdf(covtestP), lwd=3, col='red') abline(0,1, lwd=2, lty=2)  ### An exact$p$-value • Here is an exact test that takes selection into account $$\text{p-value = \frac{1 - \Phi(\lambda_1/\sigma)}{1 - \Phi(\lambda_2/\sigma)}}$$ with no assumptions on$X$. In [27]: %%R plot(ecdf(exactP), lwd=3, col='red') abline(0,1, lwd=2, lty=2)  ### Moral of the story • Maximizing correlation as in model selection quickly leads you away from the standard theory. • Equivalently, classical pivots are no longer useful after selection. • But, in some cases, it is possible to describe the effect of selection. • Other approaches include data-splitting, and post-selection inference. ## And now for something completely different... • Where does this exact$p\$-value come from?

• It is a test based on the maximum of a set of Gaussian variables.

• We could say that this is a discrete Gaussian process.

• In fact, a discrete version of Kac-Rice yields this test...

In [27]: