In [ ]:
 

Multiple linear regression

Outline

  • Specifying the model.

  • Fitting the model: least squares.

  • Interpretation of the coefficients.

  • More on $F$-statistics.

  • Matrix approach to linear regression.

  • $T$-statistics revisited.

  • More $F$ statistics.

  • Tests involving more than one $\beta$.

Prostate data

For more information on the Gleason score.

In [1]:
from ipy_table import *
Description = ([
        ['Variable','Description'],
        ['lcavol', '(log) Cancer Volume'],
        ['lweight', '(log) Weight'],
        ['age', 'Patient age'],
        ['lbph', '(log) Vening Prostatic Hyperplasia'],
        ['svi', 'Seminal Vesicle Invasion'],
        ['lcp', '(log) Capsular Penetration'],
        ['gleason', 'Gleason score'],
        ['pgg45', 'Percent of Gleason score 4 or 5'],
        ['lpsa', '(log) Prostate Specific Antigen'],
        ['train', 'Label for test / training split']])
In [2]:
make_table(Description)
apply_theme('basic')
Out[2]:
VariableDescription
lcavol(log) Cancer Volume
lweight(log) Weight
agePatient age
lbph(log) Vening Prostatic Hyperplasia
sviSeminal Vesicle Invasion
lcp(log) Capsular Penetration
gleasonGleason score
pgg45Percent of Gleason score 4 or 5
lpsa(log) Prostate Specific Antigen
trainLabel for test / training split
In [3]:
%%R -h 800 -w 800
library(ElemStatLearn)
data(prostate)
pairs(prostate, pch=23, bg='orange',                                          
      cex.labels=1.5)

Specifying the model

  • We will use variables lcavol, lweight, age, lbph, svi, lcp and pgg45.

  • Rather than one predictor, we have $p=7$ predictors.

  • $$Y_i = \beta_0 + \beta_1 X_{i1} + \dots + \beta_p X_{ip} + \varepsilon_i$$

  • Errors $\varepsilon$ are assumed independent $N(0,\sigma^2)$, as in simple linear regression.

  • Coefficients are called (partial) regression coefficients because they “allow” for the effect of other variables.

Geometry of Least Squares

Fitting the model

  • Just as in simple linear regression, model is fit by minimizing $$\begin{aligned} SSE(\beta_0, \dots, \beta_p) &= \sum_{i=1}^n(Y_i - (\beta_0 + \sum_{j=1}^p \beta_j \ X_{ij}))^2 \\ &= \|Y - \widehat{Y}(\beta)\|^2 \end{aligned}$$

  • Minimizers: $\widehat{\beta} = (\widehat{\beta}_0, \dots, \widehat{\beta}_p)$ are the “least squares estimates”: are also normally distributed as in simple linear regression.

In [4]:
%%R
prostate.lm = lm(lpsa ~ lcavol + lweight + age + lbph + svi + lcp + pgg45, data=prostate)
prostate.lm

Call:
lm(formula = lpsa ~ lcavol + lweight + age + lbph + svi + lcp + 
    pgg45, data = prostate)

Coefficients:
(Intercept)       lcavol      lweight          age         lbph          svi  
   0.494155     0.569546     0.614420    -0.020913     0.097353     0.752397  
        lcp        pgg45  
  -0.104959     0.005324  


Estimating $\sigma^2$

  • As in simple regression $$\widehat{\sigma}^2 = \frac{SSE}{n-p-1} \sim \sigma^2 \cdot \frac{\chi^2_{n-p-1}}{n-p\ -1}$$ independent of $\widehat{\beta}$.

  • Why $\chi^2_{n-p-1}$? Typically, the degrees of freedom in the estimate of $\sigma^2$ is $n-\# \text{number of parameters in regression function}$.

In [5]:
%%R
print(prostate.lm$df.resid)
sigma.hat = sqrt(sum(resid(prostate.lm)^2) / prostate.lm$df.resid)
sigma.hat
[1] 89
[1] 0.6959559

Interpretation of $\beta_j$’s

  • Take $\beta_1=\beta_{\tt{lcavol}}$ for example. This is the amount the lpsa rating increases for one “unit” of lcavol, keeping everything else constant.

  • We refer to this as the effect of lcavol allowing for or controlling for the other variables.

  • For example, let's take the 10th case in our data and change lcavol by 1 unit.

In [6]:
%%R
case1 = prostate[10,]
case2 = case1
case2['lcavol'] = case2['lcavol'] + 1
Yhat = predict(prostate.lm, rbind(case1, case2))
Yhat
      10      101 
1.307754 1.877300 

Our regression model says that this difference should be $\hat{\beta}_{\tt lcavol}$.

In [7]:
%%R
c(Yhat[2]-Yhat[1], coef(prostate.lm)['lcavol'])
     101   lcavol 
0.569546 0.569546 

Partial regression coefficients

  • The term partial refers to the fact that the coefficient $\beta_j$ represent the partial effect of ${X}_j$ on ${Y}$, i.e. after the effect of all other variables have been removed.

  • Specifically, $$Y_i - \sum_{l=1, l \neq j}^k X_{il} \beta_l = \beta_0 + \beta_j X_{ij} + \varepsilon_i.$$

  • Let $e_{i,(j)}$ be the residuals from regressing ${Y}$ onto all ${X}_{\cdot}$’s except ${X}_j$, and let $X_{i,(j)}$ be the residuals from regressing ${X}_j$ onto all ${X}_{\cdot}$’s except ${X}_j$, and let $X_{i,(j)}$.

  • If we regress $e_{i,(j)}$ against $X_{i,(j)}$, the coefficient is exactly the same as in the original model.

Let's verify this interpretation of regression coefficients.

In [8]:
%%R
partial_resid_lcavol = resid(lm(lcavol ~  lweight + age + lbph + svi + lcp + pgg45, data=prostate))
partial_resid_lpsa = resid(lm(lpsa ~  lweight + age + lbph + svi + lcp + pgg45, data=prostate))
summary(lm(partial_resid_lpsa ~ partial_resid_lcavol))

Call:
lm(formula = partial_resid_lpsa ~ partial_resid_lcavol)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.76395 -0.35764 -0.02143  0.37762  1.58178 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)          -1.226e-16  6.840e-02   0.000        1    
partial_resid_lcavol  5.695e-01  8.309e-02   6.854 7.15e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.6736 on 95 degrees of freedom
Multiple R-squared: 0.3309,	Adjusted R-squared: 0.3239 
F-statistic: 46.98 on 1 and 95 DF,  p-value: 7.147e-10 


Goodness of fit for multiple regression

$$\begin{aligned} SSE &= \sum_{i=1}^n(Y_i - \widehat{Y}_i)^2 \\ SSR &= \sum_{i=1}^n(\overline{Y} - \widehat{Y}_i)^2 \\ SST &= \sum_{i=1}^n(Y_i - \overline{Y})^2 \\ R^2 &= \frac{SSR}{SST} \end{aligned}$$

$R^2$ is called the multiple correlation coefficient of the model, or the coefficient of multiple determination.

The sums of squares and $R^2$ are defined analogously to those in simple linear regression.

In [9]:
%%R
Y = prostate$lpsa
n = length(Y)
SST = sum((Y - mean(Y))^2)
MST = SST / (n - 1)
SSE = sum(resid(prostate.lm)^2)
MSE = SSE / prostate.lm$df.residual
SSR = SST - SSE
MSR = SSR / (n - 1 - prostate.lm$df.residual)
print(c(MST,MSE,MSR))
[1]  1.3324756  0.4843546 12.1157287

Adjusted $R^2$

  • As we add more and more variables to the model – even random ones, $R^2$ will increase to 1.

  • Adjusted $R^2$ tries to take this into account by replacing sums of squares by mean squares $$R^2_a = 1 - \frac{SSE/(n-p-1)}{SST/(n-1)} = 1 - \frac{MSE}{MST}.$$

Goodness of fit test

  • As in simple linear regression, we measure the goodness of fit of the regression model by $$F = \frac{MSR}{MSE} = \frac{\|\overline{Y}\cdot {1} - \widehat{{Y}}\|^2/p}{\\ |Y - \widehat{{Y}}\|^2/(n-p-1)}.$$

  • Under $H_0:\beta_1 = \dots = \beta_p=0$, $$F \sim F_{p, n-p-1}$$ so reject $H_0$ at level $\alpha$ if $F > F_{p,n-p-1,1-\alpha}.$

In [10]:
%%R
print(summary(prostate.lm))
F = MSR / MSE
F

Call:
lm(formula = lpsa ~ lcavol + lweight + age + lbph + svi + lcp + 
    pgg45, data = prostate)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.76395 -0.35764 -0.02143  0.37762  1.58178 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.494155   0.873567   0.566  0.57304    
lcavol       0.569546   0.085847   6.634 2.46e-09 ***
lweight      0.614420   0.198449   3.096  0.00262 ** 
age         -0.020913   0.010978  -1.905  0.06000 .  
lbph         0.097353   0.057584   1.691  0.09441 .  
svi          0.752397   0.238180   3.159  0.00216 ** 
lcp         -0.104959   0.089347  -1.175  0.24323    
pgg45        0.005324   0.003385   1.573  0.11923    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.696 on 89 degrees of freedom
Multiple R-squared: 0.663,	Adjusted R-squared: 0.6365 
F-statistic: 25.01 on 7 and 89 DF,  p-value: < 2.2e-16 

[1] 25.01417

Geometry of Least Squares

Geometry of Least Squares

Geometry of Least Squares

Intuition behind the $F$ test (?)

  • The $F$ statistic is a ratio of lengths of orthogonal vectors (divided by degrees of freedom).

  • Let $\mu=E(Y)=X\beta$ be the true mean vector for $Y$.

  • We can prove that our model implies (whether $H_0$ is true or not) $$\begin{aligned} \mathbb{E}\left(MSR\right) &= \sigma^2 + \underbrace{\|{\mu} - \overline{\mu} \cdot {1}\|^2 / p}_{(*)} \\ \mathbb{E}\left(MSE\right) &= \sigma^2 \\ \mu_i &= \mathbb{E}(Y_i) = \beta_0 + X_{i1} \beta_1 + \dots + X_{ip} \beta_p \end{aligned}$$

  • If $H_0$ is true, then $(*)=0$ and $\mathbb{E}(MSR)=\mathbb{E}(MSE)=\sigma^2$ so the $F$ should not be too different from 1.

  • If $F$ is large, it is evidence that $\mathbb{E}\left(MSR\right) \neq \sigma^2$, i.e. $H_0$ is false.

Where does (*) come from?

Least squares regression can be expressed in terms of orthogonal projections. That is, $$ \hat{Y} = PY $$ for some $P_{n \times n}$ where $P=P^T$ and $P^2=P$ (this makes it an orthogonal projection matrix). We will call this $P_F$ where $F$ stands for "full". Recall that for any projection matrix and any vector $y$ $$ \begin{aligned} \|Py\|^2 &= (Py)^T(Py) \\ &= y^TP^TPy \\ & = y^TP^2y \\ &= y^TPy. \end{aligned} $$

Let $P_R$ denote projection onto the 1-dimensional model determined by the 1 vector. Note that $P_F-P_R$ is again a projection: it projects onto the orthogonal complement of the 1 vector within the $(p+1)$-dimensional full model. So, it is a projection onto a $p$ dimensional space.

We see that $$ \begin{aligned} SSR &= \|(P_F-P_R)Y\|^2 \\ &= Y^T(P_F-P_R)Y \\ &= (\mu+\epsilon)^T(P_F-P_R)(\mu+\epsilon) \\ &= \mu^T(P_F-P_R)\mu + 2 \mu^T(P_F-P_R)\epsilon + \epsilon^T(P_F-P_R)\epsilon \\ &= \|\mu - \bar{\mu} \cdot 1\|^2 + 2 \mu^T(P_F-P_R)\epsilon + \epsilon^T(P_F-P_R)\epsilon \\ \end{aligned} $$

Now, let's take expectations. The first term is a constant, the cross term has expected value zero and the expected value of the final term is $$ p \cdot \sigma^2. $$ This comes from the fact that $$ \epsilon^T(P_F-P_R)\epsilon = \|(P_F-P_R)\epsilon\|^2 \sim \sigma^2 \chi^2_p. $$

$F$-test revisited

The $F$ test can be thought of as comparing two models:

  • Full (bigger) model :

    $$Y_i = \beta_0 + \beta_1 X_{i1} + \dots \beta_p X_{ip} + \varepsilon_i$$

  • Reduced (smaller) model:

    $$Y_i = \beta_0 + \varepsilon_i$$

  • The $F$-statistic has the form $$F=\frac{(SSE(R) - SSE(F)) / (df_R - df_F)}{SSE(F) / df_F}.$$

  • Note: the smaller model should be nested within the bigger model.

Geometry of Least Squares

Matrix formulation

$${ Y}_{n \times 1} = {X}_{n \times (p + 1)} {\beta}_{(p+1) \times 1} + {\varepsilon}_{n \times 1}$$

  • ${X}$ is called the design matrix of the model

  • ${\varepsilon} \sim N(0, \sigma^2 I_{n \times n})$ is multivariate normal

$SSE$ in matrix form

$$SSE(\beta) = ({Y} - {X} {\beta})'({Y} - {X} {\beta}) = \|Y-X\beta\|^2_2$$

Design matrix

Design matrix

  • The design matrix is the $n \times (p+1)$ matrix with entries $$X = \begin{pmatrix} 1 & X_{11} & X_{12} & \dots & X_{1,p} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & X_{n1} & X_{n2} &\dots & X_{n,p} \\ \end{pmatrix}$$
In [11]:
%%R
n = length(Y)
attach(prostate)
X = cbind(rep(1,n), lcavol, lweight, age, lbph, svi, lcp, pgg45)
detach(prostate)
colnames(X)[1] = '(Intercept)'
head(X)
     (Intercept)     lcavol  lweight age      lbph svi       lcp pgg45
[1,]           1 -0.5798185 2.769459  50 -1.386294   0 -1.386294     0
[2,]           1 -0.9942523 3.319626  58 -1.386294   0 -1.386294     0
[3,]           1 -0.5108256 2.691243  74 -1.386294   0 -1.386294    20
[4,]           1 -1.2039728 3.282789  58 -1.386294   0 -1.386294     0
[5,]           1  0.7514161 3.432373  62 -1.386294   0 -1.386294     0
[6,]           1 -1.0498221 3.228826  50 -1.386294   0 -1.386294     0

The matrix X is the same as formed by R

In [12]:
%%R
head(model.matrix(prostate.lm))
  (Intercept)     lcavol  lweight age      lbph svi       lcp pgg45
1           1 -0.5798185 2.769459  50 -1.386294   0 -1.386294     0
2           1 -0.9942523 3.319626  58 -1.386294   0 -1.386294     0
3           1 -0.5108256 2.691243  74 -1.386294   0 -1.386294    20
4           1 -1.2039728 3.282789  58 -1.386294   0 -1.386294     0
5           1  0.7514161 3.432373  62 -1.386294   0 -1.386294     0
6           1 -1.0498221 3.228826  50 -1.386294   0 -1.386294     0

Least squares solution

  • Normal equations $$\frac{\partial}{\partial \beta_j} SSE \biggl|_{\beta = \widehat{\beta}_{}} = -2 \left({Y\ } - {X} \widehat{\beta}_{} \right)^T {X}_j = 0, \qquad 0 \leq j \leq p.$$

  • Equivalent to $$\begin{aligned} ({Y} - {X}{\widehat{\beta}_{}})^T{X} &= 0 \\ {\widehat{\beta}} &= ({X}^T{X})^{-1}{X}^T{Y} \end{aligned}$$

  • Properties: $$\widehat{\beta} \sim N(\beta, \sigma^2 (X^TX)^{-1}).$$

Multivariate Normal

To obtain the distribution of $\hat{\beta}$ we used the following fact about the multivariate Normal.

Suppose $Z \sim N(\mu,\Sigma)$. Then, for any fixed matrix $A$ $$ AZ \sim N(A\mu, A\Sigma A^T). $$

(It goes without saying that the dimensions of the matrix must agree with those of $Z$.)

How did we derive the distribution of $\hat{\beta}$?

Above, we saw that $\hat{\beta}$ is equal to a matrix times $Y$. The matrix form of our model is $$ Y \sim N(X\beta, \sigma^2 I). $$

Therefore, $$ \begin{aligned} \hat{\beta} &\sim N\left((X^TX)^{-1}X^T (X\beta), (X^TX)^{-1}X^T X (\sigma^2 I) (X^TX)^{-1}\right) \\ &\sim N(\beta, \sigma^2 (X^TX)^{-1}). \end{aligned} $$

Least squares solution

Let's verify our equations for $\hat{\beta}$.

In [13]:
%%R
beta = as.numeric(solve(t(X) %*% X) %*% t(X) %*% Y)
names(beta) = colnames(X)
print(beta)
print(coef(prostate.lm))
 (Intercept)       lcavol      lweight          age         lbph          svi 
 0.494154754  0.569546032  0.614419817 -0.020913467  0.097352535  0.752397342 
         lcp        pgg45 
-0.104959408  0.005324465 
 (Intercept)       lcavol      lweight          age         lbph          svi 
 0.494154754  0.569546032  0.614419817 -0.020913467  0.097352535  0.752397342 
         lcp        pgg45 
-0.104959408  0.005324465 

Inference for multiple regression

Regression function at one point

One thing one might want to learn about the regression function in the supervisor example is something about the regression function at some fixed values of ${X}_{1}, \dots, {X}_{6}$, i.e. what can be said about $$ \begin{aligned} \beta_0 + 1.3 \cdot \beta_1 &+ 3.6 \cdot \beta_2 + 64 \cdot \beta_3 + \\ 0.1 \cdot \beta_4 &+ 0.2 \cdot \beta_5 - 0.2 \cdot \beta_6 + 25 \cdot \beta_7 \tag{*}\end{aligned}$$ roughly the regression function at “typical” values of the predictors.

The expression above is equivalent to $$\sum_{j=0}^7 a_j \beta_j, \qquad a=(1,1.3,3.6,64,0.1,0.2,-0.2,25).$$

Confidence interval for $\sum_{j=0}^p a_j \beta_j$

  • Suppose we want a $(1-\alpha)\cdot 100\%$ CI for $\sum_{j=0}^p a_j\beta_j$.

  • Just as in simple linear regression:

    $$\sum_{j=0}^p a_j \widehat{\beta}_j \pm t_{1-\alpha/2, n-p-1} \cdot SE\left(\sum_{j=0}^p a_j\widehat{\beta}_j\right).$$

R will form these coefficients for each coefficient separately when using the confint function. These linear combinations are of the form $$ a_{\tt lcavol} = (0,1,0,0,0,0,0,0) $$ so that $$ a_{\tt lcavol}^T\widehat{\beta} = \widehat{\beta}_1 = {\tt coef(prostate.lm)[2]} $$

In [14]:
%%R
confint(prostate.lm, level=0.90)
                      5 %         95 %
(Intercept) -0.9578488958  1.946158404
lcavol       0.4268548240  0.712237239
lweight      0.2845659251  0.944273708
age         -0.0391601782 -0.002666755
lbph         0.0016386253  0.193066445
svi          0.3565053323  1.148289353
lcp         -0.2534678904  0.043549074
pgg45       -0.0003011464  0.010950077

$T$-statistics revisited

Of course, these confidence intervals are based on the standard ingredients of a $T$-statistic.

  • Suppose we want to test $$H_0:\sum_{j=0}^p a_j\beta_j= h.$$ As in simple linear regression, it is based on $$T = \frac{\sum_{j=0}^p a_j \widehat{\beta}_j - h}{SE(\sum_{j=0}^p a_j \widehat{\beta\ }_j)}.$$

  • If $H_0$ is true, then $T \sim t_{n-p-1}$, so we reject $H_0$ at level $\alpha$ if $$\begin{aligned} |T| &\geq t_{1-\alpha/2,n-p-1}, \qquad \text{ OR} \\ p-\text{value} &= {\tt 2*(1-pt(|T|, n-p-1))} \leq \alpha. \end{aligned}$$

R produces these in the coef table summary of the linear regression model. Again, each of these linear combinations is a vector $a$ with only one non-zero entry like $a_{\tt lcavol}$ above.

In [15]:
%%R 
summary(prostate.lm)$coef
                Estimate  Std. Error    t value     Pr(>|t|)
(Intercept)  0.494154754 0.873566764  0.5656749 5.730382e-01
lcavol       0.569546032 0.085847096  6.6344240 2.461450e-09
lweight      0.614419817 0.198449499  3.0961016 2.622121e-03
age         -0.020913467 0.010977742 -1.9050792 5.999883e-02
lbph         0.097352535 0.057584215  1.6906115 9.441057e-02
svi          0.752397342 0.238179913  3.1589454 2.163265e-03
lcp         -0.104959408 0.089346934 -1.1747399 2.432323e-01
pgg45        0.005324465 0.003384528  1.5731780 1.192260e-01

Let's do a quick calculation to remind ourselves the relationships of the variables in the table above.

In [16]:
%%R
T1 = 0.56954 / 0.08584
P1 = 2 * (1 - pt(abs(T1), 89))
print(c(T1,P1))
[1] 6.634902e+00 2.456119e-09

These were indeed the values for lcavol in the summary table.

One-sided tests

  • Suppose, instead, we wanted to test the one-sided hypothesis $$H_0:\sum_{j=0}^p a_j\beta_j \leq h, \ \text{vs.} \ H_a: \sum_{j=0}^p a_j\beta_j > \ h$$

  • If $H_0$ is true, then $T$ is no longer exactly $t_{n-p-1}$ but we still have $$\mathbb{P}\left(T > t_{1-\alpha, n-p-1}\right) \leq 1 - \alpha$$ so we reject $H_0$ at level $\alpha$ if $$\begin{aligned} T &\geq t_{1-\alpha,n-p-1}, \qquad \text{ OR} \\ p-\text{value} &= {\tt (1-pt(T, n-p-1))} \leq \alpha. \end{aligned}$$

Standard error of $\sum_{j=0}^p a_j \widehat{\beta}_j$

  • In order to form these $T$ statistics, we need the $SE$ of our estimate $\sum_{j=0}^p a_j \widehat{\beta}_j$.

  • Based on matrix approach to regression $$SE\left(\sum_{j=0}^p a_j\widehat{\beta}_j \right) = \sqrt{\widehat{\sigma}^2 a (X^TX\ )^{-1} a^T}.$$

  • Don’t worry too much about specific implementation – for much of the effects we want R will do this for you in general.

In [17]:
%%R
Y.hat = X %*% beta
sigma.hat = sqrt(sum((Y - Y.hat)^2) / (n - ncol(X)))
cov.beta = sigma.hat^2 * solve(t(X) %*% X)
cov.beta
             (Intercept)        lcavol       lweight           age
(Intercept)  0.763118892  9.968185e-03 -1.127145e-01 -5.757711e-03
lcavol       0.009968185  7.369724e-03 -3.267921e-03 -1.303787e-04
lweight     -0.112714488 -3.267921e-03  3.938220e-02 -4.088331e-04
age         -0.005757711 -1.303787e-04 -4.088331e-04  1.205108e-04
lbph         0.024821341  3.981644e-04 -4.540916e-03 -1.471652e-04
svi          0.010342177 -2.643164e-03 -4.105065e-03 -7.340636e-05
lcp          0.002077485 -3.502258e-03  6.198759e-05  1.429198e-04
pgg45        0.000033477  7.077893e-06  6.876251e-05 -9.141775e-06
                     lbph           svi           lcp         pgg45
(Intercept)  2.482134e-02  1.034218e-02  2.077485e-03  3.347700e-05
lcavol       3.981644e-04 -2.643164e-03 -3.502258e-03  7.077893e-06
lweight     -4.540916e-03 -4.105065e-03  6.198759e-05  6.876251e-05
age         -1.471652e-04 -7.340636e-05  1.429198e-04 -9.141775e-06
lbph         3.315942e-03  2.136331e-03 -1.416010e-04 -1.238514e-05
svi          2.136331e-03  5.672967e-02 -8.863127e-03 -5.227255e-05
lcp         -1.416010e-04 -8.863127e-03  7.982875e-03 -1.368737e-04
pgg45       -1.238514e-05 -5.227255e-05 -1.368737e-04  1.145503e-05

The standard errors of each coefficient estimate are the square root of the diagonal entries. They appear as the Std. Error column in the coef table.

In [18]:
%%R
sqrt(diag(cov.beta))
(Intercept)      lcavol     lweight         age        lbph         svi 
0.873566764 0.085847096 0.198449499 0.010977742 0.057584215 0.238179913 
        lcp       pgg45 
0.089346934 0.003384528 

Generally, we can find our estimate of the covariance function as follows:

In [19]:
%%R
vcov(prostate.lm)
             (Intercept)        lcavol       lweight           age
(Intercept)  0.763118892  9.968185e-03 -1.127145e-01 -5.757711e-03
lcavol       0.009968185  7.369724e-03 -3.267921e-03 -1.303787e-04
lweight     -0.112714488 -3.267921e-03  3.938220e-02 -4.088331e-04
age         -0.005757711 -1.303787e-04 -4.088331e-04  1.205108e-04
lbph         0.024821341  3.981644e-04 -4.540916e-03 -1.471652e-04
svi          0.010342177 -2.643164e-03 -4.105065e-03 -7.340636e-05
lcp          0.002077485 -3.502258e-03  6.198759e-05  1.429198e-04
pgg45        0.000033477  7.077893e-06  6.876251e-05 -9.141775e-06
                     lbph           svi           lcp         pgg45
(Intercept)  2.482134e-02  1.034218e-02  2.077485e-03  3.347700e-05
lcavol       3.981644e-04 -2.643164e-03 -3.502258e-03  7.077893e-06
lweight     -4.540916e-03 -4.105065e-03  6.198759e-05  6.876251e-05
age         -1.471652e-04 -7.340636e-05  1.429198e-04 -9.141775e-06
lbph         3.315942e-03  2.136331e-03 -1.416010e-04 -1.238514e-05
svi          2.136331e-03  5.672967e-02 -8.863127e-03 -5.227255e-05
lcp         -1.416010e-04 -8.863127e-03  7.982875e-03 -1.368737e-04
pgg45       -1.238514e-05 -5.227255e-05 -1.368737e-04  1.145503e-05

Prediction interval

  • Basically identical to simple linear regression.

  • Prediction interval at $X_{1,new}, \dots, X_{p,new}$: $$\begin{aligned} \widehat{\beta}_0 + \sum_{j=1}^p X_{j,new} \widehat{\beta}_j\pm t_{1-\alpha/2, n-p-1} \times \ \sqrt{\widehat{\sigma}^2 + SE\left(\widehat{\beta}_0 + \sum_{j=1}^p X_{j,new}\widehat{\beta}_j\right)^2}. \end{aligned}$$

Forming intervals by hand

While R computes most of the intervals we need, we could write a function that explicitly computes a confidence interval (and can be used for predicition intervals with the "extra" argument).

This exercise shows the calculations that R is doing under the hood: the function predict is generally going to be fine for our purposes.

In [20]:
%%R
CI.lm = function(cur.lm, a, level=0.95, extra=0) {

     # the center of the confidence interval
     center = sum(a*cur.lm$coef)

     # the estimate of sigma^2
     sigma.sq = sum(resid(cur.lm)^2) / cur.lm$df.resid

     # the standard error of sum(a*cur.lm$coef)
     se = sqrt(extra * sigma.sq + sum((a %*% vcov(cur.lm)) * a))

     # the degrees of freedom for the t-statistic
     df = cur.lm$df
     
     # the quantile used in the confidence interval

     q = qt((1 - level)/2, df, lower.tail=FALSE)

     # upper, lower limits
     upr = center + se * q
     lwr = center - se * q
     return(data.frame(center, lwr, upr))
}
In [21]:
%%R
print(CI.lm(prostate.lm, c(1, 1.3, 3.6, 64, 0.1, 0.2, -0.2, 25)))
predict(prostate.lm,list(lcavol=1.3,lweight=3.6,age=64,lbph=0.1,svi=0.2,lcp=-0.2,pgg45=25), interval='confidence')
    center      lwr      upr
1 2.422332 2.281218 2.563447
       fit      lwr      upr
1 2.422332 2.281218 2.563447

Prediction intervals

By using the extra argument, we can make prediction intervals.

In [22]:
%%R
print(CI.lm(prostate.lm, c(1, 1.3, 3.6, 64, 0.1, 0.2, -0.2, 25), extra=1))
predict(prostate.lm,list(lcavol=1.3,lweight=3.6,age=64,lbph=0.1,svi=0.2,lcp=-0.2,pgg45=25), interval='prediction')
    center      lwr      upr
1 2.422332 1.032301 3.812363
       fit      lwr      upr
1 2.422332 1.032301 3.812363

Arbitrary contrasts

If we want, we can set the intercept term to 0. This allows us to construct confidence interval for, say, how much the lpsa score will change will increase if we change age by 3 years and svi by 0.5 units, leaving everything else unchanged.

Therefore, what we want is a confidence interval for 2 times the coefficient of age + 0.5 times the coefficient of lbph: $$ 2 \cdot \beta_{\tt age} + 0.5 \cdot \beta_{\tt svi} $$

Most of the time, predict will do what you want so this won't be used too often.

In [23]:
%%R
CI.lm(prostate.lm, c(0,0,0,2,0,0.5,0,0))
     center        lwr       upr
1 0.3343717 0.09496226 0.5737812

Questions about many (combinations) of $\beta_j$’s

  • In multiple regression we can ask more complicated questions than in simple regression.

  • For instance, we could ask whether lcp and pgg45 explains little of the variability in the data, and might be dropped from the regression model.

  • These questions can be answered answered by $F$-statistics.

  • Note: This hypothesis should really be formed before looking at the output of summary.

  • Later we'll see some examples of the messiness when forming a hypothesis after seeing the summary...

Dropping one or more variables

  • Suppose we wanted to test the above hypothesis Formally, the null hypothesis is: $$ H_0: \beta_{\tt lcp} (=\beta_6) =\beta_{\tt pgg45} (=\beta_7) =0$$ and the alternative is $$ H_a = \text{one of $ \beta_{\tt lcp},\beta_{\tt pgg45}$ is not 0}. $$

  • This test is an $F$-test based on two models $$\begin{aligned} Full: Y_i &= \beta_0 + \sum_{j=1}^7 \beta_j X_{ij} \beta_j + \varepsilon_i \\ Reduced: Y_i &= \beta_0 + \sum_{j=1}^5 \beta_j X_{ij} + \varepsilon_i \\ \end{aligned}$$

  • Note: The reduced model $R$ must be a special case of the full model $F$ to use the $F$-test.

Geometry of Least Squares

$SSE$ of a model

  • In the graphic, a “model”, ${\cal M}$ is a subspace of $\mathbb{R}^n$ = column space of ${X}$.

  • Least squares fit = projection onto the subspace of ${\cal M}$, yielding predicted values $\widehat{Y}_{{\cal M}}$

  • Error sum of squares: $$SSE({\cal M}) = \|Y - \widehat{Y}_{{\cal M}}\|^2.$$

$F$-statistic for $H_0:\beta_{\tt lcp}=\beta_{\tt pgg45}=0$

  • We compute the $F$ statistic the same to compare any models $$\begin{aligned} F &=\frac{\frac{SSE(R) - SSE(F)}{2}}{\frac{SSE(F)}{n-1-p}} \\ & \sim F_{2, n-p-1} \qquad (\text{if $H_0$ is true}) \end{aligned}$$

  • Reject $H_0$ at level $\alpha$ if $F > F_{1-\alpha, 2, n-1-p}$.

When comparing two models, one a special case of the other (i.e. one nested in the other), we can test if the smaller model (the special case) is roughly as good as the larger model in describing our outcome. This is typically tested using an F test based on comparing the two models. The following function does this.

In [24]:
%%R
f.test.lm = function(R.lm, F.lm) {
    SSE.R = sum(resid(R.lm)^2)
    SSE.F = sum(resid(F.lm)^2)
    df.num = R.lm$df - F.lm$df
    df.den = F.lm$df
    F = ((SSE.R - SSE.F) / df.num) / (SSE.F / df.den)
    p.value = 1 - pf(F, df.num, df.den)
    return(data.frame(F, df.num, df.den, p.value))
}

R has a function that does essentially the same thing as f.test.lm: the function is anova. It can be used several ways, but it can be used to compare two models.

In [25]:
%%R
reduced.lm = lm(lpsa ~ lcavol + lweight + age + lbph + svi, data=prostate)
print(f.test.lm(reduced.lm, prostate.lm))
anova(reduced.lm, prostate.lm)
         F df.num df.den   p.value
1 1.372057      2     89 0.2588958
Analysis of Variance Table

Model 1: lpsa ~ lcavol + lweight + age + lbph + svi
Model 2: lpsa ~ lcavol + lweight + age + lbph + svi + lcp + pgg45
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1     91 44.437                           
2     89 43.108  2    1.3291 1.3721 0.2589

Dropping an arbitrary subset

  • For an arbitrary model, suppose we want to test $$ \begin{aligned} H_0:&\beta_{i_1}=\dots=\beta_{i_j}=0 \\ H_a:& \text{one of $\beta_{i_1}, \dots \beta_{i_j} \neq 0$} \end{aligned} $$ for some subset $\{i_1, \dots, i_j\} \subset \{0, \dots, p\}$.

  • You guessed it: it is based on the two models: $$\begin{aligned} R: Y_i &= \sum_{l=0, l \not \in \{i_1, \dots, i_j\}}^p \beta_j X_{il} + \varepsilon_i \\ F: Y_i &= \sum_{j=0}^p \beta_j X_{il} + \varepsilon_i \\ \end{aligned}$$ where $X_{i0}=1$ for all $i$.

  • $$ \begin{aligned} F &=\frac{\frac{SSE(R) - SSE(F)}{j}}{\frac{SSE(F)}{n-p-1}} \\ & \sim F_{j, n-p-1} \qquad (\text{if $H_0$ is true}) \end{aligned} $$

  • Reject $H_0$ at level $\alpha$ if $F > F_{1-\alpha, j, n-1-p}$.

Geometry of Least Squares

Geometry of Least Squares

Geometry of Least Squares

General $F$-tests

  • Given two models $R \subset F$ (i.e. $R$ is a subspace of $F$), we can consider testing $$ H~0: H_0: \text{$R$ is adequate (i.e. $\mathbb{E}(Y) \in R$)} $$ vs. $$ H_a: \text{$F$ is adequate (i.e. $\mathbb{E}(Y) \in F$)} $$

    • The test statistic is $$ F = F = \frac{(SSE(R) - SSE(F)) / (df_R - df_F)}{SSE(F)/df_F} $$
  • If $H_0$ is true, $F \sim F_{df_R-df_F, df_F}$ so we reject $H_0$ at level $\alpha$ if $F > F_{df_R-df_F, df_F, 1-\alpha}$.

Constraining $\beta_{\tt lcavol}=\beta_{\tt svi}$

In this example, we might suppose that the coefficients for lcavol and svi are the same and want to test this. We do this, again, by comparing a "full model" and a "reduced model".

  • Full model: $$\begin{aligned} Y_i &= \beta_0 + \beta_1 X_{i,{\tt lcavol}} + \beta_2 X_{i,{\tt lweight}} + \beta_3 X_{i, {\tt age}} \\ & \qquad+ \beta_4 X_{i,{\tt lbph}} + \beta_5 X_{i, {\tt svi}} + \beta_6 X_{i, {\tt lcp}} + \beta_7 X_{i, {\tt pgg45}} + \varepsilon_i \end{aligned} $$

  • Reduced model: $$\begin{aligned} Y_i &= \beta_0 + \tilde{\beta}_1 X_{i,{\tt lcavol}} + \beta_2 X_{i,{\tt lweight}} + \beta_3 X_{i, {\tt age}} \\ & \qquad+ \beta_4 X_{i,{\tt lbph}} + \tilde{\beta}_1 X_{i, {\tt svi}} + \beta_6 X_{i, {\tt lcp}} + \beta_7 X_{i, {\tt pgg45}} + \varepsilon_i \end{aligned} $$

In [26]:
%%R
prostate$Z = prostate$lcavol + prostate$svi
equal.lm = lm(Y ~ Z + lweight + age + lbph + lcp + pgg45, data=prostate)
f.test.lm(equal.lm, prostate.lm)
          F df.num df.den   p.value
1 0.4818657      1     89 0.4893864

Constraining $\beta_{\tt lcavol}+\beta_{\tt svi}=1$

  • Full model: $$\begin{aligned} Y_i &= \beta_0 + \beta_1 X_{i,{\tt lcavol}} + \beta_2 X_{i,{\tt lweight}} + \beta_3 X_{i, {\tt age}} \\ & \qquad+ \beta_4 X_{i,{\tt lbph}} + \beta_5 X_{i, {\tt svi}} + \beta_6 X_{i, {\tt lcp}} + \beta_7 X_{i, {\tt pgg45}} + \varepsilon_i \end{aligned} $$

  • Reduced model: $$\begin{aligned} Y_i &= \beta_0 + \tilde{\beta}_1 X_{i,{\tt lcavol}} + \beta_2 X_{i,{\tt lweight}} + \beta_3 X_{i, {\tt age}} \\ & \qquad+ \beta_4 X_{i,{\tt lbph}} + (1 - \tilde{\beta}_1) X_{i, {\tt svi}} + \beta_6 X_{i, {\tt lcp}} + \beta_7 X_{i, {\tt pgg45}} + \varepsilon_i \end{aligned} $$

In [27]:
%%R
prostate$Z2 = prostate$lcavol - prostate$svi
constrained.lm = lm(lpsa ~ Z2 + lweight + age + lbph + lcp + pgg45, data=prostate, offset=svi)
anova(constrained.lm, prostate.lm)
Analysis of Variance Table

Model 1: lpsa ~ Z2 + lweight + age + lbph + lcp + pgg45
Model 2: lpsa ~ lcavol + lweight + age + lbph + svi + lcp + pgg45
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1     90 43.961                           
2     89 43.108  1   0.85359 1.7623 0.1877

What we had to do above was subtract X3 from Y on the right hand side of the formula. R has a way to do this called using an offset. What this does is it subtracts this vector from

General linear hypothesis

An alternative version of the $F$ test can be derived that does not require refitting a model.

Suppose we want to test $$H_0:C_{q \times (p+1)}\beta_{(p+1) \times 1} = h$$ vs. $$ H_a :C_{q \times (p+1)}\beta_{(p+1) \times 1} \neq h. $$

This can be tested via an $F$ test: $$ F = \frac{(C\hat{\beta}-h)^T \left(C(X^TX)^{-1}C^T \right)^{-1} (C\hat{\beta}-h) / q}{SSE(F) / df_F} \overset{H_0}{\sim} F_{q, n-p-1}. $$

Note: we are assuming that $\text{rank}(C(X^TX)^{-1}C^T)=q$.

Here's a function that implements this computation.

In [28]:
%%R
general.linear = function(model.lm, linear_part, null_value=0) {
    # shorthand
    A = linear_part
    b = null_value
    beta.hat = coef(model.lm)
    V = as.numeric(A %*% beta.hat - null_value)
    invcovV = solve(A %*% vcov(model.lm) %*% t(A)) # the MSE is included in vcov
    
    df.num = nrow(A)
    df.den = model.lm$df.resid
    F = t(V) %*% invcovV %*% V / df.num
    p.value = 1 - pf(F, df.num, df.den)
    return(data.frame(F, df.num, df.den, p.value))
}

Let's test verify this work with our test for testing $\beta_{\tt lcp}=\beta_{\tt pgg45}=0$.

In [29]:
%%R
A = matrix(0, 2, 8)
A[1,7] = 1
A[2,8] = 1
print(A)
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,]    0    0    0    0    0    0    1    0
[2,]    0    0    0    0    0    0    0    1

In [30]:
%%R
print(general.linear(prostate.lm, A))
f.test.lm(reduced.lm, prostate.lm)
         F df.num df.den   p.value
1 1.372057      2     89 0.2588958
         F df.num df.den   p.value
1 1.372057      2     89 0.2588958

In [30]: