Goodness of fit of regression: analysis of variance.

$F$-statistics.

Residuals.

Diagnostic plots.

Here are three pictures that help to describe different models we might fit.

This picture is meant to depict the regression model $$ Y = \gamma \cdot 1 + X \beta + \epsilon. $$

The $\gamma$ coefficient represents movement along the horizontal axis above, labelled $\pmb{1}$.

The $\beta$ coefficient represents movement along the axis $X$ above.

The vector $\hat{Y}$ is the vector of fitted values in the above model.

This picture is meant to depict the regression model $$ Y = \gamma \cdot 1 + \epsilon. $$

The $\gamma$ coefficient represents movement along the horizontal axis above, labelled $\pmb{1}$.

Since $\beta=0$, we have assumed there is no movement along the $X$ axis.

The vector $\hat{Y}$ is the vector of fitted values in the above model.

The above picture tries to capture both models in one image.

There is a new vector: $\hat{Y} - \bar{Y} \cdot \pmb{1}$. This vector is the difference in fits between the two previous models.

The closer $\hat{Y}$ is to the ${1}$ axis, the less "variation" there is along the $X$ axis.

This closeness can be measured by the length of the vector $\hat{Y}-\bar{Y} \cdot 1$.

The square of a vector's length is the sum of its elements squared. These quantities are usually referred to as

*sums of squares*.

$$ \begin{aligned} SSE &= \sum_{i=1}^n(Y_i - \widehat{Y}_i)^2 = \sum_{i=1}^n (Y_i - \widehat{\beta}_0 - \widehat{\beta}_1 X_i)^2 \\ SSR &= \sum_{i=1}^n(\overline{Y} - \widehat{Y}_i)^2 = \sum_{i=1}^n (\overline{Y} - \widehat{\beta}_0 - \widehat{\beta}_1 X_i)^2 \\ SST &= \sum_{i=1}^n(Y_i - \overline{Y})^2 = SSE + SSR \\ R^2 &= \frac{SSR}{SST} = 1 - \frac{SSE}{SST} = \widehat{Cor}(\pmb{X},\pmb{Y})^2. \end{aligned} $$

The quantity $SSE$, or

*error sum of squares*, is the squared length of the vector $Y - \hat{Y}$ which protrudes perpendicular to the plane.The quantity $SSR$, or

*regression sum of squares*, is the length of the vector $\hat{Y} - \bar{Y} \cdot 1$.The quantity $SST$, or

*total sum of squares*, is the length of the vector $Y - \bar{Y} \cdot 1$.The quantity $R^2$ is a measure of the goodness of fit of the simple linear regression model. Values near 1 indicate much of the total variability in $Y$ is explained by the regression model.

Each sum of squares gets an extra bit of information associated to them, called their

*degrees of freedom*.Roughly speaking, the

*degrees of freedom*can be determined by dimension counting.The $SSE$ has $n-2$ degrees of freedom because it is the squared length of a vector that lies in $n-2$ dimensions. To see this, note that it is perpendicular to the 2-dimensional plane formed by the $X$ axis and the $1$ axis.

The $SST$ has $n-1$ degrees of freedom because it is the squared length of a vector that lies in $n-1$ dimensions. In this case, this vector is perpendicular to the $1$ axis.

The $SSR$ has 1 degree of freedom because it is the squared length of a vector that lies in the 2-dimensional plane but is perpendicular to the $1$ axis.

$$ \begin{aligned} MSE &= \frac{1}{n-2}\sum_{i=1}^n(Y_i - \widehat{Y}_i)^2 \\ MSR &= \sum_{i=1}^n(\overline{Y} - \widehat{Y}_i)^2 \\ MST &= \frac{1}{n-1}\sum_{i=1}^n(Y_i - \overline{Y})^2 \\ \end{aligned} $$

These sums of squares can be visualized by other means as well. We will illustrate with a synthetic dataset.

In [1]:

```
%%R -h 600 -w 600
X = seq(0, 20, length = 21)
Y = 0.5 * X + 1 + rnorm(21)
Y.lm = lm(Y ~ X)
meanY = mean(Y)
Yhat = predict(Y.lm)
plot(X,Y, pch=23, bg='red', cex=2)
```

*Y* values and the sample mean of the *Y*
values.

In [2]:

```
%%R -h 600 -w 600
plot(X, Y, pch = 23, bg = "red", main='Total sum of squares', cex=2)
abline(Y.lm, pch=23, col='green', lwd=2)
abline(h = meanY, col = "yellow", lwd = 2)
for (i in 1:21) {
points(X[i], meanY, pch = 23, bg = "yellow")
lines(c(X[i], X[i]), c(Y[i], meanY))
}
```

In [3]:

```
%%R -h 600 -w 600
plot(X, Y, pch = 23, bg = "red", main="Error sum of squares", cex=2)
abline(Y.lm, col = "green", lwd = 2)
for (i in 1:21) {
points(X[i], Yhat[i], pch = 23, bg = "green")
lines(c(X[i], X[i]), c(Y[i], Yhat[i]))
}
abline(h = meanY, col = "yellow", lwd = 2)
```

In [4]:

```
%%R -h 600 -w 600
plot(X, Y, pch = 23, bg = "red", main="Regression sum of squares", cex=2)
abline(Y.lm, col = "green", lwd = 2)
abline(h = meanY, col = "yellow", lwd = 2)
for (i in 1:21) {
points(X[i], Yhat[i], pch = 23, bg = "green")
points(X[i], meanY, pch = 23, bg = "yellow")
lines(c(X[i], X[i]), c(meanY, Yhat[i]))
}
```

As noted above, if the regression model fits very well, then $SSR$ will be large relative to $SST$. The $R^2$ score is just the ratio of these sums of squares.

We'll verify this on the `wages`

data.

In [5]:

```
%%R
url = "http://stats191.stanford.edu/data/wage.csv"
wages = read.table(url, sep = ",", header = T)
wages.lm = lm(logwage ~ education, data=wages)
```

Let's verify our claim $SST=SSE+SSR$:

In [6]:

```
%%R
SSE = sum(resid(wages.lm)^2)
SST = sum((wages$logwage - mean(wages$logwage))^2)
SSR = sum((mean(wages$logwage) - predict(wages.lm))^2)
data.frame(SST, SSE + SSR)
```

*summary* of *lm*.

In [7]:

```
%%R
F = (SSR / 1) / (SSE / wages.lm$df)
print(F)
summary(wages.lm)
```

Compare this to the *F-statistic* output here:

In [8]:

```
%%R
summary(wages.lm)
```

`F-statistic`

is
$$
F = \frac{(n-2) \cdot R^2}{1-R^2}
$$
where $n-2$ is `wages.lm$df`

.

In [9]:

```
2176*0.1351 / (1 - 0.1351)
```

Out[9]:

*correlation coefficient* because it is equal
to the absolute value of sample correlation coefficient of $X$ and $Y$.

In [10]:

```
%%R
cor(wages$education, wages$logwage)^2
```

After a $t$-statistic, the next most commonly encountered statistic is a $\chi^2$ statistic, or its closely related cousin, the $F$ statistic.

Roughly speaking, an $F$-statistic is a ratio of

*sample variances*: it has a numerator, $N$, and a denominator, $D$ that are independent.Let $$N \sim \frac{\chi^2_{\rm num} }{ df_{{\rm num}}}, \qquad D \sim \frac{\chi^2_{\rm den} }{ df_{{\rm den}}}$$ and define $$ F = \frac{N}{D}. $$

We say $F$ has an $F$ distribution with parameters $df_{{\rm num}}, df_{{\rm den}}$ and write $F \sim F_{df_{{\rm num}}, df_{{\rm den}}}$

The ratio $$ F=\frac{SSR/1}{SSE/(n-2)} = \frac{MSR}{MSE}$$ can be thought of as a

*ratio of variances*.In fact, under $H_0:\beta_1=0$, $$ F \sim F_{1, n-2} $$ because $$ \begin{aligned} SSR &= \|\hat{Y} - \bar{Y} \cdot 1\|^2 \\ SSE &= \|Y - \hat{Y}\|^2 \end{aligned} $$ and from our picture, these vectors are orthogonal.

The null hypothesis $H_0:\beta_1=0$ implies that $SSR \sim \chi^2_1 \cdot \sigma^2$.

If $T \sim t_{\nu}$, then $$ T^2 \sim \frac{N(0,1)^2}{\chi^2_{\nu}/\nu} \sim \frac{\chi^2_1/1}{\chi^2_{\nu}/\nu}.$$

In other words, the square of a $t$-statistic is an $F$-statistic. Because it is always positive, an $F$-statistic has no

*direction*associated with it.In fact $$ F = \frac{MSR}{MSE} = \frac{\widehat{\beta}_1^2}{SE(\widehat{\beta}_1)^2}.$$ Let's check this in our example.

In [11]:

```
%%R
summary(wages.lm)
```

*education* is the $t$-statistic for the parameter $\beta_1$ under $H_0:\beta_1=0$. Its value
is 18.4 above. If we square it, we should get about the same as the *F-statistic*.

In [12]:

```
18.44**2
```

Out[12]:

In regression, the numerator is usually a difference in

*goodness of fit*of two (nested) models.The denominator is $\hat{\sigma}^2$ -- an estimate of $\sigma^2$.

In our example today: the bigger model is the simple linear regression model, the smaller is the model with constant mean (one sample model).

If the $F$ is large, it says that the

*bigger*model explains a lot more variability in $Y$ (relative to $\sigma^2$) than the smaller one.

If the null hypothesis is true, $\hat{Y}$ will be close to $\bar{Y} 1$.

How close? We must compare to the size of the noise, i.e. $\sigma^2$.

Not knowing $\sigma^2$, we substitute our estimate $\hat{\sigma}^2$.

The $F$ statistic should compare two models. What are these models?

The

*full model*would be $$ (FM) \qquad Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i $$The

*reduced model*would be $$ (RM) \qquad Y_i = \beta_0 + \varepsilon_i $$The $F$-statistic then has the form $$ F=\frac{(SSE(RM) - SSE(FM)) / (df_{RM} - df_{FM})}{SSE(FM) / df_{FM}} $$

The

*null hypothesis*is $$ H_0: \text{model (RM) is correct}. $$The usual $\alpha$ rejection rule would be to reject $H_0$ if the $F_{\text{obs}}$ the observed $F$ statistic is greater than $F_{1,n-2,1-\alpha}$.

In our case, the observed $F$ was 340, $n-2=2176$ and the appropriate threshold is computed below to be 3.85. Therefore, we strongly reject $H_0$.

In [13]:

```
%%R
qf(0.95, 1, 2176)
```

Using a linear regression function can be wrong: maybe regression function should be quadratic.

We assumed independent Gaussian errors with the same variance. This may be incorrect.

- The errors may not be normally distributed.
- The errors may not be independent.
- The errors may not have the same variance.

Detecting problems is more

*art*then*science*, i.e. we cannot*test*for all possible problems in a regression model.

The basic idea of most diagnostic measures is the following. *If the model is correct then
residuals $e_i = Y_i -\widehat{Y}_i, 1 \leq i \leq n$ should look like a sample of
(not quite independent) $N(0, \sigma^2)$ random variables.*

In [14]:

```
%%R -h 600 -w 600
y = anscombe$y2
x = anscombe$x2
y = y + rnorm(length(y)) * 0.45
plot(x, y, pch = 23, bg = "orange", cex = 2, ylab = "Y",
xlab = "X")
simple.lm = lm(y ~ x)
abline(simple.lm, lwd = 2, col = "red", lty = 2)
```

In [15]:

```
%%R -h 600 -w 600
plot(x, resid(simple.lm), ylab = "Residual", xlab = "X", pch = 23,
bg = "orange", cex = 2)
abline(h = 0, lwd = 2, col = "red", lty = 2)
```

*multiple linear regression model*.

In [16]:

```
%%R -h 600 -w 600
quadratic.lm = lm(y ~ poly(x, 2))
Xsort = sort(x)
plot(x, y, pch = 23, bg = "orange", cex = 2, ylab = "Y",
xlab = "X")
lines(Xsort, predict(quadratic.lm, list(x = Xsort)), col = "red", lty = 2,
lwd = 2)
```

In [17]:

```
%%R -h 600 -w 600
plot(x, resid(quadratic.lm), ylab = "Residual", xlab = "X", pch = 23,
bg = "orange", cex = 2)
abline(h = 0, lwd = 2, col = "red", lty = 2)
```

Another common diagnostic plot is the *qqplot* where *qq* stands for *Quantile-Quantile*. Roughly speaking, a qqplot is designed to see if the quantiles of two distributions match.

The function

*qqnorm*can be used to ascertain if a sample of numbers are roughly normally distributed. If the points lie on the diagonal line, this is evidence that the sample is normally distributed. Various departures from the diagonal indicate skewness, asymmetry, etc.If $e_i, 1\leq i \leq n$ were really a sample of $N(0, \sigma^2)$ then their sample quantiles should be close to the sample quantiles of the $N(0, \sigma^2)$ distribution.

In [18]:

```
%%R -h 600 -w 600
qqnorm(resid(simple.lm), pch = 23, bg = "orange", cex = 2)
```

In [19]:

```
%%R -h 600 -w 600
qqnorm(resid(quadratic.lm), pch = 23, bg = "orange", cex = 2)
```

One plot that is sometimes used to determine whether the variance is constant or not is a plot of $X$ against $e=Y-\hat{Y}$. If there is a pattern to the spread in this plot, it may indicate that the variance changes as a function of $X$. In our earlier plots, we noticed a trend in this plot, not necessarily evidence of changing variance.

The dataset below, taken from some work done with Dr. Robert Shafer here at Stanford http://hivdb.stanford.edu, plots HIV virus load against a score related to the the genetic makeup of a patientâ€™s virus shows clear non-constant variance. It also provides a clear example of an outlier, or a point that is a clear departure from the model.

In [20]:

```
%%R -h 800 -w 800
url = 'http://stats191.stanford.edu/data/HIV.VL.table'
viral.load = read.table(url, header=T)
attach(viral.load)
plot(GSS, VL, pch=23, bg='orange', cex=2)
viral.lm = lm(VL ~ GSS)
abline(viral.lm, col='red', lwd=2)
```

In [21]:

```
%%R -h 800 -w 800
good = (VL < 200000)
plot(GSS, VL, pch=23, bg='orange', cex=2)
viral.lm.good = lm(VL ~ GSS, subset=good)
abline(viral.lm.good, col='green', lwd=2)
```

In [22]:

```
%%R -h 800 -w 800
plot(GSS[good], resid(viral.lm.good), pch=23,
bg='orange', cex=2, xlab='GSS', ylab='Residual')
abline(h=0, lwd=2, col='red', lty=2)
```

Outliers can be obvious to spot (or not) but very difficult to define rigorously. Roughly speaking, they points where the model really does not fit.

They might correspond to mistakes in data transcription, lab errors, who knows? If possible, they should be identified and (hopefully) explained.