A regression model is a model of the relationships between some
*covariates (predictors)* and an *outcome*.

Specifically, regression is a model of the *average* outcome *given or having fixed* the covariates.

We will consider the heights of mothers and daughters collected by Karl Pearson in the late 19th century.

One of our goals is to understand height of the daughter,

`D`

, knowing the height of the mother,`M`

.

A mathematical model might look like $$ D = f(M) + \varepsilon$$ where $f$ gives the average height of the daughter of a mother of height

`M`

and $\varepsilon$ is*error*: not*every*daughter has the same height.A statistical question: is there

*any*relationship between covariates and outcomes -- is $f$ just a constant?

Let's create a plot of the heights of the mother/daughter pairs. The data is in an `R`

package that can be downloaded
from CRAN with the command:

```
install.packages("alr3")
```

If the package is not installed, then you will get an error message when calling `library(alr3)`

.

In [1]:

```
%%R -h 600 -w 600
library(alr3)
data(heights)
M = heights$Mheight
D = heights$Dheight
plot(M, D, pch = 23, bg = "red", cex = 2)
```

In [2]:

```
%%R -h 600 -w 600
plot(M, D, pch = 23, bg = "red", cex = 2)
height.lm = lm(D ~ M)
abline(height.lm, lwd = 3, col = "yellow")
```

How do find this line? With a model.

We might model the data as $$ D = \beta_0+ \beta_1 M + \varepsilon. $$

This model is

*linear*in $\beta_1$, the coefficient of`M`

(the mother's height), it is a*simple linear regression model*.Another model: $$ D = \beta_0 + \beta_1 M + \beta_2 M^2 + \beta_3 F + \varepsilon $$ where $F$ is the height of the daughter's father.

Also linear (in the coefficients of $M,M^2,F$).

Which model is better? We will need a tool to compare models... more to come later.

Our example here was rather simple: we only had one independent variable.

Independent variables are sometimes called

*features*or*covariates*.In practice, we often have many more than one independent variable.

This example considers the effect of right-to-work legislation (which varies by state) on various factors. A description of the data can be found here.

The variables are:

Income: income for a four-person family

COL: cost of living for a four-person family

PD: Population density

URate: rate of unionization in 1978

Pop: Population

Taxes: Property taxes in 1972

RTWL: right-to-work indicator

`RTWL`

and `Income`

. However, we should recognize that other variables
have an effect on `Income`

. Let's look at some of these relationships.

In [3]:

```
%%R
url = "http://www.ilr.cornell.edu/~hadi/RABE4/Data4/P005.txt"
rtw.table <- read.table(url, header=T, sep='\t')
attach(rtw.table)
print(head(rtw.table))
```

A graphical way to
visualize the relationship between `Income`

and `RTWL`

is the *boxplot*.

In [4]:

```
%%R -h 600 -w 600
boxplot(Income ~ RTWL, col='orange', pch=23, bg='red')
```

`COL`

. It also varies between right-to-work states.

In [5]:

```
%%R -h 600 -w 600
boxplot(COL ~ RTWL, col='orange', pch=23, bg='red')
```

In [6]:

```
%%R -h 600 -w 600
par(mfrow=c(2,2))
plot(URate, COL, pch=23, bg='red')
plot(URate, Income, pch=23, bg='red')
plot(URate, Pop, pch=23, bg='red')
plot(COL, Income, pch=23, bg='red')
```

`R`

has a builtin function that will try to display all pairwise relationships in a given dataset, the function `pairs`

.

In [7]:

```
%%R -h 600 -w 600
pairs(rtw.table, pch=23, bg='red')
```

`R`

uses 1-based instead of 0-based indexing for rows and columns of arrays.)

In [8]:

```
%%R
print(rtw.table[27,])
pairs(rtw.table[-27,], pch=23, bg='red')
```

Some of the main goals of this course:

Build a statistical model describing the

*effect*of`RTWL`

on`Income`

.This model should recognize that other variables also affect

`Income`

.What sort of

*statistical confidence*do we have in our conclusion about`RTWL`

and`Income`

?Is the model adequate do describe this dataset?

Are there other (simpler, more complicated) models?

In [9]:

```
%%R
X = c(1,3,5,7,8,12,19)
print(X)
print(mean(X))
print((X[1]+X[2]+X[3]+X[4]+X[5]+X[6]+X[7])/7)
print(sum(X)/length(X))
```

In [10]:

```
%%R -h 600 -w 600
url = 'http://lib.stat.cmu.edu/DASL/Datafiles/Calcium.html'
calcium.table = read.table(url, header=T, skip=28, nrow=21)
attach(calcium.table)
treated = Decrease[(Treatment == 'Calcium')]
placebo = Decrease[(Treatment == 'Placebo')]
treated
```

Given a sample of numbers $X=(X_1, \dots, X_n)$ the sample standard deviation $S_X$ is $$ S^2_X = \frac{1}{n-1} \sum_{i=1}^n (X_i-\overline{X})^2.$$

In [11]:

```
%%R
S2 = sum((treated - mean(treated))^2) / (length(treated)-1)
print(sqrt(S2))
print(sd(treated))
```

Given a sample of numbers $X=(X_1, \dots, X_n)$ the sample median is
the `middle`

of the sample:
if $n$ is even, it is the average of the middle two points.
If $n$ is odd, it is the midpoint.

In [12]:

```
%%R
print(X)
median(X)
```

In [13]:

```
%%R
quantile(X,0.25)
```

We've already seen a boxplot. Another common statistical summary is a histogram.

In [14]:

```
%%R -h 600 -w 600
hist(treated, main='Treated group', xlab='Decrease', col='orange')
```

Suppose we want to determine the efficacy of a new drug on blood pressure.

Our study design is: we will treat a large patient population (maybe not so large: budget constraints limit it $n=20$) with the drug and measure their blood pressure before and after taking the drug.

One way to conclude that the drug is effective if the blood pressure has decreased. That is, if the average difference between before and after is positive.

The

*null hypothesis*, $H_0$ is: *the average difference is less than zero.*The

*alternative hypothesis*, $H_a$, is: *the average difference is greater than zero.*Sometimes (actually, often), people will test the alternative, $H_a$: *the average difference is not zero

*vs. $H_0$:*the average difference is zero.*The test is performed by estimating the average difference and converting to standardized units.

Formally, could set up the above test as drawing from a box of *differences in blood pressure*.

A box model is a useful theoretical device that describes the experiment under consideration. In our example, we can think of the sample of decreases drawn 20 patients at random from a large population (box) containing all the possible decreases in blood pressure.

In our box model, we will assume that the decrease is an integer drawn at random from $-3$ to 6.

We will draw 20 random integers from -3 to 6 with replacement and test whether the mean of our "box" is 0 or not.

In [15]:

```
%%R
mysample = as.integer(10*runif(20))-3
mysample
```

The test is usually a $T$ test that has the form $$ T = \frac{\overline{X}-0}{S_X/\sqrt{n}} $$

In [16]:

```
%%R
T = (mean(mysample) - 0) / (sd(mysample) / sqrt(20))
T
```

*degrees of freedom*) and the 5% cutoff is

In [17]:

```
%%R
cutoff = qt(0.975, 19)
cutoff
```

The result of the test is

In [18]:

```
%%R
result = (abs(T) > cutoff)
result
```

`result`

is `TRUE`

, then we reject $H_0$ the mean is 0, while if it is `FALSE`

we do not reject. Of course, in this example we know the mean in our "box" is not 0, it is 1.5.

This rule can be visualized with the $T$ density. The total grey area is 0.05=5%, and the cutoff is chosen to be symmetric around zero and such that this area is exactly 5%.

For a test of size $\alpha$ we write this cutoff $t_{n-1,1-\alpha/2}$.

In [20]:

```
density_fig
```

Out[20]:

In [22]:

```
density_fig
```

Out[22]:

In [24]:

```
density_fig
```

Out[24]:

Instead of testing a particular hypothesis, we might be interested in coming up with a reasonable range for the mean of our "box".

Statistically, this is done via a

*confidence interval*.If the 5% cutoff is $q$ for our test, then the 95% confidence interval is $$ [\bar{X} - q S_X / \sqrt{n}, \bar{X} + q S_X / \sqrt{n}] $$ where we recall $q=t_{n-1,0.975}$ with $n=20$.

If we wanted 90% confidence interval, we would use $q=t_{19,0.95}$. Why?

In [25]:

```
%%R
L = mean(mysample) - cutoff*sd(mysample)/sqrt(20)
U = mean(mysample) + cutoff*sd(mysample)/sqrt(20)
data.frame(L,U)
```

In [26]:

```
%%R
cutoff = qt(0.95, 19)
L = c()
U = c()
covered = c()
for (i in 1:100) {
mysample = as.integer(10*runif(20))-3
l = mean(mysample) - cutoff*sd(mysample)/sqrt(20)
u = mean(mysample) + cutoff*sd(mysample)/sqrt(20)
L = c(L, l)
U = c(U, u)
covered = c(covered, (l < 1.5) * (u > 1.5))
}
sum(covered)
```

In [27]:

```
%%R -h 600 -w 600
mu = 1.5
plot(c(1, 100), c(-2.5+mu,2.5+mu), type='n', ylab='Confidence Intervals', xlab='Sample')
for (i in 1:100) {
if (covered[i] == TRUE) {
lines(c(i,i), c(L[i],U[i]), col='green', lwd=2)
}
else {
lines(c(i,i), c(L[i],U[i]), col='red', lwd=2)
}
}
abline(h=mu, lty=2, lwd=4)
```