Notebook

Multiple Regression¶

This section is based partly on Freedman, D.A., 2009. Statistical Models: Theory and Practice, Revised Edition, Cambridge University Press.

Statistical models, and regression in particular, are used primarily for three purposes:

Description: to summarize data
Prediction: to predict future data
Causal Inference: to predict what would happen in response to an intervention

It is straightforward to check whether a regression model is a good summary of existing data, although there is some subtlety in determining whether the summary is good enough. How to measure goodness of fit appropriately is not always obvious, and adequacy of fit depends on the use of the summary.

Prediction is harder than description because it involves extrapolation: how can one tell what the future will bring? Why should the future be like the past? Is the system under study stable (i.e., stationary) or are its properties changing with time?

However, the hardest of these tasks is causal inference. The biggest difficulty in drawing causal inferences is confounding, especially when the data arise from observational studies rather than randomized, controlled experiments. (Natural experiments lie somewhere in between; there are few that approach the utility of a randomized controlled experiment, but John Snow's study of the communication of cholera is a notable exception.)

Confounding happens when one factor or variable manifests in the outcome in a way that cannot be distinguished from the treatment.

Stratification (e.g., cross tabulation) can help reduce confounding. So can modeling—in some cases, but not in others. For modeling to help, it is generally necessary for the structure of the model to correspond to how the data were actually generated. Unfortunately, most models in science, especially social science, are chosen out of habit or computational convenience, not because they have a basis in the science itself. This often produces misleading results, and the misleading impression that those results have small uncertainties.

Some notation¶

If $\{x_i\}_{i=1}^n$ is a list of numbers, $$ \bar{x} \equiv \frac{1}{n} \sum_{i=1}^n $$ is the mean of the list; $$ \mbox{var } x \equiv \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2 $$ is the variance of the list; $$ s_x \equiv \sqrt{\mbox{var } x} $$ is the SD or standard deviation of the list; and $$ z_i \equiv \frac{x_i - \bar{x}}{s_x} $$ is $x_i$ in standard units. With $\bar{y}$ and $s_y$ defined analogously for the list $\{y_i\}_{i=1}^n$, and if $s_x$ and $s_y$ are nonzero,

$$ r_{xy} \equiv \frac{1}{n} \sum_{i=1} \frac{x_i - \bar{x}}{s_x} \cdot \frac{y_i-\bar{y}}{s_y} $$

is the correlation of $x$ and $y$.

Bivariate regression¶

See SticiGui: Regression

Suppose we observe pairs $\{(x_i, y_i)\}_{i=1}^n$. What straight line $y = a + bx$ comes closest to fitting these data, in the least-squares sense? That is, what values $a$ and $b$ minimize $$ \mbox{mean squared residual} = \mbox{MSR} = \equiv \frac{1}{n}\sum_{i=1}^n \left ( y_i - (a + bx_i) \right )^2? $$ The solution is $b = r_{xy} \frac{s_y}{s_x}$ and $a = \bar{y} - b \bar{x}$.

Proof. The mean squared residual is $$ \mbox{MSR} = \frac{1}{n}\sum_{i=1}^n \left ( y_i - (a + bx_i) \right )^2 = \frac{1}{n}\sum_{i=1}^n \left ( y_i^2 - 2y_i(a + bx_i) + (a+bx_i)^2 \right ) = \frac{1}{n}\sum_{i=1}^n \left ( y_i^2 - 2y_i(a + bx_i) + a^2+2abx_i + b^2x_i^2 \right ). $$

This is quadratic in both $a$ and $b$, and has positive leading coefficients in both. Hence, its minimum occurs at a stationary point with respect to both $a$ and $b$. We can differentiate inside the sum and solve for the stationary point:

$$ 0 = \frac{\partial \mbox{MSR}}{\partial a} = \frac{1}{n} \sum_{i=1}^n 2(y_i - (a+b x_i)) = 2 (\bar{y} - a - b \bar{x}). $$

Hence, the optimal $a$ solves $a = \bar{y} - b \bar{x}$. Substituting this into the expression for mean squared residual gives

$$ \mbox{MSR} = \frac{1}{n} \sum_{i=1}^n \left ( y_i - (\bar{y} - b \bar{x} + bx_i) \right )^2 = \frac{1}{n} \sum_{i=1}^n \left ( y_i - (\bar{y} - b (\bar{x} - x_i)) \right )^2. $$

Differentiating with respect to $b$ and finding the stationary point gives $$ 0 = \frac{\partial \mbox{MSR}}{\partial b} = \frac{1}{n} \sum_{i=1}^n 2 \left ( y_i - (\bar{y} - b (\bar{x} - x_i) ) \right )(\bar{x} - x_i) = \frac{1}{n} \sum_{i=1}^n ( y_i - \bar{y})(\bar{x} - x_i) + \frac{b}{n} \sum_{i=1}^n (\bar{x}-x_i)^2 = s_x s_y r_{xy} - b s_x^2, $$

i.e.,

$$ b = \frac{s_x s_y r_{xy}}{s_x^2} = r_{xy}\frac{s_y}{s_x}. $$

The mean of the squared residuals from the regression line has a simple expression:

$$\frac{1}{n} \sum_{i=1}^n e_i^2 = \frac{1}{n}\sum_{i=1}^n \left ( y_i - \bar{y} + r_{xy} \frac{s_y}{s_x} \bar{x} - r_{xy} \frac{s_y}{s_x} x_i) \right )^2 $$$$ = \frac{1}{n} \sum_{i=1}^n ( y_i - \bar{y})^2 + 2r_{xy} \frac{s_y}{s_x} \frac{1}{n} \sum_{i=1}^n (y_i - \bar{y})(\bar{x} - x_i) + r_{xy}^2 \frac{s_y^2}{s_x^2} \frac{1}{n} \sum_{i=1}^n (\bar{x} - x_i)^2 $$$$ = s_y^2 + 2r_{xy} \frac{s_y}{s_x} (-r_{xy}s_x s_y) + r_{xy}^2 \frac{s_y^2}{s_x^2} s_x^2 $$$$ = s_y^2 - 2 r^2 s_y^2 + r^2 s_y^2 $$$$ = s_y^2 (1 - r_{xy}^2). $$

Some statistical terminology: measuring error¶

Suppose that $\theta \in \Re^p$ is a parameter, and let $\hat{\theta}$ be an estimator of $\theta$. The bias of $\hat{\theta}$ is

$$ \mbox{bias } \hat{\theta} = {\mathbb E}(\hat{\theta} - \theta) = {\mathbb E} \hat{\theta} - \theta. $$

An estimate $\hat{\theta}$ of a parameter $\theta \in \Theta$ is unbiased if ${\mathbb E}_\eta \hat{\theta} = \eta$ for all $\eta \in \Theta$.

The mean squared error (MSE) of an estimate $\hat{\theta}$ of $\theta$ is

$$ \mbox{MSE } \hat{\theta} = {\mathbb E} \| \hat{\theta} - \theta \|^2. $$

The variance of $\hat{\theta}$ is

$$ \mbox{var } \hat{\theta} = {\mathbb E} \| \hat{\theta} - {\mathbb E} \hat{\theta} \|^2. $$

Proposition.

$$ \mbox{MSE } \hat{\theta} = \mbox{var } \hat{\theta} + \| \mbox{bias} \hat{\theta} \|^2. $$

Proof.

$$ \| \hat{\theta} - \theta \|^2 = \| \hat{\theta} - {\mathbb E} \hat{\theta} - (\theta - {\mathbb E} \hat{\theta}) \|^2 = (\hat{\theta} - {\mathbb E} \hat{\theta} - (\theta - {\mathbb E} \hat{\theta}))' (\hat{\theta} - {\mathbb E} \hat{\theta} - (\theta - {\mathbb E} \hat{\theta})) $$$$ = (\hat{\theta} - {\mathbb E} \hat{\theta})'(\hat{\theta} - {\mathbb E} \hat{\theta}) + 2 (\hat{\theta} - {\mathbb E} \hat{\theta})'(\theta - {\mathbb E} \hat{\theta}) + (\theta - {\mathbb E} \hat{\theta})'(\theta - {\mathbb E} \hat{\theta}) $$$$ = \| \hat{\theta} - {\mathbb E} \hat{\theta} \|^2 + 2(\hat{\theta} - {\mathbb E} \hat{\theta})'(\theta - {\mathbb E} \hat{\theta}) + \| \theta - {\mathbb E} \hat{\theta} \|^2. $$

Hence,

$$ \mbox{MSE } \hat{\theta} = {\mathbb E} \| \hat{\theta} - {\mathbb E} \hat{\theta} \|^2 + {\mathbb E} \left ( 2(\hat{\theta} - {\mathbb E} \hat{\theta})'(\theta - {\mathbb E} \hat{\theta}) \right ) + {\mathbb E} \| \theta - {\mathbb E} \hat{\theta} \|^2 $$$$ = {\mathbb E} \| \hat{\theta} - {\mathbb E} \hat{\theta} \|^2 + 2 \left ({\mathbb E} \hat{\theta} - {\mathbb E} \hat{\theta})'(\theta - {\mathbb E} \hat{\theta}) \right ) + {\mathbb E} \| \theta - {\mathbb E} \hat{\theta} \|^2 $$$$ = \mbox{var } \hat{\theta} + \| \mbox{bias } \hat{\theta} \|^2. $$

Multiple Regression¶

Multiple linear regression is a commonly used tool in most branches of science, as well as economics. It is frequently misinterpreted.

We will start with an introduction to the linear algebra of constructing the least-squares estimate for linear regression, then discuss the features and limitations of regression, especially the limitations of drawing causal inferences from regression models.

Notation¶

The basic relationship for linear regression is the equation

$$ Y = X \beta + \epsilon. $$

Here, $Y$ is an $n$-vector of data, called the dependent variable, the response, the explained variables, the modeled variables, or the left hand side. The matrix $X \in {\mathcal M}(n,p)$ with $n \ge p$, $\mbox{rank}(X)=p$ (full rank) so that the columns of $X$ are linearly independent.

The vector $\beta$ is a $p$-vector of parameters, model parameters, or coefficients.
The usual goal of regression is to estimate $\beta$.

The vector $\epsilon$ is an $n$-vector of error, disturbance, or noise.

We will usually assume that $\epsilon$ is random, which makes $Y$ random as well. We will usually treat $X$ as fixed rather than random.

There is an observation $Y_i$ for each unit of observation, a row of $X$ for each observation, and a column of $X$ for each parameter (element of $\beta$). The columns correspond to explanatory variables, independent variables, covariates, control variables, or right-hand side variables.

We have observaations of $Y$, which are assumed to be values of $X\beta + \epsilon$. The value of $\beta$ is unknown. The value of $\epsilon$ cannot be observed.

The standard assumption in multiple regression is that the noise terms $\epsilon$ are random, with

$\{\epsilon_i\}$ iid (independent and identically distributed)
${\mathbb E}\epsilon_i = 0$
$\mbox{var}\epsilon_i = \sigma^2$, generally a known constant

It is also standard to assume that if $X$ is random, $X$ and $\epsilon$ are independent. Regardless, the value of $X$ is observed$.

Since $X$ is observed, we can find $\mbox{rank}(X)$. But there is no way to test whether the main assumptions are true, that is, whether

$Y = X\beta + \epsilon$
$\{ \epsilon_i \}$ are iid with mean 0 and finite variance
$X$ and $\epsilon$ are independent.

It is common to condition on $X$; that is, to treat it as fixed erather than random.

Ordinary least squares¶

The ordinary least squares (OLS) estimate of $\beta$ is

$$ \hat{\beta} \equiv \left ( X'X \right )^{-1}X' Y. $$

The estimate $\hat{\beta}$ is a $p$-vector, just like $\beta$.

The residuals or fitting errors are $$ e \equiv Y - X \hat{\beta}. $$

__Theorem.__

$e \perp X$ (i.e., $X'e = {\mathbf 0}$: the fitting errors are orthogonal to $X$)
$\min_{\gamma \in \Re^p} \| Y - X\gamma\|^2 = \| Y - X \hat{\beta}\|^2$ ($\hat{\beta}$ solves the least squares fitting problem)

This is a special case of the Projection Theorem.

Proof.

We prove the first part by direct calculation: $ X'e = X' (Y - X \hat{\beta})$, so

$$ \left ( X'X \right )^{-1} X' e = \left ( X'X \right )^{-1} X' Y - \left ( X'X \right )^{-1} X'X \hat{\beta} = \hat{\beta} - \hat{\beta} = 0. $$

For the second part, write $\gamma = \hat{\beta} + (\gamma - \hat{\beta})$. Then

$$ \| Y - X \gamma \|^2 = \| Y - X \hat{\beta} - X(\gamma - \hat{\beta}) \|^2 = \| e - X (\gamma - \hat{\beta})\|^2 $$$$ = \| e \|^2 + 2 e' X(\gamma - \hat{\beta}) + \| X(\gamma - \hat{\beta}) \|^2 \ge \| e \|^2 $$

since $e'X = {\mathbf 0}$.

What does that orthogonality mean? Finding $\hat{\beta}$ amounts to finding the coefficients in a linear combination of columns of $X$ that give the best approximation to $Y$, in a least-squares sense. That is, we are approximating $Y \in \Re^n$ by an element of the $p$-dimensional subspace spanned by the columns of $X$. If the residuals had a non-zero component in that subspace, there would be a better approximation to $Y$ from that subspace.

For instance, if any column of $X$ is constant, then the residuals should have zero mean: $\sum_{i=1}^n e_j = 0$. Otherwise, the dot product of the residuals with that column of $X$ would not be zero: the residuals would not be orthogonal to $X$, and there would be an element of that subspace that is a better approximation to $Y$.

Let's consider statistical properties of $\hat{\beta}$. We start with (conditional) bias; later we will get to the MSE.

__Theorem.__ The ordinary least squares estimate $\hat{\beta}$ is _conditionally unbiased_: ${\mathbb E} (\hat{\beta} \mid X) = \beta$.

__Proof. __ By calculation:

$$ \hat{\beta} = (X'X)^{-1} X'Y = (X'X)^{-1} X'(X\beta + \epsilon) $$$$ = (X'X)^{-1}X'X\beta + (X'X)^{-1}X'\epsilon = \beta + (X'X)^{-1}X' \epsilon. $$

Now, $$ {\mathbb E} \left ( (X'X)^{-1} X' \epsilon \mid X \right ) = (X'X)^{-1} X' E(\epsilon \mid X). $$

Since $\epsilon$ and $X$ are independent, $$ {\mathbb E} (\epsilon \mid X) = {\mathbb E}\epsilon = 0. $$

Hence, $$ {\mathbb E} (\hat{\beta} \vert X) = {\mathbb E}(\beta + (X'X)^{-1}X'\epsilon \mid X) = \beta + {\mathbf 0} = \beta. $$

Computational example¶

We drop an object from an unknown height $h$, with an unknown velocity $v$, subject to an unknown constant acceleration $a$.

We measure the height $Y_i$ of the object at times 0, 1, 2, and 3 seconds:

$t_i$	$Y_i$
0	2.043
1	6.856
2	22.565
3	47.709

The height at time $t$ is

$$ Y_i = h + v t_i + (1/2) a t_i^2 + \epsilon_i. $$

This has three unknown parameters: the initial height $h$, the initial velocity $v$, and the acceleration $a$. It is an (unverifiable) assumption that the noise terms $\{\epsilon_i\}$ are independent and have zero mean and finite variance.

We can expand the data relationship as follows:

$$Y_1 = h + vt_1 + 1/2 a t_1^2 + \epsilon_1 = h \times 1 + v \times 0 + a \times 0 + \epsilon_1$$$$Y_2 = h + vt_2 + 1/2 a t_2^2 + \epsilon_2 = h \times 1 + v \times 1 + a \times (1/2)1^2 + \epsilon_2$$$$Y_3 = h + vt_3 + 1/2 a t_3^2 + \epsilon_3 = h \times 1 + v \times 2 + a \times (1/2)2^2 + \epsilon_3$$$$Y_4 = h + vt_4 + 1/2 a t_3^4 + \epsilon_4 = h \times 1 + v \times 3 + a \times (1/2)3^2 + \epsilon_3.$$

We can write these relations in matrix form, as follows.

Define $$ Y \equiv \begin{pmatrix} 2.043 \\ 6.856 \\ 22.565 \\ 47.709 \end{pmatrix}, $$

$$ X \equiv \begin{pmatrix} 1 & 0 & 0 \\ 1 & 1 & 0.5 \\ 1 & 2 & 2 \\ 1 & 3 & 4.5 \end{pmatrix}, $$$$ \beta \equiv \begin{pmatrix} h \\ v \\ a \end{pmatrix}, $$

and

$$ \epsilon \equiv \begin{pmatrix} \epsilon_1 \\ \epsilon_2 \\ \epsilon_3 \\ \epsilon_4 \\ \end{pmatrix} . $$

Then the data relationship can be written

$$ Y = X \beta + \epsilon. $$

The least squares estimate $\hat{\beta}$ of $\beta$ is $$ \hat{\beta} = (X'X)^{-1}X'Y. $$ If the Newtonian gravity data relations are correct and the observational noise $\epsilon$ really has zero mean, then formally $\hat{\beta}$ is an unbiased estimate of $\beta$. However, we should not calculate $\hat{\beta}$ by inverting $X'X$, as discussed previously. Rather, we should use matrix factorization or back-substitution.

In [49]:

# Least squares example in R
Y <- c(2.043, 6.856, 22.565, 47.709);
X <- matrix(c(1,0,0,1,1,0.5,1,2,2,1,3,9/2), nrow = 4, ncol=3, byrow = T);
X
betahat <- solve(t(X) %*% X, t(X) %*% Y);
betahat

Out[49]:

1	0	0
1.0	1.0	0.5
1	2	2
1.0	3.0	4.5

Out[49]:

1.96995

0.02245

10.1655

That is, we would estimate that the object was dropped from a height of 1.970 m with an initial velocity of 0.022 m/s, under gravitational acceleration of 10.166 m/s^2.

Let's find the residual vector and verify that it is orthogonal to $X$:

In [50]:

e <- Y - X %*% betahat
e
t(e) %*% X

Out[50]:

0.07305

-0.21915

0.21915

-0.07305

Out[50]:

-9.992007e-15

-2.753353e-14

-4.218847e-14

In [51]:

# R also has a function to fit linear models: lm.
# Here is its help function:
help(lm)

Out[51]:

lm {stats}

R Documentation

Fitting Linear Models

Description

lm is used to fit linear models. It can be used to carry out regression, single stratum analysis of variance and analysis of covariance (although aov may provide a more convenient interface for these).

Usage

lm(formula, data, subset, weights, na.action,
   method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE,
   singular.ok = TRUE, contrasts = NULL, offset, ...)

Arguments

`formula`	an object of class `"formula"` (or one that can be coerced to that class): a symbolic description of the model to be fitted. The details of model specification are given under ‘Details’.
`data`	an optional data frame, list or environment (or object coercible by `as.data.frame` to a data frame) containing the variables in the model. If not found in `data`, the variables are taken from `environment(formula)`, typically the environment from which `lm` is called.
`subset`	an optional vector specifying a subset of observations to be used in the fitting process.
`weights`	an optional vector of weights to be used in the fitting process. Should be `NULL` or a numeric vector. If non-NULL, weighted least squares is used with weights `weights` (that is, minimizing `sum(w*e^2)`); otherwise ordinary least squares is used. See also ‘Details’,
`na.action`	a function which indicates what should happen when the data contain `NA`s. The default is set by the `na.action` setting of `options`, and is `na.fail` if that is unset. The ‘factory-fresh’ default is `na.omit`. Another possible value is `NULL`, no action. Value `na.exclude` can be useful.
`method`	the method to be used; for fitting, currently only `method = "qr"` is supported; `method = "model.frame"` returns the model frame (the same as with `model = TRUE`, see below).
`model, x, y, qr`	logicals. If `TRUE` the corresponding components of the fit (the model frame, the model matrix, the response, the QR decomposition) are returned.
`singular.ok`	logical. If `FALSE` (the default in S but not in R) a singular fit is an error.
`contrasts`	an optional list. See the `contrasts.arg` of `model.matrix.default`.
`offset`	this can be used to specify an a priori known component to be included in the linear predictor during fitting. This should be `NULL` or a numeric vector of length equal to the number of cases. One or more `offset` terms can be included in the formula instead or as well, and if more than one are specified their sum is used. See `model.offset`.
`...`	additional arguments to be passed to the low level regression fitting functions (see below).

Details

Models for lm are specified symbolically. A typical model has the form response ~ terms where response is the (numeric) response vector and terms is a series of terms which specifies a linear predictor for response. A terms specification of the form first + second indicates all the terms in first together with all the terms in second with duplicates removed. A specification of the form first:second indicates the set of terms obtained by taking the interactions of all terms in first with all terms in second. The specification first*second indicates the cross of first and second. This is the same as first + second + first:second.

If the formula includes an offset, this is evaluated and subtracted from the response.

If response is a matrix a linear model is fitted separately by least-squares to each column of the matrix.

See model.matrix for some further details. The terms in the formula will be re-ordered so that main effects come first, followed by the interactions, all second-order, all third-order and so on: to avoid this pass a terms object as the formula (see aov and demo(glm.vr) for an example).

A formula has an implied intercept term. To remove this use either y ~ x - 1 or y ~ 0 + x. See formula for more details of allowed formulae.

Non-NULL weights can be used to indicate that different observations have different variances (with the values in weights being inversely proportional to the variances); or equivalently, when the elements of weights are positive integers w_i, that each response y_i is the mean of w_i unit-weight observations (including the case that there are w_i observations equal to y_i and the data have been summarized).

lm calls the lower level functions lm.fit, etc, see below, for the actual numerical computations. For programming only, you may consider doing likewise.

All of weights, subset and offset are evaluated in the same way as variables in formula, that is first in data and then in the environment of formula.

Value

lm returns an object of class "lm" or for multiple responses of class c("mlm", "lm").

The functions summary and anova are used to obtain and print a summary and analysis of variance table of the results. The generic accessor functions coefficients, effects, fitted.values and residuals extract various useful features of the value returned by lm.

An object of class "lm" is a list containing at least the following components:

`coefficients`	a named vector of coefficients
`residuals`	the residuals, that is response minus fitted values.
`fitted.values`	the fitted mean values.
`rank`	the numeric rank of the fitted linear model.
`weights`	(only for weighted fits) the specified weights.
`df.residual`	the residual degrees of freedom.
`call`	the matched call.
`terms`	the `terms` object used.
`contrasts`	(only where relevant) the contrasts used.
`xlevels`	(only where relevant) a record of the levels of the factors used in fitting.
`offset`	the offset used (missing if none were used).
`y`	if requested, the response used.
`x`	if requested, the model matrix used.
`model`	if requested (the default), the model frame used.
`na.action`	(where relevant) information returned by `model.frame` on the special handling of `NA`s.

In addition, non-null fits will have components assign, effects and (unless not requested) qr relating to the linear fit, for use by extractor functions such as summary and effects.

Using time series

Considerable care is needed when using lm with time series.

Unless na.action = NULL, the time series attributes are stripped from the variables before the regression is done. (This is necessary as omitting NAs would invalidate the time series attributes, and if NAs are omitted in the middle of the series the result would no longer be a regular time series.)

Even if the time series attributes are retained, they are not used to line up series, so that the time shift of a lagged or differenced regressor would be ignored. It is good practice to prepare a data argument by ts.intersect(..., dframe = TRUE), then apply a suitable na.action to that data frame and call lm with na.action = NULL so that residuals and fitted values are time series.

Note

Offsets specified by offset will not be included in predictions by predict.lm, whereas those specified by an offset term in the formula will be.

Author(s)

The design was inspired by the S function of the same name described in Chambers (1992). The implementation of model formula by Ross Ihaka was based on Wilkinson & Rogers (1973).

References

Chambers, J. M. (1992) Linear models. Chapter 4 of Statistical Models in S eds J. M. Chambers and T. J. Hastie, Wadsworth & Brooks/Cole.

Wilkinson, G. N. and Rogers, C. E. (1973) Symbolic descriptions of factorial models for analysis of variance. Applied Statistics, 22, 392–9.

Examples

require(graphics)

## Annette Dobson (1990) "An Introduction to Generalized Linear Models".
## Page 9: Plant Weight Data.
ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2, 10, 20, labels = c("Ctl","Trt"))
weight <- c(ctl, trt)
lm.D9 <- lm(weight ~ group)
lm.D90 <- lm(weight ~ group - 1) # omitting intercept

anova(lm.D9)
summary(lm.D90)

opar <- par(mfrow = c(2,2), oma = c(0, 0, 1.1, 0))
plot(lm.D9, las = 1)      # Residuals, Fitted, ...
par(opar)

### less simple examples in "See Also" above

[Package stats version 3.1.3 ]

In [52]:

# It will help to be familiar with "formula" objects in R:
help(formula)

Out[52]:

formula {stats}

R Documentation

Model Formulae

Description

The generic function formula and its specific methods provide a way of extracting formulae which have been included in other objects.

as.formula is almost identical, additionally preserving attributes when object already inherits from "formula".

Usage

formula(x, ...)
as.formula(object, env = parent.frame())

## S3 method for class 'formula'
print(x, showEnv = !identical(e, .GlobalEnv), ...)

Arguments

`x, object`	R object.
`...`	further arguments passed to or from other methods.
`env`	the environment to associate with the result, if not already a formula.
`showEnv`	logical indicating if the environment should be printed as well.

Details

The models fit by, e.g., the lm and glm functions are specified in a compact symbolic form. The ~ operator is basic in the formation of such models. An expression of the form y ~ model is interpreted as a specification that the response y is modelled by a linear predictor specified symbolically by model. Such a model consists of a series of terms separated by + operators. The terms themselves consist of variable and factor names separated by : operators. Such a term is interpreted as the interaction of all the variables and factors appearing in the term.

In addition to + and :, a number of other operators are useful in model formulae. The * operator denotes factor crossing: a*b interpreted as a+b+a:b. The ^ operator indicates crossing to the specified degree. For example (a+b+c)^2 is identical to (a+b+c)*(a+b+c) which in turn expands to a formula containing the main effects for a, b and c together with their second-order interactions. The %in% operator indicates that the terms on its left are nested within those on the right. For example a + b %in% a expands to the formula a + a:b. The - operator removes the specified terms, so that (a+b+c)^2 - a:b is identical to a + b + c + b:c + a:c. It can also used to remove the intercept term: when fitting a linear model y ~ x - 1 specifies a line through the origin. A model with no intercept can be also specified as y ~ x + 0 or y ~ 0 + x.

While formulae usually involve just variable and factor names, they can also involve arithmetic expressions. The formula log(y) ~ a + log(x) is quite legal. When such arithmetic expressions involve operators which are also used symbolically in model formulae, there can be confusion between arithmetic and symbolic operator use.

To avoid this confusion, the function I() can be used to bracket those portions of a model formula where the operators are used in their arithmetic sense. For example, in the formula y ~ a + I(b+c), the term b+c is to be interpreted as the sum of b and c.

Variable names can be quoted by backticks `like this` in formulae, although there is no guarantee that all code using formulae will accept such non-syntactic names.

Most model-fitting functions accept formulae with right-hand-side including the function offset to indicate terms with a fixed coefficient of one. Some functions accept other ‘specials’ such as strata or cluster (see the specials argument of terms.formula).

There are two special interpretations of . in a formula. The usual one is in the context of a data argument of model fitting functions and means ‘all columns not otherwise in the formula’: see terms.formula. In the context of update.formula, only, it means ‘what was previously in this part of the formula’.

When formula is called on a fitted model object, either a specific method is used (such as that for class "nls") or the default method. The default first looks for a "formula" component of the object (and evaluates it), then a "terms" component, then a formula parameter of the call (and evaluates its value) and finally a "formula" attribute.

There is a formula method for data frames. If there is only one column this forms the RHS with an empty LHS. For more columns, the first column is the LHS of the formula and the remaining columns separated by + form the RHS.

Value

All the functions above produce an object of class "formula" which contains a symbolic model formula.

Environments

A formula object has an associated environment, and this environment (rather than the parent environment) is used by model.frame to evaluate variables that are not found in the supplied data argument.

Formulas created with the ~ operator use the environment in which they were created. Formulas created with as.formula will use the env argument for their environment.

References

Chambers, J. M. and Hastie, T. J. (1992) Statistical models. Chapter 2 of Statistical Models in S eds J. M. Chambers and T. J. Hastie, Wadsworth & Brooks/Cole.

Examples

class(fo <- y ~ x1*x2) # "formula"
fo
typeof(fo)  # R internal : "language"
terms(fo)

environment(fo)
environment(as.formula("y ~ x"))
environment(as.formula("y ~ x", env = new.env()))


## Create a formula for a model with a large number of variables:
xnam <- paste0("x", 1:25)
(fmla <- as.formula(paste("y ~ ", paste(xnam, collapse= "+"))))

[Package stats version 3.1.3 ]

In [53]:

# In R, there are always several ways to do anything!

# Here are several calls to lm() to fit the regression model. 
# Note that lm() uses QR factorization by default to solve least-squares problems (rather than inverting matrices)
# R uses a fairly standard "formula" notation.  
# "Y ~ X" models Y as a linear function of X.
# Y ~ X + I(X^2) models Y as a linear function of X and X^2.
# By default, R includes an intercept (constant) term. You can alter this, or take advantage of it.

lm(Y ~ X)  # the constant term (intercept) corresponding to h is redundant: lm() automatically inserts an intercept.
           # Since the model has a constant, X1 won't be estimated.

lm(Y ~ X[,2:3]) # this omits X1, the intercept column of X, from the model, since lm() automatically includes one.

lm(Y ~ X - 1)  # this tells lm() not to add an intercept term (we don't need it, since X already has a constant term).

time <- 0:3;  # set just the "time" variable
lm(Y ~ time + I(time^2/2))  # Model the position (Y) as an affine function of t and t^2. 
                            # The intercept (h) term will be included automatically.

Out[53]:

Call:
lm(formula = Y ~ X)

Coefficients:
(Intercept)           X1           X2           X3  
    1.96995           NA      0.02245     10.16550

Out[53]:

Call:
lm(formula = Y ~ X[, 2:3])

Coefficients:
(Intercept)    X[, 2:3]1    X[, 2:3]2  
    1.96995      0.02245     10.16550

Out[53]:

Call:
lm(formula = Y ~ X - 1)

Coefficients:
      X1        X2        X3  
 1.96995   0.02245  10.16550

Out[53]:

Call:
lm(formula = Y ~ time + I(time^2/2))

Coefficients:
(Intercept)         time  I(time^2/2)  
    1.96995      0.02245     10.16550

Standard errors¶

Recall that there is in general no way to test whether the regression model is true; in most applications, that the form of the model is correct is simply an assumption. If the model is not correct, the parameters in the model may be meaningless; and it is not obvious what uncertainties in estimates of the model parameters mean when the form of the model is wrong.

However, suppose that the regression model is true: $Y = X\beta + \epsilon$, where the components of $\epsilon$ are uncorrelated, with mean zero and variance $\sigma^2$, and $\epsilon$ and $X$ are independent. Then, as we have seen, $\hat{\beta}$ is an unbiased estimate of $\beta$, i.e., ${\mathbb E}\hat{\beta} = \beta$.

Moreover, we can find the covariance of the parameter estimates as follows:

__Theorem (variance of $\hat{\beta}$).__ $$ \mbox{cov}(\hat{\beta} \mid X) = \sigma^2 (X'X)^{-1}. $$

Proof. Since $\hat{\beta}$ is conditionally unbiased under these assumptions,

$$ \hat{\beta} - {\mathbb E}(\hat{\beta} \mid X) = (X'X)^{-1}X'(X\beta + \epsilon) - \beta = \beta + (X'X)^{-1}X'\epsilon - \beta = (X'X)^{-1}X'\epsilon. $$

Hence,

$$ \mbox{cov}(\hat{\beta} \mid X) = \mbox{cov}((X'X)^{-1}X'\epsilon \mid X) $$$$ = {\mathbb E} \left ( ((X'X)^{-1}X' \epsilon) ((X'X)^{-1}X' \epsilon)' \mid X \right ) $$$$ = {\mathbb E} \left ( (X'X)^{-1}X' \epsilon \epsilon' X (X'X)^{-1}) \mid X \right ) $$

(since $(X'X)^{-1} = ((X'X)')^{-1}$) $$ = (X'X)^{-1} X' {\mathbb E}(\epsilon \epsilon' \mid X) X (X'X)^{-1} $$ (by the linearity of expectation) $$ = (X'X)^{-1}X' \sigma^2 {\mathbf 1}_n X (X'X)^{-1} $$ $$ = \sigma^2 (X'X)^{-1}X'X(X'X)^{-1} = \sigma^2(X'X)^{-1}. $$

In typical applications, $\sigma^2$ is unknown. If we knew the disturbance term $\epsilon$, we could estimate $\sigma^2$ by $\frac{1}{n} \sum_{i=1}^n \epsilon_i^2$. But the disturbance terms $\epsilon$ are not observable. We do observe the residuals $e = Y - X\hat{\beta}$. If the model is true, we can use them to estimate $\sigma^2$.

The quantity $\frac{1}{n} \sum_{i=1}^n e_i^2$ tends to be an underestimate of $\sigma^2$ because the model is opimized precisely to minimize this quantity over all possible values of $\beta$. To compensate for that, we can divide the sum of squared residuals by $n-p$ rather than by $n$. The quantity

$$ \hat{\sigma}^2 \equiv \frac{1}{n-p} \sum_{i=1}^n e_i^2 $$

is a conditionally unbiased estimate of $\sigma^2$ given $X$, if the model is true. The number $n-p$ is called the degrees of freedom.

Define $$ \hat{\mbox{cov}}(\hat{\beta} | X) \equiv \hat{\sigma}^2 (X'X)^{-1}. $$

This is a conditionally unbiased estimate of the covariance of $\hat{\beta}$ given $X$. (Note that we do have to invert a matrix to find this!)

Let's find this quantity for our toy problem.

In [54]:

# estimate the uncertainty of the model estimate.
# first estimate sigma^2
e <- Y - X %*% betahat;  # the residuals
dims <- dim(X);  # the number of rows is n; the number of columns is p
sigmahat2 <- sum(e^2)/(dims[1]-dims[2]);
sigmahat2

Out[54]:

0.10672605

In [55]:

# now use this to find the estimate of cov(betahat)
covbhat <- sigmahat2 * solve(t(X) %*% X)
covbhat

Out[55]:

0.10138975	-0.11206235	0.05336302
-0.1120624	0.2614788	-0.1600891
0.05336302	-0.16008907	0.10672605

The estimated covariance matrix of $\hat{\beta}$ gives us standard errors for the individual components of $\hat{\beta}$. The estimated uncertainty of the $j$th parameter is the square-root of the $(j,j)$ element of $\hat{\mbox{cov}}(\hat{\beta} | X)$.

In [56]:

# find the square-root of the diagonal elements of the estimated covariance matrix
sqrt(diag(covbhat))

Out[56]:

0.318417567825646
0.511349999999999
0.326689531512719

So, for instance, the estimated standard error of the estimate of $h$ is 0.3184...

Our estimates, with estimated uncertainties, are now

$h = 1.97 \pm 0.318$

$v = 0.02 \pm 0.511$

$a = 10.16 \pm 0.327$

In "reality," the data for this problem were generated using $h=3$, $v=0$, and $a = 9.8$, and the components of $\epsilon$ were iid standard normal (pseudo-)random variables: $\epsilon_i \sim N(0,1)$. That is, $\sigma = 1$, while we estimated $\hat{\sigma} = 0.327$!

Let's find the actual errors and compare their magnitude to the estimated standard errors of their estimates. Student's t statistic is the true error divided by the estimated standard error.

In [57]:

# find betahat - beta
beta <- c(3, 0, 9.8);
betahat - beta
tstat <- (betahat - beta)/sqrt(diag(covbhat));  # this is Student's t statistic for the estimates
tstat

Out[57]:

-1.03005

0.02245

0.3655

Out[57]:

-3.234903

0.04390339

1.118799

The estimated value of $h$ was off by more than 3.23 estimated standard errors.

We haven't discussed this yet, but if the components of $\epsilon$ are iid normal, then the $t$-statistics would have Student's $t$ distribution with $n-p$ degrees of freedom. In this case, $n=4$ and $p=3$, so the $t$-statistics have $4-3=1$ degree of freedom.

The chance that such a variable would be larger in magnitude than 3.234903 can be found using the R function pt(), which finds probabilities for Student's $t$-distribution.

In [58]:

help(pt)

Out[58]:

TDist {stats}

R Documentation

The Student t Distribution

Description

Density, distribution function, quantile function and random generation for the t distribution with df degrees of freedom (and optional non-centrality parameter ncp).

Usage

dt(x, df, ncp, log = FALSE)
pt(q, df, ncp, lower.tail = TRUE, log.p = FALSE)
qt(p, df, ncp, lower.tail = TRUE, log.p = FALSE)
rt(n, df, ncp)

Arguments

`x, q`	vector of quantiles.
`p`	vector of probabilities.
`n`	number of observations. If `length(n) > 1`, the length is taken to be the number required.
`df`	degrees of freedom (> 0, maybe non-integer). `df = Inf` is allowed.
`ncp`	non-centrality parameter delta; currently except for `rt()`, only for `abs(ncp) <= 37.62`. If omitted, use the central t distribution.
`log, log.p`	logical; if TRUE, probabilities p are given as log(p).
`lower.tail`	logical; if TRUE (default), probabilities are P[X ≤ x], otherwise, P[X > x].

Details

The t distribution with df = n degrees of freedom has density

f(x) = Γ((n+1)/2) / (√(n π) Γ(n/2)) (1 + x^2/n)^-((n+1)/2)

for all real x. It has mean 0 (for n > 1) and variance n/(n-2) (for n > 2).

The general non-central t with parameters (df, Del) = (df, ncp) is defined as the distribution of T(df, Del) := (U + Del) / √(V/df) where U and V are independent random variables, U ~ N(0,1) and V ~ χ^2(df) (see Chisquare).

The most used applications are power calculations for t-tests:
Let T= (mX - m0) / (S/sqrt(n)) where mX is the mean and S the sample standard deviation (sd) of X_1, X_2, …, X_n which are i.i.d. N(μ, σ^2) Then T is distributed as non-central t with df= n - 1 degrees of freedom and non-centrality parameter ncp = (μ - m0) * sqrt(n)/σ.

Value

dt gives the density, pt gives the distribution function, qt gives the quantile function, and rt generates random deviates.

Invalid arguments will result in return value NaN, with a warning.

The length of the result is determined by n for rt, and is the maximum of the lengths of the numerical arguments for the other functions.

The numerical arguments other than n are recycled to the length of the result. Only the first elements of the logical arguments are used.

Note

Supplying ncp = 0 uses the algorithm for the non-central distribution, which is not the same algorithm used if ncp is omitted. This is to give consistent behaviour in extreme cases with values of ncp very near zero.

The code for non-zero ncp is principally intended to be used for moderate values of ncp: it will not be highly accurate, especially in the tails, for large values.

Source

The central dt is computed via an accurate formula provided by Catherine Loader (see the reference in dbinom).

For the non-central case of dt, C code contributed by Claus Ekstrøm based on the relationship (for x != 0) to the cumulative distribution.

For the central case of pt, a normal approximation in the tails, otherwise via pbeta.

For the non-central case of pt based on a C translation of

Lenth, R. V. (1989). Algorithm AS 243 — Cumulative distribution function of the non-central t distribution, Applied Statistics 38, 185–189.

This computes the lower tail only, so the upper tail suffers from cancellation and a warning will be given when this is likely to be significant.

For central qt, a C translation of

Hill, G. W. (1970) Algorithm 396: Student's t-quantiles. Communications of the ACM, 13(10), 619–620.

altered to take account of

Hill, G. W. (1981) Remark on Algorithm 396, ACM Transactions on Mathematical Software, 7, 250–1.

The non-central case is done by inversion.

References

Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole. (Except non-central versions.)

Johnson, N. L., Kotz, S. and Balakrishnan, N. (1995) Continuous Univariate Distributions, volume 2, chapters 28 and 31. Wiley, New York.

Examples

require(graphics)

1 - pt(1:5, df = 1)
qt(.975, df = c(1:10,20,50,100,1000))

tt <- seq(0, 10, len = 21)
ncp <- seq(0, 6, len = 31)
ptn <- outer(tt, ncp, function(t, d) pt(t, df = 3, ncp = d))
t.tit <- "Non-central t - Probabilities"
image(tt, ncp, ptn, zlim = c(0,1), main = t.tit)
persp(tt, ncp, ptn, zlim = 0:1, r = 2, phi = 20, theta = 200, main = t.tit,
      xlab = "t", ylab = "non-centrality parameter",
      zlab = "Pr(T <= t)")

plot(function(x) dt(x, df = 3, ncp = 2), -3, 11, ylim = c(0, 0.32),
     main = "Non-central t - Density", yaxs = "i")

[Package stats version 3.1.3 ]

In [59]:

# how surprising is it for our estimates to be off by so much?
2*pt(abs(tstat),1, lower.tail = F) # estimated p-values: twice the upper tail probability

Out[59]:

0.1908651

0.9720682

0.4643426

That is, there's about a 19% chance that the estimated value of $h$ would differ from the true value of $h$ by as much as it did, assuming the model is true—which it is (ignoring the distinction between truly random and pseudo-random).

Standard errors and the "hat matrix"¶

Below we prove that $\hat{\sigma}^2$ is an unbiased estimate of $\sigma^2$ if the model is true.

The key to the proof is the "hat matrix" $H$, so called because it puts a "hat" on $Y$ (i.e., $HY = \hat{Y} = X\hat{\beta}$).

We have

$$ \hat{Y} = X \hat{\beta} = X(X'X)^{-1}X' Y = HY, $$

where $$ H \equiv X(X'X)^{-1}X'. $$

The hat matrix has many useful properties:

$Y - \hat{Y} = ({\mathbf 1}_n - H)Y$.
$H' = (X(X'X)^{-1}X')' = H$ ($H$ is symmetric)
$H^2 = (X(X'X)^{-1}X')(X(X'X)^{-1}X') = H$ ($H$ is idempotent)
$HX = X(X'X)^{-1}X'X = X$
$e = Y - HY \perp X$ (the residual vector is orthogonal to $X$; i.e., $H$ projects $Y$ onto the subspace spanned by the columns of $X$)
$({\mathbf 1}_n - H)X = 0$
$({\mathbf 1}_n - H)H = {\mathbf 0}_n$

We now prove that $\hat{\sigma}^2 = \frac{1}{n-p} \sum_{i=1}^n e_i^2$ is an unbiased estimate of $\sigma^2$.

__Theorem.__ ${\mathbb E}(\hat{\sigma}^2 \mid X) = \sigma^2$.

Proof.

$$ e = ({\mathbf 1}_n - H)Y = ({\mathbf 1}_n - H)(X\beta + \epsilon $$$$ = ({\mathbf 1}_n - H)X\beta + ({\mathbf 1}_n - H)\epsilon = ({\mathbf 1}_n - H)\epsilon. $$

Hence,

$$ \| e \|^2 = \epsilon'({\mathbf 1}_n - H)'({\mathbf 1}_n - H)\epsilon = \epsilon' \tilde{H} \epsilon, $$

where $$ \tilde{H} = ({\mathbf 1}_n - H)'({\mathbf 1}_n - H) = ({\mathbf 1}_n - H) - H({\mathbf 1}_n - H) = ({\mathbf 1}_n - H). $$

$$ {\mathbb E} \left ( \epsilon'\tilde{H}\epsilon \mid X \right ) = $$

[TO DO: finish]

OLS is BLUE¶

The Gauss-Markov Theorem says that if $Y = X\beta + \epsilon$ with $X$ fixed, and the components of $\epsilon$ are uncorrelated and have mean zero and common variance $\sigma^2$, then the linear estimator of $\beta$ that has smallest variance among all unbiased estimates (the best linear unbiased estimator or BLUE) is the ordinary least squares estimate $$ \hat{\beta}_{\mbox{OLS}} = (X'X)^{-1}X'Y. $$

[TO DO: add proof.]

Generalized least squares¶

The model we have been studying assumes that $\epsilon$ and $X$ are independent and that the components of $\epsilon$ are uncorrelated, have zero mean, and have the same variance $\sigma^2$ (i.e., ${\mathbb E} \epsilon = 0$, and $\mbox{cov } \epsilon = \sigma^2 {\mathbf 1}_n$).

When the covariance matrix is not diagonal but we still have ${\mathbb E} \epsilon = 0$ and $\epsilon$ independent of $X$, the ordinary least squares estimate is still unbiased.

$$ {\mathbb E} (\hat{\beta}_{\mbox{OLS}} \mid X) = {\mathbb E} \left ( (X'X)^{-1}X'Y \mid X \right ) $$$$ = {\mathbb E} \left ( (X'X)^{-1}X'(X\beta + \epsilon) \mid X \right ) = {\mathbb E} \left ( \beta + (X'X)^{-1}X'\epsilon \mid X \right ) $$$$ = \beta + (X'X)^{-1}X'{\mathbb E}(\epsilon \mid X) = \beta + 0 = \beta. $$

Moreover, $$ {\mbox cov } (\hat{\beta}_{\mbox{OLS}} \mid X) = {\mathbb E} \left ( \left [ (X'X)^{-1}X'\epsilon \right ] \left [(X'X)^{-1}X'\epsilon \right ]' \mid X \right ) $$ $$ = {\mathbb E} \left ( (X'X)^{-1} X' \epsilon \epsilon' X (X'X)^{-1} \mid X \right ) $$ $$ = (X'X)^{-1}X'{\mathbb E} (\epsilon \epsilon' \mid X) X (X'X)^{-1} $$ $$ = (X'X)^{-1}X' \Sigma X (X'X)^{-1}. $$

In general, this is not diagonal and the OLS estimate is not longer BLUE unless $\Sigma = \sigma^2 {\mathbf 1}_n$.

If the covariance matrix ${\mbox cov }\epsilon = \Sigma$ is known and positive definite, we can take it into account through generalized least squares (GLS), as follows. Since $\Sigma$ is positive definite, its square root $\Sigma^{1/2}$ exists and is invertible. We change the problem by pre-multiplying by $\Sigma^{-1/2}$:

$$ \Sigma^{-1/2}Y = (\Sigma^{-1/2}X) \beta + \Sigma^{-1/2} \epsilon. $$

In this new problem the unknown parameter is still $\beta$, but there is a different "noise," $\Sigma^{-1/2}\epsilon$, we have new data $\Sigma^{-1/2}Y$, and we have a new design matrix $\Sigma^{-1/2} X$. Since $\Sigma$, $Y$, and $X$ are known, we can set up this regression problem using only things we know.

The components of the new noise terms are uncorrelated:

$$ \mbox{cov } (\Sigma^{-1/2} \epsilon | X) = {\mathbb E} (\Sigma^{-1/2} \epsilon \epsilon' \Sigma^{-1/2} \mid X) $$$$ = \Sigma^{-1/2} {\mathbb E} ( \epsilon \epsilon' \mid X) \Sigma^{-1/2} = \Sigma^{-1/2} \Sigma \Sigma^{-1/2} = {\mathbf 1}_n. $$

This transforms the original problem into a problem where the covariance matrix of the new noise is a diagonal with equal elements. OLS is BLUE for this problem. That leads to the estimator $$ \hat{\beta}_{\mbox{GLS}} \equiv \left [ (\Sigma^{-1/2} X)'(\Sigma^{-1/2} X) \right ]^{-1}(\Sigma^{-1/2} X)'\Sigma^{-1/2}Y $$ $$ = \left [ X'\Sigma^{-1/2}\Sigma^{-1/2}X \right ] ^{-1} X' \Sigma^{-1/2}\Sigma^{-1/2}Y = (X' \Sigma^{-1}X)^{-1}X'\Sigma^{-1} Y. $$

To construct this estimator, we need $X'\Sigma^{-1}X$ to be invertible. Is it? $$ X'\Sigma^{-1}X = (\Sigma^{-1/2}X)'(\Sigma^{-1/2}X). $$ The matrix $\Sigma^{-1/2}X$ is full rank, so we are OK.

It's straightforward to show that $\hat{\beta}_{\mbox{GLS}}$ is conditionally unbiased given $X$. Also,

$$ \mbox{cov } (\hat{\beta}_{\mbox{GLS}} \mid X) = (\tilde{X}'\tilde{X})^{-1}, $$

where $\tilde{X} = \Sigma^{-1/2}X$, i.e., $$ \mbox{cov } (\hat{\beta}_{\mbox{GLS}} \mid X) = \left [ (\Sigma^{-1/2}X)'(\Sigma^{-1/2}X) \right ] ^{-1} = (X' \Sigma^{-1} X)^{-1}. $$

For fixed $X$, $\hat{\beta}_{\mbox{GLS}}$ is BLUE.

Example: weighted least squares¶

Suppose $\Sigma = \lambda \Gamma$, where $\lambda$ is an unknown positive constant and $\Gamma$ is a known, positive definite matrix.

We re-write the regression equation as

$$ (\Gamma^{-1/2} Y) = (\Gamma^{-1/2}X) \beta + (\Gamma^{-1/2} \epsilon). $$

Then we still have ${\mathbb E} (\Gamma^{-1/2}\epsilon) = 0$ and $$ \mbox{cov } (\Gamma^{-1/2} \epsilon) = {\mathbb E} ((\Gamma^{-1/2} \epsilon)(\Gamma^{-1/2} \epsilon)') $$ $$ = {\mathbb E} (\Gamma^{-1/2} \epsilon \epsilon' \Gamma^{-1/2}) = \Gamma^{-1/2} {\mathbb E}( \epsilon \epsilon') \Gamma^{-1/2} $$ $$ = \Gamma^{-1'2} \lambda \Gamma \Gamma^{-1/2} = \lambda {\mathbf 1}_n. $$

Thus, the new problem fits the standard OLS model with $\lambda = \sigma^2$, and we have

$$ \hat{\beta} = \left ( ( \Gamma^{-1/2}X)'(\Gamma^{-1/2}X \right )^{-1} (\Gamma^{-1/2}X)'(\Gamma^{-1/2}Y) = (X'\Gamma^{-1}X)^{-1}X'\Gamma^{-1}Y. $$

Also,

$$ \mbox{cov } \hat{\beta} = \lambda \left ( (\Gamma^{-1/2}X)'(\Gamma^{-1/2}X) \right )^{-1} = \lambda (X'\Gamma^{-1}X)^{-1}. $$

An unbiased estimate of $\lambda$ is

$$ \hat{\lambda} = \frac{1}{n-p} \sum_{i=1}^n e_i^2 = \frac{1}{n-p} \sum_{i=1}^n \left [ (\Gamma^{-1/2} Y)_j - (\Gamma^{-1/2}X\hat{\beta})_j \right ]^2 $$$$ = \frac{1}{n-p} \sum_{i=1}^n \left [ \Gamma^{-1/2}(Y - X \hat{\beta} )\right ]_j^2. $$

The estimated covariance matrix of $\hat{\beta}$ is

$$ \hat{\mbox{cov}} \hat{\beta} = \hat{\lambda} (X'\Gamma^{-1}X)^{-1}. $$

Numerical example¶

Suppose that the measuring instrument the falling weight example has correlated errors with covariance matrix $\lambda \Gamma$, with $\lambda > 0$ unknown and

$$ \Gamma = \begin{pmatrix} 1 & 0.2 & 0.1 & 0.1 \\ 0.2 & 1 & 0.2 & 0.1 \\ 0.1 & 0.2 & 1 & 0.2 \\ 0.1 & 0.1 & 0.2 & 1 \end{pmatrix} . $$

Let's see how this changes the estimate.

In [65]:

# R example.
gam <- matrix(c(1, .2, .1, .1, .2, 1, .2, .1, .1, .2, 1, .2, .1, .1, .2, 1), nrow=4, ncol=4, byrow=T)
gam
decomp <- eigen(gam)
sqGamInv <- decomp$vectors %*% diag(1/sqrt(decomp$values)) %*% t(decomp$vectors);
gY <- sqGamInv %*% Y;
gX <- sqGamInv %*% X;
betahat_GLS <- solve(t(gX) %*% gX, t(gX) %*% gY )
betahat_GLS  # the GLS estimate

Out[65]:

1.0	0.2	0.1	0.1
0.2	1.0	0.2	0.1
0.1	0.2	1.0	0.2
0.1	0.1	0.2	1.0

Out[65]:

1.98456

0.01271

10.1655

In [66]:

# Estimate lambda
lambdahat <- sum((gY - gX %*% betahat_GLS)^2)/(dims[1]-dims[2]);
lambdahat

Out[66]:

0.142301400000001

Feasible GLS¶

Usually $\Sigma$ is unknown: we have to estimate it to take advantage of the covariance of $\epsilon$. But a covariance matrix has $n^2$ unknowns, and we only have $n$ data, so it's impossible to estimate $\Sigma$ without constraints. (There are $(n^2+n)/2$ unknowns, since $\Sigma$ is symmetric.)

Substituting an estimate $\hat{\Sigma}$ for $\Sigma$ in GLS is called feasible GLS or the Aitken method.

The resulting feasible GLS estimate is

$$ \hat{\beta}_{\mbox{FGLS}} \equiv (X'\hat{\Sigma}X)^{-1}X' \hat{\Sigma}^{-1}Y, $$

which has estimated covariance $$ \hat{\mbox{cov}}(\hat{\beta}_{\mbox{FGLS}} \mid X) = (X' \hat{\Sigma}^{-1} X)^{-1}. $$

Sometimes, this is a good estimator; sometimes it has terrible performance. Moreover, FGLS is a nonlinear estimator, and is usually biased.

Dummy variables¶

Imagine a covariate that is binary or categorical, such as gender, or whether a subject in an experiment receives a treatment, or which treatment the subject receives.

Dummy variables are a way to inforporate such a covariate into a regression model. That is not always the right way to incorporate such covariates—especially in the context of an experiment—but it is extremely common in practice.

Suppose there is a covariate that can take any of $k$ levels. Augment the list of variables in the model by including $k-1$ new variables. The $i$th such variable will be equal to 1 if the case in question has the $i$th level of treatment and 0 if not, for $i=1, \ldots, k-1$. The $k$th level corresponds to having all $k-1$ of those variables equal to zero. If we used $k$ such variables, they would be linearly dependent: the $k$th variable would be 1 minus the sum of the first $k-1$ variables.

What does the model say?¶

[To do: discuss constancy assumptions, additive effects, independence of $\epsilon$ and $X$, equal variances, uncorrelated, etc.]

Illustration: Student Evaluations¶

We will deliberately mis-analyze the MacNell et al. (2014) data using a dummy variable for the identified gender of the instructor and a dummy variable for the true gender of the instructor. Note that the assumptions of the regression model are not satisfied here. They are violated in many ways, including the lack of any justification for an underlying linear model. The permutation analysis in the chapter on inference is far preferable, in part because it matches the randomization that was used in the experiment. Regression modeling ignores that randomization.

However, regression can sometimes adjust for the effect of confounding variables. We will illustrate this by example—but the circumstances in which the approach gives reliable inferences are limited, and typically cannot be checked from the data themselves.

The model¶

Here is an example of a regression model for student evaluations:

$$ Y = X \beta + \epsilon, $$

where $Y_i$ is the rating assigned by student $i$; $X_{i1} = 1$ for all $i$ (to account for the fact that a typical rating is about 4), and $X_{i2} = 1$ if student $i$ was assigned to an instructor identified as male, and $X_{i2} = 0$ student $i$ was assigned to an instructor identified as female; where $\epsilon$ and $X$ are independent; and the components of $\epsilon$ are uncorrelated, have zero mean, have common variance $\sigma^2$.

In this model, the rating student $i$ assigns to an instructor is $\beta_1 + \epsilon_i$ if the instructor is apparently female and $\beta_1+ \beta_2 + \epsilon_i$ if the instructor is apparently male. instructor; $\beta_1$ if assigned apparently female.

[Show that $\epsilon$ and $X$ are not independent for reasonable ticket models. Show ${\mathbb E} \epsilon \ne 0$ for reasonable ticket models. Etc. Note that both have non-interference built in.]

In finding $p$-values in the regression model, the randomness is in $\epsilon$, the (invented) randomness of individual students' responses. The $p$-values are with respect to different realizations of the individual "noise" terms. The randomness in assigning students to sections of the course is hidden in $X$, but the regression—and the $p$-value calculation—are conditional on $X$.

In the "ticket" model, the randomness comes from the random assignment of students to sections of the class, and the $p$-value calculations imagine different random assignments.

Both models assume that each student's response does not depend on other students' assignments to sections, the non interference assumption. That assumption might not be accurate if students study in groups, but

The regression model assumes that $X$ and $\epsilon$ are independent. Suppose that the expected rating both for apparently male and apparently female instructors is 4 (corresponding to $\beta_1 = 4$ and $\beta_2 = 0$). Suppose that student $i$ would rate an apparently female instructor 5 and an apparently male instructor 3. Then $X_{i2}$ and $\epsilon_i$ are not independent: $\epsilon_i = -1$ if $X_{i2} = 1$ amd $\epsilon_i = +1$ if $X_{i2} = 0$.

The regression model assumes that $\epsilon_i$ has mean zero, for all $i$, so every student's expected rating of an apparently female instructor is the same, amd every student's expected rating of an apparently male instructor is the same. There is no reason that should be true. Again, suppose that the average rating both for apparently male and for apparently female instructors is 4. Imagine that student $i$ will always rate an instructor 5, regardless of the instructor's apparent gender. Then $\epsilon_i = 1$ and hence ${\mathbb E} \epsilon_i = +1$, violating this assumption of the regression model.

In short, the assumptions of the regression model seem highly contrived.

In [73]:

# Use regression with a dummy variable for the instructor's identified gender 
ratings <- read.csv("Data/Macnell-RatingsData.csv", as.is=T);  # reads a .csv file into a DataFrame

character <- setdiff(names(ratings),c("group","gender","tagender","taidgender","age"));

# for each characteristic, we will model the student's rating of the instructor as
# using regression. The model will include a dummy variable for the instructor's perceived gender.
#
# Then we will try a variety of adjustments--for the instructor's true gender and for the
# student's gender, as well as including an "intercept" or not

for (ch in character) {
    reg_mod <- lm(unlist(ratings[ch]) ~ ratings$taidgender - 1) # omit the intercept
    print(ch, quote=F)
    print(summary(reg_mod))
}

[1] professional

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender - 1)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.6087  0.3913  0.3913  4.0000  5.0000 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
ratings$taidgender   4.6087     0.6134   7.513 2.71e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.942 on 42 degrees of freedom
Multiple R-squared:  0.5734,	Adjusted R-squared:  0.5632 
F-statistic: 56.45 on 1 and 42 DF,  p-value: 2.706e-09

[1] respect

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender - 1)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.6087  0.3913  0.3913  4.0000  5.0000 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
ratings$taidgender   4.6087     0.6134   7.513 2.71e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.942 on 42 degrees of freedom
Multiple R-squared:  0.5734,	Adjusted R-squared:  0.5632 
F-statistic: 56.45 on 1 and 42 DF,  p-value: 2.706e-09

[1] caring

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender - 1)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.1739 -0.1739  0.8261  4.0000  5.0000 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
ratings$taidgender   4.1739     0.5668   7.364  4.4e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.718 on 42 degrees of freedom
Multiple R-squared:  0.5636,	Adjusted R-squared:  0.5532 
F-statistic: 54.23 on 1 and 42 DF,  p-value: 4.397e-09

[1] enthusiastic

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender - 1)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.1739 -0.1739  0.8261  4.0000  5.0000 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
ratings$taidgender   4.1739     0.5566   7.499 2.84e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.67 on 42 degrees of freedom
Multiple R-squared:  0.5724,	Adjusted R-squared:  0.5622 
F-statistic: 56.23 on 1 and 42 DF,  p-value: 2.839e-09

[1] communicate

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender - 1)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.2174 -0.2174  0.7826  4.0000  5.0000 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
ratings$taidgender   4.2174     0.5673   7.434 3.51e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.721 on 42 degrees of freedom
Multiple R-squared:  0.5682,	Adjusted R-squared:  0.5579 
F-statistic: 55.26 on 1 and 42 DF,  p-value: 3.505e-09

[1] helpful

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender - 1)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.9565  0.0435  1.0435  3.5000  5.0000 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
ratings$taidgender   3.9565     0.5488   7.209 7.31e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.632 on 42 degrees of freedom
Multiple R-squared:  0.5531,	Adjusted R-squared:  0.5424 
F-statistic: 51.97 on 1 and 42 DF,  p-value: 7.309e-09

[1] feedback

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender - 1)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.1739 -0.1739  0.8261  4.0000  5.0000 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
ratings$taidgender   4.1739     0.5803   7.193 7.72e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.783 on 42 degrees of freedom
Multiple R-squared:  0.5519,	Adjusted R-squared:  0.5413 
F-statistic: 51.73 on 1 and 42 DF,  p-value: 7.718e-09

[1] prompt

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender - 1)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.3478 -0.3478  0.6522  4.0000  5.0000 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
ratings$taidgender   4.3478     0.5519   7.878 8.28e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.647 on 42 degrees of freedom
Multiple R-squared:  0.5964,	Adjusted R-squared:  0.5868 
F-statistic: 62.07 on 1 and 42 DF,  p-value: 8.284e-10

[1] consistent

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender - 1)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.9565  0.0435  1.0435  4.0000  5.0000 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
ratings$taidgender   3.9565     0.5563   7.112 1.01e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.668 on 42 degrees of freedom
Multiple R-squared:  0.5463,	Adjusted R-squared:  0.5355 
F-statistic: 50.58 on 1 and 42 DF,  p-value: 1.005e-08

[1] fair

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender - 1)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.2609 -0.2609  0.7391  3.5000  5.0000 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
ratings$taidgender   4.2609     0.5388   7.908 7.52e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.584 on 42 degrees of freedom
Multiple R-squared:  0.5982,	Adjusted R-squared:  0.5887 
F-statistic: 62.54 on 1 and 42 DF,  p-value: 7.523e-10

[1] responsive

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender - 1)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.8696  0.1304  1.1304  4.0000  5.0000 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
ratings$taidgender   3.8696     0.5643   6.857 2.33e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.706 on 42 degrees of freedom
Multiple R-squared:  0.5282,	Adjusted R-squared:  0.517 
F-statistic: 47.02 on 1 and 42 DF,  p-value: 2.326e-08

[1] praised

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender - 1)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.5217  0.4783  0.4783  4.0000  5.0000 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
ratings$taidgender   4.5217     0.5834   7.751 1.25e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.798 on 42 degrees of freedom
Multiple R-squared:  0.5886,	Adjusted R-squared:  0.5788 
F-statistic: 60.08 on 1 and 42 DF,  p-value: 1.249e-09

[1] knowledgeable

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender - 1)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.3043 -0.3043  0.6957  4.0000  5.0000 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
ratings$taidgender   4.3043     0.6035   7.132 9.42e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.894 on 42 degrees of freedom
Multiple R-squared:  0.5477,	Adjusted R-squared:  0.537 
F-statistic: 50.86 on 1 and 42 DF,  p-value: 9.421e-09

[1] clear

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender - 1)

Residuals:
   Min     1Q Median     3Q    Max 
-1.913  0.087  1.087  4.000  5.000 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
ratings$taidgender   3.9130     0.5663   6.909 1.96e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.716 on 42 degrees of freedom
Multiple R-squared:  0.532,	Adjusted R-squared:  0.5208 
F-statistic: 47.74 on 1 and 42 DF,  p-value: 1.959e-08

[1] overall

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender - 1)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.1739 -0.1739  0.8261  4.0000  5.0000 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
ratings$taidgender   4.1739     0.5713   7.306 5.33e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.74 on 42 degrees of freedom
Multiple R-squared:  0.5596,	Adjusted R-squared:  0.5491 
F-statistic: 53.37 on 1 and 42 DF,  p-value: 5.327e-09

Note the highly significant results. This is not evidence that gender matters; raterh, it is strong evidence that the average rating is not zero—if there is no intercept, having the coefficient of taidgender = 0 means that the data are just iid noice with zero mean, which is a bogus assumption! The mean of the data is large (about 4), so we are bound to reject the null hypothesis in this case.

In [83]:

# include an intercept NOTE HOW BIG THE CHANGE IS!
for (ch in character) {
    reg_mod <- lm(unlist(ratings[ch]) ~ ratings$taidgender ) # include the intercept
    print(ch, quote=F)
    print(summary(reg_mod))
}

[1] professional

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.0000 -0.6087  0.3913  0.3913  1.0000 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)          4.0000     0.2303  17.371   <2e-16 ***
ratings$taidgender   0.6087     0.3148   1.933   0.0601 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.03 on 41 degrees of freedom
Multiple R-squared:  0.08355,	Adjusted R-squared:  0.06119 
F-statistic: 3.738 on 1 and 41 DF,  p-value: 0.06012

[1] respect

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.0000 -0.6087  0.3913  0.3913  1.0000 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)          4.0000     0.2303  17.371   <2e-16 ***
ratings$taidgender   0.6087     0.3148   1.933   0.0601 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.03 on 41 degrees of freedom
Multiple R-squared:  0.08355,	Adjusted R-squared:  0.06119 
F-statistic: 3.738 on 1 and 41 DF,  p-value: 0.06012

[1] caring

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.6500 -0.4120  0.3500  0.8261  1.3500 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)          3.6500     0.2313  15.783   <2e-16 ***
ratings$taidgender   0.5239     0.3162   1.657    0.105    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.034 on 41 degrees of freedom
Multiple R-squared:  0.06275,	Adjusted R-squared:  0.0399 
F-statistic: 2.745 on 1 and 41 DF,  p-value: 0.1052

[1] enthusiastic

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.6000 -0.1739 -0.1739  0.6130  1.4000 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)          3.6000     0.2212  16.278   <2e-16 ***
ratings$taidgender   0.5739     0.3024   1.898   0.0648 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.989 on 41 degrees of freedom
Multiple R-squared:  0.08076,	Adjusted R-squared:  0.05834 
F-statistic: 3.602 on 1 and 41 DF,  p-value: 0.06476

[1] communicate

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.6500 -0.2174  0.3500  0.7826  1.3500 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)          3.6500     0.2329  15.675   <2e-16 ***
ratings$taidgender   0.5674     0.3184   1.782   0.0821 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.041 on 41 degrees of freedom
Multiple R-squared:  0.07189,	Adjusted R-squared:  0.04925 
F-statistic: 3.176 on 1 and 41 DF,  p-value: 0.08215

[1] helpful

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.50000 -0.72826  0.04348  1.04348  1.50000 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)          3.5000     0.2367   14.78   <2e-16 ***
ratings$taidgender   0.4565     0.3237    1.41    0.166    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.059 on 41 degrees of freedom
Multiple R-squared:  0.04627,	Adjusted R-squared:  0.02301 
F-statistic: 1.989 on 1 and 41 DF,  p-value: 0.166

[1] feedback

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.7000 -0.1739  0.3000  0.8261  1.3000 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)          3.7000     0.2506  14.763   <2e-16 ***
ratings$taidgender   0.4739     0.3427   1.383    0.174    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.121 on 41 degrees of freedom
Multiple R-squared:  0.04457,	Adjusted R-squared:  0.02127 
F-statistic: 1.913 on 1 and 41 DF,  p-value: 0.1742

[1] prompt

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.5500 -0.3478  0.4500  0.6522  1.4500 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)          3.5500     0.2268  15.655   <2e-16 ***
ratings$taidgender   0.7978     0.3101   2.573   0.0138 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.014 on 41 degrees of freedom
Multiple R-squared:  0.139,	Adjusted R-squared:  0.118 
F-statistic: 6.621 on 1 and 41 DF,  p-value: 0.0138

[1] consistent

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.50000 -0.50000  0.04348  1.04348  1.50000 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)          3.5000     0.2565  13.644   <2e-16 ***
ratings$taidgender   0.4565     0.3507   1.302      0.2    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.147 on 41 degrees of freedom
Multiple R-squared:  0.03968,	Adjusted R-squared:  0.01626 
F-statistic: 1.694 on 1 and 41 DF,  p-value: 0.2003

[1] fair

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.5000 -0.3804 -0.2609  0.7391  1.5000 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)          3.5000     0.2079  16.837   <2e-16 ***
ratings$taidgender   0.7609     0.2842   2.677   0.0106 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9297 on 41 degrees of freedom
Multiple R-squared:  0.1488,	Adjusted R-squared:  0.128 
F-statistic: 7.166 on 1 and 41 DF,  p-value: 0.01064

[1] responsive

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.6500 -0.8696  0.1304  0.7402  1.3500 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)          3.6500     0.2240  16.292   <2e-16 ***
ratings$taidgender   0.2196     0.3063   0.717    0.478    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.002 on 41 degrees of freedom
Multiple R-squared:  0.01238,	Adjusted R-squared:  -0.01171 
F-statistic: 0.5137 on 1 and 41 DF,  p-value: 0.4776

[1] praised

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.8500 -0.5217  0.1500  0.4783  1.1500 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)          3.8500     0.1984  19.402   <2e-16 ***
ratings$taidgender   0.6717     0.2713   2.476   0.0175 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.8874 on 41 degrees of freedom
Multiple R-squared:  0.1301,	Adjusted R-squared:  0.1088 
F-statistic: 6.129 on 1 and 41 DF,  p-value: 0.01752

[1] knowledgeable

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.9500 -0.3044  0.0500  0.6956  1.0500 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)          3.9500     0.2204  17.925   <2e-16 ***
ratings$taidgender   0.3543     0.3013   1.176    0.246    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9855 on 41 degrees of freedom
Multiple R-squared:  0.03263,	Adjusted R-squared:  0.009038 
F-statistic: 1.383 on 1 and 41 DF,  p-value: 0.2464

[1] clear

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.50000 -0.50000  0.08696  1.08696  1.50000 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)          3.5000     0.2812  12.448 1.64e-15 ***
ratings$taidgender   0.4130     0.3844   1.074    0.289    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.257 on 41 degrees of freedom
Multiple R-squared:  0.02738,	Adjusted R-squared:  0.00366 
F-statistic: 1.154 on 1 and 41 DF,  p-value: 0.2889

[1] overall

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.7000 -0.1739  0.3000  0.8261  1.3000 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)          3.7000     0.2250  16.446   <2e-16 ***
ratings$taidgender   0.4739     0.3076   1.541    0.131    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.006 on 41 degrees of freedom
Multiple R-squared:  0.05472,	Adjusted R-squared:  0.03167 
F-statistic: 2.373 on 1 and 41 DF,  p-value: 0.1311

In [ ]:

# allow for a real difference between the two instructors by including gender (really a proxy for the two instructors)
for (ch in character) {
    reg_mod <- lm(unlist(ratings[ch]) ~ 
                  ratings$taidgender 
                  + ratings$tagender 
                  - 1 ) # omit the intercept
    print(ch, quote=F)
    print(summary(reg_mod))
}

In [74]:

# adjust for the identified gender, the true gender, and the student's gender
for (ch in character) {
    reg_mod <- lm(unlist(ratings[ch]) ~ 
                  ratings$taidgender 
                  + ratings$tagender 
                  + ratings$gender 
                  ) # include the intercept 
    print(ch, quote=F)
    print(summary(reg_mod))
}

[1] professional

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + 
    ratings$gender - 1)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.3897 -0.3897  0.5483  1.1239  3.0620 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
ratings$taidgender   1.4981     0.4110   3.645 0.000761 ***
ratings$tagender     0.5136     0.4458   1.152 0.256055    
ratings$gender       1.9380     0.2389   8.111 5.59e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.463 on 40 degrees of freedom
Multiple R-squared:  0.8995,	Adjusted R-squared:  0.892 
F-statistic: 119.3 on 3 and 40 DF,  p-value: < 2.2e-16

[1] respect

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + 
    ratings$gender - 1)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.3897 -0.3897  0.5483  1.1239  3.0620 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
ratings$taidgender   1.4981     0.4110   3.645 0.000761 ***
ratings$tagender     0.5136     0.4458   1.152 0.256055    
ratings$gender       1.9380     0.2389   8.111 5.59e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.463 on 40 degrees of freedom
Multiple R-squared:  0.8995,	Adjusted R-squared:  0.892 
F-statistic: 119.3 on 3 and 40 DF,  p-value: < 2.2e-16

[1] caring

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + 
    ratings$gender - 1)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.6121 -0.7091  0.0705  1.1322  3.1940 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
ratings$taidgender   1.3523     0.4055   3.335  0.00185 ** 
ratings$tagender     0.3174     0.4398   0.722  0.47464    
ratings$gender       1.8060     0.2358   7.661 2.29e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.444 on 40 degrees of freedom
Multiple R-squared:  0.8828,	Adjusted R-squared:  0.874 
F-statistic: 100.4 on 3 and 40 DF,  p-value: < 2.2e-16

[1] enthusiastic

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + 
    ratings$gender - 1)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.8464 -0.6099  0.1536  0.7849  3.1583 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
ratings$taidgender   1.3734     0.3896   3.525  0.00108 ** 
ratings$tagender     0.1630     0.4226   0.386  0.70182    
ratings$gender       1.8417     0.2265   8.130 5.28e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.387 on 40 degrees of freedom
Multiple R-squared:   0.89,	Adjusted R-squared:  0.8818 
F-statistic: 107.9 on 3 and 40 DF,  p-value: < 2.2e-16

[1] communicate

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + 
    ratings$gender - 1)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.9570 -0.5723  0.0430  1.3926  3.1963 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
ratings$taidgender   1.3839     0.4015   3.446  0.00135 ** 
ratings$tagender     0.3496     0.4355   0.803  0.42687    
ratings$gender       1.8037     0.2335   7.726 1.87e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.429 on 40 degrees of freedom
Multiple R-squared:  0.8865,	Adjusted R-squared:  0.878 
F-statistic: 104.1 on 3 and 40 DF,  p-value: < 2.2e-16

[1] helpful

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + 
    ratings$gender - 1)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.3328 -0.6125  0.1081  1.1079  3.3336 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
ratings$taidgender   1.2257     0.3855   3.180  0.00284 ** 
ratings$tagender     0.5591     0.4181   1.337  0.18866    
ratings$gender       1.6664     0.2241   7.436 4.67e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.372 on 40 degrees of freedom
Multiple R-squared:  0.8843,	Adjusted R-squared:  0.8756 
F-statistic: 101.9 on 3 and 40 DF,  p-value: < 2.2e-16

[1] feedback

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + 
    ratings$gender - 1)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.98771 -0.26829  0.01229  1.06223  3.05609 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
ratings$taidgender  1.25253    0.40651   3.081  0.00372 ** 
ratings$tagender    0.09989    0.44093   0.227  0.82193    
ratings$gender      1.94391    0.23635   8.225 3.93e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.447 on 40 degrees of freedom
Multiple R-squared:  0.8846,	Adjusted R-squared:  0.876 
F-statistic: 102.2 on 3 and 40 DF,  p-value: < 2.2e-16

[1] prompt

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + 
    ratings$gender - 1)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.7624 -0.5702  0.3289  1.4298  3.2653 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
ratings$taidgender   1.6434     0.4115   3.994 0.000271 ***
ratings$tagender     0.2931     0.4463   0.657 0.515176    
ratings$gender       1.7347     0.2392   7.251 8.41e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.465 on 40 degrees of freedom
Multiple R-squared:  0.8823,	Adjusted R-squared:  0.8734 
F-statistic: 99.91 on 3 and 40 DF,  p-value: < 2.2e-16

[1] consistent

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + 
    ratings$gender - 1)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.8227 -0.7780  0.1773  1.0973  3.2134 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
ratings$taidgender   1.1961     0.4029   2.969  0.00503 ** 
ratings$tagender     0.2495     0.4370   0.571  0.57131    
ratings$gender       1.7866     0.2342   7.627 2.55e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.434 on 40 degrees of freedom
Multiple R-squared:  0.8751,	Adjusted R-squared:  0.8658 
F-statistic: 93.44 on 3 and 40 DF,  p-value: < 2.2e-16

[1] fair

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + 
    ratings$gender - 1)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.8183 -0.3668  0.1817  1.0545  3.1273 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
ratings$taidgender  1.45767    0.33871   4.304 0.000105 ***
ratings$tagender    0.07282    0.36739   0.198 0.843874    
ratings$gender      1.87272    0.19693   9.510 8.07e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.206 on 40 degrees of freedom
Multiple R-squared:  0.9167,	Adjusted R-squared:  0.9104 
F-statistic: 146.7 on 3 and 40 DF,  p-value: < 2.2e-16

[1] responsive

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + 
    ratings$gender - 1)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.7595 -0.7595  0.2405  1.2262  3.1508 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
ratings$taidgender  1.10675    0.42540   2.602   0.0129 *  
ratings$tagender    0.06112    0.46141   0.132   0.8953    
ratings$gender      1.84919    0.24733   7.477  4.1e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.514 on 40 degrees of freedom
Multiple R-squared:  0.8593,	Adjusted R-squared:  0.8487 
F-statistic: 81.43 on 3 and 40 DF,  p-value: < 2.2e-16

[1] praised

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + 
    ratings$gender - 1)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.1464 -0.4829 -0.1464  1.3393  3.0126 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
ratings$taidgender   1.5017     0.3722   4.035 0.000239 ***
ratings$tagender     0.1715     0.4037   0.425 0.673143    
ratings$gender       1.9874     0.2164   9.185 2.11e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.325 on 40 degrees of freedom
Multiple R-squared:  0.9121,	Adjusted R-squared:  0.9055 
F-statistic: 138.4 on 3 and 40 DF,  p-value: < 2.2e-16

[1] knowledgeable

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + 
    ratings$gender - 1)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.2501 -0.5044 -0.1543  1.2728  3.0500 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
ratings$taidgender   1.2543     0.4133   3.034  0.00422 ** 
ratings$tagender     0.3501     0.4483   0.781  0.43942    
ratings$gender       1.9500     0.2403   8.114 5.54e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.472 on 40 degrees of freedom
Multiple R-squared:  0.8887,	Adjusted R-squared:  0.8803 
F-statistic: 106.4 on 3 and 40 DF,  p-value: < 2.2e-16

[1] clear

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + 
    ratings$gender - 1)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.8037 -0.8037  0.4014  1.2169  3.2929 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
ratings$taidgender   1.2051     0.4503   2.676   0.0107 *  
ratings$tagender    -0.1845     0.4884  -0.378   0.7076    
ratings$gender       1.8915     0.2618   7.225 9.14e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.603 on 40 degrees of freedom
Multiple R-squared:  0.8447,	Adjusted R-squared:  0.8331 
F-statistic: 72.53 on 3 and 40 DF,  p-value: 3.152e-16

[1] overall

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + 
    ratings$gender - 1)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.99601 -0.38870  0.00399  0.94702  3.11394 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
ratings$taidgender   1.2787     0.3904   3.275  0.00219 ** 
ratings$tagender     0.2239     0.4235   0.529  0.59994    
ratings$gender       1.8861     0.2270   8.309 3.03e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.39 on 40 degrees of freedom
Multiple R-squared:  0.8921,	Adjusted R-squared:  0.884 
F-statistic: 110.2 on 3 and 40 DF,  p-value: < 2.2e-16

In [76]:

# model using "group" variable; note how R deals with the fact that the levels are dependent when there is
# an intercept.
# use as.factor() to get R to treat the variable as a dummy
for (ch in character) {
    reg_mod <- lm(unlist(ratings[ch]) ~ as.factor(ratings$group) ) # include the constant term for now
    print(ch, quote=F)
    print(summary(reg_mod))
}

[1] professional

Call:
lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group))

Residuals:
    Min      1Q  Median      3Q     Max 
-3.0833 -0.5833  0.3636  0.4167  1.1250 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)                4.58333    0.30401  15.076   <2e-16 ***
as.factor(ratings$group)4  0.05303    0.43960   0.121    0.905    
as.factor(ratings$group)5 -0.70833    0.48068  -1.474    0.149    
as.factor(ratings$group)6 -0.50000    0.42994  -1.163    0.252    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.053 on 39 degrees of freedom
Multiple R-squared:  0.08828,	Adjusted R-squared:  0.01815 
F-statistic: 1.259 on 3 and 39 DF,  p-value: 0.3019

[1] respect

Call:
lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group))

Residuals:
    Min      1Q  Median      3Q     Max 
-3.0833 -0.5833  0.3636  0.4167  1.1250 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)                4.58333    0.30401  15.076   <2e-16 ***
as.factor(ratings$group)4  0.05303    0.43960   0.121    0.905    
as.factor(ratings$group)5 -0.70833    0.48068  -1.474    0.149    
as.factor(ratings$group)6 -0.50000    0.42994  -1.163    0.252    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.053 on 39 degrees of freedom
Multiple R-squared:  0.08828,	Adjusted R-squared:  0.01815 
F-statistic: 1.259 on 3 and 39 DF,  p-value: 0.3019

[1] caring

Call:
lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group))

Residuals:
    Min      1Q  Median      3Q     Max 
-2.6250 -0.4375  0.3333  0.7500  1.3750 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 4.2500     0.3056  13.908   <2e-16 ***
as.factor(ratings$group)4  -0.1591     0.4419  -0.360    0.721    
as.factor(ratings$group)5  -0.6250     0.4832  -1.294    0.203    
as.factor(ratings$group)6  -0.5833     0.4322  -1.350    0.185    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.059 on 39 degrees of freedom
Multiple R-squared:  0.06604,	Adjusted R-squared:  -0.005806 
F-statistic: 0.9192 on 3 and 39 DF,  p-value: 0.4406

[1] enthusiastic

Call:
lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group))

Residuals:
     Min       1Q   Median       3Q      Max 
-2.75000 -0.25000 -0.09091  0.62500  1.50000 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 4.2500     0.2911  14.600   <2e-16 ***
as.factor(ratings$group)4  -0.1591     0.4209  -0.378   0.7075    
as.factor(ratings$group)5  -0.5000     0.4603  -1.086   0.2840    
as.factor(ratings$group)6  -0.7500     0.4117  -1.822   0.0762 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.008 on 39 degrees of freedom
Multiple R-squared:  0.09097,	Adjusted R-squared:  0.02104 
F-statistic: 1.301 on 3 and 39 DF,  p-value: 0.2878

[1] communicate

Call:
lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group))

Residuals:
    Min      1Q  Median      3Q     Max 
-2.7500 -0.3333  0.2500  0.6667  1.5000 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 4.3333     0.3060  14.161   <2e-16 ***
as.factor(ratings$group)4  -0.2424     0.4425  -0.548   0.5869    
as.factor(ratings$group)5  -0.8333     0.4839  -1.722   0.0929 .  
as.factor(ratings$group)6  -0.5833     0.4328  -1.348   0.1855    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.06 on 39 degrees of freedom
Multiple R-squared:  0.08519,	Adjusted R-squared:  0.01482 
F-statistic: 1.211 on 3 and 39 DF,  p-value: 0.3187

[1] helpful

Call:
lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group))

Residuals:
     Min       1Q   Median       3Q      Max 
-2.12500 -0.82955  0.09091  1.00000  1.87500 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)                4.00000    0.30674  13.040 8.45e-16 ***
as.factor(ratings$group)4 -0.09091    0.44355  -0.205   0.8387    
as.factor(ratings$group)5 -0.87500    0.48500  -1.804   0.0789 .  
as.factor(ratings$group)6 -0.25000    0.43380  -0.576   0.5677    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.063 on 39 degrees of freedom
Multiple R-squared:  0.08617,	Adjusted R-squared:  0.01587 
F-statistic: 1.226 on 3 and 39 DF,  p-value: 0.3133

[1] feedback

Call:
lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group))

Residuals:
    Min      1Q  Median      3Q     Max 
-2.7500 -0.4167  0.2500  0.5833  1.3750 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 4.4167     0.3267  13.519 2.66e-16 ***
as.factor(ratings$group)4  -0.5076     0.4724  -1.074    0.289    
as.factor(ratings$group)5  -0.7917     0.5166  -1.533    0.133    
as.factor(ratings$group)6  -0.6667     0.4620  -1.443    0.157    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.132 on 39 degrees of freedom
Multiple R-squared:  0.07339,	Adjusted R-squared:  0.002112 
F-statistic:  1.03 on 3 and 39 DF,  p-value: 0.3901

[1] prompt

Call:
lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group))

Residuals:
    Min      1Q  Median      3Q     Max 
-2.7500 -0.3636  0.5833  0.6364  1.2500 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 4.3333     0.2983  14.529   <2e-16 ***
as.factor(ratings$group)4   0.0303     0.4313   0.070   0.9443    
as.factor(ratings$group)5  -0.5833     0.4716  -1.237   0.2235    
as.factor(ratings$group)6  -0.9167     0.4218  -2.173   0.0359 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.033 on 39 degrees of freedom
Multiple R-squared:   0.15,	Adjusted R-squared:  0.08465 
F-statistic: 2.295 on 3 and 39 DF,  p-value: 0.09292

[1] consistent

Call:
lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group))

Residuals:
     Min       1Q   Median       3Q      Max 
-2.62500 -0.52083  0.08333  1.00000  1.58333 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)                3.91667    0.33876  11.562  3.6e-14 ***
as.factor(ratings$group)4  0.08333    0.48985   0.170    0.866    
as.factor(ratings$group)5 -0.29167    0.53563  -0.545    0.589    
as.factor(ratings$group)6 -0.50000    0.47909  -1.044    0.303    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.174 on 39 degrees of freedom
Multiple R-squared:  0.0441,	Adjusted R-squared:  -0.02943 
F-statistic: 0.5997 on 3 and 39 DF,  p-value: 0.619

[1] fair

Call:
lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group))

Residuals:
    Min      1Q  Median      3Q     Max 
-2.6250 -0.3750 -0.1818  0.6667  1.5833 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 4.3333     0.2738  15.824   <2e-16 ***
as.factor(ratings$group)4  -0.1515     0.3960  -0.383    0.704    
as.factor(ratings$group)5  -0.7083     0.4330  -1.636    0.110    
as.factor(ratings$group)6  -0.9167     0.3873  -2.367    0.023 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9486 on 39 degrees of freedom
Multiple R-squared:  0.1569,	Adjusted R-squared:  0.09209 
F-statistic:  2.42 on 3 and 39 DF,  p-value: 0.08061

[1] responsive

Call:
lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group))

Residuals:
    Min      1Q  Median      3Q     Max 
-2.7500 -0.5644  0.4167  0.6439  1.4546 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 4.1667     0.2880  14.469   <2e-16 ***
as.factor(ratings$group)4  -0.6212     0.4164  -1.492    0.144    
as.factor(ratings$group)5  -0.4167     0.4553  -0.915    0.366    
as.factor(ratings$group)6  -0.5833     0.4073  -1.432    0.160    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9976 on 39 degrees of freedom
Multiple R-squared:  0.06872,	Adjusted R-squared:  -0.002918 
F-statistic: 0.9593 on 3 and 39 DF,  p-value: 0.4216

[1] praised

Call:
lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group))

Residuals:
    Min      1Q  Median      3Q     Max 
-3.0000 -0.4546  0.2500  0.4167  1.2500 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 4.5833     0.2611  17.557   <2e-16 ***
as.factor(ratings$group)4  -0.1288     0.3775  -0.341   0.7348    
as.factor(ratings$group)5  -0.5833     0.4128  -1.413   0.1655    
as.factor(ratings$group)6  -0.8333     0.3692  -2.257   0.0297 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9043 on 39 degrees of freedom
Multiple R-squared:  0.1407,	Adjusted R-squared:  0.0746 
F-statistic: 2.129 on 3 and 39 DF,  p-value: 0.1122

[1] knowledgeable

Call:
lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group))

Residuals:
    Min      1Q  Median      3Q     Max 
-3.1250 -0.3636  0.1667  0.7500  1.1667 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 4.2500     0.2899  14.659   <2e-16 ***
as.factor(ratings$group)4   0.1136     0.4192   0.271    0.788    
as.factor(ratings$group)5  -0.1250     0.4584  -0.273    0.787    
as.factor(ratings$group)6  -0.4167     0.4100  -1.016    0.316    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.004 on 39 degrees of freedom
Multiple R-squared:  0.04435,	Adjusted R-squared:  -0.02916 
F-statistic: 0.6033 on 3 and 39 DF,  p-value: 0.6168

[1] clear

Call:
lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group))

Residuals:
    Min      1Q  Median      3Q     Max 
-2.5000 -0.5000  0.5000  0.6364  1.6364 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 4.4167     0.3534  12.496 3.26e-15 ***
as.factor(ratings$group)4  -1.0530     0.5111  -2.060   0.0461 *  
as.factor(ratings$group)5  -0.9167     0.5588  -1.640   0.1090    
as.factor(ratings$group)6  -0.9167     0.4998  -1.834   0.0743 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.224 on 39 degrees of freedom
Multiple R-squared:  0.1229,	Adjusted R-squared:  0.05539 
F-statistic: 1.821 on 3 and 39 DF,  p-value: 0.1593

[1] overall

Call:
lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group))

Residuals:
    Min      1Q  Median      3Q     Max 
-2.7500 -0.3333  0.2500  0.6667  1.3750 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 4.3333     0.2952  14.678   <2e-16 ***
as.factor(ratings$group)4  -0.3333     0.4269  -0.781    0.440    
as.factor(ratings$group)5  -0.7083     0.4668  -1.517    0.137    
as.factor(ratings$group)6  -0.5833     0.4175  -1.397    0.170    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.023 on 39 degrees of freedom
Multiple R-squared:  0.07095,	Adjusted R-squared:  -0.0005127 
F-statistic: 0.9928 on 3 and 39 DF,  p-value: 0.4063

In [77]:

# model using "group" variable.
# use as.factor() to get R to treat the variable as a dummy
# Omit the intercept
for (ch in character) {
    reg_mod <- lm(unlist(ratings[ch]) ~ as.factor(ratings$group) - 1) # omit the intercept
    print(ch, quote=F)
    print(summary(reg_mod))
}

[1] professional

Call:
lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group) - 
    1)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.0833 -0.5833  0.3636  0.4167  1.1250 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
as.factor(ratings$group)3   4.5833     0.3040   15.08  < 2e-16 ***
as.factor(ratings$group)4   4.6364     0.3175   14.60  < 2e-16 ***
as.factor(ratings$group)5   3.8750     0.3723   10.41 8.16e-13 ***
as.factor(ratings$group)6   4.0833     0.3040   13.43 3.28e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.053 on 39 degrees of freedom
Multiple R-squared:  0.9492,	Adjusted R-squared:  0.944 
F-statistic: 182.3 on 4 and 39 DF,  p-value: < 2.2e-16

[1] respect

Call:
lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group) - 
    1)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.0833 -0.5833  0.3636  0.4167  1.1250 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
as.factor(ratings$group)3   4.5833     0.3040   15.08  < 2e-16 ***
as.factor(ratings$group)4   4.6364     0.3175   14.60  < 2e-16 ***
as.factor(ratings$group)5   3.8750     0.3723   10.41 8.16e-13 ***
as.factor(ratings$group)6   4.0833     0.3040   13.43 3.28e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.053 on 39 degrees of freedom
Multiple R-squared:  0.9492,	Adjusted R-squared:  0.944 
F-statistic: 182.3 on 4 and 39 DF,  p-value: < 2.2e-16

[1] caring

Call:
lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group) - 
    1)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.6250 -0.4375  0.3333  0.7500  1.3750 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
as.factor(ratings$group)3   4.2500     0.3056  13.908  < 2e-16 ***
as.factor(ratings$group)4   4.0909     0.3192  12.818 1.46e-15 ***
as.factor(ratings$group)5   3.6250     0.3743   9.686 6.27e-12 ***
as.factor(ratings$group)6   3.6667     0.3056  11.999 1.15e-14 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.059 on 39 degrees of freedom
Multiple R-squared:  0.9385,	Adjusted R-squared:  0.9322 
F-statistic: 148.9 on 4 and 39 DF,  p-value: < 2.2e-16

[1] enthusiastic

Call:
lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group) - 
    1)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.75000 -0.25000 -0.09091  0.62500  1.50000 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
as.factor(ratings$group)3   4.2500     0.2911   14.60  < 2e-16 ***
as.factor(ratings$group)4   4.0909     0.3040   13.46 3.10e-16 ***
as.factor(ratings$group)5   3.7500     0.3565   10.52 6.00e-13 ***
as.factor(ratings$group)6   3.5000     0.2911   12.02 1.08e-14 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.008 on 39 degrees of freedom
Multiple R-squared:  0.9433,	Adjusted R-squared:  0.9375 
F-statistic: 162.3 on 4 and 39 DF,  p-value: < 2.2e-16

[1] communicate

Call:
lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group) - 
    1)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.7500 -0.3333  0.2500  0.6667  1.5000 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
as.factor(ratings$group)3   4.3333     0.3060  14.161  < 2e-16 ***
as.factor(ratings$group)4   4.0909     0.3196  12.799 1.53e-15 ***
as.factor(ratings$group)5   3.5000     0.3748   9.339 1.71e-11 ***
as.factor(ratings$group)6   3.7500     0.3060  12.254 6.00e-15 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.06 on 39 degrees of freedom
Multiple R-squared:  0.9391,	Adjusted R-squared:  0.9329 
F-statistic: 150.4 on 4 and 39 DF,  p-value: < 2.2e-16

[1] helpful

Call:
lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group) - 
    1)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.12500 -0.82955  0.09091  1.00000  1.87500 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
as.factor(ratings$group)3   4.0000     0.3067  13.040 8.45e-16 ***
as.factor(ratings$group)4   3.9091     0.3204  12.201 6.87e-15 ***
as.factor(ratings$group)5   3.1250     0.3757   8.318 3.58e-10 ***
as.factor(ratings$group)6   3.7500     0.3067  12.225 6.46e-15 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.063 on 39 degrees of freedom
Multiple R-squared:  0.9324,	Adjusted R-squared:  0.9254 
F-statistic: 134.4 on 4 and 39 DF,  p-value: < 2.2e-16

[1] feedback

Call:
lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group) - 
    1)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.7500 -0.4167  0.2500  0.5833  1.3750 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
as.factor(ratings$group)3   4.4167     0.3267   13.52 2.66e-16 ***
as.factor(ratings$group)4   3.9091     0.3412   11.46 4.75e-14 ***
as.factor(ratings$group)5   3.6250     0.4001    9.06 3.88e-11 ***
as.factor(ratings$group)6   3.7500     0.3267   11.48 4.48e-14 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.132 on 39 degrees of freedom
Multiple R-squared:  0.9312,	Adjusted R-squared:  0.9241 
F-statistic:   132 on 4 and 39 DF,  p-value: < 2.2e-16

[1] prompt

Call:
lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group) - 
    1)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.7500 -0.3636  0.5833  0.6364  1.2500 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
as.factor(ratings$group)3   4.3333     0.2982   14.53  < 2e-16 ***
as.factor(ratings$group)4   4.3636     0.3115   14.01  < 2e-16 ***
as.factor(ratings$group)5   3.7500     0.3653   10.27 1.21e-12 ***
as.factor(ratings$group)6   3.4167     0.2982   11.46 4.76e-14 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.033 on 39 degrees of freedom
Multiple R-squared:  0.9429,	Adjusted R-squared:  0.937 
F-statistic:   161 on 4 and 39 DF,  p-value: < 2.2e-16

[1] consistent

Call:
lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group) - 
    1)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.62500 -0.52083  0.08333  1.00000  1.58333 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
as.factor(ratings$group)3   3.9167     0.3388  11.562 3.60e-14 ***
as.factor(ratings$group)4   4.0000     0.3538  11.305 7.10e-14 ***
as.factor(ratings$group)5   3.6250     0.4149   8.737 1.01e-10 ***
as.factor(ratings$group)6   3.4167     0.3388  10.086 2.01e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.174 on 39 degrees of freedom
Multiple R-squared:  0.9185,	Adjusted R-squared:  0.9101 
F-statistic: 109.9 on 4 and 39 DF,  p-value: < 2.2e-16

[1] fair

Call:
lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group) - 
    1)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.6250 -0.3750 -0.1818  0.6667  1.5833 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
as.factor(ratings$group)3   4.3333     0.2738   15.82  < 2e-16 ***
as.factor(ratings$group)4   4.1818     0.2860   14.62  < 2e-16 ***
as.factor(ratings$group)5   3.6250     0.3354   10.81 2.70e-13 ***
as.factor(ratings$group)6   3.4167     0.2738   12.48 3.42e-15 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9486 on 39 degrees of freedom
Multiple R-squared:  0.9497,	Adjusted R-squared:  0.9446 
F-statistic: 184.2 on 4 and 39 DF,  p-value: < 2.2e-16

[1] responsive

Call:
lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group) - 
    1)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.7500 -0.5644  0.4167  0.6439  1.4546 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
as.factor(ratings$group)3   4.1667     0.2880   14.47  < 2e-16 ***
as.factor(ratings$group)4   3.5455     0.3008   11.79 1.99e-14 ***
as.factor(ratings$group)5   3.7500     0.3527   10.63 4.38e-13 ***
as.factor(ratings$group)6   3.5833     0.2880   12.44 3.72e-15 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9976 on 39 degrees of freedom
Multiple R-squared:  0.9405,	Adjusted R-squared:  0.9344 
F-statistic:   154 on 4 and 39 DF,  p-value: < 2.2e-16

[1] praised

Call:
lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group) - 
    1)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.0000 -0.4546  0.2500  0.4167  1.2500 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
as.factor(ratings$group)3   4.5833     0.2611   17.56  < 2e-16 ***
as.factor(ratings$group)4   4.4545     0.2727   16.34  < 2e-16 ***
as.factor(ratings$group)5   4.0000     0.3197   12.51 3.14e-15 ***
as.factor(ratings$group)6   3.7500     0.2611   14.37  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9043 on 39 degrees of freedom
Multiple R-squared:  0.9601,	Adjusted R-squared:  0.956 
F-statistic: 234.5 on 4 and 39 DF,  p-value: < 2.2e-16

[1] knowledgeable

Call:
lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group) - 
    1)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.1250 -0.3636  0.1667  0.7500  1.1667 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
as.factor(ratings$group)3   4.2500     0.2899   14.66  < 2e-16 ***
as.factor(ratings$group)4   4.3636     0.3028   14.41  < 2e-16 ***
as.factor(ratings$group)5   4.1250     0.3551   11.62 3.11e-14 ***
as.factor(ratings$group)6   3.8333     0.2899   13.22 5.43e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.004 on 39 degrees of freedom
Multiple R-squared:  0.9494,	Adjusted R-squared:  0.9443 
F-statistic: 183.1 on 4 and 39 DF,  p-value: < 2.2e-16

[1] clear

Call:
lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group) - 
    1)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.5000 -0.5000  0.5000  0.6364  1.6364 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
as.factor(ratings$group)3   4.4167     0.3534  12.496 3.26e-15 ***
as.factor(ratings$group)4   3.3636     0.3692   9.112 3.33e-11 ***
as.factor(ratings$group)5   3.5000     0.4329   8.086 7.28e-10 ***
as.factor(ratings$group)6   3.5000     0.3534   9.903 3.37e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.224 on 39 degrees of freedom
Multiple R-squared:  0.9117,	Adjusted R-squared:  0.9026 
F-statistic: 100.7 on 4 and 39 DF,  p-value: < 2.2e-16

[1] overall

Call:
lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group) - 
    1)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.7500 -0.3333  0.2500  0.6667  1.3750 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
as.factor(ratings$group)3   4.3333     0.2952   14.68  < 2e-16 ***
as.factor(ratings$group)4   4.0000     0.3084   12.97 9.99e-16 ***
as.factor(ratings$group)5   3.6250     0.3616   10.03 2.38e-12 ***
as.factor(ratings$group)6   3.7500     0.2952   12.70 1.95e-15 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.023 on 39 degrees of freedom
Multiple R-squared:  0.943,	Adjusted R-squared:  0.9372 
F-statistic: 161.4 on 4 and 39 DF,  p-value: < 2.2e-16

In [82]:

# model using taidgender variable and adjust for student gender and an interaction between
# student gender and taid gender
# Omit the intercept
# adjust for 
for (ch in character) {
    reg_mod <- lm(unlist(ratings[ch]) ~ 
                  ratings$taidgender + 
                  ratings$tagender + 
                  ratings$gender +
                  ratings$taidgender*ratings$gender
                  - 1 ) # omit intercept
    print(ch, quote=F)
    print(summary(reg_mod))
}

[1] professional

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + 
    ratings$gender + ratings$taidgender * ratings$gender - 1)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.6610 -0.5996  0.3390  0.6526  2.8263 

Coefficients:
                                  Estimate Std. Error t value Pr(>|t|)    
ratings$taidgender                  3.9249     0.8741   4.490 6.16e-05 ***
ratings$tagender                    0.3136     0.4103   0.764  0.44927    
ratings$gender                      2.1737     0.2303   9.438 1.28e-11 ***
ratings$taidgender:ratings$gender  -1.8126     0.5903  -3.071  0.00388 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.33 on 39 degrees of freedom
Multiple R-squared:  0.9191,	Adjusted R-squared:  0.9108 
F-statistic: 110.7 on 4 and 39 DF,  p-value: < 2.2e-16

[1] respect

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + 
    ratings$gender + ratings$taidgender * ratings$gender - 1)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.6610 -0.5996  0.3390  0.6526  2.8263 

Coefficients:
                                  Estimate Std. Error t value Pr(>|t|)    
ratings$taidgender                  3.9249     0.8741   4.490 6.16e-05 ***
ratings$tagender                    0.3136     0.4103   0.764  0.44927    
ratings$gender                      2.1737     0.2303   9.438 1.28e-11 ***
ratings$taidgender:ratings$gender  -1.8126     0.5903  -3.071  0.00388 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.33 on 39 degrees of freedom
Multiple R-squared:  0.9191,	Adjusted R-squared:  0.9108 
F-statistic: 110.7 on 4 and 39 DF,  p-value: < 2.2e-16

[1] caring

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + 
    ratings$gender + ratings$taidgender * ratings$gender - 1)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.9971 -0.5246  0.1475  0.9204  3.0015 

Coefficients:
                                  Estimate Std. Error t value Pr(>|t|)    
ratings$taidgender                  3.3344     0.8946   3.727 0.000613 ***
ratings$tagender                    0.1541     0.4200   0.367 0.715698    
ratings$gender                      1.9985     0.2357   8.478 2.21e-10 ***
ratings$taidgender:ratings$gender  -1.4805     0.6041  -2.451 0.018852 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.361 on 39 degrees of freedom
Multiple R-squared:  0.8984,	Adjusted R-squared:  0.888 
F-statistic: 86.22 on 4 and 39 DF,  p-value: < 2.2e-16

[1] enthusiastic

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + 
    ratings$gender + ratings$taidgender * ratings$gender - 1)

Residuals:
     Min       1Q   Median       3Q      Max 
-3.07312 -0.45347  0.08235  0.54653  2.96581 

Coefficients:
                                  Estimate Std. Error t value Pr(>|t|)    
ratings$taidgender                 3.37947    0.85239   3.965 0.000304 ***
ratings$tagender                  -0.00237    0.40014  -0.006 0.995304    
ratings$gender                     2.03656    0.22460   9.067  3.8e-11 ***
ratings$taidgender:ratings$gender -1.49838    0.57562  -2.603 0.012999 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.297 on 39 degrees of freedom
Multiple R-squared:  0.9063,	Adjusted R-squared:  0.8967 
F-statistic: 94.33 on 4 and 39 DF,  p-value: < 2.2e-16

[1] communicate

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + 
    ratings$gender + ratings$taidgender * ratings$gender - 1)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.1930 -0.5344 -0.1024  0.9401  2.9913 

Coefficients:
                                  Estimate Std. Error t value Pr(>|t|)    
ratings$taidgender                  3.4949     0.8751   3.994 0.000279 ***
ratings$tagender                    0.1756     0.4108   0.427 0.671379    
ratings$gender                      2.0087     0.2306   8.711 1.09e-10 ***
ratings$taidgender:ratings$gender  -1.5768     0.5910  -2.668 0.011054 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.331 on 39 degrees of freedom
Multiple R-squared:  0.904,	Adjusted R-squared:  0.8942 
F-statistic: 91.82 on 4 and 39 DF,  p-value: < 2.2e-16

[1] helpful

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + 
    ratings$gender + ratings$taidgender * ratings$gender - 1)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.76834 -0.78077 -0.05169  1.03079  3.11583 

Coefficients:
                                  Estimate Std. Error t value Pr(>|t|)    
ratings$taidgender                  3.4680     0.8227   4.215 0.000143 ***
ratings$tagender                    0.3743     0.3862   0.969 0.338415    
ratings$gender                      1.8842     0.2168   8.691 1.16e-10 ***
ratings$taidgender:ratings$gender  -1.6748     0.5556  -3.015 0.004509 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.252 on 39 degrees of freedom
Multiple R-squared:  0.9062,	Adjusted R-squared:  0.8965 
F-statistic: 94.16 on 4 and 39 DF,  p-value: < 2.2e-16

[1] feedback

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + 
    ratings$gender + ratings$taidgender * ratings$gender - 1)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.2539 -0.5203  0.1990  0.8016  2.9285 

Coefficients:
                                  Estimate Std. Error t value Pr(>|t|)    
ratings$taidgender                 3.13717    0.90379   3.471  0.00128 ** 
ratings$tagender                  -0.05545    0.42427  -0.131  0.89669    
ratings$gender                     2.12695    0.23815   8.931 5.68e-11 ***
ratings$taidgender:ratings$gender -1.40768    0.61033  -2.306  0.02649 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.375 on 39 degrees of freedom
Multiple R-squared:  0.8985,	Adjusted R-squared:  0.888 
F-statistic: 86.27 on 4 and 39 DF,  p-value: < 2.2e-16

[1] prompt

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + 
    ratings$gender + ratings$taidgender * ratings$gender - 1)

Residuals:
     Min       1Q   Median       3Q      Max 
-3.05733 -0.37746 -0.05733  0.69819  3.00916 

Coefficients:
                                  Estimate Std. Error t value Pr(>|t|)    
ratings$taidgender                 4.28125    0.85603   5.001 1.25e-05 ***
ratings$tagender                   0.07565    0.40185   0.188  0.85166    
ratings$gender                     1.99084    0.22556   8.826 7.76e-11 ***
ratings$taidgender:ratings$gender -1.97028    0.57808  -3.408  0.00153 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.302 on 39 degrees of freedom
Multiple R-squared:  0.9093,	Adjusted R-squared:    0.9 
F-statistic: 97.72 on 4 and 39 DF,  p-value: < 2.2e-16

[1] consistent

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + 
    ratings$gender + ratings$taidgender * ratings$gender - 1)

Residuals:
     Min       1Q   Median       3Q      Max 
-3.02803 -0.47156  0.07008  0.95445  3.03504 

Coefficients:
                                  Estimate Std. Error t value Pr(>|t|)    
ratings$taidgender                 3.03236    0.89781   3.377  0.00167 ** 
ratings$tagender                   0.09811    0.42146   0.233  0.81714    
ratings$gender                     1.96496    0.23657   8.306 3.72e-10 ***
ratings$taidgender:ratings$gender -1.37153    0.60629  -2.262  0.02934 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.366 on 39 degrees of freedom
Multiple R-squared:  0.8896,	Adjusted R-squared:  0.8783 
F-statistic: 78.57 on 4 and 39 DF,  p-value: < 2.2e-16

[1] fair

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + 
    ratings$gender + ratings$taidgender * ratings$gender - 1)

Residuals:
     Min       1Q   Median       3Q      Max 
-3.04457 -0.50273  0.05483  0.86143  2.93072 

Coefficients:
                                  Estimate Std. Error t value Pr(>|t|)    
ratings$taidgender                 3.48160    0.71829   4.847 2.03e-05 ***
ratings$tagender                  -0.09399    0.33719  -0.279  0.78191    
ratings$gender                     2.06928    0.18927  10.933 1.93e-13 ***
ratings$taidgender:ratings$gender -1.51172    0.48506  -3.117  0.00343 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.093 on 39 degrees of freedom
Multiple R-squared:  0.9333,	Adjusted R-squared:  0.9264 
F-statistic: 136.4 on 4 and 39 DF,  p-value: < 2.2e-16

[1] responsive

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + 
    ratings$gender + ratings$taidgender * ratings$gender - 1)

Residuals:
     Min       1Q   Median       3Q      Max 
-3.12009 -0.73658 -0.08528  0.91472  3.04229 

Coefficients:
                                  Estimate Std. Error t value Pr(>|t|)    
ratings$taidgender                  4.3317     0.8308   5.214 6.37e-06 ***
ratings$tagender                   -0.2047     0.3900  -0.525 0.602692    
ratings$gender                      2.1624     0.2189   9.878 3.62e-12 ***
ratings$taidgender:ratings$gender  -2.4088     0.5610  -4.293 0.000113 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.264 on 39 degrees of freedom
Multiple R-squared:  0.9045,	Adjusted R-squared:  0.8947 
F-statistic:  92.3 on 4 and 39 DF,  p-value: < 2.2e-16

[1] praised

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + 
    ratings$gender + ratings$taidgender * ratings$gender - 1)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4200 -0.4200  0.2563  0.6541  2.8051 

Coefficients:
                                  Estimate Std. Error t value Pr(>|t|)    
ratings$taidgender                 3.94806    0.76826   5.139 8.07e-06 ***
ratings$tagender                  -0.03009    0.36065  -0.083  0.93393    
ratings$gender                     2.22503    0.20244  10.991 1.65e-13 ***
ratings$taidgender:ratings$gender -1.82722    0.51881  -3.522  0.00111 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.169 on 39 degrees of freedom
Multiple R-squared:  0.9333,	Adjusted R-squared:  0.9265 
F-statistic: 136.5 on 4 and 39 DF,  p-value: < 2.2e-16

[1] knowledgeable

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + 
    ratings$gender + ratings$taidgender * ratings$gender - 1)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5392 -0.5392 -0.1096  0.7076  2.7990 

Coefficients:
                                  Estimate Std. Error t value Pr(>|t|)    
ratings$taidgender                  3.8393     0.8661   4.433 7.35e-05 ***
ratings$tagender                    0.1371     0.4066   0.337  0.73782    
ratings$gender                      2.2010     0.2282   9.645 7.06e-12 ***
ratings$taidgender:ratings$gender  -1.9308     0.5849  -3.301  0.00206 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.318 on 39 degrees of freedom
Multiple R-squared:  0.913,	Adjusted R-squared:  0.9041 
F-statistic: 102.3 on 4 and 39 DF,  p-value: < 2.2e-16

[1] clear

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + 
    ratings$gender + ratings$taidgender * ratings$gender - 1)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.2322 -0.5763  0.1429  0.9817  3.2590 

Coefficients:
                                  Estimate Std. Error t value Pr(>|t|)    
ratings$taidgender                  3.5171     0.9856   3.569 0.000971 ***
ratings$tagender                   -0.3750     0.4627  -0.811 0.422506    
ratings$gender                      2.1161     0.2597   8.148 6.01e-10 ***
ratings$taidgender:ratings$gender  -1.7269     0.6656  -2.595 0.013273 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.499 on 39 degrees of freedom
Multiple R-squared:  0.8676,	Adjusted R-squared:  0.854 
F-statistic: 63.88 on 4 and 39 DF,  p-value: < 2.2e-16

[1] overall

Call:
lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + 
    ratings$gender + ratings$taidgender * ratings$gender - 1)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.2292 -0.4522  0.0530  0.7969  2.9114 

Coefficients:
                                  Estimate Std. Error t value Pr(>|t|)    
ratings$taidgender                 3.36381    0.84848   3.965 0.000304 ***
ratings$tagender                   0.05205    0.39831   0.131 0.896710    
ratings$gender                     2.08856    0.22357   9.342  1.7e-11 ***
ratings$taidgender:ratings$gender -1.55738    0.57298  -2.718 0.009747 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.291 on 39 degrees of freedom
Multiple R-squared:  0.9093,	Adjusted R-squared:  0.8999 
F-statistic:  97.7 on 4 and 39 DF,  p-value: < 2.2e-16

Summary¶

You might have noticed that we can get an enormous range of $p$-values for the effect of perceived gender, depending on the other terms we choose to keep in the model. In particular, omitting the intercept term will make many things appear significant, since the data have a large mean, which the other variables cannot capture well.

Analyses like these are rather ad hoc—and they don't match the underlying experiment in the first place!

Why should a student's rating of an instructor be a linear combination of these things, with additive "noise"?

Moreover, if we perform a variety of analyses and then just keep the one we like best, the $p$-values and other statistical summaries become meaningless. This can lead to "significance hunting" or "significance mining." Methods that can control for the selection process—deciding which terms to keep in the model on the basis of the data—are a current area of active research in Statistics.

Assignment¶

The directory /Data has a .csv file called "ModuloRimbSpesePrestazOccasGratuiteRisoluz.49E2013_NonRes_Ingl.csv" downloaded from http://nats.sct.gob.mx/english/go-to-tables/table-8-domestic-passenger-travel/table-8-1-domestic-passenger-travel-by-mode/

The data are reported passenger miles annually in the U.S.A. from 1990–2012 for a variety of modes of transportation. The data are time series: they give a sequence of observations of the same variables at different times.

Please provide your code and your results for the following. The data editing to remove variables, transpose tables, etc., should be done using R commands—not using point-and-click tools. Do not use Excel for any part of this assignment.

Data munging. The .csv file has some notes at the bottom; those notes will make it hard to load the data file into R.
1. Make a "clean" version of the table without the notes and with appropriate column headers. It is OK to use a text editor to do this.
2. Read the data into an R dataframe and print summary statistics of the dataframe
3. Transpose the table so that the columns are modes of transportation and the rows are years, using R commands.
4. Use R commands to delete any variables for which data are unavailable.
5. Print the transposed data, after deleting those variables.
6. Visually inspect the data to look for linear dependence. Are any of the variables obviously linearly dependent? If so, which?
7. Given what the table represents, should there be linear dependencies among the rows?
Data modeling
1. Fit a regression model to the data by ordinary least squares using lm(), modeling bus travel as a linear function of rail travel, air travel, (air travel)², and personal vehicle travel.
2. Discuss the R output for the fitted model. What do the numbers mean? What do you conclude?
Testing
1. Because these data are a time series, we expect trends. Variables with trends will tend to be associated with each other, even if there is no real connection between them. A common first step in analyzing time series is to "pre-whiten" the data, for instance, by removing trends. Construct new variables from the old variables by taking year-to-year differences. For example, the first element of the new "rail" variable should be the amount of rail travel in 1991, minus the amount in 1990. See the documentation for the R function diff().
2. Develop a permutation test to check whether year-over-year differences in rail travel are associated with year-over-year differences in motorcycle travel. As a test statistic, use the Pearson correlation coefficient $r_{xy}$.
3. Is the association positive or negative?
4. Use $10^5$ random permutations to estimate the $p$-value of the hypothesis of no association.
5. Find a 95% upper confidence bound on the true $p$-value from the estimated $p$-value.
6. Explain clearly what the null hypothesis means and how to interpret the estimated $p$-value. How strong is the evidence for association? What would association signify?

Logistic regression¶

[TO DO]