This section is based partly on Freedman, D.A., 2009. Statistical Models: Theory and Practice, Revised Edition, Cambridge University Press.
Statistical models, and regression in particular, are used primarily for three purposes:
It is straightforward to check whether a regression model is a good summary of existing data, although there is some subtlety in determining whether the summary is good enough. How to measure goodness of fit appropriately is not always obvious, and adequacy of fit depends on the use of the summary.
Prediction is harder than description because it involves extrapolation: how can one tell what the future will bring? Why should the future be like the past? Is the system under study stable (i.e., stationary) or are its properties changing with time?
However, the hardest of these tasks is causal inference. The biggest difficulty in drawing causal inferences is confounding, especially when the data arise from observational studies rather than randomized, controlled experiments. (Natural experiments lie somewhere in between; there are few that approach the utility of a randomized controlled experiment, but John Snow's study of the communication of cholera is a notable exception.)
Confounding happens when one factor or variable manifests in the outcome in a way that cannot be distinguished from the treatment.
Stratification (e.g., cross tabulation) can help reduce confounding. So can modeling—in some cases, but not in others. For modeling to help, it is generally necessary for the structure of the model to correspond to how the data were actually generated. Unfortunately, most models in science, especially social science, are chosen out of habit or computational convenience, not because they have a basis in the science itself. This often produces misleading results, and the misleading impression that those results have small uncertainties.
If $\{x_i\}_{i=1}^n$ is a list of numbers, $$ \bar{x} \equiv \frac{1}{n} \sum_{i=1}^n $$ is the mean of the list; $$ \mbox{var } x \equiv \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2 $$ is the variance of the list; $$ s_x \equiv \sqrt{\mbox{var } x} $$ is the SD or standard deviation of the list; and $$ z_i \equiv \frac{x_i - \bar{x}}{s_x} $$ is $x_i$ in standard units. With $\bar{y}$ and $s_y$ defined analogously for the list $\{y_i\}_{i=1}^n$, and if $s_x$ and $s_y$ are nonzero,
$$ r_{xy} \equiv \frac{1}{n} \sum_{i=1} \frac{x_i - \bar{x}}{s_x} \cdot \frac{y_i-\bar{y}}{s_y} $$is the correlation of $x$ and $y$.
Suppose we observe pairs $\{(x_i, y_i)\}_{i=1}^n$. What straight line $y = a + bx$ comes closest to fitting these data, in the least-squares sense? That is, what values $a$ and $b$ minimize $$ \mbox{mean squared residual} = \mbox{MSR} = \equiv \frac{1}{n}\sum_{i=1}^n \left ( y_i - (a + bx_i) \right )^2? $$ The solution is $b = r_{xy} \frac{s_y}{s_x}$ and $a = \bar{y} - b \bar{x}$.
Proof. The mean squared residual is $$ \mbox{MSR} = \frac{1}{n}\sum_{i=1}^n \left ( y_i - (a + bx_i) \right )^2 = \frac{1}{n}\sum_{i=1}^n \left ( y_i^2 - 2y_i(a + bx_i) + (a+bx_i)^2 \right ) = \frac{1}{n}\sum_{i=1}^n \left ( y_i^2 - 2y_i(a + bx_i) + a^2+2abx_i + b^2x_i^2 \right ). $$
This is quadratic in both $a$ and $b$, and has positive leading coefficients in both. Hence, its minimum occurs at a stationary point with respect to both $a$ and $b$. We can differentiate inside the sum and solve for the stationary point:
$$ 0 = \frac{\partial \mbox{MSR}}{\partial a} = \frac{1}{n} \sum_{i=1}^n 2(y_i - (a+b x_i)) = 2 (\bar{y} - a - b \bar{x}). $$Hence, the optimal $a$ solves $a = \bar{y} - b \bar{x}$. Substituting this into the expression for mean squared residual gives
$$ \mbox{MSR} = \frac{1}{n} \sum_{i=1}^n \left ( y_i - (\bar{y} - b \bar{x} + bx_i) \right )^2 = \frac{1}{n} \sum_{i=1}^n \left ( y_i - (\bar{y} - b (\bar{x} - x_i)) \right )^2. $$Differentiating with respect to $b$ and finding the stationary point gives $$ 0 = \frac{\partial \mbox{MSR}}{\partial b} = \frac{1}{n} \sum_{i=1}^n 2 \left ( y_i - (\bar{y} - b (\bar{x} - x_i) ) \right )(\bar{x} - x_i) = \frac{1}{n} \sum_{i=1}^n ( y_i - \bar{y})(\bar{x} - x_i) + \frac{b}{n} \sum_{i=1}^n (\bar{x}-x_i)^2 = s_x s_y r_{xy} - b s_x^2, $$
i.e.,
$$ b = \frac{s_x s_y r_{xy}}{s_x^2} = r_{xy}\frac{s_y}{s_x}. $$The mean of the squared residuals from the regression line has a simple expression:
$$\frac{1}{n} \sum_{i=1}^n e_i^2 = \frac{1}{n}\sum_{i=1}^n \left ( y_i - \bar{y} + r_{xy} \frac{s_y}{s_x} \bar{x} - r_{xy} \frac{s_y}{s_x} x_i) \right )^2 $$$$ = \frac{1}{n} \sum_{i=1}^n ( y_i - \bar{y})^2 + 2r_{xy} \frac{s_y}{s_x} \frac{1}{n} \sum_{i=1}^n (y_i - \bar{y})(\bar{x} - x_i) + r_{xy}^2 \frac{s_y^2}{s_x^2} \frac{1}{n} \sum_{i=1}^n (\bar{x} - x_i)^2 $$$$ = s_y^2 + 2r_{xy} \frac{s_y}{s_x} (-r_{xy}s_x s_y) + r_{xy}^2 \frac{s_y^2}{s_x^2} s_x^2 $$$$ = s_y^2 - 2 r^2 s_y^2 + r^2 s_y^2 $$$$ = s_y^2 (1 - r_{xy}^2). $$Suppose that $\theta \in \Re^p$ is a parameter, and let $\hat{\theta}$ be an estimator of $\theta$. The bias of $\hat{\theta}$ is
$$ \mbox{bias } \hat{\theta} = {\mathbb E}(\hat{\theta} - \theta) = {\mathbb E} \hat{\theta} - \theta. $$An estimate $\hat{\theta}$ of a parameter $\theta \in \Theta$ is unbiased if ${\mathbb E}_\eta \hat{\theta} = \eta$ for all $\eta \in \Theta$.
The mean squared error (MSE) of an estimate $\hat{\theta}$ of $\theta$ is
$$ \mbox{MSE } \hat{\theta} = {\mathbb E} \| \hat{\theta} - \theta \|^2. $$The variance of $\hat{\theta}$ is
$$ \mbox{var } \hat{\theta} = {\mathbb E} \| \hat{\theta} - {\mathbb E} \hat{\theta} \|^2. $$Proposition.
$$ \mbox{MSE } \hat{\theta} = \mbox{var } \hat{\theta} + \| \mbox{bias} \hat{\theta} \|^2. $$Proof.
$$ \| \hat{\theta} - \theta \|^2 = \| \hat{\theta} - {\mathbb E} \hat{\theta} - (\theta - {\mathbb E} \hat{\theta}) \|^2 = (\hat{\theta} - {\mathbb E} \hat{\theta} - (\theta - {\mathbb E} \hat{\theta}))' (\hat{\theta} - {\mathbb E} \hat{\theta} - (\theta - {\mathbb E} \hat{\theta})) $$$$ = (\hat{\theta} - {\mathbb E} \hat{\theta})'(\hat{\theta} - {\mathbb E} \hat{\theta}) + 2 (\hat{\theta} - {\mathbb E} \hat{\theta})'(\theta - {\mathbb E} \hat{\theta}) + (\theta - {\mathbb E} \hat{\theta})'(\theta - {\mathbb E} \hat{\theta}) $$$$ = \| \hat{\theta} - {\mathbb E} \hat{\theta} \|^2 + 2(\hat{\theta} - {\mathbb E} \hat{\theta})'(\theta - {\mathbb E} \hat{\theta}) + \| \theta - {\mathbb E} \hat{\theta} \|^2. $$Hence,
$$ \mbox{MSE } \hat{\theta} = {\mathbb E} \| \hat{\theta} - {\mathbb E} \hat{\theta} \|^2 + {\mathbb E} \left ( 2(\hat{\theta} - {\mathbb E} \hat{\theta})'(\theta - {\mathbb E} \hat{\theta}) \right ) + {\mathbb E} \| \theta - {\mathbb E} \hat{\theta} \|^2 $$$$ = {\mathbb E} \| \hat{\theta} - {\mathbb E} \hat{\theta} \|^2 + 2 \left ({\mathbb E} \hat{\theta} - {\mathbb E} \hat{\theta})'(\theta - {\mathbb E} \hat{\theta}) \right ) + {\mathbb E} \| \theta - {\mathbb E} \hat{\theta} \|^2 $$$$ = \mbox{var } \hat{\theta} + \| \mbox{bias } \hat{\theta} \|^2. $$Multiple linear regression is a commonly used tool in most branches of science, as well as economics. It is frequently misinterpreted.
We will start with an introduction to the linear algebra of constructing the least-squares estimate for linear regression, then discuss the features and limitations of regression, especially the limitations of drawing causal inferences from regression models.
The basic relationship for linear regression is the equation
$$ Y = X \beta + \epsilon. $$Here, $Y$ is an $n$-vector of data, called the dependent variable, the response, the explained variables, the modeled variables, or the left hand side. The matrix $X \in {\mathcal M}(n,p)$ with $n \ge p$, $\mbox{rank}(X)=p$ (full rank) so that the columns of $X$ are linearly independent.
The vector $\beta$ is a $p$-vector of parameters, model parameters, or coefficients.
The usual goal of regression is to estimate $\beta$.
The vector $\epsilon$ is an $n$-vector of error, disturbance, or noise.
We will usually assume that $\epsilon$ is random, which makes $Y$ random as well. We will usually treat $X$ as fixed rather than random.
There is an observation $Y_i$ for each unit of observation, a row of $X$ for each observation, and a column of $X$ for each parameter (element of $\beta$). The columns correspond to explanatory variables, independent variables, covariates, control variables, or right-hand side variables.
We have observaations of $Y$, which are assumed to be values of $X\beta + \epsilon$. The value of $\beta$ is unknown. The value of $\epsilon$ cannot be observed.
The standard assumption in multiple regression is that the noise terms $\epsilon$ are random, with
It is also standard to assume that if $X$ is random, $X$ and $\epsilon$ are independent. Regardless, the value of $X$ is observed$.
Since $X$ is observed, we can find $\mbox{rank}(X)$. But there is no way to test whether the main assumptions are true, that is, whether
It is common to condition on $X$; that is, to treat it as fixed erather than random.
The ordinary least squares (OLS) estimate of $\beta$ is
$$ \hat{\beta} \equiv \left ( X'X \right )^{-1}X' Y. $$The estimate $\hat{\beta}$ is a $p$-vector, just like $\beta$.
The residuals or fitting errors are $$ e \equiv Y - X \hat{\beta}. $$
This is a special case of the Projection Theorem.
Proof.
We prove the first part by direct calculation: $ X'e = X' (Y - X \hat{\beta})$, so
$$ \left ( X'X \right )^{-1} X' e = \left ( X'X \right )^{-1} X' Y - \left ( X'X \right )^{-1} X'X \hat{\beta} = \hat{\beta} - \hat{\beta} = 0. $$For the second part, write $\gamma = \hat{\beta} + (\gamma - \hat{\beta})$. Then
$$ \| Y - X \gamma \|^2 = \| Y - X \hat{\beta} - X(\gamma - \hat{\beta}) \|^2 = \| e - X (\gamma - \hat{\beta})\|^2 $$$$ = \| e \|^2 + 2 e' X(\gamma - \hat{\beta}) + \| X(\gamma - \hat{\beta}) \|^2 \ge \| e \|^2 $$since $e'X = {\mathbf 0}$.
What does that orthogonality mean? Finding $\hat{\beta}$ amounts to finding the coefficients in a linear combination of columns of $X$ that give the best approximation to $Y$, in a least-squares sense. That is, we are approximating $Y \in \Re^n$ by an element of the $p$-dimensional subspace spanned by the columns of $X$. If the residuals had a non-zero component in that subspace, there would be a better approximation to $Y$ from that subspace.
For instance, if any column of $X$ is constant, then the residuals should have zero mean: $\sum_{i=1}^n e_j = 0$. Otherwise, the dot product of the residuals with that column of $X$ would not be zero: the residuals would not be orthogonal to $X$, and there would be an element of that subspace that is a better approximation to $Y$.
Let's consider statistical properties of $\hat{\beta}$. We start with (conditional) bias; later we will get to the MSE.
__Proof. __ By calculation:
$$ \hat{\beta} = (X'X)^{-1} X'Y = (X'X)^{-1} X'(X\beta + \epsilon) $$$$ = (X'X)^{-1}X'X\beta + (X'X)^{-1}X'\epsilon = \beta + (X'X)^{-1}X' \epsilon. $$Now, $$ {\mathbb E} \left ( (X'X)^{-1} X' \epsilon \mid X \right ) = (X'X)^{-1} X' E(\epsilon \mid X). $$
Since $\epsilon$ and $X$ are independent, $$ {\mathbb E} (\epsilon \mid X) = {\mathbb E}\epsilon = 0. $$
Hence, $$ {\mathbb E} (\hat{\beta} \vert X) = {\mathbb E}(\beta + (X'X)^{-1}X'\epsilon \mid X) = \beta + {\mathbf 0} = \beta. $$
We drop an object from an unknown height $h$, with an unknown velocity $v$, subject to an unknown constant acceleration $a$.
We measure the height $Y_i$ of the object at times 0, 1, 2, and 3 seconds:
$t_i$ | $Y_i$ |
---|---|
0 | 2.043 |
1 | 6.856 |
2 | 22.565 |
3 | 47.709 |
The height at time $t$ is
$$ Y_i = h + v t_i + (1/2) a t_i^2 + \epsilon_i. $$This has three unknown parameters: the initial height $h$, the initial velocity $v$, and the acceleration $a$. It is an (unverifiable) assumption that the noise terms $\{\epsilon_i\}$ are independent and have zero mean and finite variance.
We can expand the data relationship as follows:
$$Y_1 = h + vt_1 + 1/2 a t_1^2 + \epsilon_1 = h \times 1 + v \times 0 + a \times 0 + \epsilon_1$$$$Y_2 = h + vt_2 + 1/2 a t_2^2 + \epsilon_2 = h \times 1 + v \times 1 + a \times (1/2)1^2 + \epsilon_2$$$$Y_3 = h + vt_3 + 1/2 a t_3^2 + \epsilon_3 = h \times 1 + v \times 2 + a \times (1/2)2^2 + \epsilon_3$$$$Y_4 = h + vt_4 + 1/2 a t_3^4 + \epsilon_4 = h \times 1 + v \times 3 + a \times (1/2)3^2 + \epsilon_3.$$We can write these relations in matrix form, as follows.
Define $$ Y \equiv \begin{pmatrix} 2.043 \\ 6.856 \\ 22.565 \\ 47.709 \end{pmatrix}, $$
$$ X \equiv \begin{pmatrix} 1 & 0 & 0 \\ 1 & 1 & 0.5 \\ 1 & 2 & 2 \\ 1 & 3 & 4.5 \end{pmatrix}, $$$$ \beta \equiv \begin{pmatrix} h \\ v \\ a \end{pmatrix}, $$and
$$ \epsilon \equiv \begin{pmatrix} \epsilon_1 \\ \epsilon_2 \\ \epsilon_3 \\ \epsilon_4 \\ \end{pmatrix} . $$Then the data relationship can be written
$$ Y = X \beta + \epsilon. $$The least squares estimate $\hat{\beta}$ of $\beta$ is $$ \hat{\beta} = (X'X)^{-1}X'Y. $$ If the Newtonian gravity data relations are correct and the observational noise $\epsilon$ really has zero mean, then formally $\hat{\beta}$ is an unbiased estimate of $\beta$. However, we should not calculate $\hat{\beta}$ by inverting $X'X$, as discussed previously. Rather, we should use matrix factorization or back-substitution.
# Least squares example in R
Y <- c(2.043, 6.856, 22.565, 47.709);
X <- matrix(c(1,0,0,1,1,0.5,1,2,2,1,3,9/2), nrow = 4, ncol=3, byrow = T);
X
betahat <- solve(t(X) %*% X, t(X) %*% Y);
betahat
1 | 0 | 0 |
1.0 | 1.0 | 0.5 |
1 | 2 | 2 |
1.0 | 3.0 | 4.5 |
1.96995 |
0.02245 |
10.1655 |
That is, we would estimate that the object was dropped from a height of 1.970 m with an initial velocity of 0.022 m/s, under gravitational acceleration of 10.166 m/s2.
Let's find the residual vector and verify that it is orthogonal to $X$:
e <- Y - X %*% betahat
e
t(e) %*% X
0.07305 |
-0.21915 |
0.21915 |
-0.07305 |
-9.992007e-15 | -2.753353e-14 | -4.218847e-14 |
# R also has a function to fit linear models: lm.
# Here is its help function:
help(lm)
lm {stats} | R Documentation |
lm
is used to fit linear models.
It can be used to carry out regression,
single stratum analysis of variance and
analysis of covariance (although aov
may provide a more
convenient interface for these).
lm(formula, data, subset, weights, na.action, method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, contrasts = NULL, offset, ...)
formula |
an object of class |
data |
an optional data frame, list or environment (or object
coercible by |
subset |
an optional vector specifying a subset of observations to be used in the fitting process. |
weights |
an optional vector of weights to be used in the fitting
process. Should be |
na.action |
a function which indicates what should happen
when the data contain |
method |
the method to be used; for fitting, currently only
|
model, x, y, qr |
logicals. If |
singular.ok |
logical. If |
contrasts |
an optional list. See the |
offset |
this can be used to specify an a priori known
component to be included in the linear predictor during fitting.
This should be |
... |
additional arguments to be passed to the low level regression fitting functions (see below). |
Models for lm
are specified symbolically. A typical model has
the form response ~ terms
where response
is the (numeric)
response vector and terms
is a series of terms which specifies a
linear predictor for response
. A terms specification of the form
first + second
indicates all the terms in first
together
with all the terms in second
with duplicates removed. A
specification of the form first:second
indicates the set of
terms obtained by taking the interactions of all terms in first
with all terms in second
. The specification first*second
indicates the cross of first
and second
. This is
the same as first + second + first:second
.
If the formula includes an offset
, this is evaluated and
subtracted from the response.
If response
is a matrix a linear model is fitted separately by
least-squares to each column of the matrix.
See model.matrix
for some further details. The terms in
the formula will be re-ordered so that main effects come first,
followed by the interactions, all second-order, all third-order and so
on: to avoid this pass a terms
object as the formula (see
aov
and demo(glm.vr)
for an example).
A formula has an implied intercept term. To remove this use either
y ~ x - 1
or y ~ 0 + x
. See formula
for
more details of allowed formulae.
Non-NULL
weights
can be used to indicate that different
observations have different variances (with the values in
weights
being inversely proportional to the variances); or
equivalently, when the elements of weights
are positive
integers w_i, that each response y_i is the mean of
w_i unit-weight observations (including the case that there are
w_i observations equal to y_i and the data have been
summarized).
lm
calls the lower level functions lm.fit
, etc,
see below, for the actual numerical computations. For programming
only, you may consider doing likewise.
All of weights
, subset
and offset
are evaluated
in the same way as variables in formula
, that is first in
data
and then in the environment of formula
.
lm
returns an object of class
"lm"
or for
multiple responses of class c("mlm", "lm")
.
The functions summary
and anova
are used to
obtain and print a summary and analysis of variance table of the
results. The generic accessor functions coefficients
,
effects
, fitted.values
and residuals
extract
various useful features of the value returned by lm
.
An object of class "lm"
is a list containing at least the
following components:
coefficients |
a named vector of coefficients |
residuals |
the residuals, that is response minus fitted values. |
fitted.values |
the fitted mean values. |
rank |
the numeric rank of the fitted linear model. |
weights |
(only for weighted fits) the specified weights. |
df.residual |
the residual degrees of freedom. |
call |
the matched call. |
terms |
the |
contrasts |
(only where relevant) the contrasts used. |
xlevels |
(only where relevant) a record of the levels of the factors used in fitting. |
offset |
the offset used (missing if none were used). |
y |
if requested, the response used. |
x |
if requested, the model matrix used. |
model |
if requested (the default), the model frame used. |
na.action |
(where relevant) information returned by
|
In addition, non-null fits will have components assign
,
effects
and (unless not requested) qr
relating to the linear
fit, for use by extractor functions such as summary
and
effects
.
Considerable care is needed when using lm
with time series.
Unless na.action = NULL
, the time series attributes are
stripped from the variables before the regression is done. (This is
necessary as omitting NA
s would invalidate the time series
attributes, and if NA
s are omitted in the middle of the series
the result would no longer be a regular time series.)
Even if the time series attributes are retained, they are not used to
line up series, so that the time shift of a lagged or differenced
regressor would be ignored. It is good practice to prepare a
data
argument by ts.intersect(..., dframe = TRUE)
,
then apply a suitable na.action
to that data frame and call
lm
with na.action = NULL
so that residuals and fitted
values are time series.
Offsets specified by offset
will not be included in predictions
by predict.lm
, whereas those specified by an offset term
in the formula will be.
The design was inspired by the S function of the same name described in Chambers (1992). The implementation of model formula by Ross Ihaka was based on Wilkinson & Rogers (1973).
Chambers, J. M. (1992) Linear models. Chapter 4 of Statistical Models in S eds J. M. Chambers and T. J. Hastie, Wadsworth & Brooks/Cole.
Wilkinson, G. N. and Rogers, C. E. (1973) Symbolic descriptions of factorial models for analysis of variance. Applied Statistics, 22, 392–9.
summary.lm
for summaries and anova.lm
for
the ANOVA table; aov
for a different interface.
The generic functions coef
, effects
,
residuals
, fitted
, vcov
.
predict.lm
(via predict
) for prediction,
including confidence and prediction intervals;
confint
for confidence intervals of parameters.
lm.influence
for regression diagnostics, and
glm
for generalized linear models.
The underlying low level functions,
lm.fit
for plain, and lm.wfit
for weighted
regression fitting.
More lm()
examples are available e.g., in
anscombe
, attitude
, freeny
,
LifeCycleSavings
, longley
,
stackloss
, swiss
.
biglm
in package biglm for an alternative
way to fit linear models to large datasets (especially those with many
cases).
require(graphics) ## Annette Dobson (1990) "An Introduction to Generalized Linear Models". ## Page 9: Plant Weight Data. ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14) trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69) group <- gl(2, 10, 20, labels = c("Ctl","Trt")) weight <- c(ctl, trt) lm.D9 <- lm(weight ~ group) lm.D90 <- lm(weight ~ group - 1) # omitting intercept anova(lm.D9) summary(lm.D90) opar <- par(mfrow = c(2,2), oma = c(0, 0, 1.1, 0)) plot(lm.D9, las = 1) # Residuals, Fitted, ... par(opar) ### less simple examples in "See Also" above
# It will help to be familiar with "formula" objects in R:
help(formula)
formula {stats} | R Documentation |
The generic function formula
and its specific methods provide a
way of extracting formulae which have been included in other objects.
as.formula
is almost identical, additionally preserving
attributes when object
already inherits from
"formula"
.
formula(x, ...) as.formula(object, env = parent.frame()) ## S3 method for class 'formula' print(x, showEnv = !identical(e, .GlobalEnv), ...)
x, object |
R object. |
... |
further arguments passed to or from other methods. |
env |
the environment to associate with the result, if not already a formula. |
showEnv |
logical indicating if the environment should be printed as well. |
The models fit by, e.g., the lm
and glm
functions
are specified in a compact symbolic form.
The ~
operator is basic in the formation of such models.
An expression of the form y ~ model
is interpreted
as a specification that the response y
is modelled
by a linear predictor specified symbolically by model
.
Such a model consists of a series of terms separated
by +
operators.
The terms themselves consist of variable and factor
names separated by :
operators.
Such a term is interpreted as the interaction of
all the variables and factors appearing in the term.
In addition to +
and :
, a number of other operators are
useful in model formulae. The *
operator denotes factor
crossing: a*b
interpreted as a+b+a:b
. The ^
operator indicates crossing to the specified degree. For example
(a+b+c)^2
is identical to (a+b+c)*(a+b+c)
which in turn
expands to a formula containing the main effects for a
,
b
and c
together with their second-order interactions.
The %in%
operator indicates that the terms on its left are
nested within those on the right. For example a + b %in% a
expands to the formula a + a:b
. The -
operator removes
the specified terms, so that (a+b+c)^2 - a:b
is identical to
a + b + c + b:c + a:c
. It can also used to remove the
intercept term: when fitting a linear model y ~ x - 1
specifies
a line through the origin. A model with no intercept can be also
specified as y ~ x + 0
or y ~ 0 + x
.
While formulae usually involve just variable and factor
names, they can also involve arithmetic expressions.
The formula log(y) ~ a + log(x)
is quite legal.
When such arithmetic expressions involve
operators which are also used symbolically
in model formulae, there can be confusion between
arithmetic and symbolic operator use.
To avoid this confusion, the function I()
can be used to bracket those portions of a model
formula where the operators are used in their
arithmetic sense. For example, in the formula
y ~ a + I(b+c)
, the term b+c
is to be
interpreted as the sum of b
and c
.
Variable names can be quoted by backticks `like this`
in
formulae, although there is no guarantee that all code using formulae
will accept such non-syntactic names.
Most model-fitting functions accept formulae with right-hand-side
including the function offset
to indicate terms with a
fixed coefficient of one. Some functions accept other
‘specials’ such as strata
or cluster
(see the
specials
argument of terms.formula)
.
There are two special interpretations of .
in a formula. The
usual one is in the context of a data
argument of model
fitting functions and means ‘all columns not otherwise in the
formula’: see terms.formula
. In the context of
update.formula
, only, it means ‘what was
previously in this part of the formula’.
When formula
is called on a fitted model object, either a
specific method is used (such as that for class "nls"
) or the
default method. The default first looks for a "formula"
component of the object (and evaluates it), then a "terms"
component, then a formula
parameter of the call (and evaluates
its value) and finally a "formula"
attribute.
There is a formula
method for data frames. If there is only
one column this forms the RHS with an empty LHS. For more columns,
the first column is the LHS of the formula and the remaining columns
separated by +
form the RHS.
All the functions above produce an object of class "formula"
which contains a symbolic model formula.
A formula object has an associated environment, and
this environment (rather than the parent
environment) is used by model.frame
to evaluate variables
that are not found in the supplied data
argument.
Formulas created with the ~
operator use the
environment in which they were created. Formulas created with
as.formula
will use the env
argument for their
environment.
Chambers, J. M. and Hastie, T. J. (1992) Statistical models. Chapter 2 of Statistical Models in S eds J. M. Chambers and T. J. Hastie, Wadsworth & Brooks/Cole.
I
, offset
.
For formula manipulation: terms
, and all.vars
;
for typical use: lm
, glm
, and
coplot
.
class(fo <- y ~ x1*x2) # "formula" fo typeof(fo) # R internal : "language" terms(fo) environment(fo) environment(as.formula("y ~ x")) environment(as.formula("y ~ x", env = new.env())) ## Create a formula for a model with a large number of variables: xnam <- paste0("x", 1:25) (fmla <- as.formula(paste("y ~ ", paste(xnam, collapse= "+"))))
# In R, there are always several ways to do anything!
# Here are several calls to lm() to fit the regression model.
# Note that lm() uses QR factorization by default to solve least-squares problems (rather than inverting matrices)
# R uses a fairly standard "formula" notation.
# "Y ~ X" models Y as a linear function of X.
# Y ~ X + I(X^2) models Y as a linear function of X and X^2.
# By default, R includes an intercept (constant) term. You can alter this, or take advantage of it.
lm(Y ~ X) # the constant term (intercept) corresponding to h is redundant: lm() automatically inserts an intercept.
# Since the model has a constant, X1 won't be estimated.
lm(Y ~ X[,2:3]) # this omits X1, the intercept column of X, from the model, since lm() automatically includes one.
lm(Y ~ X - 1) # this tells lm() not to add an intercept term (we don't need it, since X already has a constant term).
time <- 0:3; # set just the "time" variable
lm(Y ~ time + I(time^2/2)) # Model the position (Y) as an affine function of t and t^2.
# The intercept (h) term will be included automatically.
Call: lm(formula = Y ~ X) Coefficients: (Intercept) X1 X2 X3 1.96995 NA 0.02245 10.16550
Call: lm(formula = Y ~ X[, 2:3]) Coefficients: (Intercept) X[, 2:3]1 X[, 2:3]2 1.96995 0.02245 10.16550
Call: lm(formula = Y ~ X - 1) Coefficients: X1 X2 X3 1.96995 0.02245 10.16550
Call: lm(formula = Y ~ time + I(time^2/2)) Coefficients: (Intercept) time I(time^2/2) 1.96995 0.02245 10.16550
Recall that there is in general no way to test whether the regression model is true; in most applications, that the form of the model is correct is simply an assumption. If the model is not correct, the parameters in the model may be meaningless; and it is not obvious what uncertainties in estimates of the model parameters mean when the form of the model is wrong.
However, suppose that the regression model is true: $Y = X\beta + \epsilon$, where the components of $\epsilon$ are uncorrelated, with mean zero and variance $\sigma^2$, and $\epsilon$ and $X$ are independent. Then, as we have seen, $\hat{\beta}$ is an unbiased estimate of $\beta$, i.e., ${\mathbb E}\hat{\beta} = \beta$.
Moreover, we can find the covariance of the parameter estimates as follows:
Proof. Since $\hat{\beta}$ is conditionally unbiased under these assumptions,
$$ \hat{\beta} - {\mathbb E}(\hat{\beta} \mid X) = (X'X)^{-1}X'(X\beta + \epsilon) - \beta = \beta + (X'X)^{-1}X'\epsilon - \beta = (X'X)^{-1}X'\epsilon. $$Hence,
$$ \mbox{cov}(\hat{\beta} \mid X) = \mbox{cov}((X'X)^{-1}X'\epsilon \mid X) $$$$ = {\mathbb E} \left ( ((X'X)^{-1}X' \epsilon) ((X'X)^{-1}X' \epsilon)' \mid X \right ) $$$$ = {\mathbb E} \left ( (X'X)^{-1}X' \epsilon \epsilon' X (X'X)^{-1}) \mid X \right ) $$(since $(X'X)^{-1} = ((X'X)')^{-1}$) $$ = (X'X)^{-1} X' {\mathbb E}(\epsilon \epsilon' \mid X) X (X'X)^{-1} $$ (by the linearity of expectation) $$ = (X'X)^{-1}X' \sigma^2 {\mathbf 1}_n X (X'X)^{-1} $$ $$ = \sigma^2 (X'X)^{-1}X'X(X'X)^{-1} = \sigma^2(X'X)^{-1}. $$
In typical applications, $\sigma^2$ is unknown. If we knew the disturbance term $\epsilon$, we could estimate $\sigma^2$ by $\frac{1}{n} \sum_{i=1}^n \epsilon_i^2$. But the disturbance terms $\epsilon$ are not observable. We do observe the residuals $e = Y - X\hat{\beta}$. If the model is true, we can use them to estimate $\sigma^2$.
The quantity $\frac{1}{n} \sum_{i=1}^n e_i^2$ tends to be an underestimate of $\sigma^2$ because the model is opimized precisely to minimize this quantity over all possible values of $\beta$. To compensate for that, we can divide the sum of squared residuals by $n-p$ rather than by $n$. The quantity
$$ \hat{\sigma}^2 \equiv \frac{1}{n-p} \sum_{i=1}^n e_i^2 $$is a conditionally unbiased estimate of $\sigma^2$ given $X$, if the model is true. The number $n-p$ is called the degrees of freedom.
Define $$ \hat{\mbox{cov}}(\hat{\beta} | X) \equiv \hat{\sigma}^2 (X'X)^{-1}. $$
This is a conditionally unbiased estimate of the covariance of $\hat{\beta}$ given $X$. (Note that we do have to invert a matrix to find this!)
Let's find this quantity for our toy problem.
# estimate the uncertainty of the model estimate.
# first estimate sigma^2
e <- Y - X %*% betahat; # the residuals
dims <- dim(X); # the number of rows is n; the number of columns is p
sigmahat2 <- sum(e^2)/(dims[1]-dims[2]);
sigmahat2
# now use this to find the estimate of cov(betahat)
covbhat <- sigmahat2 * solve(t(X) %*% X)
covbhat
0.10138975 | -0.11206235 | 0.05336302 |
-0.1120624 | 0.2614788 | -0.1600891 |
0.05336302 | -0.16008907 | 0.10672605 |
The estimated covariance matrix of $\hat{\beta}$ gives us standard errors for the individual components of $\hat{\beta}$. The estimated uncertainty of the $j$th parameter is the square-root of the $(j,j)$ element of $\hat{\mbox{cov}}(\hat{\beta} | X)$.
# find the square-root of the diagonal elements of the estimated covariance matrix
sqrt(diag(covbhat))
So, for instance, the estimated standard error of the estimate of $h$ is 0.3184...
Our estimates, with estimated uncertainties, are now
$h = 1.97 \pm 0.318$ | $v = 0.02 \pm 0.511$ | $a = 10.16 \pm 0.327$ |
In "reality," the data for this problem were generated using $h=3$, $v=0$, and $a = 9.8$, and the components of $\epsilon$ were iid standard normal (pseudo-)random variables: $\epsilon_i \sim N(0,1)$. That is, $\sigma = 1$, while we estimated $\hat{\sigma} = 0.327$!
Let's find the actual errors and compare their magnitude to the estimated standard errors of their estimates. Student's t statistic is the true error divided by the estimated standard error.
# find betahat - beta
beta <- c(3, 0, 9.8);
betahat - beta
tstat <- (betahat - beta)/sqrt(diag(covbhat)); # this is Student's t statistic for the estimates
tstat
-1.03005 |
0.02245 |
0.3655 |
-3.234903 |
0.04390339 |
1.118799 |
The estimated value of $h$ was off by more than 3.23 estimated standard errors.
We haven't discussed this yet, but if the components of $\epsilon$ are iid normal, then the $t$-statistics would have Student's $t$ distribution with $n-p$ degrees of freedom. In this case, $n=4$ and $p=3$, so the $t$-statistics have $4-3=1$ degree of freedom.
The chance that such a variable would be larger in magnitude than 3.234903 can be found using the R function pt(), which finds probabilities for Student's $t$-distribution.
help(pt)
TDist {stats} | R Documentation |
Density, distribution function, quantile function and random
generation for the t distribution with df
degrees of freedom
(and optional non-centrality parameter ncp
).
dt(x, df, ncp, log = FALSE) pt(q, df, ncp, lower.tail = TRUE, log.p = FALSE) qt(p, df, ncp, lower.tail = TRUE, log.p = FALSE) rt(n, df, ncp)
x, q |
vector of quantiles. |
p |
vector of probabilities. |
n |
number of observations. If |
df |
degrees of freedom (> 0, maybe non-integer). |
ncp |
non-centrality parameter delta;
currently except for |
log, log.p |
logical; if TRUE, probabilities p are given as log(p). |
lower.tail |
logical; if TRUE (default), probabilities are P[X ≤ x], otherwise, P[X > x]. |
The t distribution with df
= n degrees of
freedom has density
f(x) = Γ((n+1)/2) / (√(n π) Γ(n/2)) (1 + x^2/n)^-((n+1)/2)
for all real x. It has mean 0 (for n > 1) and variance n/(n-2) (for n > 2).
The general non-central t
with parameters (df, Del) = (df, ncp)
is defined as the distribution of
T(df, Del) := (U + Del) / √(V/df)
where U and V are independent random
variables, U ~ N(0,1) and
V ~ χ^2(df) (see Chisquare).
The most used applications are power calculations for t-tests:
Let T= (mX - m0) / (S/sqrt(n))
where
mX is the mean
and S the sample standard
deviation (sd
) of X_1, X_2, …, X_n which are
i.i.d. N(μ, σ^2)
Then T is distributed as non-central t with
df
= n - 1
degrees of freedom and non-centrality parameter
ncp
= (μ - m0) * sqrt(n)/σ.
dt
gives the density,
pt
gives the distribution function,
qt
gives the quantile function, and
rt
generates random deviates.
Invalid arguments will result in return value NaN
, with a warning.
The length of the result is determined by n
for
rt
, and is the maximum of the lengths of the
numerical arguments for the other functions.
The numerical arguments other than n
are recycled to the
length of the result. Only the first elements of the logical
arguments are used.
Supplying ncp = 0
uses the algorithm for the non-central
distribution, which is not the same algorithm used if ncp
is
omitted. This is to give consistent behaviour in extreme cases with
values of ncp
very near zero.
The code for non-zero ncp
is principally intended to be used
for moderate values of ncp
: it will not be highly accurate,
especially in the tails, for large values.
The central dt
is computed via an accurate formula
provided by Catherine Loader (see the reference in dbinom
).
For the non-central case of dt
, C code contributed by
Claus Ekstrøm based on the relationship (for
x != 0) to the cumulative distribution.
For the central case of pt
, a normal approximation in the
tails, otherwise via pbeta
.
For the non-central case of pt
based on a C translation of
Lenth, R. V. (1989). Algorithm AS 243 — Cumulative distribution function of the non-central t distribution, Applied Statistics 38, 185–189.
This computes the lower tail only, so the upper tail suffers from cancellation and a warning will be given when this is likely to be significant.
For central qt
, a C translation of
Hill, G. W. (1970) Algorithm 396: Student's t-quantiles. Communications of the ACM, 13(10), 619–620.
altered to take account of
Hill, G. W. (1981) Remark on Algorithm 396, ACM Transactions on Mathematical Software, 7, 250–1.
The non-central case is done by inversion.
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole. (Except non-central versions.)
Johnson, N. L., Kotz, S. and Balakrishnan, N. (1995) Continuous Univariate Distributions, volume 2, chapters 28 and 31. Wiley, New York.
Distributions for other standard distributions, including
df
for the F distribution.
require(graphics) 1 - pt(1:5, df = 1) qt(.975, df = c(1:10,20,50,100,1000)) tt <- seq(0, 10, len = 21) ncp <- seq(0, 6, len = 31) ptn <- outer(tt, ncp, function(t, d) pt(t, df = 3, ncp = d)) t.tit <- "Non-central t - Probabilities" image(tt, ncp, ptn, zlim = c(0,1), main = t.tit) persp(tt, ncp, ptn, zlim = 0:1, r = 2, phi = 20, theta = 200, main = t.tit, xlab = "t", ylab = "non-centrality parameter", zlab = "Pr(T <= t)") plot(function(x) dt(x, df = 3, ncp = 2), -3, 11, ylim = c(0, 0.32), main = "Non-central t - Density", yaxs = "i")
# how surprising is it for our estimates to be off by so much?
2*pt(abs(tstat),1, lower.tail = F) # estimated p-values: twice the upper tail probability
0.1908651 |
0.9720682 |
0.4643426 |
That is, there's about a 19% chance that the estimated value of $h$ would differ from the true value of $h$ by as much as it did, assuming the model is true—which it is (ignoring the distinction between truly random and pseudo-random).
Below we prove that $\hat{\sigma}^2$ is an unbiased estimate of $\sigma^2$ if the model is true.
The key to the proof is the "hat matrix" $H$, so called because it puts a "hat" on $Y$ (i.e., $HY = \hat{Y} = X\hat{\beta}$).
We have
$$ \hat{Y} = X \hat{\beta} = X(X'X)^{-1}X' Y = HY, $$where $$ H \equiv X(X'X)^{-1}X'. $$
The hat matrix has many useful properties:
We now prove that $\hat{\sigma}^2 = \frac{1}{n-p} \sum_{i=1}^n e_i^2$ is an unbiased estimate of $\sigma^2$.
Proof.
$$ e = ({\mathbf 1}_n - H)Y = ({\mathbf 1}_n - H)(X\beta + \epsilon $$$$ = ({\mathbf 1}_n - H)X\beta + ({\mathbf 1}_n - H)\epsilon = ({\mathbf 1}_n - H)\epsilon. $$Hence,
$$ \| e \|^2 = \epsilon'({\mathbf 1}_n - H)'({\mathbf 1}_n - H)\epsilon = \epsilon' \tilde{H} \epsilon, $$where $$ \tilde{H} = ({\mathbf 1}_n - H)'({\mathbf 1}_n - H) = ({\mathbf 1}_n - H) - H({\mathbf 1}_n - H) = ({\mathbf 1}_n - H). $$
$$ {\mathbb E} \left ( \epsilon'\tilde{H}\epsilon \mid X \right ) = $$[TO DO: finish]
The Gauss-Markov Theorem says that if $Y = X\beta + \epsilon$ with $X$ fixed, and the components of $\epsilon$ are uncorrelated and have mean zero and common variance $\sigma^2$, then the linear estimator of $\beta$ that has smallest variance among all unbiased estimates (the best linear unbiased estimator or BLUE) is the ordinary least squares estimate $$ \hat{\beta}_{\mbox{OLS}} = (X'X)^{-1}X'Y. $$
[TO DO: add proof.]
The model we have been studying assumes that $\epsilon$ and $X$ are independent and that the components of $\epsilon$ are uncorrelated, have zero mean, and have the same variance $\sigma^2$ (i.e., ${\mathbb E} \epsilon = 0$, and $\mbox{cov } \epsilon = \sigma^2 {\mathbf 1}_n$).
When the covariance matrix is not diagonal but we still have ${\mathbb E} \epsilon = 0$ and $\epsilon$ independent of $X$, the ordinary least squares estimate is still unbiased.
$$ {\mathbb E} (\hat{\beta}_{\mbox{OLS}} \mid X) = {\mathbb E} \left ( (X'X)^{-1}X'Y \mid X \right ) $$$$ = {\mathbb E} \left ( (X'X)^{-1}X'(X\beta + \epsilon) \mid X \right ) = {\mathbb E} \left ( \beta + (X'X)^{-1}X'\epsilon \mid X \right ) $$$$ = \beta + (X'X)^{-1}X'{\mathbb E}(\epsilon \mid X) = \beta + 0 = \beta. $$Moreover, $$ {\mbox cov } (\hat{\beta}_{\mbox{OLS}} \mid X) = {\mathbb E} \left ( \left [ (X'X)^{-1}X'\epsilon \right ] \left [(X'X)^{-1}X'\epsilon \right ]' \mid X \right ) $$ $$ = {\mathbb E} \left ( (X'X)^{-1} X' \epsilon \epsilon' X (X'X)^{-1} \mid X \right ) $$ $$ = (X'X)^{-1}X'{\mathbb E} (\epsilon \epsilon' \mid X) X (X'X)^{-1} $$ $$ = (X'X)^{-1}X' \Sigma X (X'X)^{-1}. $$
In general, this is not diagonal and the OLS estimate is not longer BLUE unless $\Sigma = \sigma^2 {\mathbf 1}_n$.
If the covariance matrix ${\mbox cov }\epsilon = \Sigma$ is known and positive definite, we can take it into account through generalized least squares (GLS), as follows. Since $\Sigma$ is positive definite, its square root $\Sigma^{1/2}$ exists and is invertible. We change the problem by pre-multiplying by $\Sigma^{-1/2}$:
$$ \Sigma^{-1/2}Y = (\Sigma^{-1/2}X) \beta + \Sigma^{-1/2} \epsilon. $$In this new problem the unknown parameter is still $\beta$, but there is a different "noise," $\Sigma^{-1/2}\epsilon$, we have new data $\Sigma^{-1/2}Y$, and we have a new design matrix $\Sigma^{-1/2} X$. Since $\Sigma$, $Y$, and $X$ are known, we can set up this regression problem using only things we know.
The components of the new noise terms are uncorrelated:
$$ \mbox{cov } (\Sigma^{-1/2} \epsilon | X) = {\mathbb E} (\Sigma^{-1/2} \epsilon \epsilon' \Sigma^{-1/2} \mid X) $$$$ = \Sigma^{-1/2} {\mathbb E} ( \epsilon \epsilon' \mid X) \Sigma^{-1/2} = \Sigma^{-1/2} \Sigma \Sigma^{-1/2} = {\mathbf 1}_n. $$This transforms the original problem into a problem where the covariance matrix of the new noise is a diagonal with equal elements. OLS is BLUE for this problem. That leads to the estimator $$ \hat{\beta}_{\mbox{GLS}} \equiv \left [ (\Sigma^{-1/2} X)'(\Sigma^{-1/2} X) \right ]^{-1}(\Sigma^{-1/2} X)'\Sigma^{-1/2}Y $$ $$ = \left [ X'\Sigma^{-1/2}\Sigma^{-1/2}X \right ] ^{-1} X' \Sigma^{-1/2}\Sigma^{-1/2}Y = (X' \Sigma^{-1}X)^{-1}X'\Sigma^{-1} Y. $$
To construct this estimator, we need $X'\Sigma^{-1}X$ to be invertible. Is it? $$ X'\Sigma^{-1}X = (\Sigma^{-1/2}X)'(\Sigma^{-1/2}X). $$ The matrix $\Sigma^{-1/2}X$ is full rank, so we are OK.
It's straightforward to show that $\hat{\beta}_{\mbox{GLS}}$ is conditionally unbiased given $X$. Also,
$$ \mbox{cov } (\hat{\beta}_{\mbox{GLS}} \mid X) = (\tilde{X}'\tilde{X})^{-1}, $$where $\tilde{X} = \Sigma^{-1/2}X$, i.e., $$ \mbox{cov } (\hat{\beta}_{\mbox{GLS}} \mid X) = \left [ (\Sigma^{-1/2}X)'(\Sigma^{-1/2}X) \right ] ^{-1} = (X' \Sigma^{-1} X)^{-1}. $$
For fixed $X$, $\hat{\beta}_{\mbox{GLS}}$ is BLUE.
Suppose $\Sigma = \lambda \Gamma$, where $\lambda$ is an unknown positive constant and $\Gamma$ is a known, positive definite matrix.
We re-write the regression equation as
$$ (\Gamma^{-1/2} Y) = (\Gamma^{-1/2}X) \beta + (\Gamma^{-1/2} \epsilon). $$Then we still have ${\mathbb E} (\Gamma^{-1/2}\epsilon) = 0$ and $$ \mbox{cov } (\Gamma^{-1/2} \epsilon) = {\mathbb E} ((\Gamma^{-1/2} \epsilon)(\Gamma^{-1/2} \epsilon)') $$ $$ = {\mathbb E} (\Gamma^{-1/2} \epsilon \epsilon' \Gamma^{-1/2}) = \Gamma^{-1/2} {\mathbb E}( \epsilon \epsilon') \Gamma^{-1/2} $$ $$ = \Gamma^{-1'2} \lambda \Gamma \Gamma^{-1/2} = \lambda {\mathbf 1}_n. $$
Thus, the new problem fits the standard OLS model with $\lambda = \sigma^2$, and we have
$$ \hat{\beta} = \left ( ( \Gamma^{-1/2}X)'(\Gamma^{-1/2}X \right )^{-1} (\Gamma^{-1/2}X)'(\Gamma^{-1/2}Y) = (X'\Gamma^{-1}X)^{-1}X'\Gamma^{-1}Y. $$Also,
$$ \mbox{cov } \hat{\beta} = \lambda \left ( (\Gamma^{-1/2}X)'(\Gamma^{-1/2}X) \right )^{-1} = \lambda (X'\Gamma^{-1}X)^{-1}. $$An unbiased estimate of $\lambda$ is
$$ \hat{\lambda} = \frac{1}{n-p} \sum_{i=1}^n e_i^2 = \frac{1}{n-p} \sum_{i=1}^n \left [ (\Gamma^{-1/2} Y)_j - (\Gamma^{-1/2}X\hat{\beta})_j \right ]^2 $$$$ = \frac{1}{n-p} \sum_{i=1}^n \left [ \Gamma^{-1/2}(Y - X \hat{\beta} )\right ]_j^2. $$The estimated covariance matrix of $\hat{\beta}$ is
$$ \hat{\mbox{cov}} \hat{\beta} = \hat{\lambda} (X'\Gamma^{-1}X)^{-1}. $$Suppose that the measuring instrument the falling weight example has correlated errors with covariance matrix $\lambda \Gamma$, with $\lambda > 0$ unknown and
$$ \Gamma = \begin{pmatrix} 1 & 0.2 & 0.1 & 0.1 \\ 0.2 & 1 & 0.2 & 0.1 \\ 0.1 & 0.2 & 1 & 0.2 \\ 0.1 & 0.1 & 0.2 & 1 \end{pmatrix} . $$Let's see how this changes the estimate.
# R example.
gam <- matrix(c(1, .2, .1, .1, .2, 1, .2, .1, .1, .2, 1, .2, .1, .1, .2, 1), nrow=4, ncol=4, byrow=T)
gam
decomp <- eigen(gam)
sqGamInv <- decomp$vectors %*% diag(1/sqrt(decomp$values)) %*% t(decomp$vectors);
gY <- sqGamInv %*% Y;
gX <- sqGamInv %*% X;
betahat_GLS <- solve(t(gX) %*% gX, t(gX) %*% gY )
betahat_GLS # the GLS estimate
1.0 | 0.2 | 0.1 | 0.1 |
0.2 | 1.0 | 0.2 | 0.1 |
0.1 | 0.2 | 1.0 | 0.2 |
0.1 | 0.1 | 0.2 | 1.0 |
1.98456 |
0.01271 |
10.1655 |
# Estimate lambda
lambdahat <- sum((gY - gX %*% betahat_GLS)^2)/(dims[1]-dims[2]);
lambdahat
Usually $\Sigma$ is unknown: we have to estimate it to take advantage of the covariance of $\epsilon$. But a covariance matrix has $n^2$ unknowns, and we only have $n$ data, so it's impossible to estimate $\Sigma$ without constraints. (There are $(n^2+n)/2$ unknowns, since $\Sigma$ is symmetric.)
Substituting an estimate $\hat{\Sigma}$ for $\Sigma$ in GLS is called feasible GLS or the Aitken method.
The resulting feasible GLS estimate is
$$ \hat{\beta}_{\mbox{FGLS}} \equiv (X'\hat{\Sigma}X)^{-1}X' \hat{\Sigma}^{-1}Y, $$which has estimated covariance $$ \hat{\mbox{cov}}(\hat{\beta}_{\mbox{FGLS}} \mid X) = (X' \hat{\Sigma}^{-1} X)^{-1}. $$
Sometimes, this is a good estimator; sometimes it has terrible performance. Moreover, FGLS is a nonlinear estimator, and is usually biased.
Imagine a covariate that is binary or categorical, such as gender, or whether a subject in an experiment receives a treatment, or which treatment the subject receives.
Dummy variables are a way to inforporate such a covariate into a regression model. That is not always the right way to incorporate such covariates—especially in the context of an experiment—but it is extremely common in practice.
Suppose there is a covariate that can take any of $k$ levels. Augment the list of variables in the model by including $k-1$ new variables. The $i$th such variable will be equal to 1 if the case in question has the $i$th level of treatment and 0 if not, for $i=1, \ldots, k-1$. The $k$th level corresponds to having all $k-1$ of those variables equal to zero. If we used $k$ such variables, they would be linearly dependent: the $k$th variable would be 1 minus the sum of the first $k-1$ variables.
[To do: discuss constancy assumptions, additive effects, independence of $\epsilon$ and $X$, equal variances, uncorrelated, etc.]
We will deliberately mis-analyze the MacNell et al. (2014) data using a dummy variable for the identified gender of the instructor and a dummy variable for the true gender of the instructor. Note that the assumptions of the regression model are not satisfied here. They are violated in many ways, including the lack of any justification for an underlying linear model. The permutation analysis in the chapter on inference is far preferable, in part because it matches the randomization that was used in the experiment. Regression modeling ignores that randomization.
However, regression can sometimes adjust for the effect of confounding variables. We will illustrate this by example—but the circumstances in which the approach gives reliable inferences are limited, and typically cannot be checked from the data themselves.
Here is an example of a regression model for student evaluations:
$$ Y = X \beta + \epsilon, $$where $Y_i$ is the rating assigned by student $i$; $X_{i1} = 1$ for all $i$ (to account for the fact that a typical rating is about 4), and $X_{i2} = 1$ if student $i$ was assigned to an instructor identified as male, and $X_{i2} = 0$ student $i$ was assigned to an instructor identified as female; where $\epsilon$ and $X$ are independent; and the components of $\epsilon$ are uncorrelated, have zero mean, have common variance $\sigma^2$.
In this model, the rating student $i$ assigns to an instructor is $\beta_1 + \epsilon_i$ if the instructor is apparently female and $\beta_1+ \beta_2 + \epsilon_i$ if the instructor is apparently male. instructor; $\beta_1$ if assigned apparently female.
[Show that $\epsilon$ and $X$ are not independent for reasonable ticket models. Show ${\mathbb E} \epsilon \ne 0$ for reasonable ticket models. Etc. Note that both have non-interference built in.]
In finding $p$-values in the regression model, the randomness is in $\epsilon$, the (invented) randomness of individual students' responses. The $p$-values are with respect to different realizations of the individual "noise" terms. The randomness in assigning students to sections of the course is hidden in $X$, but the regression—and the $p$-value calculation—are conditional on $X$.
In the "ticket" model, the randomness comes from the random assignment of students to sections of the class, and the $p$-value calculations imagine different random assignments.
Both models assume that each student's response does not depend on other students' assignments to sections, the non interference assumption. That assumption might not be accurate if students study in groups, but
The regression model assumes that $X$ and $\epsilon$ are independent. Suppose that the expected rating both for apparently male and apparently female instructors is 4 (corresponding to $\beta_1 = 4$ and $\beta_2 = 0$). Suppose that student $i$ would rate an apparently female instructor 5 and an apparently male instructor 3. Then $X_{i2}$ and $\epsilon_i$ are not independent: $\epsilon_i = -1$ if $X_{i2} = 1$ amd $\epsilon_i = +1$ if $X_{i2} = 0$.
The regression model assumes that $\epsilon_i$ has mean zero, for all $i$, so every student's expected rating of an apparently female instructor is the same, amd every student's expected rating of an apparently male instructor is the same. There is no reason that should be true. Again, suppose that the average rating both for apparently male and for apparently female instructors is 4. Imagine that student $i$ will always rate an instructor 5, regardless of the instructor's apparent gender. Then $\epsilon_i = 1$ and hence ${\mathbb E} \epsilon_i = +1$, violating this assumption of the regression model.
In short, the assumptions of the regression model seem highly contrived.
# Use regression with a dummy variable for the instructor's identified gender
ratings <- read.csv("Data/Macnell-RatingsData.csv", as.is=T); # reads a .csv file into a DataFrame
character <- setdiff(names(ratings),c("group","gender","tagender","taidgender","age"));
# for each characteristic, we will model the student's rating of the instructor as
# using regression. The model will include a dummy variable for the instructor's perceived gender.
#
# Then we will try a variety of adjustments--for the instructor's true gender and for the
# student's gender, as well as including an "intercept" or not
for (ch in character) {
reg_mod <- lm(unlist(ratings[ch]) ~ ratings$taidgender - 1) # omit the intercept
print(ch, quote=F)
print(summary(reg_mod))
}
[1] professional Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender - 1) Residuals: Min 1Q Median 3Q Max -0.6087 0.3913 0.3913 4.0000 5.0000 Coefficients: Estimate Std. Error t value Pr(>|t|) ratings$taidgender 4.6087 0.6134 7.513 2.71e-09 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.942 on 42 degrees of freedom Multiple R-squared: 0.5734, Adjusted R-squared: 0.5632 F-statistic: 56.45 on 1 and 42 DF, p-value: 2.706e-09 [1] respect Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender - 1) Residuals: Min 1Q Median 3Q Max -0.6087 0.3913 0.3913 4.0000 5.0000 Coefficients: Estimate Std. Error t value Pr(>|t|) ratings$taidgender 4.6087 0.6134 7.513 2.71e-09 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.942 on 42 degrees of freedom Multiple R-squared: 0.5734, Adjusted R-squared: 0.5632 F-statistic: 56.45 on 1 and 42 DF, p-value: 2.706e-09 [1] caring Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender - 1) Residuals: Min 1Q Median 3Q Max -2.1739 -0.1739 0.8261 4.0000 5.0000 Coefficients: Estimate Std. Error t value Pr(>|t|) ratings$taidgender 4.1739 0.5668 7.364 4.4e-09 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.718 on 42 degrees of freedom Multiple R-squared: 0.5636, Adjusted R-squared: 0.5532 F-statistic: 54.23 on 1 and 42 DF, p-value: 4.397e-09 [1] enthusiastic Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender - 1) Residuals: Min 1Q Median 3Q Max -1.1739 -0.1739 0.8261 4.0000 5.0000 Coefficients: Estimate Std. Error t value Pr(>|t|) ratings$taidgender 4.1739 0.5566 7.499 2.84e-09 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.67 on 42 degrees of freedom Multiple R-squared: 0.5724, Adjusted R-squared: 0.5622 F-statistic: 56.23 on 1 and 42 DF, p-value: 2.839e-09 [1] communicate Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender - 1) Residuals: Min 1Q Median 3Q Max -1.2174 -0.2174 0.7826 4.0000 5.0000 Coefficients: Estimate Std. Error t value Pr(>|t|) ratings$taidgender 4.2174 0.5673 7.434 3.51e-09 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.721 on 42 degrees of freedom Multiple R-squared: 0.5682, Adjusted R-squared: 0.5579 F-statistic: 55.26 on 1 and 42 DF, p-value: 3.505e-09 [1] helpful Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender - 1) Residuals: Min 1Q Median 3Q Max -1.9565 0.0435 1.0435 3.5000 5.0000 Coefficients: Estimate Std. Error t value Pr(>|t|) ratings$taidgender 3.9565 0.5488 7.209 7.31e-09 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.632 on 42 degrees of freedom Multiple R-squared: 0.5531, Adjusted R-squared: 0.5424 F-statistic: 51.97 on 1 and 42 DF, p-value: 7.309e-09 [1] feedback Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender - 1) Residuals: Min 1Q Median 3Q Max -2.1739 -0.1739 0.8261 4.0000 5.0000 Coefficients: Estimate Std. Error t value Pr(>|t|) ratings$taidgender 4.1739 0.5803 7.193 7.72e-09 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.783 on 42 degrees of freedom Multiple R-squared: 0.5519, Adjusted R-squared: 0.5413 F-statistic: 51.73 on 1 and 42 DF, p-value: 7.718e-09 [1] prompt Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender - 1) Residuals: Min 1Q Median 3Q Max -1.3478 -0.3478 0.6522 4.0000 5.0000 Coefficients: Estimate Std. Error t value Pr(>|t|) ratings$taidgender 4.3478 0.5519 7.878 8.28e-10 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.647 on 42 degrees of freedom Multiple R-squared: 0.5964, Adjusted R-squared: 0.5868 F-statistic: 62.07 on 1 and 42 DF, p-value: 8.284e-10 [1] consistent Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender - 1) Residuals: Min 1Q Median 3Q Max -1.9565 0.0435 1.0435 4.0000 5.0000 Coefficients: Estimate Std. Error t value Pr(>|t|) ratings$taidgender 3.9565 0.5563 7.112 1.01e-08 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.668 on 42 degrees of freedom Multiple R-squared: 0.5463, Adjusted R-squared: 0.5355 F-statistic: 50.58 on 1 and 42 DF, p-value: 1.005e-08 [1] fair Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender - 1) Residuals: Min 1Q Median 3Q Max -1.2609 -0.2609 0.7391 3.5000 5.0000 Coefficients: Estimate Std. Error t value Pr(>|t|) ratings$taidgender 4.2609 0.5388 7.908 7.52e-10 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.584 on 42 degrees of freedom Multiple R-squared: 0.5982, Adjusted R-squared: 0.5887 F-statistic: 62.54 on 1 and 42 DF, p-value: 7.523e-10 [1] responsive Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender - 1) Residuals: Min 1Q Median 3Q Max -0.8696 0.1304 1.1304 4.0000 5.0000 Coefficients: Estimate Std. Error t value Pr(>|t|) ratings$taidgender 3.8696 0.5643 6.857 2.33e-08 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.706 on 42 degrees of freedom Multiple R-squared: 0.5282, Adjusted R-squared: 0.517 F-statistic: 47.02 on 1 and 42 DF, p-value: 2.326e-08 [1] praised Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender - 1) Residuals: Min 1Q Median 3Q Max -1.5217 0.4783 0.4783 4.0000 5.0000 Coefficients: Estimate Std. Error t value Pr(>|t|) ratings$taidgender 4.5217 0.5834 7.751 1.25e-09 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.798 on 42 degrees of freedom Multiple R-squared: 0.5886, Adjusted R-squared: 0.5788 F-statistic: 60.08 on 1 and 42 DF, p-value: 1.249e-09 [1] knowledgeable Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender - 1) Residuals: Min 1Q Median 3Q Max -1.3043 -0.3043 0.6957 4.0000 5.0000 Coefficients: Estimate Std. Error t value Pr(>|t|) ratings$taidgender 4.3043 0.6035 7.132 9.42e-09 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.894 on 42 degrees of freedom Multiple R-squared: 0.5477, Adjusted R-squared: 0.537 F-statistic: 50.86 on 1 and 42 DF, p-value: 9.421e-09 [1] clear Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender - 1) Residuals: Min 1Q Median 3Q Max -1.913 0.087 1.087 4.000 5.000 Coefficients: Estimate Std. Error t value Pr(>|t|) ratings$taidgender 3.9130 0.5663 6.909 1.96e-08 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.716 on 42 degrees of freedom Multiple R-squared: 0.532, Adjusted R-squared: 0.5208 F-statistic: 47.74 on 1 and 42 DF, p-value: 1.959e-08 [1] overall Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender - 1) Residuals: Min 1Q Median 3Q Max -1.1739 -0.1739 0.8261 4.0000 5.0000 Coefficients: Estimate Std. Error t value Pr(>|t|) ratings$taidgender 4.1739 0.5713 7.306 5.33e-09 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.74 on 42 degrees of freedom Multiple R-squared: 0.5596, Adjusted R-squared: 0.5491 F-statistic: 53.37 on 1 and 42 DF, p-value: 5.327e-09
Note the highly significant results. This is not evidence that gender matters; raterh, it is strong evidence that the average rating is not zero—if there is no intercept, having the coefficient of taidgender = 0 means that the data are just iid noice with zero mean, which is a bogus assumption! The mean of the data is large (about 4), so we are bound to reject the null hypothesis in this case.
# include an intercept NOTE HOW BIG THE CHANGE IS!
for (ch in character) {
reg_mod <- lm(unlist(ratings[ch]) ~ ratings$taidgender ) # include the intercept
print(ch, quote=F)
print(summary(reg_mod))
}
[1] professional Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender) Residuals: Min 1Q Median 3Q Max -3.0000 -0.6087 0.3913 0.3913 1.0000 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.0000 0.2303 17.371 <2e-16 *** ratings$taidgender 0.6087 0.3148 1.933 0.0601 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.03 on 41 degrees of freedom Multiple R-squared: 0.08355, Adjusted R-squared: 0.06119 F-statistic: 3.738 on 1 and 41 DF, p-value: 0.06012 [1] respect Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender) Residuals: Min 1Q Median 3Q Max -3.0000 -0.6087 0.3913 0.3913 1.0000 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.0000 0.2303 17.371 <2e-16 *** ratings$taidgender 0.6087 0.3148 1.933 0.0601 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.03 on 41 degrees of freedom Multiple R-squared: 0.08355, Adjusted R-squared: 0.06119 F-statistic: 3.738 on 1 and 41 DF, p-value: 0.06012 [1] caring Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender) Residuals: Min 1Q Median 3Q Max -2.6500 -0.4120 0.3500 0.8261 1.3500 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.6500 0.2313 15.783 <2e-16 *** ratings$taidgender 0.5239 0.3162 1.657 0.105 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.034 on 41 degrees of freedom Multiple R-squared: 0.06275, Adjusted R-squared: 0.0399 F-statistic: 2.745 on 1 and 41 DF, p-value: 0.1052 [1] enthusiastic Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender) Residuals: Min 1Q Median 3Q Max -2.6000 -0.1739 -0.1739 0.6130 1.4000 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.6000 0.2212 16.278 <2e-16 *** ratings$taidgender 0.5739 0.3024 1.898 0.0648 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.989 on 41 degrees of freedom Multiple R-squared: 0.08076, Adjusted R-squared: 0.05834 F-statistic: 3.602 on 1 and 41 DF, p-value: 0.06476 [1] communicate Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender) Residuals: Min 1Q Median 3Q Max -2.6500 -0.2174 0.3500 0.7826 1.3500 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.6500 0.2329 15.675 <2e-16 *** ratings$taidgender 0.5674 0.3184 1.782 0.0821 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.041 on 41 degrees of freedom Multiple R-squared: 0.07189, Adjusted R-squared: 0.04925 F-statistic: 3.176 on 1 and 41 DF, p-value: 0.08215 [1] helpful Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender) Residuals: Min 1Q Median 3Q Max -2.50000 -0.72826 0.04348 1.04348 1.50000 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.5000 0.2367 14.78 <2e-16 *** ratings$taidgender 0.4565 0.3237 1.41 0.166 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.059 on 41 degrees of freedom Multiple R-squared: 0.04627, Adjusted R-squared: 0.02301 F-statistic: 1.989 on 1 and 41 DF, p-value: 0.166 [1] feedback Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender) Residuals: Min 1Q Median 3Q Max -2.7000 -0.1739 0.3000 0.8261 1.3000 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.7000 0.2506 14.763 <2e-16 *** ratings$taidgender 0.4739 0.3427 1.383 0.174 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.121 on 41 degrees of freedom Multiple R-squared: 0.04457, Adjusted R-squared: 0.02127 F-statistic: 1.913 on 1 and 41 DF, p-value: 0.1742 [1] prompt Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender) Residuals: Min 1Q Median 3Q Max -2.5500 -0.3478 0.4500 0.6522 1.4500 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.5500 0.2268 15.655 <2e-16 *** ratings$taidgender 0.7978 0.3101 2.573 0.0138 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.014 on 41 degrees of freedom Multiple R-squared: 0.139, Adjusted R-squared: 0.118 F-statistic: 6.621 on 1 and 41 DF, p-value: 0.0138 [1] consistent Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender) Residuals: Min 1Q Median 3Q Max -2.50000 -0.50000 0.04348 1.04348 1.50000 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.5000 0.2565 13.644 <2e-16 *** ratings$taidgender 0.4565 0.3507 1.302 0.2 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.147 on 41 degrees of freedom Multiple R-squared: 0.03968, Adjusted R-squared: 0.01626 F-statistic: 1.694 on 1 and 41 DF, p-value: 0.2003 [1] fair Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender) Residuals: Min 1Q Median 3Q Max -2.5000 -0.3804 -0.2609 0.7391 1.5000 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.5000 0.2079 16.837 <2e-16 *** ratings$taidgender 0.7609 0.2842 2.677 0.0106 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.9297 on 41 degrees of freedom Multiple R-squared: 0.1488, Adjusted R-squared: 0.128 F-statistic: 7.166 on 1 and 41 DF, p-value: 0.01064 [1] responsive Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender) Residuals: Min 1Q Median 3Q Max -2.6500 -0.8696 0.1304 0.7402 1.3500 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.6500 0.2240 16.292 <2e-16 *** ratings$taidgender 0.2196 0.3063 0.717 0.478 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.002 on 41 degrees of freedom Multiple R-squared: 0.01238, Adjusted R-squared: -0.01171 F-statistic: 0.5137 on 1 and 41 DF, p-value: 0.4776 [1] praised Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender) Residuals: Min 1Q Median 3Q Max -2.8500 -0.5217 0.1500 0.4783 1.1500 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.8500 0.1984 19.402 <2e-16 *** ratings$taidgender 0.6717 0.2713 2.476 0.0175 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.8874 on 41 degrees of freedom Multiple R-squared: 0.1301, Adjusted R-squared: 0.1088 F-statistic: 6.129 on 1 and 41 DF, p-value: 0.01752 [1] knowledgeable Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender) Residuals: Min 1Q Median 3Q Max -2.9500 -0.3044 0.0500 0.6956 1.0500 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.9500 0.2204 17.925 <2e-16 *** ratings$taidgender 0.3543 0.3013 1.176 0.246 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.9855 on 41 degrees of freedom Multiple R-squared: 0.03263, Adjusted R-squared: 0.009038 F-statistic: 1.383 on 1 and 41 DF, p-value: 0.2464 [1] clear Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender) Residuals: Min 1Q Median 3Q Max -2.50000 -0.50000 0.08696 1.08696 1.50000 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.5000 0.2812 12.448 1.64e-15 *** ratings$taidgender 0.4130 0.3844 1.074 0.289 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.257 on 41 degrees of freedom Multiple R-squared: 0.02738, Adjusted R-squared: 0.00366 F-statistic: 1.154 on 1 and 41 DF, p-value: 0.2889 [1] overall Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender) Residuals: Min 1Q Median 3Q Max -2.7000 -0.1739 0.3000 0.8261 1.3000 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.7000 0.2250 16.446 <2e-16 *** ratings$taidgender 0.4739 0.3076 1.541 0.131 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.006 on 41 degrees of freedom Multiple R-squared: 0.05472, Adjusted R-squared: 0.03167 F-statistic: 2.373 on 1 and 41 DF, p-value: 0.1311
# allow for a real difference between the two instructors by including gender (really a proxy for the two instructors)
for (ch in character) {
reg_mod <- lm(unlist(ratings[ch]) ~
ratings$taidgender
+ ratings$tagender
- 1 ) # omit the intercept
print(ch, quote=F)
print(summary(reg_mod))
}
# adjust for the identified gender, the true gender, and the student's gender
for (ch in character) {
reg_mod <- lm(unlist(ratings[ch]) ~
ratings$taidgender
+ ratings$tagender
+ ratings$gender
) # include the intercept
print(ch, quote=F)
print(summary(reg_mod))
}
[1] professional Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + ratings$gender - 1) Residuals: Min 1Q Median 3Q Max -3.3897 -0.3897 0.5483 1.1239 3.0620 Coefficients: Estimate Std. Error t value Pr(>|t|) ratings$taidgender 1.4981 0.4110 3.645 0.000761 *** ratings$tagender 0.5136 0.4458 1.152 0.256055 ratings$gender 1.9380 0.2389 8.111 5.59e-10 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.463 on 40 degrees of freedom Multiple R-squared: 0.8995, Adjusted R-squared: 0.892 F-statistic: 119.3 on 3 and 40 DF, p-value: < 2.2e-16 [1] respect Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + ratings$gender - 1) Residuals: Min 1Q Median 3Q Max -3.3897 -0.3897 0.5483 1.1239 3.0620 Coefficients: Estimate Std. Error t value Pr(>|t|) ratings$taidgender 1.4981 0.4110 3.645 0.000761 *** ratings$tagender 0.5136 0.4458 1.152 0.256055 ratings$gender 1.9380 0.2389 8.111 5.59e-10 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.463 on 40 degrees of freedom Multiple R-squared: 0.8995, Adjusted R-squared: 0.892 F-statistic: 119.3 on 3 and 40 DF, p-value: < 2.2e-16 [1] caring Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + ratings$gender - 1) Residuals: Min 1Q Median 3Q Max -2.6121 -0.7091 0.0705 1.1322 3.1940 Coefficients: Estimate Std. Error t value Pr(>|t|) ratings$taidgender 1.3523 0.4055 3.335 0.00185 ** ratings$tagender 0.3174 0.4398 0.722 0.47464 ratings$gender 1.8060 0.2358 7.661 2.29e-09 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.444 on 40 degrees of freedom Multiple R-squared: 0.8828, Adjusted R-squared: 0.874 F-statistic: 100.4 on 3 and 40 DF, p-value: < 2.2e-16 [1] enthusiastic Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + ratings$gender - 1) Residuals: Min 1Q Median 3Q Max -2.8464 -0.6099 0.1536 0.7849 3.1583 Coefficients: Estimate Std. Error t value Pr(>|t|) ratings$taidgender 1.3734 0.3896 3.525 0.00108 ** ratings$tagender 0.1630 0.4226 0.386 0.70182 ratings$gender 1.8417 0.2265 8.130 5.28e-10 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.387 on 40 degrees of freedom Multiple R-squared: 0.89, Adjusted R-squared: 0.8818 F-statistic: 107.9 on 3 and 40 DF, p-value: < 2.2e-16 [1] communicate Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + ratings$gender - 1) Residuals: Min 1Q Median 3Q Max -2.9570 -0.5723 0.0430 1.3926 3.1963 Coefficients: Estimate Std. Error t value Pr(>|t|) ratings$taidgender 1.3839 0.4015 3.446 0.00135 ** ratings$tagender 0.3496 0.4355 0.803 0.42687 ratings$gender 1.8037 0.2335 7.726 1.87e-09 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.429 on 40 degrees of freedom Multiple R-squared: 0.8865, Adjusted R-squared: 0.878 F-statistic: 104.1 on 3 and 40 DF, p-value: < 2.2e-16 [1] helpful Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + ratings$gender - 1) Residuals: Min 1Q Median 3Q Max -2.3328 -0.6125 0.1081 1.1079 3.3336 Coefficients: Estimate Std. Error t value Pr(>|t|) ratings$taidgender 1.2257 0.3855 3.180 0.00284 ** ratings$tagender 0.5591 0.4181 1.337 0.18866 ratings$gender 1.6664 0.2241 7.436 4.67e-09 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.372 on 40 degrees of freedom Multiple R-squared: 0.8843, Adjusted R-squared: 0.8756 F-statistic: 101.9 on 3 and 40 DF, p-value: < 2.2e-16 [1] feedback Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + ratings$gender - 1) Residuals: Min 1Q Median 3Q Max -2.98771 -0.26829 0.01229 1.06223 3.05609 Coefficients: Estimate Std. Error t value Pr(>|t|) ratings$taidgender 1.25253 0.40651 3.081 0.00372 ** ratings$tagender 0.09989 0.44093 0.227 0.82193 ratings$gender 1.94391 0.23635 8.225 3.93e-10 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.447 on 40 degrees of freedom Multiple R-squared: 0.8846, Adjusted R-squared: 0.876 F-statistic: 102.2 on 3 and 40 DF, p-value: < 2.2e-16 [1] prompt Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + ratings$gender - 1) Residuals: Min 1Q Median 3Q Max -2.7624 -0.5702 0.3289 1.4298 3.2653 Coefficients: Estimate Std. Error t value Pr(>|t|) ratings$taidgender 1.6434 0.4115 3.994 0.000271 *** ratings$tagender 0.2931 0.4463 0.657 0.515176 ratings$gender 1.7347 0.2392 7.251 8.41e-09 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.465 on 40 degrees of freedom Multiple R-squared: 0.8823, Adjusted R-squared: 0.8734 F-statistic: 99.91 on 3 and 40 DF, p-value: < 2.2e-16 [1] consistent Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + ratings$gender - 1) Residuals: Min 1Q Median 3Q Max -2.8227 -0.7780 0.1773 1.0973 3.2134 Coefficients: Estimate Std. Error t value Pr(>|t|) ratings$taidgender 1.1961 0.4029 2.969 0.00503 ** ratings$tagender 0.2495 0.4370 0.571 0.57131 ratings$gender 1.7866 0.2342 7.627 2.55e-09 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.434 on 40 degrees of freedom Multiple R-squared: 0.8751, Adjusted R-squared: 0.8658 F-statistic: 93.44 on 3 and 40 DF, p-value: < 2.2e-16 [1] fair Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + ratings$gender - 1) Residuals: Min 1Q Median 3Q Max -2.8183 -0.3668 0.1817 1.0545 3.1273 Coefficients: Estimate Std. Error t value Pr(>|t|) ratings$taidgender 1.45767 0.33871 4.304 0.000105 *** ratings$tagender 0.07282 0.36739 0.198 0.843874 ratings$gender 1.87272 0.19693 9.510 8.07e-12 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.206 on 40 degrees of freedom Multiple R-squared: 0.9167, Adjusted R-squared: 0.9104 F-statistic: 146.7 on 3 and 40 DF, p-value: < 2.2e-16 [1] responsive Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + ratings$gender - 1) Residuals: Min 1Q Median 3Q Max -2.7595 -0.7595 0.2405 1.2262 3.1508 Coefficients: Estimate Std. Error t value Pr(>|t|) ratings$taidgender 1.10675 0.42540 2.602 0.0129 * ratings$tagender 0.06112 0.46141 0.132 0.8953 ratings$gender 1.84919 0.24733 7.477 4.1e-09 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.514 on 40 degrees of freedom Multiple R-squared: 0.8593, Adjusted R-squared: 0.8487 F-statistic: 81.43 on 3 and 40 DF, p-value: < 2.2e-16 [1] praised Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + ratings$gender - 1) Residuals: Min 1Q Median 3Q Max -3.1464 -0.4829 -0.1464 1.3393 3.0126 Coefficients: Estimate Std. Error t value Pr(>|t|) ratings$taidgender 1.5017 0.3722 4.035 0.000239 *** ratings$tagender 0.1715 0.4037 0.425 0.673143 ratings$gender 1.9874 0.2164 9.185 2.11e-11 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.325 on 40 degrees of freedom Multiple R-squared: 0.9121, Adjusted R-squared: 0.9055 F-statistic: 138.4 on 3 and 40 DF, p-value: < 2.2e-16 [1] knowledgeable Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + ratings$gender - 1) Residuals: Min 1Q Median 3Q Max -3.2501 -0.5044 -0.1543 1.2728 3.0500 Coefficients: Estimate Std. Error t value Pr(>|t|) ratings$taidgender 1.2543 0.4133 3.034 0.00422 ** ratings$tagender 0.3501 0.4483 0.781 0.43942 ratings$gender 1.9500 0.2403 8.114 5.54e-10 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.472 on 40 degrees of freedom Multiple R-squared: 0.8887, Adjusted R-squared: 0.8803 F-statistic: 106.4 on 3 and 40 DF, p-value: < 2.2e-16 [1] clear Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + ratings$gender - 1) Residuals: Min 1Q Median 3Q Max -2.8037 -0.8037 0.4014 1.2169 3.2929 Coefficients: Estimate Std. Error t value Pr(>|t|) ratings$taidgender 1.2051 0.4503 2.676 0.0107 * ratings$tagender -0.1845 0.4884 -0.378 0.7076 ratings$gender 1.8915 0.2618 7.225 9.14e-09 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.603 on 40 degrees of freedom Multiple R-squared: 0.8447, Adjusted R-squared: 0.8331 F-statistic: 72.53 on 3 and 40 DF, p-value: 3.152e-16 [1] overall Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + ratings$gender - 1) Residuals: Min 1Q Median 3Q Max -2.99601 -0.38870 0.00399 0.94702 3.11394 Coefficients: Estimate Std. Error t value Pr(>|t|) ratings$taidgender 1.2787 0.3904 3.275 0.00219 ** ratings$tagender 0.2239 0.4235 0.529 0.59994 ratings$gender 1.8861 0.2270 8.309 3.03e-10 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.39 on 40 degrees of freedom Multiple R-squared: 0.8921, Adjusted R-squared: 0.884 F-statistic: 110.2 on 3 and 40 DF, p-value: < 2.2e-16
# model using "group" variable; note how R deals with the fact that the levels are dependent when there is
# an intercept.
# use as.factor() to get R to treat the variable as a dummy
for (ch in character) {
reg_mod <- lm(unlist(ratings[ch]) ~ as.factor(ratings$group) ) # include the constant term for now
print(ch, quote=F)
print(summary(reg_mod))
}
[1] professional Call: lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group)) Residuals: Min 1Q Median 3Q Max -3.0833 -0.5833 0.3636 0.4167 1.1250 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.58333 0.30401 15.076 <2e-16 *** as.factor(ratings$group)4 0.05303 0.43960 0.121 0.905 as.factor(ratings$group)5 -0.70833 0.48068 -1.474 0.149 as.factor(ratings$group)6 -0.50000 0.42994 -1.163 0.252 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.053 on 39 degrees of freedom Multiple R-squared: 0.08828, Adjusted R-squared: 0.01815 F-statistic: 1.259 on 3 and 39 DF, p-value: 0.3019 [1] respect Call: lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group)) Residuals: Min 1Q Median 3Q Max -3.0833 -0.5833 0.3636 0.4167 1.1250 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.58333 0.30401 15.076 <2e-16 *** as.factor(ratings$group)4 0.05303 0.43960 0.121 0.905 as.factor(ratings$group)5 -0.70833 0.48068 -1.474 0.149 as.factor(ratings$group)6 -0.50000 0.42994 -1.163 0.252 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.053 on 39 degrees of freedom Multiple R-squared: 0.08828, Adjusted R-squared: 0.01815 F-statistic: 1.259 on 3 and 39 DF, p-value: 0.3019 [1] caring Call: lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group)) Residuals: Min 1Q Median 3Q Max -2.6250 -0.4375 0.3333 0.7500 1.3750 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.2500 0.3056 13.908 <2e-16 *** as.factor(ratings$group)4 -0.1591 0.4419 -0.360 0.721 as.factor(ratings$group)5 -0.6250 0.4832 -1.294 0.203 as.factor(ratings$group)6 -0.5833 0.4322 -1.350 0.185 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.059 on 39 degrees of freedom Multiple R-squared: 0.06604, Adjusted R-squared: -0.005806 F-statistic: 0.9192 on 3 and 39 DF, p-value: 0.4406 [1] enthusiastic Call: lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group)) Residuals: Min 1Q Median 3Q Max -2.75000 -0.25000 -0.09091 0.62500 1.50000 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.2500 0.2911 14.600 <2e-16 *** as.factor(ratings$group)4 -0.1591 0.4209 -0.378 0.7075 as.factor(ratings$group)5 -0.5000 0.4603 -1.086 0.2840 as.factor(ratings$group)6 -0.7500 0.4117 -1.822 0.0762 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.008 on 39 degrees of freedom Multiple R-squared: 0.09097, Adjusted R-squared: 0.02104 F-statistic: 1.301 on 3 and 39 DF, p-value: 0.2878 [1] communicate Call: lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group)) Residuals: Min 1Q Median 3Q Max -2.7500 -0.3333 0.2500 0.6667 1.5000 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.3333 0.3060 14.161 <2e-16 *** as.factor(ratings$group)4 -0.2424 0.4425 -0.548 0.5869 as.factor(ratings$group)5 -0.8333 0.4839 -1.722 0.0929 . as.factor(ratings$group)6 -0.5833 0.4328 -1.348 0.1855 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.06 on 39 degrees of freedom Multiple R-squared: 0.08519, Adjusted R-squared: 0.01482 F-statistic: 1.211 on 3 and 39 DF, p-value: 0.3187 [1] helpful Call: lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group)) Residuals: Min 1Q Median 3Q Max -2.12500 -0.82955 0.09091 1.00000 1.87500 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.00000 0.30674 13.040 8.45e-16 *** as.factor(ratings$group)4 -0.09091 0.44355 -0.205 0.8387 as.factor(ratings$group)5 -0.87500 0.48500 -1.804 0.0789 . as.factor(ratings$group)6 -0.25000 0.43380 -0.576 0.5677 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.063 on 39 degrees of freedom Multiple R-squared: 0.08617, Adjusted R-squared: 0.01587 F-statistic: 1.226 on 3 and 39 DF, p-value: 0.3133 [1] feedback Call: lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group)) Residuals: Min 1Q Median 3Q Max -2.7500 -0.4167 0.2500 0.5833 1.3750 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.4167 0.3267 13.519 2.66e-16 *** as.factor(ratings$group)4 -0.5076 0.4724 -1.074 0.289 as.factor(ratings$group)5 -0.7917 0.5166 -1.533 0.133 as.factor(ratings$group)6 -0.6667 0.4620 -1.443 0.157 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.132 on 39 degrees of freedom Multiple R-squared: 0.07339, Adjusted R-squared: 0.002112 F-statistic: 1.03 on 3 and 39 DF, p-value: 0.3901 [1] prompt Call: lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group)) Residuals: Min 1Q Median 3Q Max -2.7500 -0.3636 0.5833 0.6364 1.2500 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.3333 0.2983 14.529 <2e-16 *** as.factor(ratings$group)4 0.0303 0.4313 0.070 0.9443 as.factor(ratings$group)5 -0.5833 0.4716 -1.237 0.2235 as.factor(ratings$group)6 -0.9167 0.4218 -2.173 0.0359 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.033 on 39 degrees of freedom Multiple R-squared: 0.15, Adjusted R-squared: 0.08465 F-statistic: 2.295 on 3 and 39 DF, p-value: 0.09292 [1] consistent Call: lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group)) Residuals: Min 1Q Median 3Q Max -2.62500 -0.52083 0.08333 1.00000 1.58333 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.91667 0.33876 11.562 3.6e-14 *** as.factor(ratings$group)4 0.08333 0.48985 0.170 0.866 as.factor(ratings$group)5 -0.29167 0.53563 -0.545 0.589 as.factor(ratings$group)6 -0.50000 0.47909 -1.044 0.303 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.174 on 39 degrees of freedom Multiple R-squared: 0.0441, Adjusted R-squared: -0.02943 F-statistic: 0.5997 on 3 and 39 DF, p-value: 0.619 [1] fair Call: lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group)) Residuals: Min 1Q Median 3Q Max -2.6250 -0.3750 -0.1818 0.6667 1.5833 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.3333 0.2738 15.824 <2e-16 *** as.factor(ratings$group)4 -0.1515 0.3960 -0.383 0.704 as.factor(ratings$group)5 -0.7083 0.4330 -1.636 0.110 as.factor(ratings$group)6 -0.9167 0.3873 -2.367 0.023 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.9486 on 39 degrees of freedom Multiple R-squared: 0.1569, Adjusted R-squared: 0.09209 F-statistic: 2.42 on 3 and 39 DF, p-value: 0.08061 [1] responsive Call: lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group)) Residuals: Min 1Q Median 3Q Max -2.7500 -0.5644 0.4167 0.6439 1.4546 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.1667 0.2880 14.469 <2e-16 *** as.factor(ratings$group)4 -0.6212 0.4164 -1.492 0.144 as.factor(ratings$group)5 -0.4167 0.4553 -0.915 0.366 as.factor(ratings$group)6 -0.5833 0.4073 -1.432 0.160 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.9976 on 39 degrees of freedom Multiple R-squared: 0.06872, Adjusted R-squared: -0.002918 F-statistic: 0.9593 on 3 and 39 DF, p-value: 0.4216 [1] praised Call: lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group)) Residuals: Min 1Q Median 3Q Max -3.0000 -0.4546 0.2500 0.4167 1.2500 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.5833 0.2611 17.557 <2e-16 *** as.factor(ratings$group)4 -0.1288 0.3775 -0.341 0.7348 as.factor(ratings$group)5 -0.5833 0.4128 -1.413 0.1655 as.factor(ratings$group)6 -0.8333 0.3692 -2.257 0.0297 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.9043 on 39 degrees of freedom Multiple R-squared: 0.1407, Adjusted R-squared: 0.0746 F-statistic: 2.129 on 3 and 39 DF, p-value: 0.1122 [1] knowledgeable Call: lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group)) Residuals: Min 1Q Median 3Q Max -3.1250 -0.3636 0.1667 0.7500 1.1667 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.2500 0.2899 14.659 <2e-16 *** as.factor(ratings$group)4 0.1136 0.4192 0.271 0.788 as.factor(ratings$group)5 -0.1250 0.4584 -0.273 0.787 as.factor(ratings$group)6 -0.4167 0.4100 -1.016 0.316 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.004 on 39 degrees of freedom Multiple R-squared: 0.04435, Adjusted R-squared: -0.02916 F-statistic: 0.6033 on 3 and 39 DF, p-value: 0.6168 [1] clear Call: lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group)) Residuals: Min 1Q Median 3Q Max -2.5000 -0.5000 0.5000 0.6364 1.6364 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.4167 0.3534 12.496 3.26e-15 *** as.factor(ratings$group)4 -1.0530 0.5111 -2.060 0.0461 * as.factor(ratings$group)5 -0.9167 0.5588 -1.640 0.1090 as.factor(ratings$group)6 -0.9167 0.4998 -1.834 0.0743 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.224 on 39 degrees of freedom Multiple R-squared: 0.1229, Adjusted R-squared: 0.05539 F-statistic: 1.821 on 3 and 39 DF, p-value: 0.1593 [1] overall Call: lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group)) Residuals: Min 1Q Median 3Q Max -2.7500 -0.3333 0.2500 0.6667 1.3750 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.3333 0.2952 14.678 <2e-16 *** as.factor(ratings$group)4 -0.3333 0.4269 -0.781 0.440 as.factor(ratings$group)5 -0.7083 0.4668 -1.517 0.137 as.factor(ratings$group)6 -0.5833 0.4175 -1.397 0.170 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.023 on 39 degrees of freedom Multiple R-squared: 0.07095, Adjusted R-squared: -0.0005127 F-statistic: 0.9928 on 3 and 39 DF, p-value: 0.4063
# model using "group" variable.
# use as.factor() to get R to treat the variable as a dummy
# Omit the intercept
for (ch in character) {
reg_mod <- lm(unlist(ratings[ch]) ~ as.factor(ratings$group) - 1) # omit the intercept
print(ch, quote=F)
print(summary(reg_mod))
}
[1] professional Call: lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group) - 1) Residuals: Min 1Q Median 3Q Max -3.0833 -0.5833 0.3636 0.4167 1.1250 Coefficients: Estimate Std. Error t value Pr(>|t|) as.factor(ratings$group)3 4.5833 0.3040 15.08 < 2e-16 *** as.factor(ratings$group)4 4.6364 0.3175 14.60 < 2e-16 *** as.factor(ratings$group)5 3.8750 0.3723 10.41 8.16e-13 *** as.factor(ratings$group)6 4.0833 0.3040 13.43 3.28e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.053 on 39 degrees of freedom Multiple R-squared: 0.9492, Adjusted R-squared: 0.944 F-statistic: 182.3 on 4 and 39 DF, p-value: < 2.2e-16 [1] respect Call: lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group) - 1) Residuals: Min 1Q Median 3Q Max -3.0833 -0.5833 0.3636 0.4167 1.1250 Coefficients: Estimate Std. Error t value Pr(>|t|) as.factor(ratings$group)3 4.5833 0.3040 15.08 < 2e-16 *** as.factor(ratings$group)4 4.6364 0.3175 14.60 < 2e-16 *** as.factor(ratings$group)5 3.8750 0.3723 10.41 8.16e-13 *** as.factor(ratings$group)6 4.0833 0.3040 13.43 3.28e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.053 on 39 degrees of freedom Multiple R-squared: 0.9492, Adjusted R-squared: 0.944 F-statistic: 182.3 on 4 and 39 DF, p-value: < 2.2e-16 [1] caring Call: lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group) - 1) Residuals: Min 1Q Median 3Q Max -2.6250 -0.4375 0.3333 0.7500 1.3750 Coefficients: Estimate Std. Error t value Pr(>|t|) as.factor(ratings$group)3 4.2500 0.3056 13.908 < 2e-16 *** as.factor(ratings$group)4 4.0909 0.3192 12.818 1.46e-15 *** as.factor(ratings$group)5 3.6250 0.3743 9.686 6.27e-12 *** as.factor(ratings$group)6 3.6667 0.3056 11.999 1.15e-14 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.059 on 39 degrees of freedom Multiple R-squared: 0.9385, Adjusted R-squared: 0.9322 F-statistic: 148.9 on 4 and 39 DF, p-value: < 2.2e-16 [1] enthusiastic Call: lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group) - 1) Residuals: Min 1Q Median 3Q Max -2.75000 -0.25000 -0.09091 0.62500 1.50000 Coefficients: Estimate Std. Error t value Pr(>|t|) as.factor(ratings$group)3 4.2500 0.2911 14.60 < 2e-16 *** as.factor(ratings$group)4 4.0909 0.3040 13.46 3.10e-16 *** as.factor(ratings$group)5 3.7500 0.3565 10.52 6.00e-13 *** as.factor(ratings$group)6 3.5000 0.2911 12.02 1.08e-14 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.008 on 39 degrees of freedom Multiple R-squared: 0.9433, Adjusted R-squared: 0.9375 F-statistic: 162.3 on 4 and 39 DF, p-value: < 2.2e-16 [1] communicate Call: lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group) - 1) Residuals: Min 1Q Median 3Q Max -2.7500 -0.3333 0.2500 0.6667 1.5000 Coefficients: Estimate Std. Error t value Pr(>|t|) as.factor(ratings$group)3 4.3333 0.3060 14.161 < 2e-16 *** as.factor(ratings$group)4 4.0909 0.3196 12.799 1.53e-15 *** as.factor(ratings$group)5 3.5000 0.3748 9.339 1.71e-11 *** as.factor(ratings$group)6 3.7500 0.3060 12.254 6.00e-15 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.06 on 39 degrees of freedom Multiple R-squared: 0.9391, Adjusted R-squared: 0.9329 F-statistic: 150.4 on 4 and 39 DF, p-value: < 2.2e-16 [1] helpful Call: lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group) - 1) Residuals: Min 1Q Median 3Q Max -2.12500 -0.82955 0.09091 1.00000 1.87500 Coefficients: Estimate Std. Error t value Pr(>|t|) as.factor(ratings$group)3 4.0000 0.3067 13.040 8.45e-16 *** as.factor(ratings$group)4 3.9091 0.3204 12.201 6.87e-15 *** as.factor(ratings$group)5 3.1250 0.3757 8.318 3.58e-10 *** as.factor(ratings$group)6 3.7500 0.3067 12.225 6.46e-15 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.063 on 39 degrees of freedom Multiple R-squared: 0.9324, Adjusted R-squared: 0.9254 F-statistic: 134.4 on 4 and 39 DF, p-value: < 2.2e-16 [1] feedback Call: lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group) - 1) Residuals: Min 1Q Median 3Q Max -2.7500 -0.4167 0.2500 0.5833 1.3750 Coefficients: Estimate Std. Error t value Pr(>|t|) as.factor(ratings$group)3 4.4167 0.3267 13.52 2.66e-16 *** as.factor(ratings$group)4 3.9091 0.3412 11.46 4.75e-14 *** as.factor(ratings$group)5 3.6250 0.4001 9.06 3.88e-11 *** as.factor(ratings$group)6 3.7500 0.3267 11.48 4.48e-14 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.132 on 39 degrees of freedom Multiple R-squared: 0.9312, Adjusted R-squared: 0.9241 F-statistic: 132 on 4 and 39 DF, p-value: < 2.2e-16 [1] prompt Call: lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group) - 1) Residuals: Min 1Q Median 3Q Max -2.7500 -0.3636 0.5833 0.6364 1.2500 Coefficients: Estimate Std. Error t value Pr(>|t|) as.factor(ratings$group)3 4.3333 0.2982 14.53 < 2e-16 *** as.factor(ratings$group)4 4.3636 0.3115 14.01 < 2e-16 *** as.factor(ratings$group)5 3.7500 0.3653 10.27 1.21e-12 *** as.factor(ratings$group)6 3.4167 0.2982 11.46 4.76e-14 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.033 on 39 degrees of freedom Multiple R-squared: 0.9429, Adjusted R-squared: 0.937 F-statistic: 161 on 4 and 39 DF, p-value: < 2.2e-16 [1] consistent Call: lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group) - 1) Residuals: Min 1Q Median 3Q Max -2.62500 -0.52083 0.08333 1.00000 1.58333 Coefficients: Estimate Std. Error t value Pr(>|t|) as.factor(ratings$group)3 3.9167 0.3388 11.562 3.60e-14 *** as.factor(ratings$group)4 4.0000 0.3538 11.305 7.10e-14 *** as.factor(ratings$group)5 3.6250 0.4149 8.737 1.01e-10 *** as.factor(ratings$group)6 3.4167 0.3388 10.086 2.01e-12 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.174 on 39 degrees of freedom Multiple R-squared: 0.9185, Adjusted R-squared: 0.9101 F-statistic: 109.9 on 4 and 39 DF, p-value: < 2.2e-16 [1] fair Call: lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group) - 1) Residuals: Min 1Q Median 3Q Max -2.6250 -0.3750 -0.1818 0.6667 1.5833 Coefficients: Estimate Std. Error t value Pr(>|t|) as.factor(ratings$group)3 4.3333 0.2738 15.82 < 2e-16 *** as.factor(ratings$group)4 4.1818 0.2860 14.62 < 2e-16 *** as.factor(ratings$group)5 3.6250 0.3354 10.81 2.70e-13 *** as.factor(ratings$group)6 3.4167 0.2738 12.48 3.42e-15 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.9486 on 39 degrees of freedom Multiple R-squared: 0.9497, Adjusted R-squared: 0.9446 F-statistic: 184.2 on 4 and 39 DF, p-value: < 2.2e-16 [1] responsive Call: lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group) - 1) Residuals: Min 1Q Median 3Q Max -2.7500 -0.5644 0.4167 0.6439 1.4546 Coefficients: Estimate Std. Error t value Pr(>|t|) as.factor(ratings$group)3 4.1667 0.2880 14.47 < 2e-16 *** as.factor(ratings$group)4 3.5455 0.3008 11.79 1.99e-14 *** as.factor(ratings$group)5 3.7500 0.3527 10.63 4.38e-13 *** as.factor(ratings$group)6 3.5833 0.2880 12.44 3.72e-15 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.9976 on 39 degrees of freedom Multiple R-squared: 0.9405, Adjusted R-squared: 0.9344 F-statistic: 154 on 4 and 39 DF, p-value: < 2.2e-16 [1] praised Call: lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group) - 1) Residuals: Min 1Q Median 3Q Max -3.0000 -0.4546 0.2500 0.4167 1.2500 Coefficients: Estimate Std. Error t value Pr(>|t|) as.factor(ratings$group)3 4.5833 0.2611 17.56 < 2e-16 *** as.factor(ratings$group)4 4.4545 0.2727 16.34 < 2e-16 *** as.factor(ratings$group)5 4.0000 0.3197 12.51 3.14e-15 *** as.factor(ratings$group)6 3.7500 0.2611 14.37 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.9043 on 39 degrees of freedom Multiple R-squared: 0.9601, Adjusted R-squared: 0.956 F-statistic: 234.5 on 4 and 39 DF, p-value: < 2.2e-16 [1] knowledgeable Call: lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group) - 1) Residuals: Min 1Q Median 3Q Max -3.1250 -0.3636 0.1667 0.7500 1.1667 Coefficients: Estimate Std. Error t value Pr(>|t|) as.factor(ratings$group)3 4.2500 0.2899 14.66 < 2e-16 *** as.factor(ratings$group)4 4.3636 0.3028 14.41 < 2e-16 *** as.factor(ratings$group)5 4.1250 0.3551 11.62 3.11e-14 *** as.factor(ratings$group)6 3.8333 0.2899 13.22 5.43e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.004 on 39 degrees of freedom Multiple R-squared: 0.9494, Adjusted R-squared: 0.9443 F-statistic: 183.1 on 4 and 39 DF, p-value: < 2.2e-16 [1] clear Call: lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group) - 1) Residuals: Min 1Q Median 3Q Max -2.5000 -0.5000 0.5000 0.6364 1.6364 Coefficients: Estimate Std. Error t value Pr(>|t|) as.factor(ratings$group)3 4.4167 0.3534 12.496 3.26e-15 *** as.factor(ratings$group)4 3.3636 0.3692 9.112 3.33e-11 *** as.factor(ratings$group)5 3.5000 0.4329 8.086 7.28e-10 *** as.factor(ratings$group)6 3.5000 0.3534 9.903 3.37e-12 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.224 on 39 degrees of freedom Multiple R-squared: 0.9117, Adjusted R-squared: 0.9026 F-statistic: 100.7 on 4 and 39 DF, p-value: < 2.2e-16 [1] overall Call: lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group) - 1) Residuals: Min 1Q Median 3Q Max -2.7500 -0.3333 0.2500 0.6667 1.3750 Coefficients: Estimate Std. Error t value Pr(>|t|) as.factor(ratings$group)3 4.3333 0.2952 14.68 < 2e-16 *** as.factor(ratings$group)4 4.0000 0.3084 12.97 9.99e-16 *** as.factor(ratings$group)5 3.6250 0.3616 10.03 2.38e-12 *** as.factor(ratings$group)6 3.7500 0.2952 12.70 1.95e-15 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.023 on 39 degrees of freedom Multiple R-squared: 0.943, Adjusted R-squared: 0.9372 F-statistic: 161.4 on 4 and 39 DF, p-value: < 2.2e-16
# model using taidgender variable and adjust for student gender and an interaction between
# student gender and taid gender
# Omit the intercept
# adjust for
for (ch in character) {
reg_mod <- lm(unlist(ratings[ch]) ~
ratings$taidgender +
ratings$tagender +
ratings$gender +
ratings$taidgender*ratings$gender
- 1 ) # omit intercept
print(ch, quote=F)
print(summary(reg_mod))
}
[1] professional Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + ratings$gender + ratings$taidgender * ratings$gender - 1) Residuals: Min 1Q Median 3Q Max -3.6610 -0.5996 0.3390 0.6526 2.8263 Coefficients: Estimate Std. Error t value Pr(>|t|) ratings$taidgender 3.9249 0.8741 4.490 6.16e-05 *** ratings$tagender 0.3136 0.4103 0.764 0.44927 ratings$gender 2.1737 0.2303 9.438 1.28e-11 *** ratings$taidgender:ratings$gender -1.8126 0.5903 -3.071 0.00388 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.33 on 39 degrees of freedom Multiple R-squared: 0.9191, Adjusted R-squared: 0.9108 F-statistic: 110.7 on 4 and 39 DF, p-value: < 2.2e-16 [1] respect Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + ratings$gender + ratings$taidgender * ratings$gender - 1) Residuals: Min 1Q Median 3Q Max -3.6610 -0.5996 0.3390 0.6526 2.8263 Coefficients: Estimate Std. Error t value Pr(>|t|) ratings$taidgender 3.9249 0.8741 4.490 6.16e-05 *** ratings$tagender 0.3136 0.4103 0.764 0.44927 ratings$gender 2.1737 0.2303 9.438 1.28e-11 *** ratings$taidgender:ratings$gender -1.8126 0.5903 -3.071 0.00388 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.33 on 39 degrees of freedom Multiple R-squared: 0.9191, Adjusted R-squared: 0.9108 F-statistic: 110.7 on 4 and 39 DF, p-value: < 2.2e-16 [1] caring Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + ratings$gender + ratings$taidgender * ratings$gender - 1) Residuals: Min 1Q Median 3Q Max -2.9971 -0.5246 0.1475 0.9204 3.0015 Coefficients: Estimate Std. Error t value Pr(>|t|) ratings$taidgender 3.3344 0.8946 3.727 0.000613 *** ratings$tagender 0.1541 0.4200 0.367 0.715698 ratings$gender 1.9985 0.2357 8.478 2.21e-10 *** ratings$taidgender:ratings$gender -1.4805 0.6041 -2.451 0.018852 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.361 on 39 degrees of freedom Multiple R-squared: 0.8984, Adjusted R-squared: 0.888 F-statistic: 86.22 on 4 and 39 DF, p-value: < 2.2e-16 [1] enthusiastic Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + ratings$gender + ratings$taidgender * ratings$gender - 1) Residuals: Min 1Q Median 3Q Max -3.07312 -0.45347 0.08235 0.54653 2.96581 Coefficients: Estimate Std. Error t value Pr(>|t|) ratings$taidgender 3.37947 0.85239 3.965 0.000304 *** ratings$tagender -0.00237 0.40014 -0.006 0.995304 ratings$gender 2.03656 0.22460 9.067 3.8e-11 *** ratings$taidgender:ratings$gender -1.49838 0.57562 -2.603 0.012999 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.297 on 39 degrees of freedom Multiple R-squared: 0.9063, Adjusted R-squared: 0.8967 F-statistic: 94.33 on 4 and 39 DF, p-value: < 2.2e-16 [1] communicate Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + ratings$gender + ratings$taidgender * ratings$gender - 1) Residuals: Min 1Q Median 3Q Max -3.1930 -0.5344 -0.1024 0.9401 2.9913 Coefficients: Estimate Std. Error t value Pr(>|t|) ratings$taidgender 3.4949 0.8751 3.994 0.000279 *** ratings$tagender 0.1756 0.4108 0.427 0.671379 ratings$gender 2.0087 0.2306 8.711 1.09e-10 *** ratings$taidgender:ratings$gender -1.5768 0.5910 -2.668 0.011054 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.331 on 39 degrees of freedom Multiple R-squared: 0.904, Adjusted R-squared: 0.8942 F-statistic: 91.82 on 4 and 39 DF, p-value: < 2.2e-16 [1] helpful Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + ratings$gender + ratings$taidgender * ratings$gender - 1) Residuals: Min 1Q Median 3Q Max -2.76834 -0.78077 -0.05169 1.03079 3.11583 Coefficients: Estimate Std. Error t value Pr(>|t|) ratings$taidgender 3.4680 0.8227 4.215 0.000143 *** ratings$tagender 0.3743 0.3862 0.969 0.338415 ratings$gender 1.8842 0.2168 8.691 1.16e-10 *** ratings$taidgender:ratings$gender -1.6748 0.5556 -3.015 0.004509 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.252 on 39 degrees of freedom Multiple R-squared: 0.9062, Adjusted R-squared: 0.8965 F-statistic: 94.16 on 4 and 39 DF, p-value: < 2.2e-16 [1] feedback Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + ratings$gender + ratings$taidgender * ratings$gender - 1) Residuals: Min 1Q Median 3Q Max -3.2539 -0.5203 0.1990 0.8016 2.9285 Coefficients: Estimate Std. Error t value Pr(>|t|) ratings$taidgender 3.13717 0.90379 3.471 0.00128 ** ratings$tagender -0.05545 0.42427 -0.131 0.89669 ratings$gender 2.12695 0.23815 8.931 5.68e-11 *** ratings$taidgender:ratings$gender -1.40768 0.61033 -2.306 0.02649 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.375 on 39 degrees of freedom Multiple R-squared: 0.8985, Adjusted R-squared: 0.888 F-statistic: 86.27 on 4 and 39 DF, p-value: < 2.2e-16 [1] prompt Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + ratings$gender + ratings$taidgender * ratings$gender - 1) Residuals: Min 1Q Median 3Q Max -3.05733 -0.37746 -0.05733 0.69819 3.00916 Coefficients: Estimate Std. Error t value Pr(>|t|) ratings$taidgender 4.28125 0.85603 5.001 1.25e-05 *** ratings$tagender 0.07565 0.40185 0.188 0.85166 ratings$gender 1.99084 0.22556 8.826 7.76e-11 *** ratings$taidgender:ratings$gender -1.97028 0.57808 -3.408 0.00153 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.302 on 39 degrees of freedom Multiple R-squared: 0.9093, Adjusted R-squared: 0.9 F-statistic: 97.72 on 4 and 39 DF, p-value: < 2.2e-16 [1] consistent Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + ratings$gender + ratings$taidgender * ratings$gender - 1) Residuals: Min 1Q Median 3Q Max -3.02803 -0.47156 0.07008 0.95445 3.03504 Coefficients: Estimate Std. Error t value Pr(>|t|) ratings$taidgender 3.03236 0.89781 3.377 0.00167 ** ratings$tagender 0.09811 0.42146 0.233 0.81714 ratings$gender 1.96496 0.23657 8.306 3.72e-10 *** ratings$taidgender:ratings$gender -1.37153 0.60629 -2.262 0.02934 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.366 on 39 degrees of freedom Multiple R-squared: 0.8896, Adjusted R-squared: 0.8783 F-statistic: 78.57 on 4 and 39 DF, p-value: < 2.2e-16 [1] fair Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + ratings$gender + ratings$taidgender * ratings$gender - 1) Residuals: Min 1Q Median 3Q Max -3.04457 -0.50273 0.05483 0.86143 2.93072 Coefficients: Estimate Std. Error t value Pr(>|t|) ratings$taidgender 3.48160 0.71829 4.847 2.03e-05 *** ratings$tagender -0.09399 0.33719 -0.279 0.78191 ratings$gender 2.06928 0.18927 10.933 1.93e-13 *** ratings$taidgender:ratings$gender -1.51172 0.48506 -3.117 0.00343 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.093 on 39 degrees of freedom Multiple R-squared: 0.9333, Adjusted R-squared: 0.9264 F-statistic: 136.4 on 4 and 39 DF, p-value: < 2.2e-16 [1] responsive Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + ratings$gender + ratings$taidgender * ratings$gender - 1) Residuals: Min 1Q Median 3Q Max -3.12009 -0.73658 -0.08528 0.91472 3.04229 Coefficients: Estimate Std. Error t value Pr(>|t|) ratings$taidgender 4.3317 0.8308 5.214 6.37e-06 *** ratings$tagender -0.2047 0.3900 -0.525 0.602692 ratings$gender 2.1624 0.2189 9.878 3.62e-12 *** ratings$taidgender:ratings$gender -2.4088 0.5610 -4.293 0.000113 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.264 on 39 degrees of freedom Multiple R-squared: 0.9045, Adjusted R-squared: 0.8947 F-statistic: 92.3 on 4 and 39 DF, p-value: < 2.2e-16 [1] praised Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + ratings$gender + ratings$taidgender * ratings$gender - 1) Residuals: Min 1Q Median 3Q Max -3.4200 -0.4200 0.2563 0.6541 2.8051 Coefficients: Estimate Std. Error t value Pr(>|t|) ratings$taidgender 3.94806 0.76826 5.139 8.07e-06 *** ratings$tagender -0.03009 0.36065 -0.083 0.93393 ratings$gender 2.22503 0.20244 10.991 1.65e-13 *** ratings$taidgender:ratings$gender -1.82722 0.51881 -3.522 0.00111 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.169 on 39 degrees of freedom Multiple R-squared: 0.9333, Adjusted R-squared: 0.9265 F-statistic: 136.5 on 4 and 39 DF, p-value: < 2.2e-16 [1] knowledgeable Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + ratings$gender + ratings$taidgender * ratings$gender - 1) Residuals: Min 1Q Median 3Q Max -3.5392 -0.5392 -0.1096 0.7076 2.7990 Coefficients: Estimate Std. Error t value Pr(>|t|) ratings$taidgender 3.8393 0.8661 4.433 7.35e-05 *** ratings$tagender 0.1371 0.4066 0.337 0.73782 ratings$gender 2.2010 0.2282 9.645 7.06e-12 *** ratings$taidgender:ratings$gender -1.9308 0.5849 -3.301 0.00206 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.318 on 39 degrees of freedom Multiple R-squared: 0.913, Adjusted R-squared: 0.9041 F-statistic: 102.3 on 4 and 39 DF, p-value: < 2.2e-16 [1] clear Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + ratings$gender + ratings$taidgender * ratings$gender - 1) Residuals: Min 1Q Median 3Q Max -3.2322 -0.5763 0.1429 0.9817 3.2590 Coefficients: Estimate Std. Error t value Pr(>|t|) ratings$taidgender 3.5171 0.9856 3.569 0.000971 *** ratings$tagender -0.3750 0.4627 -0.811 0.422506 ratings$gender 2.1161 0.2597 8.148 6.01e-10 *** ratings$taidgender:ratings$gender -1.7269 0.6656 -2.595 0.013273 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.499 on 39 degrees of freedom Multiple R-squared: 0.8676, Adjusted R-squared: 0.854 F-statistic: 63.88 on 4 and 39 DF, p-value: < 2.2e-16 [1] overall Call: lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + ratings$gender + ratings$taidgender * ratings$gender - 1) Residuals: Min 1Q Median 3Q Max -3.2292 -0.4522 0.0530 0.7969 2.9114 Coefficients: Estimate Std. Error t value Pr(>|t|) ratings$taidgender 3.36381 0.84848 3.965 0.000304 *** ratings$tagender 0.05205 0.39831 0.131 0.896710 ratings$gender 2.08856 0.22357 9.342 1.7e-11 *** ratings$taidgender:ratings$gender -1.55738 0.57298 -2.718 0.009747 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.291 on 39 degrees of freedom Multiple R-squared: 0.9093, Adjusted R-squared: 0.8999 F-statistic: 97.7 on 4 and 39 DF, p-value: < 2.2e-16
You might have noticed that we can get an enormous range of $p$-values for the effect of perceived gender, depending on the other terms we choose to keep in the model. In particular, omitting the intercept term will make many things appear significant, since the data have a large mean, which the other variables cannot capture well.
Analyses like these are rather ad hoc—and they don't match the underlying experiment in the first place!
Why should a student's rating of an instructor be a linear combination of these things, with additive "noise"?
Moreover, if we perform a variety of analyses and then just keep the one we like best, the $p$-values and other statistical summaries become meaningless. This can lead to "significance hunting" or "significance mining." Methods that can control for the selection process—deciding which terms to keep in the model on the basis of the data—are a current area of active research in Statistics.
The directory /Data has a .csv file called "ModuloRimbSpesePrestazOccasGratuiteRisoluz.49E2013_NonRes_Ingl.csv" downloaded from http://nats.sct.gob.mx/english/go-to-tables/table-8-domestic-passenger-travel/table-8-1-domestic-passenger-travel-by-mode/
The data are reported passenger miles annually in the U.S.A. from 1990–2012 for a variety of modes of transportation. The data are time series: they give a sequence of observations of the same variables at different times.
Please provide your code and your results for the following. The data editing to remove variables, transpose tables, etc., should be done using R commands—not using point-and-click tools. Do not use Excel for any part of this assignment.
[TO DO]