You may discuss homework problems with other students, but you have to prepare the written assignments yourself. Late homework will be penalized 10% per day.

Please combine all your answers, the computer code and the figures into one file, and submit a copy in your dropbox on coursework.

Due date: February 17, 2015, 11:59 PM.

# Question 1¶

Power is an important quantity in many applications of statistics. This question investigates the power of a test in simple linear regression. In a simple linear regression setting, suppose the true slope of the regression line is $\beta_1$ and the true intercept is $\beta_0$. If we assume $\sigma$ is known, then we can test $H_0: \beta_0 + 66 \beta_1 =66$ using $$Z = \frac{\hat{\beta}_0 + 66 \hat{\beta}_1 - 66}{\sigma_{\hat{\beta}_0 + 66 \hat{\beta}_1}}$$ where $\sigma_{\hat{\beta}_0 + 66 \hat{\beta}_1}$ is the standard deviation of $\hat{\beta}_0 + 66 \hat{\beta}_1$.

The power of this test is a function of the true value $(\beta_0 + 66 {\beta}_1)$ as well as the accuracy of our estimate $\sigma_{\hat{\beta}_0 + 66 \hat{\beta}_1}$ and is defined as $$P(\text{H_0 is rejected}).$$

As we change the true $\beta_0 + 66 \beta_1$, the probability we reject $H_0$ changes: if the true value of $\beta_0 + 66 \beta_1$ is much larger than 66 relative to $\sigma_{\hat{\beta}_0 + 66 \hat{\beta}_1}$, then we are very likely to reject $H_0$.

1. What rule would you use to determine whether or not you reject $H_0$ at level $\alpha=0.1$.

2. What is the distribution of our test statistic $Z$? Show that the distribution depends on the values $\beta_0 + 66 \beta_1, \sigma_{\hat{\beta}_0 + 66 \hat{\beta}_1}$ and the 66 in our null $H_0$ via the quantity $(\beta_0 + 66 \beta_1 - 66) / \sigma_{\hat{\beta}_0 + 66 \hat{\beta}_1}$. We call this quantity the non-centrality parameter.

3. Plot the power of your test as your function of the non-centrality parameter.

4. Roughly how large does the non-centrality parameter have to be in order to achieve power of 70%?

# Question 2 (ALSM, 6.18)¶

A researcher in a scientific foundation wished to evaluate the relation between intermediate and senior level annuals salaries of bachelor’s and master’s level mathematicians (Y, in thousand dollars) and an index of work quality (X1), number of years of experience (X2), and an index of publication success (X3). The data for a sample of 24 bachelor’s and master’s level mathematicians can be found at http://www.stanford.edu/class/stats191/data/math-salaries.table.

1. Obtain the scatter plot matrix and the correlation matrix of the table. Summarize the results.

2. Fit a linear regression model for salary based on X1, X2, X3. Report the fitted regression function.

3. Test the overall goodness of fit for the regression model at level $\alpha = 0.05$. Specify the null and alternative hypotheses, as well as the test used.

4. Give Bonferroni corrected simultaneous 95 % confidence intervals for $\beta_1, \beta_2, \beta_3$.

5. What is the $R^2$ of the model? How is the $R^2$ interpreted? What is the adjusted $R^2$?

6. The researcher wishes to find confidence interval estimates at certain levels of the X variables found in http://stats191.stanford.edu/data/salary_levels.table. Construct Bonferonni corrected simultaneous 90% confidence intervals at each of the columns of the above table.

# Question 3¶

The dataset state.x77 in R contains the following statistics (among others) related to the 50 states of the United States of America:

• Population: population estimate (1975)

• Income: per capita income (1974)

• Illiteracy: illiteracy (1970, percent of population)

• HS.Grad: percent high school graduates (1970)

In [1]:
%%R
state.data = data.frame(state.x77)


We are interested in the relation between Income and other 3 variables.

1. Produce a 4 by 4 scatter plot of the variables above.

2. Fit a multiple linear regression model to the data with Income as the dependent variable, and Population, Illiteracy, HS.Grad as the independent variables. Comment on the significance of the variables in the model using the result of summary.

3. Produce standard diagnostic plots of the multiple regression fit in part 2.

4. Plot dffits of the observations and find observations which have high influence, using critical value 0.5.

5. Plot Cook's distance of the observations and find observations which have high influence, using critical value 0.1. Compare with the result of part 4.

6. Find states with outlying predictors by looking at the leverage values. Use critical value 0.3.

7. Find outliers, if any, in the response. Remove them from the data and refit a multiple linear regression model and compare the result with the previous fit.

8. As a summary, find all the influential states using influence.measures function.

# Question 4¶

The dataset iris in R gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris.

In [2]:
%%R
data(iris)

1. Fit a multiple linear regression model to the data with sepal length as the dependent variable and sepal width, petal length and petal width as the independent variables.

2. Test the reduced model of $H_0: \beta_{\tt sepal width}=\beta_{\tt petal length} = 0$ with an F-test at level $\alpha=0.05$

3. Test $H_0: \beta_{\tt sepal width} = \beta_{\tt petal length}$ at level $\alpha=0.05$

4. Test $H_0: \beta_{\tt sepal width} < \beta_{\tt petal length}$ at level $\alpha=0.05$.

# Question 5 (ALSM 19.14)¶

A research laboratory was developing a new compound for the relief of severe cases of hay fever. In an experiment with 36 volunteers, the amounts of the two active ingredients (factors A and B) in the compound were varied at three levels each. Randomization was used in assigning four volunteers to each of the nine treatments. The data can be found at http://stats191.stanford.edu/data/hayfever.table.

1. Fit the two-way ANOVA model, including interactions. What is the estimated mean when Factor A is 2 and Factor B is 1?

2. Using R’s standard regression plots, plot the qqplot of the residuals. Is there any serious violation of normality?

3. This question asks you to graphically summarize the data. Create a plot with Factor A on the x-axis, and, using 3 different plotting symbols, the mean for each level of Factor B above each level of Factor A (see kidney data example). Does there appear to be any interactions?

4. Test for an interaction at level $\alpha = 0.01$.

5. Test for main effects of Factors A and B at level $\alpha = 0.01$.