You may discuss homework problems with other students, but you have to prepare the written assignments yourself. Late homework will be penalized 10% per day.

Please combine all your answers, the computer code and the figures into one file, and submit a copy to your dropbox folder.

Grading scheme: 10 points per question, total of 30.

Due date: 11:59 PM January 21, 2014 (Tuesday evening).

Question 1

On Groundhog Day, February 2, a famous groundhog in Punxsutawney, PA is used to predict whether a winter will be long or not based on whether or not he sees his shadow. I collected data on whether he saw his shadow or not from here. I stored some of this data in this table.

Although Phil is on the East Coast, I wondered if the information says anything about whether or not we will experience a rainy winter out here in California. For this, I found rainfall data, and saved it in a table. To see how this was extracted see this notebook.

  1. Make a boxplot of the average rainfall in Northen California comparing the years Phil sees his shadow versus the years he does not.

  2. Construct a 90% confidence interval for the difference between the mean rainfall in years Phil sees his shadow and years he does not.

  3. Interpret the interval in part 2.

  4. At level, $\alpha = 0.05$ would you reject the null hypothesis that the average rainfall in Northern California during the month of February was the same in years Phil sees his shadow versus years he does not?

  5. What assumptions are you making in forming your confidence interval and in your hypothesis test?

Question 2

The data set on supervisor performance has information on an overall measure of supervisor performance (Y), as well as a measure of how often they give raises based on performance (X4) as well as 5 other measures.

  1. Create a boxplot of the supervisor rating Y, splitting the data based on the median of X4.

  2. Compute the sample mean, sample standard deviation Y in the two groups.

  3. Create a histogram of Y within each group.

  4. Compute a 90% confidence interval for the difference in supervisor performance between the two groups. What assumptions are you making?

  5. At level $\alpha=5\%$, test the null hypothesis that the average supervisor performance does not differ between the two groups. What assumptions are you making? What can you conclude?

  6. Repeat the test in 5. using the function lm.

Question 3 (RABE)

  1. Use the anscombe data in R. Attach the table using the command attach.

  2. Plot the 4 data sets (x1,y1), (x2,y2), (x3,y3), (x4,y4) on a 2-by-2 grid of plots using the commands plot and par(mfrow=c(2,2)). Add the number of the dataset to each plot as the main title on each plot.

  3. Fit a regression model to the data sets:

    a. y1 ~ x1

    b. y2 ~ x2

    c. y3 ~ x3

    d. y4 ~ x4

    using the command lm. Verify that all the fitted models have the exact same coefficients (up to numerical tolerance).

  4. Using the command cor, compute the sample correlation for each data set.

  5. Fit the same models in 3. but with the x and y reversed. Using the command summary, does anything about the results stay the same when you reverse x and y?

  6. Compute the $SSE, SST$ and $R^2$ value for each data set. Use the commands mean, sum, predict.

  7. Using the command summary, verify that all 4 models have exactly (up to numerical accuracy) the same $t$-statistics for testing the hypotheses $H_0:\beta_0=0$ and $H_0:\beta_1=0$.

  8. Using the command abline, replot the data, adding the regression line to each plot.

In [2]:
%%R
anscombe
   x1 x2 x3 x4    y1   y2    y3    y4
1  10 10 10  8  8.04 9.14  7.46  6.58
2   8  8  8  8  6.95 8.14  6.77  5.76
3  13 13 13  8  7.58 8.74 12.74  7.71
4   9  9  9  8  8.81 8.77  7.11  8.84
5  11 11 11  8  8.33 9.26  7.81  8.47
6  14 14 14  8  9.96 8.10  8.84  7.04
7   6  6  6  8  7.24 6.13  6.08  5.25
8   4  4  4 19  4.26 3.10  5.39 12.50
9  12 12 12  8 10.84 9.13  8.15  5.56
10  7  7  7  8  4.82 7.26  6.42  7.91
11  5  5  5  8  5.68 4.74  5.73  6.89