You may discuss homework problems with other students, but you have to prepare the written assignments yourself. Late homework will be penalized 10% per day.

Please combine all your answers, the computer code and the figures into one file, and submit a copy in your dropbox on coursework.

Grading scheme: 10 points per question, total of 30.

Due date: March 4, 2014.

Question 1 (RABE, 3.14)

The data set http://stats191.stanford.edu/data/asthma.table contains data on the number of admittances, Y to an emergency room for asthma-related problems in a hospital for several days. On each day, researchers also recorded the daily high temperature T, and the level of some atmospheric pollutants P.

  1. Fit a linear regression model to the observed counts as a linear function of T and P.

  2. Looking at the usual diagnostics plots, does the constant variance assumption seem justified?

  3. The outcomes are counts, for which a common model is the so-called Poisson model which says that $\text{Var}(Y) = E(Y)$. In words, this says that the variance of the outcome is equal to the expected value of the outcome. Using a two-stage procedure, fit a weighted least squares regression to Y as a function of T and P with weights being inversely proportional to the fitted values of the initial model in 1.

  4. Looking at the usual diagnostics plots of this model (which takes the weights into account), does the constant variance assumption seem more reasonable?

  5. Using the weighted least squares fit, test the hypotheses at level $\alpha = 0.05$ that

    • the number of asthma cases is uncorrelated to the temperature allowing for pollutants;
    • the number of asthma cases is uncorrelated to the atmospheric pollutants allowing for temperature.

    Use a Bonferroni correction.

Question 2 (Based on RABE 8.4-8.6)

The file

http://www.ilr.cornell.edu/~hadi/RABE4/Data4/P217.txt

contains the values of the daily DJIA (Dow Jones Industrial Average) for all the trading days in 1996. The variable Time denotes the trading day of the year. There were 262 trading days in 1996.

  1. Fit a linear regression model connecting DJIA with `Time using all 262 trading days in 1996. Is the linear trend model adequate? Examine the residuals for time dependencies, including a plot of the autocorrelation function.

  2. Regress DJIA[t] against its lagged by one version DJIA[t-1]. Is this an adequate model? Are there any evidences of autocorrelation in the residuals?

  3. The variability (volatility) of the daily DJIA is large, and to accomodate this phenomenon the analysis is crried out on the logarithm of the DJIA. Repeat 2. above using log(DJIA) instead of DJIA.

  4. A simplified version of the random walk model of stock prices states that the best prediction of the stock index at Time=t is the value of the index at Time=t − 1. Show that this corresponds to a simple linear regression model for 2. with an intercept of 0 and a slope of 1.

  5. Carry out the the appropriate tests of significance at level α = 0.05 for 4. Test each coefficient separately ($t$-tests) , then test both simultaneously (i.e. an F test).

  6. The random walk theory implies that the first differences of the index (the difference between successive values) should be independently normally distributed with mean zero and constant variance. What kind of plot can be used to visually assess this hypothesis? Provide the plot.

Question 3

In this question we will look at inference ($t$-tests and confidence intervals) after model selection. We will use the prostate data from ElemStatLearn.

In [1]:
%%R
library(ElemStatLearn)
data(prostate)

We will perform 1 step of forward stepwise model selection when the outcome is generated completely independently from the design matrix. Following this, we will look at the corresponding $t$-statistics and confidence intervals.

I've written

In [2]:
%%R
select_first = function(dataset) {
    dataset$Y = rnorm(nrow(dataset))
    first_model = step(lm(Y~1,data=dataset), list(upper=~lcavol + 
                                                  lweight + age + 
                                                  lbph + svi + 
                                                  lcp + pgg45),
                      trace=FALSE, k=0, steps=1)
    return(list(interval=confint(first_model)[2,], pvalue=summary(first_model)$coef[2,4]))
}
select_first(prostate)
$interval
      2.5 %      97.5 % 
-0.01012249  1.01142252 

$pvalue
[1] 0.05462121


  1. I have written a function above that you can use for this exercise. By reading help(step) if necessary, describe what the function select_first does.

  2. Collect a sample of 1000 realizations of pvalue by running select_first 1000 times and storing only the $pvalue entry. Plot a histogram of these 1000 $p$-values. What do you expect it to look like? What does it look like? Why?

  3. What are the true regression coefficients in the above simulation? What value do you expect each of the above interval realizations to cover? Collect 1000 realizations of interval. How many do you expect to cover the true value? How many actually do cover the true value?

  4. Repeat 2. and 3. but modify select_first to add a strong effect for the variable lcavol.

In [2]: