You may discuss homework problems with other students, but you have to prepare the written assignments yourself. Late homework will be penalized 10% per day.

Please combine all your answers, the computer code and the figures into one file, and submit a copy in your dropbox on coursework.

Due date: January 30, 2015, 11:59PM.

Grading scheme: 10 points per question, total of 40.

# Question 1

A criminologist studying the relationship between income level and assults in U.S. cities (among other things) collected the following data for 2215 communities. The dataset can be found in the UCI machine learning site.

We are interested in the per capita assult rate and its relation to median income.

In [1]:
%%R
median_income = crime.data[,18]
assault_rate = crime.data[,137]

Populating the interactive namespace from numpy and matplotlib


1. Fit a simple linear regression model to the data with log(pmax(assault_rate,1)) as the dependent variable and log(median_income) as the independent variable. Plot the estimated regression line. (I am suggesting using the maximum of assault_rate and 1 because there are some communities where assault_rate is 0).

2. Add upper and lower 95% prediction bands for the regression line on the plot, using predict. That is, produce one line for the upper limit of each interval over a sequence of densities, and one line for the lower limits of the intervals. Interpret these bands at a median_income of 30000.

3. Add upper and lower 95% confidence bands for the regression line on the plot, using predict. That is, produce one line for the upper limit of each interval over a sequence of densities, and one line for the lower limits of the intervals. Interpret these bands at a median_income of 30000.

4. Test whether there is a linear relationship between assault_rate and median_income at level $\alpha=0.05$. State the null hypothesis, the alternative, the conclusion and the $p$-value.

5. Give a 95% confidence interval for the slope of the regression line. Interpret your interval.

6. Report the $R^2$ and the adjusted $R^2$ of the model, as well as an estimate of the variance of the errors in the model.

# Question 2

Let $Y$ and $X$ denote variables in a simple linear regression of median home prices versus median income in state in the US. Suppose that the model $$Y = \beta_0 + \beta_1 X + \epsilon$$ satisfies the usual regression assumptions.

The table below is a table similar to the output of anova when passed a simple linear regression model.

Response: Y
Df Sum Sq Mean Sq F value    Pr(>F)
X          1     NA    4239      NA        NA
Residuals 48 123546      NA

1. Compute the missing values of in the above table.

2. Test the null hypothesis $H_0 : \beta_1 = 0$ at level $\alpha = 0.05$ using the above table. Can you test the hypothesis $H_0 : \beta_1 < 0$ using Table 1?

3. If $Y$ and $X$ were reversed in the above regression, what would you expect $R^2$ to be?

# Question 3

The tables below show the regression output of a multiple regression model relating Salary, the beginning salaries in dollars of employees in a given company to the following predictor variables: Education, Experience and a variable STEM indicating whether or not they have an undergraduate degree in a STEM field or not. (The units of both Education and Experience are years.)

ANOVA table:

Response: Salary
Df   Sum Sq   Mean Sq  F value   Pr(>F)
Regression   NA  2416338        NA       NA       NA
Residuals    62  9113079        NA

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   3226.4      937.7      NA       NA
Education      850.0         NA   3.646       NA
Experience     923.4      260.1      NA       NA
STEM              NA      330.1   1.675       NA


Below, specify the null and alternative hypotheses, the test used, and your conclusion using $\alpha=0.05$ throughout. You may not necessarily be able to compute everything, but be as explicit as possible.

1. Fill in the missing values in the above table.

2. Test whether or not the linear regression model explains significantly more variability in Salary than a model with no explanatory variables. What assumptions are you making?

3. Is there a positive linear relationship between Salary and Experience, after accounting for the effect of the variables STEM and Education? (Hint: one-sided test)

4. What salary interval would you forecast for an electrical engineer with 10 years of education and 5 years working in a related field?

5. What salary interval would you forecast, on average, for english majors with 10 years of education and 6 years in a related field?

# Question 4 (Based on RABE 3.15)

A national insurance organization wanted to study the consumption pattern of cigarettes in all 50 states and the District of Columbia. The variables chosen for the study are:

• Age: Median age of a person living in a state.

• HS: Percentage of people over 25 years of age in a state who had completed high school.

• Income: Per capita personal income for a state (income in dollars).

• Black: Percentage of blacks living in a state.

• Female: Percentage of females living in a state.

• Price: Weighted average price (in cents) of a pack ofcigarettes in a state.

• Sales: Number of packs of cigarettes sold in a state on a per capita basis.

The data can be found at http://www1.aucegypt.edu/faculty/hadi/RABE5/Data5/P088.txt.

Below, specify the null and alternative hypotheses, the test used, and your conclusion using a 5% level of significance.

1. Test the hypothesis that the variable Female is not needed in the regression equation relating Sales to the six predictor variables.

2. Test the hypothesis that the variables Female and HS are not needed in the above regression equation.

3. Compute a 95% confidence interval for the true regression coefficient of the variable Income.

4. What percentage of the variation in Sales can be accounted for when Income is removed from the above regression equation? Which model did you use?