**You may discuss homework problems with other students, but you have to prepare the written assignments yourself. Late homework will be penalized 10% per day.**

**Please combine all your answers, the computer code and the figures into one file, and submit a copy in your dropbox on coursework.**

**Due date: January 30, 2015, 11:59PM.**

**Grading scheme: 10 points per question, total of 40.**

A criminologist studying the relationship between income level and assults in U.S. cities (among other things) collected the following data for 2215 communities. The dataset can be found in the UCI machine learning site.

We are interested in the per capita assult rate and its relation to median income.

In [1]:

```
%%R
crime.data = read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/00211/CommViolPredUnnormalizedData.txt", header=FALSE, sep=',', na.strings="?")
median_income = crime.data[,18]
assault_rate = crime.data[,137]
```

Fit a simple linear regression model to the data with

`log(pmax(assault_rate,1))`

as the dependent variable and`log(median_income)`

as the independent variable. Plot the estimated regression line. (I am suggesting using the maximum of`assault_rate`

and 1 because there are some communities where`assault_rate`

is 0).Add upper and lower 95% prediction bands for the regression line on the plot, using

`predict`

. That is, produce one line for the upper limit of each interval over a sequence of densities, and one line for the lower limits of the intervals. Interpret these bands at a`median_income`

of 30000.Add upper and lower 95% confidence bands for the regression line on the plot, using

`predict`

. That is, produce one line for the upper limit of each interval over a sequence of densities, and one line for the lower limits of the intervals. Interpret these bands at a`median_income`

of 30000.Test whether there is a linear relationship between

`log(assault_rate)`

and`log(median_income)`

at level $\alpha=0.05$. State the null hypothesis, the alternative, the conclusion and the $p$-value.Give a 95% confidence interval for the slope of the regression line. Interpret your interval.

Report the $R^2$ and the adjusted $R^2$ of the model, as well as an estimate of the variance of the errors in the model.

Let $Y$ and $X$ denote variables in a simple linear regression of median home prices versus median income in state in the US. Suppose that the model $$ Y = \beta_0 + \beta_1 X + \epsilon $$ satisfies the usual regression assumptions.

The table below is a table similar to the output of `anova`

when passed a simple linear regression model.

```
Response: Y
Df Sum Sq Mean Sq F value Pr(>F)
X 1 NA 4239 NA NA
Residuals 48 123546 NA
```

Compute the missing values of in the above table.

Test the null hypothesis $H_0 : \beta_1 = 0$ at level $\alpha = 0.05$ using the above table. Can you test the hypothesis $H_0 : \beta_1 < 0$ using Table 1?

If $Y$ and $X$ were reversed in the above regression, what would you expect $R^2$ to be?

The tables below show the regression output of a multiple regression model relating `Salary`

, the beginning salaries in dollars of employees in a given company to the following predictor variables: `Education, Experience`

and a variable `STEM`

indicating whether or not they have an undergraduate degree in a STEM field or not. (The units of both `Education`

and `Experience`

are years.)

```
ANOVA table:
Response: Salary
Df Sum Sq Mean Sq F value Pr(>F)
Regression NA 2416338 NA NA NA
Residuals 62 9113079 NA
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3226.4 937.7 NA NA
Education 850.0 NA 3.646 NA
Experience 923.4 260.1 NA NA
STEM NA 330.1 1.675 NA
```

Below, specify the null and alternative hypotheses, the test used, and your conclusion using $\alpha=0.05$ throughout. You may not necessarily be able to compute everything, but be as explicit as possible.

Fill in the missing values in the above table.

Test whether or not the linear regression model explains significantly more variability in

`Salary`

than a model with no explanatory variables. What assumptions are you making?Is there a positive linear relationship between

`Salary`

and`Experience`

, after accounting for the effect of the variables`STEM`

and`Education`

? (Hint: one-sided test)What salary interval would you forecast for an electrical engineer with 10 years of education and 5 years working in a related field?

What salary interval would you forecast, on average, for english majors with 10 years of education and 6 years in a related field?

A national insurance organization wanted to study the consumption pattern of cigarettes in all 50 states and the District of Columbia. The variables chosen for the study are:

Age: Median age of a person living in a state.

HS: Percentage of people over 25 years of age in a state who had completed high school.

Income: Per capita personal income for a state (income in dollars).

Black: Percentage of blacks living in a state.

Female: Percentage of females living in a state.

Price: Weighted average price (in cents) of a pack ofcigarettes in a state.

Sales: Number of packs of cigarettes sold in a state on a per capita basis.

The data can be found at http://www1.aucegypt.edu/faculty/hadi/RABE5/Data5/P088.txt.

Below, specify the null and alternative hypotheses, the test used, and your conclusion using a 5% level of significance.

Test the hypothesis that the variable

`Female`

is not needed in the regression equation relating`Sales`

to the six predictor variables.Test the hypothesis that the variables

`Female`

and`HS`

are not needed in the above regression equation.Compute a 95% confidence interval for the true regression coefficient of the variable

`Income`

.What percentage of the variation in

`Sales`

can be accounted for when`Income`

is removed from the above regression equation? Which model did you use?