*Based on the notebook by Michael Granitzer*

Goal: Implement and evaluate a linear model to predict car mileage.

Location:

- http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data
- http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.names

Tools:

`!curl -o <filename> "<url>"`

`urllib2.urlretrieve(<url>, <filename>)`

`tempfileName = urllib2.urlretrieve(<url>)`

Specifications:

- Use
`data`

as the filename for the data. - Use
`description`

as the filename for the names.

In [ ]:

```
import numpy as np
```

In [ ]:

```
# code to get the data
```

In [ ]:

```
#%load solutions/get-data-curl.py
```

In [ ]:

```
#%load solutions/get-data-urllib.py
```

In [ ]:

```
# code to load the data
```

In [ ]:

```
#%load solutions/load-data.py
```

In [ ]:

```
#look at the first few examples.
!cat data | grep "?"
```

We don't want the car name and the data has missing values, marked by "?".

Tools:

`auto = np.genfromtxt("<filename>", usecols=(0,1,2,3,4,5,6,7), missing_values='?')`

In [ ]:

```
# code to load the data without strings
```

In [ ]:

```
%load solutions/load-data2.py
```

In [ ]:

```
auto = np.genfromtxt("data", usecols=(0,1,2,3,4,5,6,7),
missing_values='?')
```

Let's now handle the missing values.

In [ ]:

```
#missing values are displayed as nan. Let's take a look which columns have them
np.any(np.isnan(auto), axis=0)
```

In [ ]:

```
# Now let's take a look which rows have them
np.any(np.isnan(auto), axis=1)
```

Now print the rows which contain nans, remove them from the `auto`

matrix and store the result in the same `auto`

matrix.

In [ ]:

```
# code to remove nans
nan_rows = ? # fixme
auto = auto[:,:] # fixme
```

In [ ]:

```
#%load solutions/remove-nans.py
```

- Let's assume that we are going to predict the mileage of a car from various
*features*in the data. We use a*linear model*. - Denote the data by $X_{n \times p}$, containing $n$ rows and $p$ columns.
- A single
*training example*is a single row of $X$, with each element being the value of a feature. - In a linear model, the target is a linear combination of the feature values.
- We can write the
*prediction*of a single row of $X$ as follows:

- If we augment $X$ with a leftmost column of 1's, we can write all predictions in vector form:

- Our predictions from the training examples may differ from the actualy values of the target. We define the error as:

- We want to find the value of $w$ that minimises this error. This is given by:

In [ ]:

```
import numpy.linalg as la
```

In [ ]:

```
# code to fit and predict
def linreg(X,y):
# 1. Add a column of 1's to X; now it has a total of p columns
# 2. Calculate the weight vector w (should have p columns too)
# 3. Calculate the predicted values of the target, y_pred
# 4. Calculate the error, sse
return w, y_pred, sse
# 1. Split the data into the training examples X (cylinders and displacement)
# and target column y (miles per gallon)
X = auto[:,:] # fixme
y = auto[:,:] # fixme
print linreg(X,y)
```

In [ ]:

```
#%load solutions/linreg.py
```

A better error measure: $RMS = \sqrt{SSE / n}$. Can be roughly interpreted as the prediction error in miles-per-gallon.

In [ ]:

```
from math import sqrt
```

In [ ]:

```
# code to print the RMS error
```

In [ ]:

```
#%load solutions/rms.py
```

Let's make predictions using single features and plot the fit.

In [ ]:

```
import matplotlib.pyplot as plt
%matplotlib inline
```

In [ ]:

```
# predict and plot using just cylinders as a feature, print the sse
X = auto[:,:] # fixme
y = auto[:,:] # fixme
w, y_pred, sse = linreg(X, y)
plt.plot(X, y, 'o')
plt.plot(X, y_pred, 'r')
plt.show()
```

In [ ]:

```
#%load solutions/predict-cylinders.py
```

In [ ]:

```
# predict and plot using just weight as a feature, print the sse
```

In [ ]:

```
#%load solutions/predict-and-plot.py
```