# Project: Linear Regression¶

Based on the notebook by Michael Granitzer

Goal: Implement and evaluate a linear model to predict car mileage.

## 1. Get the data!¶

Location:

Tools:

• !curl -o <filename> "<url>"
• urllib2.urlretrieve(<url>, <filename>)
• tempfileName = urllib2.urlretrieve(<url>)

Specifications:

• Use data as the filename for the data.
• Use description as the filename for the names.
import numpy as np

# code to get the data

#%load solutions/get-data-curl.py

#%load solutions/get-data-urllib.py


Tools:

• auto = numpy.loadtxt("<filename>")
# code to load the data

#%load solutions/load-data.py


### 2.1. Inspection¶

#look at the first few examples.
!cat data | grep "?"


### 2.2. Clean the data¶

We don't want the car name and the data has missing values, marked by "?".

Tools:

• auto = np.genfromtxt("<filename>", usecols=(0,1,2,3,4,5,6,7), missing_values='?')
# code to load the data without strings

%load solutions/load-data2.py

auto = np.genfromtxt("data", usecols=(0,1,2,3,4,5,6,7),
missing_values='?')


Let's now handle the missing values.

#missing values are displayed as nan. Let's take a look which columns have them
np.any(np.isnan(auto), axis=0)

# Now let's take a look which rows have them
np.any(np.isnan(auto), axis=1)


Now print the rows which contain nans, remove them from the auto matrix and store the result in the same auto matrix.

# code to remove nans
nan_rows = ? # fixme
auto = auto[:,:] # fixme

#%load solutions/remove-nans.py


## 3. Fit the data¶

• Let's assume that we are going to predict the mileage of a car from various features in the data. We use a linear model.
• Denote the data by $X_{n \times p}$, containing $n$ rows and $p$ columns.
• A single training example is a single row of $X$, with each element being the value of a feature.
• In a linear model, the target is a linear combination of the feature values.
• We can write the prediction of a single row of $X$ as follows:
$$\hat{y}(X)_i = w_0 + \sum_{j=1}^p w_jx_{ij}, \quad x_{ij} \in X[i,:]$$
• If we augment $X$ with a leftmost column of 1's, we can write all predictions in vector form:
$$\hat{y}(X) = w^TX$$
• Our predictions from the training examples may differ from the actualy values of the target. We define the error as:
$$\textrm{SSE} = \sum_{i=1}^n (y_i - \hat{y}(X)_i )^2$$
• We want to find the value of $w$ that minimises this error. This is given by:
$$w = (X^TX)^{-1}X^Ty$$
import numpy.linalg as la

# code to fit and predict
def linreg(X,y):
# 1. Add a column of 1's to X; now it has a total of p columns

# 2. Calculate the weight vector w (should have p columns too)

# 3. Calculate the predicted values of the target, y_pred

# 4. Calculate the error, sse

return w, y_pred, sse

# 1. Split the data into the training examples X (cylinders and displacement)
#    and target column y (miles per gallon)
X = auto[:,:] # fixme
y = auto[:,:] # fixme
print linreg(X,y)

#%load solutions/linreg.py


A better error measure: $RMS = \sqrt{SSE / n}$. Can be roughly interpreted as the prediction error in miles-per-gallon.

from math import sqrt

# code to print the RMS error

#%load solutions/rms.py


## 4. Visualise the data¶

Let's make predictions using single features and plot the fit.

import matplotlib.pyplot as plt
%matplotlib inline

# predict and plot using just cylinders as a feature, print the sse
X = auto[:,:] # fixme
y = auto[:,:] # fixme
w, y_pred, sse = linreg(X, y)

plt.plot(X, y, 'o')
plt.plot(X, y_pred, 'r')
plt.show()

#%load solutions/predict-cylinders.py

# predict and plot using just weight as a feature, print the sse

#%load solutions/predict-and-plot.py