Project: Linear Regression

Based on the notebook by Michael Granitzer

Goal: Implement and evaluate a linear model to predict car mileage.

1. Get the data!

Location:

Tools:

  • !curl -o <filename> "<url>"
  • urllib2.urlretrieve(<url>, <filename>)
  • tempfileName = urllib2.urlretrieve(<url>)

Specifications:

  • Use data as the filename for the data.
  • Use description as the filename for the names.
In [ ]:
import numpy as np
In [ ]:
# code to get the data
In [ ]:
#%load solutions/get-data-curl.py
In [ ]:
#%load solutions/get-data-urllib.py

2. Load the data

Tools:

  • auto = numpy.loadtxt("<filename>")
In [ ]:
# code to load the data
In [ ]:
#%load solutions/load-data.py

2.1. Inspection

In [ ]:
#look at the first few examples.
!cat data | grep "?"

2.2. Clean the data

We don't want the car name and the data has missing values, marked by "?".

Tools:

  • auto = np.genfromtxt("<filename>", usecols=(0,1,2,3,4,5,6,7), missing_values='?')
In [ ]:
# code to load the data without strings
In [ ]:
%load solutions/load-data2.py
In [ ]:
auto = np.genfromtxt("data", usecols=(0,1,2,3,4,5,6,7),
                     missing_values='?')

Let's now handle the missing values.

In [ ]:
#missing values are displayed as nan. Let's take a look which columns have them
np.any(np.isnan(auto), axis=0)
In [ ]:
# Now let's take a look which rows have them
np.any(np.isnan(auto), axis=1)

Now print the rows which contain nans, remove them from the auto matrix and store the result in the same auto matrix.

In [ ]:
# code to remove nans
nan_rows = ? # fixme
auto = auto[:,:] # fixme
In [ ]:
#%load solutions/remove-nans.py

3. Fit the data

  • Let's assume that we are going to predict the mileage of a car from various features in the data. We use a linear model.
  • Denote the data by $X_{n \times p}$, containing $n$ rows and $p$ columns.
  • A single training example is a single row of $X$, with each element being the value of a feature.
  • In a linear model, the target is a linear combination of the feature values.
  • We can write the prediction of a single row of $X$ as follows:
$$ \hat{y}(X)_i = w_0 + \sum_{j=1}^p w_jx_{ij}, \quad x_{ij} \in X[i,:] $$
  • If we augment $X$ with a leftmost column of 1's, we can write all predictions in vector form:
$$ \hat{y}(X) = w^TX $$
  • Our predictions from the training examples may differ from the actualy values of the target. We define the error as:
$$ \textrm{SSE} = \sum_{i=1}^n (y_i - \hat{y}(X)_i )^2 $$
  • We want to find the value of $w$ that minimises this error. This is given by:
$$ w = (X^TX)^{-1}X^Ty $$
In [ ]:
import numpy.linalg as la
In [ ]:
# code to fit and predict
def linreg(X,y):
    # 1. Add a column of 1's to X; now it has a total of p columns
    
    # 2. Calculate the weight vector w (should have p columns too)
    
    # 3. Calculate the predicted values of the target, y_pred
    
    # 4. Calculate the error, sse
    
    return w, y_pred, sse

# 1. Split the data into the training examples X (cylinders and displacement)
#    and target column y (miles per gallon)
X = auto[:,:] # fixme
y = auto[:,:] # fixme
print linreg(X,y)
In [ ]:
#%load solutions/linreg.py

A better error measure: $RMS = \sqrt{SSE / n}$. Can be roughly interpreted as the prediction error in miles-per-gallon.

In [ ]:
from math import sqrt
In [ ]:
# code to print the RMS error
In [ ]:
#%load solutions/rms.py

4. Visualise the data

Let's make predictions using single features and plot the fit.

In [ ]:
import matplotlib.pyplot as plt
%matplotlib inline
In [ ]:
# predict and plot using just cylinders as a feature, print the sse
X = auto[:,:] # fixme
y = auto[:,:] # fixme
w, y_pred, sse = linreg(X, y)

plt.plot(X, y, 'o')
plt.plot(X, y_pred, 'r')
plt.show()
In [ ]:
#%load solutions/predict-cylinders.py
In [ ]:
# predict and plot using just weight as a feature, print the sse
In [ ]:
#%load solutions/predict-and-plot.py