In [1]:

# special IPython command to prepare the notebook for matplotlib
%matplotlib inline 

import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import statsmodels.api as sm

# To load in the baseball data
import requests
import StringIO
import zipfile

# special matplotlib argument for improved plots
from matplotlib import rcParams

#colorbrewer2 Dark2 qualitative color table
dark2_colors = [(0.10588235294117647, 0.6196078431372549, 0.4666666666666667),
                (0.8509803921568627, 0.37254901960784315, 0.00784313725490196),
                (0.4588235294117647, 0.4392156862745098, 0.7019607843137254),
                (0.9058823529411765, 0.1607843137254902, 0.5411764705882353),
                (0.4, 0.6509803921568628, 0.11764705882352941),
                (0.9019607843137255, 0.6705882352941176, 0.00784313725490196),
                (0.6509803921568628, 0.4627450980392157, 0.11372549019607843)]

rcParams['figure.figsize'] = (10, 6)
rcParams['figure.dpi'] = 150
rcParams['axes.color_cycle'] = dark2_colors
rcParams['lines.linewidth'] = 2
rcParams['axes.facecolor'] = 'white'
rcParams['font.size'] = 14
rcParams['patch.edgecolor'] = 'white'
rcParams['patch.facecolor'] = dark2_colors[0]
rcParams['font.family'] = 'StixGeneral'

Recall from from lab last week 10/10/2014¶

Previously discussed:

Boston Housing data set
- Predict housing prices
Linear regression
- Multiple ways to do this in python: numpy, scipy, statsmodels, sklearn

Today, we will discuss the following:¶

More linear regression using the statmodels python module
- Old Faithful Geyser data set
- Oakland baseball data set

Download this notebook from Github

`statsmodels`¶

statsmodels is python module specifically for estimating statistical models (less machine learning compared to sklearn). It can estimate many types of statistical models, but today we will focus on linear regression.

Recap of linear regression and least squares¶

Last week we learned, linear regression is a method to model the relationship between a set of independent variables $X$ (also knowns as explantory variables, features, predictors) and a dependent variable $Y$. This method assumes the relationship bewteen each predictor $X$ is linearly related to the dependent variable $Y$.

$$ Y = \beta_0 + \beta_1 X + \epsilon$$

On Tuesday in lecture, we learned how to write this model using matrix multiplication

$$ Y = \beta X + \epsilon$$

where $Y$ has dimensions $n \times 1$, $X$ has dimensions $n \times p$ and $\epsilon$ has dimensions $n \times 1$. On Tuesday, we also derived the least squares estimates of the coefficients of a linear model. These estimates minimize the difference between the following:

$$ S = \sum_{i=1}^n r_i = \sum_{i=1}^n (y_i - (\beta_0 + \beta_1 x_i))^2 $$

where $n$ is the number of observations.

The least squares estimates $\hat{\beta}_0$ and $\hat{\beta}_1$ minimize the sum of the squared residuals $r_i = y_i - (\beta_0 + \beta_1 x_i)$ in the model (i.e. makes the difference bewteen the observed $y_i$ and linear model $\beta_0 + \beta_1 x_i$ as small as possible).

Old Faithful Geyser Data Set¶

The Old Faithful Geyser data set is a well-known data set that depicts the relationship of the waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA [webcam]. This data set is found in the base installation of the R programming language.

faithful is a data set with 272 observations on 2 variables.

Column name	Description
eruptions	Eruption time (in mins)
waiting	Waiting time to next eruption (in mins)

There is a function in statsmodels (or sm for short) called sm.datasets.get_rdataset which will download and return a data set found in R.

Let's import the faithful dataset.

In [8]:

faithful = sm.datasets.get_rdataset("faithful")

In [4]:

# Let's look at the help file
# sm.datasets.get_rdataset?
# faithful?

In [5]:

faithful.title

Out[5]:

'Old Faithful Geyser Data'

In [9]:

faithful = faithful.data
faithful.head()

Out[9]:

	eruptions	waiting
0	3.600	79
1	1.800	54
2	3.333	74
3	2.283	62
4	4.533	85

In [10]:

faithful.shape

Out[10]:

(272, 2)

Histogram¶

Create a histogram of the time between eruptions. What do you see?

In [11]:

plt.hist(faithful.waiting)
plt.xlabel('Waiting time to next eruption (in mins)')
plt.ylabel('Frequency')
plt.title('Old Faithful Geyser time between eruption')
plt.show()

This histogram indicates Old Faithful isn’t as “faithful” as you might think.

Scatter plot¶

Create a scatter plot of the waiting on the x-axis and the eruptions on the y-axis.

In [9]:

plt.scatter(faithful.waiting, faithful.eruptions)
plt.xlabel('Waiting time to next eruption (in mins)')
plt.ylabel('Eruption time (in mins)')
plt.title('Old Faithful Geyser')
plt.show()

Build a linear regression to predict eruption time using `statsmodels`¶

Now let's build a linear regression model for the faithful DataFrame, and estimate the next eruption duration if the waiting time since the last eruption has been 75 minutes.

$$ Eruptions = \beta_0 + \beta_1 * Waiting + \epsilon $$

In [15]:

X = faithful.waiting
y = faithful.eruptions
model = sm.OLS(y, X)

In [11]:

# Let's look at the options in model
# model.<tab>

In [16]:

results = model.fit()

In [17]:

# Let's look at the options in results
# results.<tab>

In [18]:

print results.summary()

                            OLS Regression Results                            
==============================================================================
Dep. Variable:              eruptions   R-squared:                       0.973
Model:                            OLS   Adj. R-squared:                  0.973
Method:                 Least Squares   F-statistic:                     9621.
Date:                Fri, 17 Oct 2014   Prob (F-statistic):          9.97e-214
Time:                        10:34:55   Log-Likelihood:                -250.30
No. Observations:                 272   AIC:                             502.6
Df Residuals:                     271   BIC:                             506.2
Df Model:                           1                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
waiting        0.0501      0.001     98.086      0.000         0.049     0.051
==============================================================================
Omnibus:                       37.012   Durbin-Watson:                   2.835
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               10.965
Skew:                          -0.159   Prob(JB):                      0.00416
Kurtosis:                       2.069   Cond. No.                         1.00
==============================================================================

In [20]:

results.params.values

Out[20]:

array([ 0.05012919])

We notice, there is no intercept ($\beta_0$) fit in this linear model. To add it, we can use the function sm.add_constant.

In [22]:

X = sm.add_constant(X)
X.head()

Out[22]:

	const	waiting
0	1	79
1	1	54
2	1	74
3	1	62
4	1	85

Now let's fit a linear regression model with an intercept.

In [23]:

modelW0 = sm.OLS(y, X)
resultsW0 = modelW0.fit()
print resultsW0.summary()

                            OLS Regression Results                            
==============================================================================
Dep. Variable:              eruptions   R-squared:                       0.811
Model:                            OLS   Adj. R-squared:                  0.811
Method:                 Least Squares   F-statistic:                     1162.
Date:                Fri, 17 Oct 2014   Prob (F-statistic):          8.13e-100
Time:                        10:38:10   Log-Likelihood:                -194.51
No. Observations:                 272   AIC:                             393.0
Df Residuals:                     270   BIC:                             400.2
Df Model:                           1                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const         -1.8740      0.160    -11.702      0.000        -2.189    -1.559
waiting        0.0756      0.002     34.089      0.000         0.071     0.080
==============================================================================
Omnibus:                        4.133   Durbin-Watson:                   2.561
Prob(Omnibus):                  0.127   Jarque-Bera (JB):                3.173
Skew:                          -0.138   Prob(JB):                        0.205
Kurtosis:                       2.548   Cond. No.                         384.
==============================================================================

If you want to predict the time to the next eruption using a waiting time of 75, you can directly estimate this using the equation

$$ \hat{y} = \hat{\beta}_0 + \hat{\beta}_1 * 75 $$

or you can use results.predict.

In [25]:

newX = np.array([1,75])
resultsW0.params[0]*newX[0] + resultsW0.params[1] * newX[1]

Out[25]:

3.7980801099789772

In [19]:

resultsW0.predict(newX)

Out[19]:

3.7980801099789772

Based on this linear regression, if the waiting time since the last eruption has been 75 minutes, we expect the next one to last approximately 3.80 minutes.

Plot the regression line¶

Instead of using resultsW0.predict(X), we can use resultsW0.fittedvalues which are the $\hat{y}$.

In [26]:

plt.scatter(faithful.waiting, faithful.eruptions)
plt.xlabel('Waiting time to next eruption (in mins)')
plt.ylabel('Eruption time (in mins)')
plt.title('Old Faithful Geyser')

plt.plot(faithful.waiting, resultsW0.fittedvalues, color='blue', linewidth=3)
plt.show()

Residuals, residual sum of squares, mean squared error¶

Recall, we can directly calculate the residuals as

$$r_i = y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_i)$$

To calculate the residual sum of squares,

$$ S = \sum_{i=1}^n r_i = \sum_{i=1}^n (y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_i))^2 $$

where $n$ is the number of observations. Alternatively, we can simply ask for the residuals using resultsW0.predict

In [27]:

resids = faithful.eruptions - resultsW0.predict(X)

In [22]:

resids = resultsW0.resid

In [29]:

plt.plot(faithful.waiting, resids, 'o')
plt.hlines(y = 0, xmin=40, xmax = 100)
plt.xlabel('Waiting time')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.show()

The residual sum of squares:

In [30]:

print np.sum((faithful.eruptions - resultsW0.predict(X)) ** 2)

66.5617757127

Mean squared error:

In [31]:

print np.mean((faithful.eruptions - resultsW0.predict(X)) ** 2)

0.244712410708

Build a linear regression to predict eruption time using least squares¶

Now let's build a linear regression model for the faithful DataFrame, but instead of using statmodels (or sklearn), let's use the least squares estimates of the coefficients for the linear regression model.

$$ \hat{\beta} = (X^{\top}X)^{-1} X^{\top}Y $$

The numpy function np.dot is the dot product (or inner product) of two vectors (or arrays in python).

The numpy function np.linalg.inv can be used to compute the inverse of a matrix.

In [32]:

X = sm.add_constant(faithful.waiting)
y = faithful.eruptions

First, compute $X^{\top}X$

In [33]:

np.dot(X.T, X)

Out[33]:

array([[    272,   19284],
       [  19284, 1417266]])

Next, compute the inverse of $X^{\top}X$ or $(X^{\top}X)^{-1}$.

In [34]:

np.linalg.inv(np.dot(X.T, X))

Out[34]:

array([[  1.04029479e-01,  -1.41547492e-03],
       [ -1.41547492e-03,   1.99652136e-05]])

Finally, compute $\hat{\beta} = (X^{\top}X)^{-1} X^{\top}Y $

In [35]:

beta = np.linalg.inv(np.dot(X.T, X)).dot(X.T).dot(y)
print "Directly estimating beta:", beta
print "Estimating beta using statmodels: ", resultsW0.params.values

Directly estimating beta: [-1.87401599  0.07562795]
Estimating beta using statmodels:  [-1.87401599  0.07562795]

Baseball data from Homework 1¶

Let's return back to the baseball data from Homework 1. In Problem 1(e), we asked all the students registered in 209 to complete the following problem:

Fit a linear regression to the data from each year and obtain the residuals. Plot the residuals against time to detect a competitive advantage in years 2001-2003 for the Oakland baseball team

Load baseball data (salaries and wins)¶

In [36]:

def getZIP(zipFileName):
    r = requests.get(zipFileName).content
    s = StringIO.StringIO(r)
    zf = zipfile.ZipFile(s, 'r') # Read in a list of zipped files
    return zf

url = 'http://seanlahman.com/files/database/lahman-csv_2014-02-14.zip'
zf = getZIP(url)
tablenames = zf.namelist()

salaries = pd.read_csv(zf.open(tablenames[tablenames.index('Salaries.csv')]))
teams = pd.read_csv(zf.open(tablenames[tablenames.index('Teams.csv')]))
teams = teams[['yearID', 'teamID', 'W']]

totSalaries = salaries.groupby(['yearID','teamID'], as_index=False).sum()
joined = pd.merge(totSalaries, teams, how="inner", on=['yearID', 'teamID'])
joined.head()

Out[36]:

	yearID	teamID	salary	W
0	1985	ATL	14807000	66
1	1985	BAL	11560712	83
2	1985	BOS	10897560	81
3	1985	CAL	14427894	90
4	1985	CHA	9846178	85

For each year, we perform the following:

Calculate the least squares estimate of the coefficients in a linear regression model where x = salaries (in millions) and y = total wins.
Calculate the residuals for each team: $$r_i = y_i - \hat{y}_i$$
Plot the residuals for each team across time.

In [37]:

teamName = 'OAK'
years = np.arange(1999, 2005)
residData = pd.DataFrame()

for yr in years: 
    df = joined[joined['yearID'] == yr]
    X = df['salary'].values / 1e6
    X = sm.add_constant(X)
    y = df['W'].values

    # least squares estimates
    model = sm.OLS(y, X)
    results = model.fit() # fit the linear regression model
    beta = results.params # least squares coefficients
    yhat = (beta[0] + beta[1]*X[:,1]) # regression line
    residData[yr] = results.resid # residuals = y - yhat
  
residData.index = df['teamID']
residData = residData.T
residData.index = residData.index.format()

residData.plot(title = 'Residuals from least squares estimates across years', figsize = (15, 8),
               color=map(lambda x: 'blue' if x=='OAK' else 'gray',df.teamID))
plt.xlabel('Year')
plt.ylabel('Residuals')
plt.show()

Let's try an example together¶

We will revisit the Motor Trend Car Road Tests data set used in Lab 3 and build a linear regression model to predict miles per gallon (mpg).

Motor Trend Car Road Tests Data¶

We previously looked at the mtcars data set in Lab 3. The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). This data set is also found in the base installation of the R programming language.

Column name	Description
mpg	Miles/(US) gallon
cyl	Number of cylinders
disp	Displacement (cu.in.)
hp	Gross horsepower
drat	Rear axle ratio
wt	Weight (lb/1000)
qsec	1/4 mile time
vs	V/S
am	Transmission (0 = automatic, 1 = manual)
gear	Number of forward gears
carb	Number of carburetors

First, read in the mtcars data set using the sm.datasets.get_rdataset function to import the dataset from R.

In [38]:

# your turn
mtcars = sm.datasets.get_rdataset("mtcars")
mtcars = mtcars.data

Histogram¶

In [39]:

mtcars['mpg'].hist()
plt.title('Distribution of MPG')
plt.xlabel('Miles Per Gallon')

Out[39]:

<matplotlib.text.Text at 0x10b7152d0>

Scatter plots¶

Relationship between cyl and mpg

In [40]:

plt.plot(mtcars.cyl, mtcars.mpg, 'o')
plt.xlim(3, 9)
plt.xlabel('Cylinders')
plt.ylabel('MPG')
plt.title('Relationship between cylinders and MPG')

Out[40]:

<matplotlib.text.Text at 0x10b751e90>

Relationship between horsepower and mpg

In [41]:

plt.plot(mtcars.hp, mtcars.mpg, 'o')
plt.xlabel('Horsepower')
plt.ylabel('MPG')
plt.title('Relationship between horsepower and MPG')

Out[41]:

<matplotlib.text.Text at 0x10b78c590>

Build a linear regression to predict mpg using `statsmodels`¶

Now let's build a linear regression model for the mtcars DataFrame, and estimate predicted mpg given a new car has 6 cylinders and 180 horsepower.

$$ MPG = \beta_0 + \beta_1 * cylinders + \beta_2 * horsepower + \epsilon $$

In [49]:

# your turn
y = mtcars.mpg
X = mtcars[['cyl', 'hp']]
X = sm.add_constant(X)

model = sm.OLS(y,X)
results = model.fit()
print results.summary()

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    mpg   R-squared:                       0.741
Model:                            OLS   Adj. R-squared:                  0.723
Method:                 Least Squares   F-statistic:                     41.42
Date:                Fri, 17 Oct 2014   Prob (F-statistic):           3.16e-09
Time:                        11:08:10   Log-Likelihood:                -80.781
No. Observations:                  32   AIC:                             167.6
Df Residuals:                      29   BIC:                             172.0
Df Model:                           2                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const         36.9083      2.191     16.847      0.000        32.428    41.389
cyl           -2.2647      0.576     -3.933      0.000        -3.443    -1.087
hp            -0.0191      0.015     -1.275      0.213        -0.050     0.012
==============================================================================
Omnibus:                        1.178   Durbin-Watson:                   1.667
Prob(Omnibus):                  0.555   Jarque-Bera (JB):                1.092
Skew:                           0.411   Prob(JB):                        0.579
Kurtosis:                       2.623   Cond. No.                         645.
==============================================================================

In [48]:

newX = np.array([1, 6, 180])
results.predict(newX)

Out[48]:

19.878263535641754

What if a new car had 4 cylinders and 120 horsepower?

In [51]:

# your turn
newX = np.array([1, 4, 120])
results.predict(newX)

Out[51]:

25.554952517097814

Now estimate the least squares estimates for $\beta_0$, $\beta_1$ and $\beta_2$ using matrix multiplication and the formula:

$$ \hat{\beta} = (X^{\top}X)^{-1} X^{\top}Y $$

In [54]:

# your turn
beta = np.linalg.inv(np.dot(X.T, X)).dot(X.T).dot(y)
beta

Out[54]:

array([  3.69083305e+01,  -2.26469360e+00,  -1.91216965e-02])

Many different types of regression¶

You do not always have a continuous $y$ variable that you are measuring. Sometimes it may be binary (e.g. 0 or 1). Sometimes it may be count data. What do you do?

Use other types of regression besides just simple linear regression.

Nice summary of several types of regression.

Recall from from lab last week 10/10/2014¶

Today, we will discuss the following:¶

statsmodels¶

Recap of linear regression and least squares¶

Old Faithful Geyser Data Set¶

Histogram¶

Scatter plot¶

Build a linear regression to predict eruption time using statsmodels¶

Plot the regression line¶

Residuals, residual sum of squares, mean squared error¶

Build a linear regression to predict eruption time using least squares¶

Baseball data from Homework 1¶

Load baseball data (salaries and wins)¶

Let's try an example together¶

Motor Trend Car Road Tests Data¶

Histogram¶

Scatter plots¶

Build a linear regression to predict mpg using statsmodels¶

Many different types of regression¶

`statsmodels`¶

Build a linear regression to predict eruption time using `statsmodels`¶

Build a linear regression to predict mpg using `statsmodels`¶