# special IPython command to prepare the notebook for matplotlib
%matplotlib inline
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import statsmodels.api as sm
# To load in the baseball data
import requests
import StringIO
import zipfile
# special matplotlib argument for improved plots
from matplotlib import rcParams
#colorbrewer2 Dark2 qualitative color table
dark2_colors = [(0.10588235294117647, 0.6196078431372549, 0.4666666666666667),
(0.8509803921568627, 0.37254901960784315, 0.00784313725490196),
(0.4588235294117647, 0.4392156862745098, 0.7019607843137254),
(0.9058823529411765, 0.1607843137254902, 0.5411764705882353),
(0.4, 0.6509803921568628, 0.11764705882352941),
(0.9019607843137255, 0.6705882352941176, 0.00784313725490196),
(0.6509803921568628, 0.4627450980392157, 0.11372549019607843)]
rcParams['figure.figsize'] = (10, 6)
rcParams['figure.dpi'] = 150
rcParams['axes.color_cycle'] = dark2_colors
rcParams['lines.linewidth'] = 2
rcParams['axes.facecolor'] = 'white'
rcParams['font.size'] = 14
rcParams['patch.edgecolor'] = 'white'
rcParams['patch.facecolor'] = dark2_colors[0]
rcParams['font.family'] = 'StixGeneral'
Previously discussed:
numpy
, scipy
, statsmodels
, sklearn
statmodels
python modulestatsmodels
¶statsmodels is python module specifically for estimating statistical models (less machine learning compared to sklearn
). It can estimate many types of statistical models, but today we will focus on linear regression.
Last week we learned, linear regression is a method to model the relationship between a set of independent variables $X$ (also knowns as explantory variables, features, predictors) and a dependent variable $Y$. This method assumes the relationship bewteen each predictor $X$ is linearly related to the dependent variable $Y$.
$$ Y = \beta_0 + \beta_1 X + \epsilon$$On Tuesday in lecture, we learned how to write this model using matrix multiplication
$$ Y = \beta X + \epsilon$$where $Y$ has dimensions $n \times 1$, $X$ has dimensions $n \times p$ and $\epsilon$ has dimensions $n \times 1$. On Tuesday, we also derived the least squares estimates of the coefficients of a linear model. These estimates minimize the difference between the following:
$$ S = \sum_{i=1}^n r_i = \sum_{i=1}^n (y_i - (\beta_0 + \beta_1 x_i))^2 $$where $n$ is the number of observations.
The least squares estimates $\hat{\beta}_0$ and $\hat{\beta}_1$ minimize the sum of the squared residuals $r_i = y_i - (\beta_0 + \beta_1 x_i)$ in the model (i.e. makes the difference bewteen the observed $y_i$ and linear model $\beta_0 + \beta_1 x_i$ as small as possible).
The Old Faithful Geyser data set is a well-known data set that depicts the relationship of the waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA [webcam]. This data set is found in the base installation of the R programming language.
faithful
is a data set with 272 observations on 2 variables.
Column name | Description |
---|---|
eruptions | Eruption time (in mins) |
waiting | Waiting time to next eruption (in mins) |
There is a function in statsmodels
(or sm
for short) called sm.datasets.get_rdataset
which will download and return a data set found in R.
Let's import the faithful
dataset.
faithful = sm.datasets.get_rdataset("faithful")
# Let's look at the help file
# sm.datasets.get_rdataset?
# faithful?
faithful.title
'Old Faithful Geyser Data'
faithful = faithful.data
faithful.head()
eruptions | waiting | |
---|---|---|
0 | 3.600 | 79 |
1 | 1.800 | 54 |
2 | 3.333 | 74 |
3 | 2.283 | 62 |
4 | 4.533 | 85 |
faithful.shape
(272, 2)
Create a histogram of the time between eruptions. What do you see?
plt.hist(faithful.waiting)
plt.xlabel('Waiting time to next eruption (in mins)')
plt.ylabel('Frequency')
plt.title('Old Faithful Geyser time between eruption')
plt.show()
This histogram indicates Old Faithful isn’t as “faithful” as you might think.
Create a scatter plot of the waiting
on the x-axis and the eruptions
on the y-axis.
plt.scatter(faithful.waiting, faithful.eruptions)
plt.xlabel('Waiting time to next eruption (in mins)')
plt.ylabel('Eruption time (in mins)')
plt.title('Old Faithful Geyser')
plt.show()
statsmodels
¶Now let's build a linear regression model for the faithful
DataFrame, and estimate the next eruption duration if the waiting time since the last eruption has been 75 minutes.
X = faithful.waiting
y = faithful.eruptions
model = sm.OLS(y, X)
# Let's look at the options in model
# model.<tab>
results = model.fit()
# Let's look at the options in results
# results.<tab>
print results.summary()
OLS Regression Results ============================================================================== Dep. Variable: eruptions R-squared: 0.973 Model: OLS Adj. R-squared: 0.973 Method: Least Squares F-statistic: 9621. Date: Fri, 17 Oct 2014 Prob (F-statistic): 9.97e-214 Time: 10:34:55 Log-Likelihood: -250.30 No. Observations: 272 AIC: 502.6 Df Residuals: 271 BIC: 506.2 Df Model: 1 ============================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------ waiting 0.0501 0.001 98.086 0.000 0.049 0.051 ============================================================================== Omnibus: 37.012 Durbin-Watson: 2.835 Prob(Omnibus): 0.000 Jarque-Bera (JB): 10.965 Skew: -0.159 Prob(JB): 0.00416 Kurtosis: 2.069 Cond. No. 1.00 ==============================================================================
results.params.values
array([ 0.05012919])
We notice, there is no intercept ($\beta_0$) fit in this linear model. To add it, we can use the function sm.add_constant
.
X = sm.add_constant(X)
X.head()
const | waiting | |
---|---|---|
0 | 1 | 79 |
1 | 1 | 54 |
2 | 1 | 74 |
3 | 1 | 62 |
4 | 1 | 85 |
Now let's fit a linear regression model with an intercept.
modelW0 = sm.OLS(y, X)
resultsW0 = modelW0.fit()
print resultsW0.summary()
OLS Regression Results ============================================================================== Dep. Variable: eruptions R-squared: 0.811 Model: OLS Adj. R-squared: 0.811 Method: Least Squares F-statistic: 1162. Date: Fri, 17 Oct 2014 Prob (F-statistic): 8.13e-100 Time: 10:38:10 Log-Likelihood: -194.51 No. Observations: 272 AIC: 393.0 Df Residuals: 270 BIC: 400.2 Df Model: 1 ============================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------ const -1.8740 0.160 -11.702 0.000 -2.189 -1.559 waiting 0.0756 0.002 34.089 0.000 0.071 0.080 ============================================================================== Omnibus: 4.133 Durbin-Watson: 2.561 Prob(Omnibus): 0.127 Jarque-Bera (JB): 3.173 Skew: -0.138 Prob(JB): 0.205 Kurtosis: 2.548 Cond. No. 384. ==============================================================================
If you want to predict the time to the next eruption using a waiting time of 75, you can directly estimate this using the equation
$$ \hat{y} = \hat{\beta}_0 + \hat{\beta}_1 * 75 $$or you can use results.predict
.
newX = np.array([1,75])
resultsW0.params[0]*newX[0] + resultsW0.params[1] * newX[1]
3.7980801099789772
resultsW0.predict(newX)
3.7980801099789772
Based on this linear regression, if the waiting time since the last eruption has been 75 minutes, we expect the next one to last approximately 3.80 minutes.
Instead of using resultsW0.predict(X)
, we can use resultsW0.fittedvalues
which are the $\hat{y}$.
plt.scatter(faithful.waiting, faithful.eruptions)
plt.xlabel('Waiting time to next eruption (in mins)')
plt.ylabel('Eruption time (in mins)')
plt.title('Old Faithful Geyser')
plt.plot(faithful.waiting, resultsW0.fittedvalues, color='blue', linewidth=3)
plt.show()
Recall, we can directly calculate the residuals as
$$r_i = y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_i)$$To calculate the residual sum of squares,
$$ S = \sum_{i=1}^n r_i = \sum_{i=1}^n (y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_i))^2 $$where $n$ is the number of observations. Alternatively, we can simply ask for the residuals using resultsW0.predict
resids = faithful.eruptions - resultsW0.predict(X)
resids = resultsW0.resid
plt.plot(faithful.waiting, resids, 'o')
plt.hlines(y = 0, xmin=40, xmax = 100)
plt.xlabel('Waiting time')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.show()
The residual sum of squares:
print np.sum((faithful.eruptions - resultsW0.predict(X)) ** 2)
66.5617757127
Mean squared error:
print np.mean((faithful.eruptions - resultsW0.predict(X)) ** 2)
0.244712410708
Now let's build a linear regression model for the faithful
DataFrame, but instead of using statmodels
(or sklearn
), let's use the least squares estimates of the coefficients for the linear regression model.
The numpy
function np.dot
is the dot product (or inner product) of two vectors (or arrays in python).
The numpy
function np.linalg.inv
can be used to compute the inverse of a matrix.
X = sm.add_constant(faithful.waiting)
y = faithful.eruptions
First, compute $X^{\top}X$
np.dot(X.T, X)
array([[ 272, 19284], [ 19284, 1417266]])
Next, compute the inverse of $X^{\top}X$ or $(X^{\top}X)^{-1}$.
np.linalg.inv(np.dot(X.T, X))
array([[ 1.04029479e-01, -1.41547492e-03], [ -1.41547492e-03, 1.99652136e-05]])
Finally, compute $\hat{\beta} = (X^{\top}X)^{-1} X^{\top}Y $
beta = np.linalg.inv(np.dot(X.T, X)).dot(X.T).dot(y)
print "Directly estimating beta:", beta
print "Estimating beta using statmodels: ", resultsW0.params.values
Directly estimating beta: [-1.87401599 0.07562795] Estimating beta using statmodels: [-1.87401599 0.07562795]
Let's return back to the baseball data from Homework 1. In Problem 1(e), we asked all the students registered in 209 to complete the following problem:
Fit a linear regression to the data from each year and obtain the residuals. Plot the residuals against time to detect a competitive advantage in years 2001-2003 for the Oakland baseball team
def getZIP(zipFileName):
r = requests.get(zipFileName).content
s = StringIO.StringIO(r)
zf = zipfile.ZipFile(s, 'r') # Read in a list of zipped files
return zf
url = 'http://seanlahman.com/files/database/lahman-csv_2014-02-14.zip'
zf = getZIP(url)
tablenames = zf.namelist()
salaries = pd.read_csv(zf.open(tablenames[tablenames.index('Salaries.csv')]))
teams = pd.read_csv(zf.open(tablenames[tablenames.index('Teams.csv')]))
teams = teams[['yearID', 'teamID', 'W']]
totSalaries = salaries.groupby(['yearID','teamID'], as_index=False).sum()
joined = pd.merge(totSalaries, teams, how="inner", on=['yearID', 'teamID'])
joined.head()
yearID | teamID | salary | W | |
---|---|---|---|---|
0 | 1985 | ATL | 14807000 | 66 |
1 | 1985 | BAL | 11560712 | 83 |
2 | 1985 | BOS | 10897560 | 81 |
3 | 1985 | CAL | 14427894 | 90 |
4 | 1985 | CHA | 9846178 | 85 |
For each year, we perform the following:
teamName = 'OAK'
years = np.arange(1999, 2005)
residData = pd.DataFrame()
for yr in years:
df = joined[joined['yearID'] == yr]
X = df['salary'].values / 1e6
X = sm.add_constant(X)
y = df['W'].values
# least squares estimates
model = sm.OLS(y, X)
results = model.fit() # fit the linear regression model
beta = results.params # least squares coefficients
yhat = (beta[0] + beta[1]*X[:,1]) # regression line
residData[yr] = results.resid # residuals = y - yhat
residData.index = df['teamID']
residData = residData.T
residData.index = residData.index.format()
residData.plot(title = 'Residuals from least squares estimates across years', figsize = (15, 8),
color=map(lambda x: 'blue' if x=='OAK' else 'gray',df.teamID))
plt.xlabel('Year')
plt.ylabel('Residuals')
plt.show()
We previously looked at the mtcars data set in Lab 3. The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). This data set is also found in the base installation of the R programming language.
Column name | Description |
---|---|
mpg | Miles/(US) gallon |
cyl | Number of cylinders |
disp | Displacement (cu.in.) |
hp | Gross horsepower |
drat | Rear axle ratio |
wt | Weight (lb/1000) |
qsec | 1/4 mile time |
vs | V/S |
am | Transmission (0 = automatic, 1 = manual) |
gear | Number of forward gears |
carb | Number of carburetors |
First, read in the mtcars
data set using the sm.datasets.get_rdataset
function to import the dataset from R.
# your turn
mtcars = sm.datasets.get_rdataset("mtcars")
mtcars = mtcars.data
mtcars['mpg'].hist()
plt.title('Distribution of MPG')
plt.xlabel('Miles Per Gallon')
<matplotlib.text.Text at 0x10b7152d0>
Relationship between cyl
and mpg
plt.plot(mtcars.cyl, mtcars.mpg, 'o')
plt.xlim(3, 9)
plt.xlabel('Cylinders')
plt.ylabel('MPG')
plt.title('Relationship between cylinders and MPG')
<matplotlib.text.Text at 0x10b751e90>
Relationship between horsepower
and mpg
plt.plot(mtcars.hp, mtcars.mpg, 'o')
plt.xlabel('Horsepower')
plt.ylabel('MPG')
plt.title('Relationship between horsepower and MPG')
<matplotlib.text.Text at 0x10b78c590>
statsmodels
¶Now let's build a linear regression model for the mtcars
DataFrame, and estimate predicted mpg
given a new car has 6 cylinders and 180 horsepower.
# your turn
y = mtcars.mpg
X = mtcars[['cyl', 'hp']]
X = sm.add_constant(X)
model = sm.OLS(y,X)
results = model.fit()
print results.summary()
OLS Regression Results ============================================================================== Dep. Variable: mpg R-squared: 0.741 Model: OLS Adj. R-squared: 0.723 Method: Least Squares F-statistic: 41.42 Date: Fri, 17 Oct 2014 Prob (F-statistic): 3.16e-09 Time: 11:08:10 Log-Likelihood: -80.781 No. Observations: 32 AIC: 167.6 Df Residuals: 29 BIC: 172.0 Df Model: 2 ============================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------ const 36.9083 2.191 16.847 0.000 32.428 41.389 cyl -2.2647 0.576 -3.933 0.000 -3.443 -1.087 hp -0.0191 0.015 -1.275 0.213 -0.050 0.012 ============================================================================== Omnibus: 1.178 Durbin-Watson: 1.667 Prob(Omnibus): 0.555 Jarque-Bera (JB): 1.092 Skew: 0.411 Prob(JB): 0.579 Kurtosis: 2.623 Cond. No. 645. ==============================================================================
newX = np.array([1, 6, 180])
results.predict(newX)
19.878263535641754
What if a new car had 4 cylinders and 120 horsepower?
# your turn
newX = np.array([1, 4, 120])
results.predict(newX)
25.554952517097814
Now estimate the least squares estimates for $\beta_0$, $\beta_1$ and $\beta_2$ using matrix multiplication and the formula:
$$ \hat{\beta} = (X^{\top}X)^{-1} X^{\top}Y $$# your turn
beta = np.linalg.inv(np.dot(X.T, X)).dot(X.T).dot(y)
beta
array([ 3.69083305e+01, -2.26469360e+00, -1.91216965e-02])
You do not always have a continuous $y$ variable that you are measuring. Sometimes it may be binary (e.g. 0 or 1). Sometimes it may be count data. What do you do?
Use other types of regression besides just simple linear regression.