%load_ext load_style
%load_style talk.css
from IPython.display import Image
from talktools import website
statsmodels is a Python module that allows users to explore data, estimate statistical models, and perform statistical tests. An (incomplete) list of what you can do with statsmodels:
I will also mention briefly the seaborn package, for high-level, beautiful statistical visualisations
#website('http://statsmodels.sourceforge.net/', width=1000)
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from IPython.display import Image, HTML
%matplotlib inline
data = pd.read_csv('../data/soi_nino.csv', index_col=0, parse_dates=True)
data.tail()
SOI | nino | |
---|---|---|
2013-08-01 | -0.222680 | -0.37 |
2013-09-01 | 0.317375 | -0.27 |
2013-10-01 | -0.309098 | -0.14 |
2013-11-01 | 0.836244 | -0.19 |
2013-12-01 | -0.117755 | -0.43 |
data.corr()
SOI | nino | |
---|---|---|
SOI | 1.000000 | -0.710077 |
nino | -0.710077 | 1.000000 |
import statsmodels.api as sm
Get the data into numpy arrays
y = data['SOI'].values
X = data['nino'].values
add an axis to X and add a constant with the sm.add_constant
function
X = X[...,np.newaxis]
X = sm.add_constant(X)
X
array([[ 1. , -1.84], [ 1. , -1.63], [ 1. , -1.3 ], ..., [ 1. , -0.14], [ 1. , -0.19], [ 1. , -0.43]])
instantiate the model, arguments are y
(dependent variable), X
(independant variable)
lm1_mod = sm.OLS(y, X)
fit the model
lm1_fit = lm1_mod.fit()
lm1_fit.summary()
Dep. Variable: | y | R-squared: | 0.504 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.504 |
Method: | Least Squares | F-statistic: | 779.0 |
Date: | Wed, 25 Feb 2015 | Prob (F-statistic): | 8.08e-119 |
Time: | 09:15:55 | Log-Likelihood: | -905.59 |
No. Observations: | 768 | AIC: | 1815. |
Df Residuals: | 766 | BIC: | 1824. |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
const | -0.2307 | 0.029 | -7.977 | 0.000 | -0.287 -0.174 |
x1 | -0.9506 | 0.034 | -27.911 | 0.000 | -1.017 -0.884 |
Omnibus: | 2.974 | Durbin-Watson: | 1.505 |
---|---|---|---|
Prob(Omnibus): | 0.226 | Jarque-Bera (JB): | 2.809 |
Skew: | -0.138 | Prob(JB): | 0.246 |
Kurtosis: | 3.106 | Cond. No. | 1.28 |
Out of sample prediction for ONE value of NINO3.4
You need to prepend a constant (1. is the default) in order for the matrices to be aligned
X_prime = np.array([1., 3.]).reshape((1,2))
X_prime.shape
(1, 2)
X.shape
(768, 2)
y_hat = lm1_fit.predict(X_prime)
y_hat
array([-3.08240935])
The statsmodels library has also a formula API that allows you to enter your model using a syntax which should be familiar to R users. The variable names are simply the names of the corresponding columns in the pandas DataFrame, which is given as the second argument.
The formula
API is based on the patsy project.
from statsmodels.formula.api import ols as OLS
instantiate the model (OLS), Specifying the columns of the pandas DataFrame
lm2_mod = OLS('SOI ~ nino', data)
fit the model
lm2_fit = lm2_mod.fit()
lm2_fit.summary()
Dep. Variable: | SOI | R-squared: | 0.504 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.504 |
Method: | Least Squares | F-statistic: | 779.0 |
Date: | Wed, 25 Feb 2015 | Prob (F-statistic): | 8.08e-119 |
Time: | 09:15:59 | Log-Likelihood: | -905.59 |
No. Observations: | 768 | AIC: | 1815. |
Df Residuals: | 766 | BIC: | 1824. |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
Intercept | -0.2307 | 0.029 | -7.977 | 0.000 | -0.287 -0.174 |
nino | -0.9506 | 0.034 | -27.911 | 0.000 | -1.017 -0.884 |
Omnibus: | 2.974 | Durbin-Watson: | 1.505 |
---|---|---|---|
Prob(Omnibus): | 0.226 | Jarque-Bera (JB): | 2.809 |
Skew: | -0.138 | Prob(JB): | 0.246 |
Kurtosis: | 3.106 | Cond. No. | 1.28 |
Seaborn Seaborn is a library for making attractive and informative statistical graphics in Python. It is built on top of matplotlib and tightly integrated with the PyData stack, including support for numpy and pandas data structures and statistical routines from scipy and statsmodels.
import seaborn as sb
sb.regplot("nino", "SOI", data)
<matplotlib.axes._subplots.AxesSubplot at 0x10a275f10>
f, ax = plt.subplots()
sb.kdeplot(data['nino'], data['SOI'], shade=True, ax=ax)
<matplotlib.axes._subplots.AxesSubplot at 0x10a355b50>