In [1]:

%load_ext load_style
%load_style talk.css
from IPython.display import Image
from talktools import website

Statistical modelling with statsmodels¶

statsmodels is a Python module that allows users to explore data, estimate statistical models, and perform statistical tests. An (incomplete) list of what you can do with statsmodels:

Linear regression models
Generalized linear models
A wide range of statistical tests
...

I will also mention briefly the seaborn package, for high-level, beautiful statistical visualisations

Statsmodels¶

In [2]:

#website('http://statsmodels.sourceforge.net/', width=1000)

In [3]:

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from IPython.display import Image, HTML
%matplotlib inline

In [4]:

data = pd.read_csv('../data/soi_nino.csv', index_col=0, parse_dates=True)

In [5]:

data.tail()

Out[5]:

	SOI	nino
2013-08-01	-0.222680	-0.37
2013-09-01	0.317375	-0.27
2013-10-01	-0.309098	-0.14
2013-11-01	0.836244	-0.19
2013-12-01	-0.117755	-0.43

In [6]:

data.corr()

Out[6]:

	SOI	nino
SOI	1.000000	-0.710077
nino	-0.710077	1.000000

Using statsmodels with numpy arrays and the classic API¶

In [7]:

import statsmodels.api as sm

Get the data into numpy arrays

In [8]:

y = data['SOI'].values
X = data['nino'].values

add an axis to X and add a constant with the sm.add_constant function

In [9]:

X = X[...,np.newaxis]

In [10]:

X = sm.add_constant(X)

In [11]:

Out[11]:

array([[ 1.  , -1.84],
       [ 1.  , -1.63],
       [ 1.  , -1.3 ],
       ..., 
       [ 1.  , -0.14],
       [ 1.  , -0.19],
       [ 1.  , -0.43]])

instantiate the model, arguments are y (dependent variable), X (independant variable)

In [12]:

lm1_mod = sm.OLS(y, X)

fit the model

In [13]:

lm1_fit = lm1_mod.fit()

In [14]:

lm1_fit.summary()

Out[14]:

OLS Regression Results
Dep. Variable:	y	R-squared:	0.504
Model:	OLS	Adj. R-squared:	0.504
Method:	Least Squares	F-statistic:	779.0
Date:	Wed, 25 Feb 2015	Prob (F-statistic):	8.08e-119
Time:	09:15:55	Log-Likelihood:	-905.59
No. Observations:	768	AIC:	1815.
Df Residuals:	766	BIC:	1824.
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
const	-0.2307	0.029	-7.977	0.000	-0.287 -0.174
x1	-0.9506	0.034	-27.911	0.000	-1.017 -0.884

Omnibus:	2.974	Durbin-Watson:	1.505
Prob(Omnibus):	0.226	Jarque-Bera (JB):	2.809
Skew:	-0.138	Prob(JB):	0.246
Kurtosis:	3.106	Cond. No.	1.28

Out of sample prediction for ONE value of NINO3.4

You need to prepend a constant (1. is the default) in order for the matrices to be aligned

In [15]:

X_prime = np.array([1., 3.]).reshape((1,2))

In [16]:

X_prime.shape

Out[16]:

(1, 2)

In [17]:

X.shape

Out[17]:

(768, 2)

In [18]:

y_hat = lm1_fit.predict(X_prime)

In [19]:

y_hat

Out[19]:

array([-3.08240935])

Using statsmodels with Pandas DataFrames and the formula API¶

The statsmodels library has also a formula API that allows you to enter your model using a syntax which should be familiar to R users. The variable names are simply the names of the corresponding columns in the pandas DataFrame, which is given as the second argument.

The formula API is based on the patsy project.

In [20]:

from statsmodels.formula.api import ols as OLS

instantiate the model (OLS), Specifying the columns of the pandas DataFrame

In [21]:

lm2_mod = OLS('SOI ~ nino', data)

fit the model

In [22]:

lm2_fit = lm2_mod.fit()

In [23]:

lm2_fit.summary()

Out[23]:

OLS Regression Results
Dep. Variable:	SOI	R-squared:	0.504
Model:	OLS	Adj. R-squared:	0.504
Method:	Least Squares	F-statistic:	779.0
Date:	Wed, 25 Feb 2015	Prob (F-statistic):	8.08e-119
Time:	09:15:59	Log-Likelihood:	-905.59
No. Observations:	768	AIC:	1815.
Df Residuals:	766	BIC:	1824.
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	-0.2307	0.029	-7.977	0.000	-0.287 -0.174
nino	-0.9506	0.034	-27.911	0.000	-1.017 -0.884

Omnibus:	2.974	Durbin-Watson:	1.505
Prob(Omnibus):	0.226	Jarque-Bera (JB):	2.809
Skew:	-0.138	Prob(JB):	0.246
Kurtosis:	3.106	Cond. No.	1.28

Statistical models and statistical visualisations with Seaborn¶

Seaborn Seaborn is a library for making attractive and informative statistical graphics in Python. It is built on top of matplotlib and tightly integrated with the PyData stack, including support for numpy and pandas data structures and statistical routines from scipy and statsmodels.

In [24]:

import seaborn as sb

In [25]:

sb.regplot("nino", "SOI", data)

Out[25]:

<matplotlib.axes._subplots.AxesSubplot at 0x10a275f10>

In [26]:

f, ax = plt.subplots()
sb.kdeplot(data['nino'], data['SOI'], shade=True, ax=ax)

Out[26]:

<matplotlib.axes._subplots.AxesSubplot at 0x10a355b50>

In [ ]: