Notebook

Generalized Least Squares¶

In [1]:

import statsmodels.api as sm
import numpy as np
from statsmodels.iolib.table import (SimpleTable, default_txt_fmt)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-b87f18091680> in <module>()
      2 import numpy as np
      3 from statsmodels.iolib.table import (SimpleTable, default_txt_fmt)
----> 4 np.vstack(gls_results.params)
      5 data = sm.datasets.longley.load()
      6 data.exog = sm.add_constant(data.exog)

NameError: name 'gls_results' is not defined

The Longley dataset is a time series dataset:

In [2]:

data = sm.datasets.longley.load()
data.exog = sm.add_constant(data.exog)
print data.exog[:5]

[[     83.   234289.     2356.     1590.   107608.     1947.        1. ]
 [     88.5  259426.     2325.     1456.   108632.     1948.        1. ]
 [     88.2  258054.     3682.     1616.   109773.     1949.        1. ]
 [     89.5  284599.     3351.     1650.   110929.     1950.        1. ]
 [     96.2  328975.     2099.     3099.   112075.     1951.        1. ]]

/usr/local/lib/python2.7/dist-packages/statsmodels-0.5.0-py2.7-linux-x86_64.egg/statsmodels/tools/tools.py:306: FutureWarning: The default of `prepend` will be changed to True in 0.5.0, use explicit prepend
  FutureWarning)

Let's assume that the data is heteroskedastic and that we know the nature of the heteroskedasticity. We can then define sigma and use it to give us a GLS model

First we will obtain the residuals from an OLS fit

In [3]:

ols_resid = sm.OLS(data.endog, data.exog).fit().resid

Assume that the error terms follow an AR(1) process with a trend:

resid[i] = beta_0 + rho*resid[i-1] + e[i]

where e ~ N(0,some_sigma**2)

and that rho is simply the correlation of the residual a consistent estimator for rho is to regress the residuals on the lagged residuals

In [4]:

resid_fit = sm.OLS(ols_resid[1:], sm.add_constant(ols_resid[:-1])).fit()
print resid_fit.tvalues[0]
print resid_fit.pvalues[0]

-1.43902298398
0.173784447887

While we don't have strong evidence that the errors follow an AR(1) process we continue

In [5]:

rho = resid_fit.params[0]

As we know, an AR(1) process means that near-neighbors have a stronger relation so we can give this structure by using a toeplitz matrix

In [6]:

from scipy.linalg import toeplitz

toeplitz(range(5))

Out[6]:

array([[0, 1, 2, 3, 4],
       [1, 0, 1, 2, 3],
       [2, 1, 0, 1, 2],
       [3, 2, 1, 0, 1],
       [4, 3, 2, 1, 0]])

In [7]:

order = toeplitz(range(len(ols_resid)))

so that our error covariance structure is actually rho**order which defines an autocorrelation structure

In [8]:

sigma = rho**order
gls_model = sm.GLS(data.endog, data.exog, sigma=sigma)
gls_results = gls_model.fit()

Of course, the exact rho in this instance is not known so it it might make more sense to use feasible gls, which currently only has experimental support.

We can use the GLSAR model with one lag, to get to a similar result:

In [9]:

glsar_model = sm.GLSAR(data.endog, data.exog, 1)
glsar_results = glsar_model.iterative_fit(1)
print glsar_results.summary()

                           GLSAR Regression Results                           
==============================================================================
Dep. Variable:                      y   R-squared:                       0.996
Model:                          GLSAR   Adj. R-squared:                  0.992
Method:                 Least Squares   F-statistic:                     295.2
Date:                Sun, 26 Aug 2012   Prob (F-statistic):           6.09e-09
Time:                        20:51:48   Log-Likelihood:                -102.04
No. Observations:                  15   AIC:                             218.1
Df Residuals:                       8   BIC:                             223.0
Df Model:                           6                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1            34.5568     84.734      0.408      0.694      -160.840   229.953
x2            -0.0343      0.033     -1.047      0.326        -0.110     0.041
x3            -1.9621      0.481     -4.083      0.004        -3.070    -0.854
x4            -1.0020      0.211     -4.740      0.001        -1.489    -0.515
x5            -0.0978      0.225     -0.435      0.675        -0.616     0.421
x6          1823.1829    445.829      4.089      0.003       795.100  2851.266
const      -3.468e+06   8.72e+05     -3.979      0.004     -5.48e+06 -1.46e+06
==============================================================================
Omnibus:                        1.960   Durbin-Watson:                   2.554
Prob(Omnibus):                  0.375   Jarque-Bera (JB):                1.423
Skew:                           0.713   Prob(JB):                        0.491
Kurtosis:                       2.508   Cond. No.                     4.80e+09
==============================================================================

The condition number is large, 4.8e+09. This might indicate that there are
strong multicollinearity or other numerical problems.

/usr/local/lib/python2.7/dist-packages/scipy/stats/stats.py:1199: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=15
  int(n))

Comparing gls and glsar results, we see that there are some small differences in the parameter estimates and the resulting standard errors of the parameter estimate. This might be do to the numerical differences in the algorithm, e.g. the treatment of initial conditions, because of the small number of observations in the longley dataset.

In [10]:

comparison = np.vstack([gls_results.params, glsar_results.params, 
                        gls_results.bse, glsar_results.bse])
comparison = np.transpose(csomparison)
colnames = ['gls_params', 'glsar_params', 'gls_bse', 'glsar_bse']
print SimpleTable(comparison, colnames, txt_fmt=default_txt_fmt)

================================================================
   gls_params      glsar_params      gls_bse        glsar_bse   
----------------------------------------------------------------
 -12.7656454401   34.5567846182   69.4308073335   84.7337145245 
-0.0380013249817 -0.0343410089663 0.026247682233 0.0328032449964
 -2.18694871107   -1.96214395046  0.382393150849  0.480544864905
 -1.15177649259   -1.00197295929  0.165252691545  0.211383870914
-0.0680535580455 -0.0978045986166 0.176428333976  0.224774369449
 1993.95292851     1823.1828867   342.634627565   445.828747793 
 -3797854.90154   -3467960.63254  670688.699307   871584.051696 
----------------------------------------------------------------