Comparing GEE and cluster robust standard errors for GLM-Poisson and Poisson¶

Sections

GEE with Independence correlation structure
GLM with cluster robust standard errors
Poisson with cluster robust standard errors
GLM and Poisson with cluster robust standard errors and t distribution based inference
Comparison
Appendix: Convariance-Types-in-GEE

In this notebook we compare different models that have inference that is robust to clustered errors, as they usually occur in longitudinal or repeated measures data.

In this model we can estimate the parameters of the mean model consistently (estimates converge to true parameters in large samples) even if the correlation structure of the error term or the distributions is incorrect. However, although the parameter estimates are consistent, the estimate of their covariance is not. If we use the usual standard errors for the parameter estimates that are based on the assumption that we have independently and identically distributed (iid) observations, then those standard errors can be very misleading if our data is not close to satisfying the iid assumption.

The commonly used solution is to correct the standard errors for un- or mis-specified heteroscedasticity or correlation in the distributions of the data. In the case of longitudinal data where we observe an individual or a group repeately, a frequent problem is that observations for the same individual or group are correlated with each other. Ignoring this correlation can produce misleading inference as we can see in the following example.

The next parts use several models from statsmodels to estimate parameters and standard errors. The last section compares the results.

In [1]:

# -*- coding: utf-8 -*-
"""
Created on Sun Aug 17 21:29:39 2014

Author: Josef Perktold, based on GEE example by Kerby Shedden

"""

import numpy as np
import pandas as pd
from statsmodels.genmod.generalized_linear_model import GLM
from statsmodels.genmod.generalized_estimating_equations import GEE
from statsmodels.genmod.dependence_structures import (Exchangeable,
    Independence, Autoregressive)
from statsmodels.genmod import families

In [2]:

data_url = "http://vincentarelbundock.github.io/Rdatasets/csv/MASS/epil.csv"
data = pd.read_csv(data_url)
print data.head()

   Unnamed: 0  y      trt  base  age  V4  subject  period     lbase      lage
0           1  5  placebo    11   31   0        1       1 -0.756354  0.114204
1           2  3  placebo    11   31   0        1       2 -0.756354  0.114204
2           3  3  placebo    11   31   0        1       3 -0.756354  0.114204
3           4  3  placebo    11   31   1        1       4 -0.756354  0.114204
4           5  3  placebo    11   30   0        2       1 -0.756354  0.081414

GEE with Independence correlation structure¶

In [3]:

fam = families.Poisson()
ind = Independence()
mod1 = GEE.from_formula("y ~ age + trt + base", "subject", data, cov_struct=ind, family=fam)
rslt1 = mod1.fit()
print '\n'
print rslt1.summary()


                               GEE Regression Results                              
===================================================================================
Dep. Variable:                           y   No. Observations:                  236
Model:                                 GEE   No. clusters:                       59
Method:                        Generalized   Min. cluster size:                   4
                      Estimating Equations   Max. cluster size:                   4
Family:                            Poisson   Mean cluster size:                 4.0
Dependence structure:         Independence   Num. iterations:                    51
Date:                     Tue, 19 Aug 2014   Scale:                           5.087
Covariance type:                    robust   Time:                         11:22:35
====================================================================================
                       coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------------
Intercept            0.5730      0.361      1.589      0.112        -0.134     1.280
trt[T.progabide]    -0.1519      0.171     -0.888      0.375        -0.487     0.183
age                  0.0223      0.011      1.960      0.050      2.11e-06     0.045
base                 0.0226      0.001     18.451      0.000         0.020     0.025
==============================================================================
Skew:                          3.7823   Kurtosis:                      28.6672
Centered skew:                 2.7597   Centered kurtosis:             21.9865
==============================================================================

In [4]:

rslt1.params

Out[4]:

array([ 0.57304359, -0.15188049,  0.02234757,  0.02263524])

In [5]:

rslt1

Out[5]:

<statsmodels.genmod.generalized_estimating_equations.GEEResults at 0x6f54fb0>

GLM with cluster robust standard errors¶

In [6]:

mod2 = GLM.from_formula("y ~ age + trt + base", data, family=fam)
groups = data["subject"]


rslt2 = mod2.fit(cov_type='cluster', cov_kwds=dict(groups=groups))
print '\n'
print(rslt2.summary())


                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                      y   No. Observations:                  236
Model:                            GLM   Df Residuals:                      232
Model Family:                 Poisson   Df Model:                            3
Link Function:                    log   Scale:                             1.0
Method:                          IRLS   Log-Likelihood:                -861.39
Date:                Tue, 19 Aug 2014   Deviance:                       906.98
Time:                        11:22:35   Pearson chi2:                 1.18e+03
No. Iterations:                     8                                         
====================================================================================
                       coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------------
Intercept            0.5730      0.366      1.565      0.118        -0.145     1.291
trt[T.progabide]    -0.1519      0.174     -0.875      0.382        -0.492     0.188
age                  0.0223      0.012      1.931      0.053        -0.000     0.045
base                 0.0226      0.001     18.177      0.000         0.020     0.025
====================================================================================

In [7]:

rslt_glm = mod2.fit()
print '\n'
print(rslt_glm.summary())


                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                      y   No. Observations:                  236
Model:                            GLM   Df Residuals:                      232
Model Family:                 Poisson   Df Model:                            3
Link Function:                    log   Scale:                             1.0
Method:                          IRLS   Log-Likelihood:                -861.39
Date:                Tue, 19 Aug 2014   Deviance:                       906.98
Time:                        11:22:35   Pearson chi2:                 1.18e+03
No. Iterations:                     8                                         
====================================================================================
                       coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------------
Intercept            0.5730      0.136      4.224      0.000         0.307     0.839
trt[T.progabide]    -0.1519      0.048     -3.175      0.001        -0.246    -0.058
age                  0.0223      0.004      5.549      0.000         0.014     0.030
base                 0.0226      0.001     44.436      0.000         0.022     0.024
====================================================================================

Poisson with cluster robust standard errors¶

In the following we try the same analysis with Poisson from the discrete_models.

Note: I switched to using 'bfgs' as optimizers because the default 'newton' method did not converge correctly.

In [8]:

import statsmodels.discrete.discrete_model as dm

mod3 = dm.Poisson.from_formula("y ~ age + trt + base", data)

rslt3 = mod3.fit(method='bfgs', cov_type='cluster', cov_kwds=dict(groups=groups))
print '\n'
print(rslt3.summary())

Optimization terminated successfully.
         Current function value: 3.649945
         Iterations: 16
         Function evaluations: 25
         Gradient evaluations: 25


                          Poisson Regression Results                          
==============================================================================
Dep. Variable:                      y   No. Observations:                  236
Model:                        Poisson   Df Residuals:                      232
Method:                           MLE   Df Model:                            3
Date:                Tue, 19 Aug 2014   Pseudo R-squ.:                  0.4754
Time:                        11:22:35   Log-Likelihood:                -861.39
converged:                       True   LL-Null:                       -1641.9
                                        LLR p-value:                     0.000
====================================================================================
                       coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------------
Intercept            0.5730      0.366      1.565      0.118        -0.145     1.291
trt[T.progabide]    -0.1519      0.174     -0.875      0.382        -0.492     0.188
age                  0.0223      0.012      1.931      0.053        -0.000     0.045
base                 0.0226      0.001     18.177      0.000         0.020     0.025
====================================================================================

GLM and Poisson with cluster robust standard errors and t distribution based inference¶

In [9]:

rslt2t = mod2.fit(cov_type='cluster', cov_kwds=dict(groups=groups), use_t=True)
print(rslt2t.summary())

                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                      y   No. Observations:                  236
Model:                            GLM   Df Residuals:                      232
Model Family:                 Poisson   Df Model:                            3
Link Function:                    log   Scale:                             1.0
Method:                          IRLS   Log-Likelihood:                -861.39
Date:                Tue, 19 Aug 2014   Deviance:                       906.98
Time:                        11:22:35   Pearson chi2:                 1.18e+03
No. Iterations:                     8                                         
====================================================================================
                       coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------------
Intercept            0.5730      0.366      1.565      0.123        -0.160     1.306
trt[T.progabide]    -0.1519      0.174     -0.875      0.385        -0.499     0.196
age                  0.0223      0.012      1.931      0.058        -0.001     0.046
base                 0.0226      0.001     18.177      0.000         0.020     0.025
====================================================================================

In [10]:

rslt3t = mod3.fit(method='bfgs', cov_type='cluster', cov_kwds=dict(groups=groups), use_t=True)
print(rslt3t.summary())

Optimization terminated successfully.
         Current function value: 3.649945
         Iterations: 16
         Function evaluations: 25
         Gradient evaluations: 25
                          Poisson Regression Results                          
==============================================================================
Dep. Variable:                      y   No. Observations:                  236
Model:                        Poisson   Df Residuals:                      232
Method:                           MLE   Df Model:                            3
Date:                Tue, 19 Aug 2014   Pseudo R-squ.:                  0.4754
Time:                        11:22:36   Log-Likelihood:                -861.39
converged:                       True   LL-Null:                       -1641.9
                                        LLR p-value:                     0.000
====================================================================================
                       coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------------
Intercept            0.5730      0.366      1.565      0.123        -0.160     1.306
trt[T.progabide]    -0.1519      0.174     -0.875      0.385        -0.499     0.196
age                  0.0223      0.012      1.931      0.058        -0.001     0.046
base                 0.0226      0.001     18.177      0.000         0.020     0.025
====================================================================================

Comparison¶

The following presents tables to compare our estimates and the resulting inference across our previously estimated models.

The models corresponding to the columns in the data are: 'GLM non-robust', 'GEE', 'GLM robust', 'Poisson robust', 'GLM robust t', 'Poisson robust t'.

First, all models estimate exactly the same values for the mean parameters. All models that we consider here assume the same underlying model and get the same parameters, even though the used estimation and optimization method differs.

In terms of inference we get essentially 3 groups of results.

The first model is the Poisson model estimated with GLM that assumes the data is generated from the Poisson distribution. The standard errors are relatively small and we reject the null hypothesis that a parameter is zero for all parameters based on the p-values.

We can see that in the second group, GEE, Poisson and GLM with the Poisson family, each with cluster robust standard errors, the standard errors are much larger than the non-robust standard errors and based on the pvalues only base is significant and age is at the 0.05 threshold. The p-values and the confidence intervals are based on the asymptotic normal distribution of the estimated parameters.

The last two columns show the cluster robust results for GLM-Poisson and Poisson if we use the t-distribution instead of the normal distribution. In the linear case the t distribution often provides a better small sample approximation than the normal distribution. Since we are using the Poisson distribution in this case, we would have to check, for example by Monte Carlo, whether the t-distribution is better than the normal distribution. Here we just use it to illustrate the difference between the two distributional assumptions. Standard errors are independent of the distribution and didn't change. However, the p-values are slightly large than under the normal distribution, and the confidence intervals are wider as can be seen in the summary tables of the models above.

The standard errors for the parameters differ between GEE on one side and GLM-Poisson and Poisson on the other side by a small scaling factor, that reflects a different small sample correction or variance scaling. GLM-Poisson and Poisson share the same code for the robust standard errors and produce the same results.

The default degrees of freedom, as well as the small sample correction, in the case of cluster robust standard errors are based on the number of clusters and not on the total number of observations. If we had based the degrees of freedom on the number of observations than the results based on t distribution would be closer to those based on the normal distribution.

In [11]:

res_all = [rslt_glm, rslt1, rslt2, rslt3, rslt2t, rslt3t]
res_names = ['GLM non-robust', 'GEE', 'GLM robust', 'Poisson robust', 'GLM robust t', 'Poisson robust t']
import pandas
# pandas.DataFrame([getattr(res, 'params') for res in res_all], columns=res_names)  # broken with my versions
# pandas.DataFrame([pandas.Series(getattr(res, 'params'), index=rslt2.params.index) for res in res_all], columns=res_names)
# the second has only nans

In [12]:

# detour through numpy
print('Parameters')
pandas.DataFrame(np.round(np.column_stack([np.asarray(getattr(res, 'params')) for res in res_all]), 4), 
                 index=rslt2.params.index, columns=res_names)

Parameters

Out[12]:

	GLM non-robust	GEE	GLM robust	Poisson robust	GLM robust t	Poisson robust t
Intercept	0.5730	0.5730	0.5730	0.5730	0.5730	0.5730
trt[T.progabide]	-0.1519	-0.1519	-0.1519	-0.1519	-0.1519	-0.1519
age	0.0223	0.0223	0.0223	0.0223	0.0223	0.0223
base	0.0226	0.0226	0.0226	0.0226	0.0226	0.0226

In [13]:

print('Standard Errors')
pandas.DataFrame(np.round(np.column_stack([np.asarray(getattr(res, 'bse')) for res in res_all]), 4), 
                 index=rslt2.params.index, columns=res_names)

Standard Errors

Out[13]:

	GLM non-robust	GEE	GLM robust	Poisson robust	GLM robust t	Poisson robust t
Intercept	0.1357	0.3607	0.3662	0.3662	0.3662	0.3662
trt[T.progabide]	0.0478	0.1711	0.1736	0.1736	0.1736	0.1736
age	0.0040	0.0114	0.0116	0.0116	0.0116	0.0116
base	0.0005	0.0012	0.0012	0.0012	0.0012	0.0012

In [14]:

print('P-values')
pandas.DataFrame(np.round(np.column_stack([np.asarray(getattr(res, 'pvalues')) for res in res_all]), 4), 
                 index=rslt2.params.index, columns=res_names)

P-values

Out[14]:

	GLM non-robust	GEE	GLM robust	Poisson robust	GLM robust t	Poisson robust t
Intercept	0.0000	0.1122	0.1176	0.1176	0.1230	0.1230
trt[T.progabide]	0.0015	0.3746	0.3817	0.3817	0.3853	0.3853
age	0.0000	0.0500	0.0535	0.0535	0.0584	0.0584
base	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000

Appendix: Convariance Types in GEE¶

In [15]:

#covariance_type='robust'

In [30]:

# BUG see https://github.com/statsmodels/statsmodels/issues/1906
mod1b = GEE.from_formula("y ~ age + trt + base", "subject", data, cov_struct=ind, family=fam)
rslt1b = mod1b.fit(covariance_type='naive')
rslt1b_ = mod1b.fit(covariance_type='naive')
print '\n'
print rslt1b.summary()


                               GEE Regression Results                              
===================================================================================
Dep. Variable:                           y   No. Observations:                  236
Model:                                 GEE   No. clusters:                       59
Method:                        Generalized   Min. cluster size:                   4
                      Estimating Equations   Max. cluster size:                   4
Family:                            Poisson   Mean cluster size:                 4.0
Dependence structure:         Independence   Num. iterations:                    51
Date:                     Tue, 19 Aug 2014   Scale:                           5.087
Covariance type:                    robust   Time:                         12:13:20
====================================================================================
                       coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------------
Intercept            0.5730      0.361      1.589      0.112        -0.134     1.280
trt[T.progabide]    -0.1519      0.171     -0.888      0.375        -0.487     0.183
age                  0.0223      0.011      1.960      0.050      2.11e-06     0.045
base                 0.0226      0.001     18.451      0.000         0.020     0.025
==============================================================================
Skew:                          3.7823   Kurtosis:                      28.6672
Centered skew:                 2.7597   Centered kurtosis:             21.9865
==============================================================================

In [31]:

#print(rslt1b_.summary())

In [32]:

rslt1b.covariance_type, rslt1b_.covariance_type

Out[32]:

('robust', 'naive')

In [39]:

rslt1b.bse, rslt1b_.bse

Out[39]:

(array([ 0.36072614,  0.17105111,  0.01140096,  0.00122675]),
 array([ 0.30598606,  0.10789098,  0.00908399,  0.00114894]))

In [34]:

rslt1b_.bse - rslt1.bse

Out[34]:

array([ -5.47400849e-02,  -6.31601337e-02,  -2.31696404e-03,
        -7.78130388e-05])

In [36]:

bse_naive = np.sqrt(np.diag(rslt1b.naive_covariance))
print(bse_naive)
print(rslt_glm.bse.values)
#np.sqrt(np.diag(rslt1c.robust_covariance_bc))
print(bse_naive / rslt_glm.bse.values)

[ 0.30598606  0.10789098  0.00908399  0.00114894]
[ 0.1356608   0.04783413  0.00402744  0.00050939]
[ 2.25552299  2.25552299  2.25552299  2.25552299]

In [ ]:

np.sqrt(rslt_glm.pearson_chi2 / rslt_glm.model.df_resid)

In [ ]:

mod1c = GEE.from_formula("y ~ age + trt + base", "subject", data, cov_struct=ind, family=fam)
rslt1c = mod1c.fit(covariance_type='bias-reduced')
print '\n'
print rslt1c.summary()

In [ ]:

rslt1c.bse - rslt1.bse

In [ ]:

np.sqrt(np.diag(rslt1c.robust_covariance_bc))

In [ ]:

rslt1c.bse

In [ ]:

rslt1b.bse

In [ ]:

rslt1b.covariance_type

In [ ]:

rslt1b.covariance_type = 'naive'
del rslt1b._cache['bse']

In [ ]:

rslt1b.bse

In [ ]:

rslt1b_ = mod1b.fit(covariance_type='naive')

In [ ]:

rslt1b_.covariance_type

In [ ]:

rslt1b_.bse

In [ ]: