In this notebook we compare different models that have inference that is robust to clustered errors, as they usually occur in longitudinal or repeated measures data.
In this model we can estimate the parameters of the mean model consistently (estimates converge to true parameters in large samples) even if the correlation structure of the error term or the distributions is incorrect. However, although the parameter estimates are consistent, the estimate of their covariance is not. If we use the usual standard errors for the parameter estimates that are based on the assumption that we have independently and identically distributed (iid) observations, then those standard errors can be very misleading if our data is not close to satisfying the iid assumption.
The commonly used solution is to correct the standard errors for un- or mis-specified heteroscedasticity or correlation in the distributions of the data. In the case of longitudinal data where we observe an individual or a group repeately, a frequent problem is that observations for the same individual or group are correlated with each other. Ignoring this correlation can produce misleading inference as we can see in the following example.
The next parts use several models from statsmodels to estimate parameters and standard errors. The last section compares the results.
# -*- coding: utf-8 -*-
"""
Created on Sun Aug 17 21:29:39 2014
Author: Josef Perktold, based on GEE example by Kerby Shedden
"""
import numpy as np
import pandas as pd
from statsmodels.genmod.generalized_linear_model import GLM
from statsmodels.genmod.generalized_estimating_equations import GEE
from statsmodels.genmod.dependence_structures import (Exchangeable,
Independence, Autoregressive)
from statsmodels.genmod import families
data_url = "http://vincentarelbundock.github.io/Rdatasets/csv/MASS/epil.csv"
data = pd.read_csv(data_url)
print data.head()
Unnamed: 0 y trt base age V4 subject period lbase lage 0 1 5 placebo 11 31 0 1 1 -0.756354 0.114204 1 2 3 placebo 11 31 0 1 2 -0.756354 0.114204 2 3 3 placebo 11 31 0 1 3 -0.756354 0.114204 3 4 3 placebo 11 31 1 1 4 -0.756354 0.114204 4 5 3 placebo 11 30 0 2 1 -0.756354 0.081414
fam = families.Poisson()
ind = Independence()
mod1 = GEE.from_formula("y ~ age + trt + base", "subject", data, cov_struct=ind, family=fam)
rslt1 = mod1.fit()
print '\n'
print rslt1.summary()
GEE Regression Results =================================================================================== Dep. Variable: y No. Observations: 236 Model: GEE No. clusters: 59 Method: Generalized Min. cluster size: 4 Estimating Equations Max. cluster size: 4 Family: Poisson Mean cluster size: 4.0 Dependence structure: Independence Num. iterations: 51 Date: Tue, 19 Aug 2014 Scale: 5.087 Covariance type: robust Time: 11:22:35 ==================================================================================== coef std err z P>|z| [95.0% Conf. Int.] ------------------------------------------------------------------------------------ Intercept 0.5730 0.361 1.589 0.112 -0.134 1.280 trt[T.progabide] -0.1519 0.171 -0.888 0.375 -0.487 0.183 age 0.0223 0.011 1.960 0.050 2.11e-06 0.045 base 0.0226 0.001 18.451 0.000 0.020 0.025 ============================================================================== Skew: 3.7823 Kurtosis: 28.6672 Centered skew: 2.7597 Centered kurtosis: 21.9865 ==============================================================================
rslt1.params
array([ 0.57304359, -0.15188049, 0.02234757, 0.02263524])
rslt1
<statsmodels.genmod.generalized_estimating_equations.GEEResults at 0x6f54fb0>
mod2 = GLM.from_formula("y ~ age + trt + base", data, family=fam)
groups = data["subject"]
rslt2 = mod2.fit(cov_type='cluster', cov_kwds=dict(groups=groups))
print '\n'
print(rslt2.summary())
Generalized Linear Model Regression Results ============================================================================== Dep. Variable: y No. Observations: 236 Model: GLM Df Residuals: 232 Model Family: Poisson Df Model: 3 Link Function: log Scale: 1.0 Method: IRLS Log-Likelihood: -861.39 Date: Tue, 19 Aug 2014 Deviance: 906.98 Time: 11:22:35 Pearson chi2: 1.18e+03 No. Iterations: 8 ==================================================================================== coef std err z P>|z| [95.0% Conf. Int.] ------------------------------------------------------------------------------------ Intercept 0.5730 0.366 1.565 0.118 -0.145 1.291 trt[T.progabide] -0.1519 0.174 -0.875 0.382 -0.492 0.188 age 0.0223 0.012 1.931 0.053 -0.000 0.045 base 0.0226 0.001 18.177 0.000 0.020 0.025 ====================================================================================
rslt_glm = mod2.fit()
print '\n'
print(rslt_glm.summary())
Generalized Linear Model Regression Results ============================================================================== Dep. Variable: y No. Observations: 236 Model: GLM Df Residuals: 232 Model Family: Poisson Df Model: 3 Link Function: log Scale: 1.0 Method: IRLS Log-Likelihood: -861.39 Date: Tue, 19 Aug 2014 Deviance: 906.98 Time: 11:22:35 Pearson chi2: 1.18e+03 No. Iterations: 8 ==================================================================================== coef std err z P>|z| [95.0% Conf. Int.] ------------------------------------------------------------------------------------ Intercept 0.5730 0.136 4.224 0.000 0.307 0.839 trt[T.progabide] -0.1519 0.048 -3.175 0.001 -0.246 -0.058 age 0.0223 0.004 5.549 0.000 0.014 0.030 base 0.0226 0.001 44.436 0.000 0.022 0.024 ====================================================================================
In the following we try the same analysis with Poisson from the discrete_models
.
Note: I switched to using 'bfgs'
as optimizers because the default 'newton'
method did not converge correctly.
import statsmodels.discrete.discrete_model as dm
mod3 = dm.Poisson.from_formula("y ~ age + trt + base", data)
rslt3 = mod3.fit(method='bfgs', cov_type='cluster', cov_kwds=dict(groups=groups))
print '\n'
print(rslt3.summary())
Optimization terminated successfully. Current function value: 3.649945 Iterations: 16 Function evaluations: 25 Gradient evaluations: 25 Poisson Regression Results ============================================================================== Dep. Variable: y No. Observations: 236 Model: Poisson Df Residuals: 232 Method: MLE Df Model: 3 Date: Tue, 19 Aug 2014 Pseudo R-squ.: 0.4754 Time: 11:22:35 Log-Likelihood: -861.39 converged: True LL-Null: -1641.9 LLR p-value: 0.000 ==================================================================================== coef std err z P>|z| [95.0% Conf. Int.] ------------------------------------------------------------------------------------ Intercept 0.5730 0.366 1.565 0.118 -0.145 1.291 trt[T.progabide] -0.1519 0.174 -0.875 0.382 -0.492 0.188 age 0.0223 0.012 1.931 0.053 -0.000 0.045 base 0.0226 0.001 18.177 0.000 0.020 0.025 ====================================================================================
rslt2t = mod2.fit(cov_type='cluster', cov_kwds=dict(groups=groups), use_t=True)
print(rslt2t.summary())
Generalized Linear Model Regression Results ============================================================================== Dep. Variable: y No. Observations: 236 Model: GLM Df Residuals: 232 Model Family: Poisson Df Model: 3 Link Function: log Scale: 1.0 Method: IRLS Log-Likelihood: -861.39 Date: Tue, 19 Aug 2014 Deviance: 906.98 Time: 11:22:35 Pearson chi2: 1.18e+03 No. Iterations: 8 ==================================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------------ Intercept 0.5730 0.366 1.565 0.123 -0.160 1.306 trt[T.progabide] -0.1519 0.174 -0.875 0.385 -0.499 0.196 age 0.0223 0.012 1.931 0.058 -0.001 0.046 base 0.0226 0.001 18.177 0.000 0.020 0.025 ====================================================================================
rslt3t = mod3.fit(method='bfgs', cov_type='cluster', cov_kwds=dict(groups=groups), use_t=True)
print(rslt3t.summary())
Optimization terminated successfully. Current function value: 3.649945 Iterations: 16 Function evaluations: 25 Gradient evaluations: 25 Poisson Regression Results ============================================================================== Dep. Variable: y No. Observations: 236 Model: Poisson Df Residuals: 232 Method: MLE Df Model: 3 Date: Tue, 19 Aug 2014 Pseudo R-squ.: 0.4754 Time: 11:22:36 Log-Likelihood: -861.39 converged: True LL-Null: -1641.9 LLR p-value: 0.000 ==================================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------------ Intercept 0.5730 0.366 1.565 0.123 -0.160 1.306 trt[T.progabide] -0.1519 0.174 -0.875 0.385 -0.499 0.196 age 0.0223 0.012 1.931 0.058 -0.001 0.046 base 0.0226 0.001 18.177 0.000 0.020 0.025 ====================================================================================
The following presents tables to compare our estimates and the resulting inference across our previously estimated models.
The models corresponding to the columns in the data are: 'GLM non-robust', 'GEE', 'GLM robust', 'Poisson robust', 'GLM robust t', 'Poisson robust t'.
First, all models estimate exactly the same values for the mean parameters. All models that we consider here assume the same underlying model and get the same parameters, even though the used estimation and optimization method differs.
In terms of inference we get essentially 3 groups of results.
The first model is the Poisson model estimated with GLM that assumes the data is generated from the Poisson distribution. The standard errors are relatively small and we reject the null hypothesis that a parameter is zero for all parameters based on the p-values.
We can see that in the second group, GEE, Poisson and GLM with the Poisson family, each with cluster robust standard errors, the standard errors are much larger than the non-robust standard errors and based on the pvalues only base is significant and age is at the 0.05 threshold. The p-values and the confidence intervals are based on the asymptotic normal distribution of the estimated parameters.
The last two columns show the cluster robust results for GLM-Poisson and Poisson if we use the t-distribution instead of the normal distribution. In the linear case the t distribution often provides a better small sample approximation than the normal distribution. Since we are using the Poisson distribution in this case, we would have to check, for example by Monte Carlo, whether the t-distribution is better than the normal distribution. Here we just use it to illustrate the difference between the two distributional assumptions. Standard errors are independent of the distribution and didn't change. However, the p-values are slightly large than under the normal distribution, and the confidence intervals are wider as can be seen in the summary tables of the models above.
The standard errors for the parameters differ between GEE on one side and GLM-Poisson and Poisson on the other side by a small scaling factor, that reflects a different small sample correction or variance scaling. GLM-Poisson and Poisson share the same code for the robust standard errors and produce the same results.
The default degrees of freedom, as well as the small sample correction, in the case of cluster robust standard errors are based on the number of clusters and not on the total number of observations. If we had based the degrees of freedom on the number of observations than the results based on t distribution would be closer to those based on the normal distribution.
res_all = [rslt_glm, rslt1, rslt2, rslt3, rslt2t, rslt3t]
res_names = ['GLM non-robust', 'GEE', 'GLM robust', 'Poisson robust', 'GLM robust t', 'Poisson robust t']
import pandas
# pandas.DataFrame([getattr(res, 'params') for res in res_all], columns=res_names) # broken with my versions
# pandas.DataFrame([pandas.Series(getattr(res, 'params'), index=rslt2.params.index) for res in res_all], columns=res_names)
# the second has only nans
# detour through numpy
print('Parameters')
pandas.DataFrame(np.round(np.column_stack([np.asarray(getattr(res, 'params')) for res in res_all]), 4),
index=rslt2.params.index, columns=res_names)
Parameters
GLM non-robust | GEE | GLM robust | Poisson robust | GLM robust t | Poisson robust t | |
---|---|---|---|---|---|---|
Intercept | 0.5730 | 0.5730 | 0.5730 | 0.5730 | 0.5730 | 0.5730 |
trt[T.progabide] | -0.1519 | -0.1519 | -0.1519 | -0.1519 | -0.1519 | -0.1519 |
age | 0.0223 | 0.0223 | 0.0223 | 0.0223 | 0.0223 | 0.0223 |
base | 0.0226 | 0.0226 | 0.0226 | 0.0226 | 0.0226 | 0.0226 |
print('Standard Errors')
pandas.DataFrame(np.round(np.column_stack([np.asarray(getattr(res, 'bse')) for res in res_all]), 4),
index=rslt2.params.index, columns=res_names)
Standard Errors
GLM non-robust | GEE | GLM robust | Poisson robust | GLM robust t | Poisson robust t | |
---|---|---|---|---|---|---|
Intercept | 0.1357 | 0.3607 | 0.3662 | 0.3662 | 0.3662 | 0.3662 |
trt[T.progabide] | 0.0478 | 0.1711 | 0.1736 | 0.1736 | 0.1736 | 0.1736 |
age | 0.0040 | 0.0114 | 0.0116 | 0.0116 | 0.0116 | 0.0116 |
base | 0.0005 | 0.0012 | 0.0012 | 0.0012 | 0.0012 | 0.0012 |
print('P-values')
pandas.DataFrame(np.round(np.column_stack([np.asarray(getattr(res, 'pvalues')) for res in res_all]), 4),
index=rslt2.params.index, columns=res_names)
P-values
GLM non-robust | GEE | GLM robust | Poisson robust | GLM robust t | Poisson robust t | |
---|---|---|---|---|---|---|
Intercept | 0.0000 | 0.1122 | 0.1176 | 0.1176 | 0.1230 | 0.1230 |
trt[T.progabide] | 0.0015 | 0.3746 | 0.3817 | 0.3817 | 0.3853 | 0.3853 |
age | 0.0000 | 0.0500 | 0.0535 | 0.0535 | 0.0584 | 0.0584 |
base | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
#covariance_type='robust'
# BUG see https://github.com/statsmodels/statsmodels/issues/1906
mod1b = GEE.from_formula("y ~ age + trt + base", "subject", data, cov_struct=ind, family=fam)
rslt1b = mod1b.fit(covariance_type='naive')
rslt1b_ = mod1b.fit(covariance_type='naive')
print '\n'
print rslt1b.summary()
GEE Regression Results =================================================================================== Dep. Variable: y No. Observations: 236 Model: GEE No. clusters: 59 Method: Generalized Min. cluster size: 4 Estimating Equations Max. cluster size: 4 Family: Poisson Mean cluster size: 4.0 Dependence structure: Independence Num. iterations: 51 Date: Tue, 19 Aug 2014 Scale: 5.087 Covariance type: robust Time: 12:13:20 ==================================================================================== coef std err z P>|z| [95.0% Conf. Int.] ------------------------------------------------------------------------------------ Intercept 0.5730 0.361 1.589 0.112 -0.134 1.280 trt[T.progabide] -0.1519 0.171 -0.888 0.375 -0.487 0.183 age 0.0223 0.011 1.960 0.050 2.11e-06 0.045 base 0.0226 0.001 18.451 0.000 0.020 0.025 ============================================================================== Skew: 3.7823 Kurtosis: 28.6672 Centered skew: 2.7597 Centered kurtosis: 21.9865 ==============================================================================
#print(rslt1b_.summary())
rslt1b.covariance_type, rslt1b_.covariance_type
('robust', 'naive')
rslt1b.bse, rslt1b_.bse
(array([ 0.36072614, 0.17105111, 0.01140096, 0.00122675]), array([ 0.30598606, 0.10789098, 0.00908399, 0.00114894]))
rslt1b_.bse - rslt1.bse
array([ -5.47400849e-02, -6.31601337e-02, -2.31696404e-03, -7.78130388e-05])
bse_naive = np.sqrt(np.diag(rslt1b.naive_covariance))
print(bse_naive)
print(rslt_glm.bse.values)
#np.sqrt(np.diag(rslt1c.robust_covariance_bc))
print(bse_naive / rslt_glm.bse.values)
[ 0.30598606 0.10789098 0.00908399 0.00114894] [ 0.1356608 0.04783413 0.00402744 0.00050939] [ 2.25552299 2.25552299 2.25552299 2.25552299]
np.sqrt(rslt_glm.pearson_chi2 / rslt_glm.model.df_resid)
mod1c = GEE.from_formula("y ~ age + trt + base", "subject", data, cov_struct=ind, family=fam)
rslt1c = mod1c.fit(covariance_type='bias-reduced')
print '\n'
print rslt1c.summary()
rslt1c.bse - rslt1.bse
np.sqrt(np.diag(rslt1c.robust_covariance_bc))
rslt1c.bse
rslt1b.bse
rslt1b.covariance_type
rslt1b.covariance_type = 'naive'
del rslt1b._cache['bse']
rslt1b.bse
rslt1b_ = mod1b.fit(covariance_type='naive')
rslt1b_.covariance_type
rslt1b_.bse