Notebook

In [ ]:

# RESEARCH IN PYTHON: USING IVE TO RECOVER THE TREATMENT EFFECT
# by J. NATHAN MATIAS March 18, 2015

# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
# THE SOFTWARE.

Using Instrumental-Variables Estimation to Recover the Treatment Effect in Quasi-Experiments¶

This section is taken from Chapter 11 of Methods Matter by Richard Murnane and John Willett.

In Chapter 10, Murnane and Willett introduce instrumental variables estimation(IVE) as a method for carving out causal claims from observational data (chapter summary) (example code).

In Chapter 11, the authors explain how IVE can be used to "recover" the treatment effect in cases where random assignment is applied to an offer to participate, where not everyone takes the offer, and where other people participate through some other means. They use the example of research on the effectiveness of a financial aid offer on the likelihood of a student to finish 8th grade, using a subset of data from Bogotá from a study on "Vouchers for Private Schooling in Columbia" (2002) by Joshua Angrist, Eric Bettinger, Erik Bloom, Elizabeth King, and Michael Kremer (full data here, subset data here).

The dataset includes the following variables:

finish8th: did the student finish 8th grade or not (outcome variable)
won_lottry: won the lottery to receive offer of financial aid
use_fin_aid: did the student use financial aid of any kind (not exclusive to the lottery) or not
base_age: student age
male: is the student male or not

In [1]:

# THINGS TO IMPORT
# This is a baseline set of libraries I import by default if I'm rushed for time.

import codecs                     # load UTF-8 Content
import json                       # load JSON files
import pandas as pd               # Pandas handles dataframes
import numpy as np                # Numpy handles lots of basic maths operations
import matplotlib.pyplot as plt   # Matplotlib for plotting
import seaborn as sns             # Seaborn for beautiful plots
from dateutil import *            # I prefer dateutil for parsing dates
import math                       # transformations
import statsmodels.formula.api as smf  # for doing statistical regression
import statsmodels.api as sm      # access to the wider statsmodels library, including R datasets
from collections import Counter   # Counter is useful for grouping and counting
import scipy

Acquire Dataset from Methods Matter¶

In [2]:

import urllib2
import os.path
if(os.path.isfile("colombia_voucher.dta")!=True):
    response = urllib2.urlopen("http://www.ats.ucla.edu/stat/stata/examples/methods_matter/chapter11/colombia_voucher.dta")
    if(response.getcode()==200):
        f = open("colombia_voucher.dta","w")
        f.write(response.read())
        f.close()
voucher_df = pd.read_stata("colombia_voucher.dta")

Summary Statistics¶

In [12]:

print "=============================================================================="
print "                              OVERALL SUMMARY"
print "=============================================================================="

print voucher_df.describe()

for i in range(2):
    print "=============================================================================="
    print "                         LOTTERY = %(i)d" % {"i":i}
    print "=============================================================================="
    print voucher_df[voucher_df['won_lottry']==i].describe()

==============================================================================
                              OVERALL SUMMARY
==============================================================================
                id   won_lottry         male     base_age    finish8th  \
count  1171.000000  1171.000000  1171.000000  1171.000000  1171.000000   
mean   1357.010248     0.505551     0.504697    12.004270     0.681469   
std     890.711584     0.500183     0.500192     1.347038     0.466106   
min       3.000000     0.000000     0.000000     7.000000     0.000000   
25%     616.000000     0.000000     0.000000    11.000000     0.000000   
50%    1280.000000     1.000000     1.000000    12.000000     1.000000   
75%    1982.500000     1.000000     1.000000    13.000000     1.000000   
max    4030.000000     1.000000     1.000000    17.000000     1.000000   

       use_fin_aid  
count  1171.000000  
mean      0.581554  
std       0.493515  
min       0.000000  
25%       0.000000  
50%       1.000000  
75%       1.000000  
max       1.000000  
==============================================================================
                         LOTTERY = 0
==============================================================================
                id  won_lottry        male    base_age   finish8th  \
count   579.000000         579  579.000000  579.000000  579.000000   
mean   1460.998273           0    0.504318   12.036269    0.625216   
std     960.839468           0    0.500414    1.351814    0.484486   
min       4.000000           0    0.000000    7.000000    0.000000   
25%     650.500000           0    0.000000   11.000000    0.000000   
50%    1392.000000           0    1.000000   12.000000    1.000000   
75%    2122.500000           0    1.000000   13.000000    1.000000   
max    4030.000000           0    1.000000   16.000000    1.000000   

       use_fin_aid  
count   579.000000  
mean      0.240069  
std       0.427495  
min       0.000000  
25%       0.000000  
50%       0.000000  
75%       0.000000  
max       1.000000  
==============================================================================
                         LOTTERY = 1
==============================================================================
                id  won_lottry        male    base_age   finish8th  \
count   592.000000         592  592.000000  592.000000  592.000000   
mean   1255.305743           1    0.505068   11.972973    0.736486   
std     804.217066           0    0.500397    1.342755    0.440911   
min       3.000000           1    0.000000    9.000000    0.000000   
25%     578.750000           1    0.000000   11.000000    0.000000   
50%    1210.000000           1    1.000000   12.000000    1.000000   
75%    1707.250000           1    1.000000   13.000000    1.000000   
max    4006.000000           1    1.000000   17.000000    1.000000   

       use_fin_aid  
count   592.000000  
mean      0.915541  
std       0.278311  
min       0.000000  
25%       1.000000  
50%       1.000000  
75%       1.000000  
max       1.000000

Two-stage Least Squares Logistic Regression¶

If you're interested to learn more on the rationale and process for doing this kind of analysis, Murnane and Willett introduce instrumental variables estimation(IVE) as a method for carving out causal claims from observational data (chapter summary) (example code).

In [15]:

print "=============================================================================="
print "                                  FIRST STAGE"
print "=============================================================================="
result = smf.glm(formula = "use_fin_aid ~ won_lottry + male + base_age", 
                 data=voucher_df,
                 family=sm.families.Binomial()).fit()
voucher_df['use_fin_aid_fitted']= result.predict()
print result.summary()

print
print
print "=============================================================================="
print "                                  SECOND STAGE"
print "=============================================================================="#
result = smf.glm(formula = " finish8th ~ use_fin_aid_fitted + male + base_age", 
                 data=voucher_df,
                 family=sm.families.Binomial()).fit()
print result.summary()

==============================================================================
                                  FIRST STAGE
==============================================================================
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:            use_fin_aid   No. Observations:                 1171
Model:                            GLM   Df Residuals:                     1167
Model Family:                Binomial   Df Model:                            3
Link Function:                  logit   Scale:                             1.0
Method:                          IRLS   Log-Likelihood:                -488.00
Date:                Thu, 19 Mar 2015   Deviance:                       975.99
Time:                        23:08:46   Pearson chi2:                 1.16e+03
No. Iterations:                     7                                         
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept      0.3455      0.731      0.472      0.637        -1.088     1.779
won_lottry     3.5514      0.178     19.934      0.000         3.202     3.901
male          -0.1622      0.164     -0.992      0.321        -0.483     0.158
base_age      -0.1184      0.061     -1.946      0.052        -0.238     0.001
==============================================================================


==============================================================================
                                  SECOND STAGE
==============================================================================
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:              finish8th   No. Observations:                 1171
Model:                            GLM   Df Residuals:                     1167
Model Family:                Binomial   Df Model:                            3
Link Function:                  logit   Scale:                             1.0
Method:                          IRLS   Log-Likelihood:                -696.65
Date:                Thu, 19 Mar 2015   Deviance:                       1393.3
Time:                        23:08:46   Pearson chi2:                 1.17e+03
No. Iterations:                     6                                         
======================================================================================
                         coef    std err          z      P>|z|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------------
Intercept              4.0756      0.604      6.753      0.000         2.893     5.258
use_fin_aid_fitted     0.7743      0.192      4.036      0.000         0.398     1.150
male                  -0.4175      0.130     -3.208      0.001        -0.673    -0.162
base_age              -0.2919      0.048     -6.077      0.000        -0.386    -0.198
======================================================================================

Interpreting the Local Average Treatment Effect¶

When we use IVE to "recover" the treatment effect, how can we actually describe the results? According to Murnane and Willett, "an estimate of a treatment effect obtained by IV methods should be regarded as an estimated local average treatment effect (LATE). The chapter walks readers through the kinds of groups involved:

won_lottery = 1	won_lottery = 0
use_fin_aid=1 (used financial aid form some source)	use_fin_aid=0 (did not use financial aid from any source)	"Compliers"
use_fin_aid=1 (used financial aid from some source)	use_fin_aid=1 (used financial aid from some source)	"Always-Takers"
use_fin_aid=0 (did not use financial aid from any source)	use_fin_aid=0 (did not use financial aid from any source)	"Never-Takers"

Murnane and Willett offer a model that distinguishes among groups based on their compliance with "the intent of the lottery" (277), based on a paper by Angrist, Imbens and Rubin on "Identiﬁcation of Causal Effects Using Instrumental Variables" (1996):

Compliers "are willing to have their behavior determined by the outcomes of the lottery, regardless of the particular experimental conditions to which they were assigned" (278).
Always-Takers "are families who will find and make use of financial aid to pay private-school fees" regardless of the lottery. They may find aid outside the lottery
Never-takers are the mirror image of always-takers: "they will not make use of financial aid to pay childrens' fees at a private secondary school under any circumstances" (278)
(there are other possible groups, like "defiers" (Gennetian et all, 2005) who always do the opposite of what investigators ask them to do, but we make the assumption of "no defiers" in this dataset)

In this context, IV estimates of the local average treatment effect (LATE) for this quasi-experiment only applies to "compliers"--and not to never-takers or always-takers.

In [ ]:

won_lottery = 1	won_lottery = 0
use_fin_aid=1 (used financial aid form some source)	use_fin_aid=0 (did not use financial aid from any source)	"Compliers"
use_fin_aid=1 (used financial aid from some source)	use_fin_aid=1 (used financial aid from some source)	"Always-Takers"
use_fin_aid=0 (did not use financial aid from any source)	use_fin_aid=0 (did not use financial aid from any source)	"Never-Takers"