# RESEARCH IN PYTHON: USING IVE TO RECOVER THE TREATMENT EFFECT
# by J. NATHAN MATIAS March 18, 2015
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
# THE SOFTWARE.
This section is taken from Chapter 11 of Methods Matter by Richard Murnane and John Willett.
In Chapter 10, Murnane and Willett introduce instrumental variables estimation(IVE) as a method for carving out causal claims from observational data (chapter summary) (example code).
In Chapter 11, the authors explain how IVE can be used to "recover" the treatment effect in cases where random assignment is applied to an offer to participate, where not everyone takes the offer, and where other people participate through some other means. They use the example of research on the effectiveness of a financial aid offer on the likelihood of a student to finish 8th grade, using a subset of data from Bogotá from a study on "Vouchers for Private Schooling in Columbia" (2002) by Joshua Angrist, Eric Bettinger, Erik Bloom, Elizabeth King, and Michael Kremer (full data here, subset data here).
The dataset includes the following variables:
# THINGS TO IMPORT
# This is a baseline set of libraries I import by default if I'm rushed for time.
import codecs # load UTF-8 Content
import json # load JSON files
import pandas as pd # Pandas handles dataframes
import numpy as np # Numpy handles lots of basic maths operations
import matplotlib.pyplot as plt # Matplotlib for plotting
import seaborn as sns # Seaborn for beautiful plots
from dateutil import * # I prefer dateutil for parsing dates
import math # transformations
import statsmodels.formula.api as smf # for doing statistical regression
import statsmodels.api as sm # access to the wider statsmodels library, including R datasets
from collections import Counter # Counter is useful for grouping and counting
import scipy
import urllib2
import os.path
if(os.path.isfile("colombia_voucher.dta")!=True):
response = urllib2.urlopen("http://www.ats.ucla.edu/stat/stata/examples/methods_matter/chapter11/colombia_voucher.dta")
if(response.getcode()==200):
f = open("colombia_voucher.dta","w")
f.write(response.read())
f.close()
voucher_df = pd.read_stata("colombia_voucher.dta")
print "=============================================================================="
print " OVERALL SUMMARY"
print "=============================================================================="
print voucher_df.describe()
for i in range(2):
print "=============================================================================="
print " LOTTERY = %(i)d" % {"i":i}
print "=============================================================================="
print voucher_df[voucher_df['won_lottry']==i].describe()
============================================================================== OVERALL SUMMARY ============================================================================== id won_lottry male base_age finish8th \ count 1171.000000 1171.000000 1171.000000 1171.000000 1171.000000 mean 1357.010248 0.505551 0.504697 12.004270 0.681469 std 890.711584 0.500183 0.500192 1.347038 0.466106 min 3.000000 0.000000 0.000000 7.000000 0.000000 25% 616.000000 0.000000 0.000000 11.000000 0.000000 50% 1280.000000 1.000000 1.000000 12.000000 1.000000 75% 1982.500000 1.000000 1.000000 13.000000 1.000000 max 4030.000000 1.000000 1.000000 17.000000 1.000000 use_fin_aid count 1171.000000 mean 0.581554 std 0.493515 min 0.000000 25% 0.000000 50% 1.000000 75% 1.000000 max 1.000000 ============================================================================== LOTTERY = 0 ============================================================================== id won_lottry male base_age finish8th \ count 579.000000 579 579.000000 579.000000 579.000000 mean 1460.998273 0 0.504318 12.036269 0.625216 std 960.839468 0 0.500414 1.351814 0.484486 min 4.000000 0 0.000000 7.000000 0.000000 25% 650.500000 0 0.000000 11.000000 0.000000 50% 1392.000000 0 1.000000 12.000000 1.000000 75% 2122.500000 0 1.000000 13.000000 1.000000 max 4030.000000 0 1.000000 16.000000 1.000000 use_fin_aid count 579.000000 mean 0.240069 std 0.427495 min 0.000000 25% 0.000000 50% 0.000000 75% 0.000000 max 1.000000 ============================================================================== LOTTERY = 1 ============================================================================== id won_lottry male base_age finish8th \ count 592.000000 592 592.000000 592.000000 592.000000 mean 1255.305743 1 0.505068 11.972973 0.736486 std 804.217066 0 0.500397 1.342755 0.440911 min 3.000000 1 0.000000 9.000000 0.000000 25% 578.750000 1 0.000000 11.000000 0.000000 50% 1210.000000 1 1.000000 12.000000 1.000000 75% 1707.250000 1 1.000000 13.000000 1.000000 max 4006.000000 1 1.000000 17.000000 1.000000 use_fin_aid count 592.000000 mean 0.915541 std 0.278311 min 0.000000 25% 1.000000 50% 1.000000 75% 1.000000 max 1.000000
If you're interested to learn more on the rationale and process for doing this kind of analysis, Murnane and Willett introduce instrumental variables estimation(IVE) as a method for carving out causal claims from observational data (chapter summary) (example code).
print "=============================================================================="
print " FIRST STAGE"
print "=============================================================================="
result = smf.glm(formula = "use_fin_aid ~ won_lottry + male + base_age",
data=voucher_df,
family=sm.families.Binomial()).fit()
voucher_df['use_fin_aid_fitted']= result.predict()
print result.summary()
print
print
print "=============================================================================="
print " SECOND STAGE"
print "=============================================================================="#
result = smf.glm(formula = " finish8th ~ use_fin_aid_fitted + male + base_age",
data=voucher_df,
family=sm.families.Binomial()).fit()
print result.summary()
============================================================================== FIRST STAGE ============================================================================== Generalized Linear Model Regression Results ============================================================================== Dep. Variable: use_fin_aid No. Observations: 1171 Model: GLM Df Residuals: 1167 Model Family: Binomial Df Model: 3 Link Function: logit Scale: 1.0 Method: IRLS Log-Likelihood: -488.00 Date: Thu, 19 Mar 2015 Deviance: 975.99 Time: 23:08:46 Pearson chi2: 1.16e+03 No. Iterations: 7 ============================================================================== coef std err z P>|z| [95.0% Conf. Int.] ------------------------------------------------------------------------------ Intercept 0.3455 0.731 0.472 0.637 -1.088 1.779 won_lottry 3.5514 0.178 19.934 0.000 3.202 3.901 male -0.1622 0.164 -0.992 0.321 -0.483 0.158 base_age -0.1184 0.061 -1.946 0.052 -0.238 0.001 ============================================================================== ============================================================================== SECOND STAGE ============================================================================== Generalized Linear Model Regression Results ============================================================================== Dep. Variable: finish8th No. Observations: 1171 Model: GLM Df Residuals: 1167 Model Family: Binomial Df Model: 3 Link Function: logit Scale: 1.0 Method: IRLS Log-Likelihood: -696.65 Date: Thu, 19 Mar 2015 Deviance: 1393.3 Time: 23:08:46 Pearson chi2: 1.17e+03 No. Iterations: 6 ====================================================================================== coef std err z P>|z| [95.0% Conf. Int.] -------------------------------------------------------------------------------------- Intercept 4.0756 0.604 6.753 0.000 2.893 5.258 use_fin_aid_fitted 0.7743 0.192 4.036 0.000 0.398 1.150 male -0.4175 0.130 -3.208 0.001 -0.673 -0.162 base_age -0.2919 0.048 -6.077 0.000 -0.386 -0.198 ======================================================================================
When we use IVE to "recover" the treatment effect, how can we actually describe the results? According to Murnane and Willett, "an estimate of a treatment effect obtained by IV methods should be regarded as an estimated local average treatment effect (LATE). The chapter walks readers through the kinds of groups involved:
**won_lottery = 1** | **won_lottery = 0** | |
use_fin_aid=1 (*used financial aid form some source*) | use_fin_aid=0 (*did not use financial aid from any source*) | "**Compliers**" |
use_fin_aid=1 (used financial aid from some source) | use_fin_aid=1 (used financial aid from some source) | "**Always-Takers**" |
use_fin_aid=0 (did not use financial aid from any source) | use_fin_aid=0 (did not use financial aid from any source) | "**Never-Takers**" |
Murnane and Willett offer a model that distinguishes among groups based on their compliance with "the intent of the lottery" (277), based on a paper by Angrist, Imbens and Rubin on "Identification of Causal Effects Using Instrumental Variables" (1996):
In this context, IV estimates of the local average treatment effect (LATE) for this quasi-experiment only applies to "compliers"--and not to never-takers or always-takers.