Author: Ties de Kok tdekok@uw.edu
Homepage: https://github.com/TiesdeKok/ipystata
PyPi: https://pypi.python.org/pypi/ipystata
Stata Batch Mode
method.¶See Github for an example notebook using the Windows-only Stata automation
method.
import pandas as pd
import ipystata
from ipystata.config import config_stata
config_stata('/home/user/stata15/stata-se')
#config_stata("D:\Software\stata15\StataSE-64.exe", force_batch=True)
Note: for this change to take effect you need to Kernel
--> Restart
the notebook.
%%stata
display "Hello, I am printed by Stata."
Hello, I am printed by Stata.
The code cell below runs the Stata command sysuse auto.dta
to load the dataset and returns it back to Python via the -o car_df
argument.
%%stata -o car_df
sysuse auto.dta
(1978 Automobile Data)
car_df
is a regular Pandas dataframe on which Python / Pandas actions can be performed.
car_df.head()
make | price | mpg | rep78 | headroom | trunk | weight | length | turn | displacement | gear_ratio | foreign | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AMC Concord | 4099 | 22 | 3.0 | 2.5 | 11 | 2930 | 186 | 40 | 121 | 3.58 | Domestic |
1 | AMC Pacer | 4749 | 17 | 3.0 | 3.0 | 11 | 3350 | 173 | 40 | 258 | 2.53 | Domestic |
2 | AMC Spirit | 3799 | 22 | NaN | 3.0 | 12 | 2640 | 168 | 35 | 121 | 3.08 | Domestic |
3 | Buick Century | 4816 | 20 | 3.0 | 4.5 | 16 | 3250 | 196 | 40 | 196 | 2.93 | Domestic |
4 | Buick Electra | 7827 | 15 | 4.0 | 4.0 | 20 | 4080 | 222 | 43 | 350 | 2.41 | Domestic |
The argument -d or --data
is used to define which dataframe should be set as dataset in Stata.
In the example below the Stata function tabulate
is used to generate some descriptive statistics for the dataframe car_df
.
%%stata -d car_df
tabulate foreign headroom
| headroom foreign | 1.5 2 2.5 3 3.5 4 4.5 5 | Total -----------+----------------------------------------------------------------------------------------+---------- Domestic | 3 10 4 7 13 10 4 1 | 52 Foreign | 1 3 10 6 2 0 0 0 | 22 -----------+----------------------------------------------------------------------------------------+---------- Total | 4 13 14 13 15 10 4 1 | 74
These descriptive statistics can be replicated in Pandas using the crosstab
fuction, see the code below.
pd.crosstab(car_df['foreign'], car_df['headroom'], margins=True)
headroom | 1.5 | 2.0 | 2.5 | 3.0 | 3.5 | 4.0 | 4.5 | 5.0 | All |
---|---|---|---|---|---|---|---|---|---|
foreign | |||||||||
Domestic | 3 | 10 | 4 | 7 | 13 | 10 | 4 | 1 | 52 |
Foreign | 1 | 3 | 10 | 6 | 2 | 0 | 0 | 0 | 22 |
All | 4 | 13 | 14 | 13 | 15 | 10 | 4 | 1 | 74 |
Note: due to a limitation of Stata it currently returns the graph as a PDF.
This is a temporary workaround that I hope to find a more suitable fix for in the future.
%%stata -gr
use https://stats.idre.ucla.edu/stat/data/hsb2.dta, clear
graph twoway scatter read math
(highschool and beyond (200 cases))
In many situations it is convenient to define values or variable names in a Python list or equivalently in a Stata macro.
The -i or --input
argument makes a Python list available for use in Stata as a local macro.
For example, -i main_var
converts the Python list ['mpg', 'rep78']
into the following Stata macro: ``main_var'`.
main_var = ['mpg', 'rep78']
control_var = ['gear_ratio', 'trunk', 'weight', 'displacement']
%%stata -d car_df -i main_var -i control_var
display "`main_var'"
display "`control_var'"
regress price `main_var' `control_var', vce(robust)
mpg rep78 gear_ratio trunk weight displacement Linear regression Number of obs = 69 F(6, 62) = 8.60 Prob > F = 0.0000 R-squared = 0.4124 Root MSE = 2338.1 ------------------------------------------------------------------------------ | Robust price | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- mpg | -76.95578 84.95038 -0.91 0.369 -246.7692 92.8576 rep78 | 899.0818 299.7541 3.00 0.004 299.882 1498.282 gear_ratio | 1479.744 917.5363 1.61 0.112 -354.3846 3313.873 trunk | -110.3163 80.16622 -1.38 0.174 -270.5663 49.93365 weight | 1.139509 1.187361 0.96 0.341 -1.233991 3.51301 displacement | 17.82274 8.523647 2.09 0.041 .7842094 34.86126 _cons | -5163.323 4965.389 -1.04 0.302 -15088.99 4762.348 ------------------------------------------------------------------------------
It is possible create new variables or modify the existing dataset in Stata and have it returned as a Pandas dataframe.
In the example below the output -o car_df
will overwrite the data -d car_df
, effectively modifying the dataframe in place.
Note, the argument -np or --noprint
can be used to supress any output below the code cell.
%%stata -d car_df -o car_df -np
generate weight_squared = weight^2
generate log_weight = log(weight)
car_df.head(3)
make | price | mpg | rep78 | headroom | trunk | weight | length | turn | displacement | gear_ratio | foreign | weight_squared | log_weight | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AMC Concord | 4099 | 22 | 3.0 | 2.5 | 11 | 2930 | 186 | 40 | 121 | 3.58 | Domestic | 8584900.0 | 7.982758 |
1 | AMC Pacer | 4749 | 17 | 3.0 | 3.0 | 11 | 3350 | 173 | 40 | 258 | 2.53 | Domestic | 11222500.0 | 8.116715 |
2 | AMC Spirit | 3799 | 22 | NaN | 3.0 | 12 | 2640 | 168 | 35 | 121 | 3.08 | Domestic | 6969600.0 | 7.878534 |
directory = '~/sandbox'
%%stata -cwd directory -np
display "`c(pwd)'"
%%stata -cwd '~/sandbox' -np
display "`c(pwd)'"
Create the variable large
in Python and use it as the dependent variable for a binary choice estimation by Stata.
car_df['large'] = [1 if x > 3 and y > 200 else 0 for x, y in zip(car_df['headroom'], car_df['length'])]
car_df[['headroom', 'length', 'large']].head(7)
headroom | length | large | |
---|---|---|---|
0 | 2.5 | 186 | 0 |
1 | 3.0 | 173 | 0 |
2 | 3.0 | 168 | 0 |
3 | 4.5 | 196 | 0 |
4 | 4.0 | 222 | 1 |
5 | 4.0 | 218 | 1 |
6 | 3.0 | 170 | 0 |
%%stata -d car_df -i main_var -i control_var
logit large `main_var' `control_var', vce(cluster make)
Iteration 0: log pseudolikelihood = -39.60355 Iteration 1: log pseudolikelihood = -19.307161 Iteration 2: log pseudolikelihood = -13.526857 Iteration 3: log pseudolikelihood = -10.999644 Iteration 4: log pseudolikelihood = -10.726345 Iteration 5: log pseudolikelihood = -10.723111 Iteration 6: log pseudolikelihood = -10.723109 Iteration 7: log pseudolikelihood = -10.723109 Logistic regression Number of obs = 69 Wald chi2(6) = 12.90 Prob > chi2 = 0.0446 Log pseudolikelihood = -10.723109 Pseudo R2 = 0.7292 (Std. Err. adjusted for 69 clusters in make) ------------------------------------------------------------------------------ | Robust large | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- mpg | -.5846335 .2941083 -1.99 0.047 -1.161075 -.0081918 rep78 | -1.298127 1.264918 -1.03 0.305 -3.777322 1.181067 gear_ratio | -1.331913 3.389448 -0.39 0.694 -7.975109 5.311283 trunk | 1.210178 .4830082 2.51 0.012 .2634991 2.156856 weight | -.0007284 .0022358 -0.33 0.745 -.0051105 .0036536 displacement | .001631 .0160425 0.10 0.919 -.0298119 .0330738 _cons | -.2977676 16.7841 -0.02 0.986 -33.19401 32.59847 ------------------------------------------------------------------------------ Note: 8 failures and 0 successes completely determined.