Example notebook for the %%stata cell magic by the IPyStata package.¶

Author: Ties de Kok tdekok@uw.edu
Homepage: https://github.com/TiesdeKok/ipystata
PyPi: https://pypi.python.org/pypi/ipystata

Note: this example notebook uses the `Stata Batch Mode` method.¶

See Github for an example notebook using the Windows-only Stata automation method.

Import packages¶

In [1]:

import pandas as pd

In [2]:

import ipystata

Configure ipystata¶

In [1]:

from ipystata.config import config_stata
config_stata('/home/user/stata15/stata-se')
#config_stata("D:\Software\stata15\StataSE-64.exe", force_batch=True) 

Note: for this change to take effect you need to Kernel --> Restart the notebook.

Check whether IPyStata is working¶

In [4]:

%%stata

display "Hello, I am printed by Stata."

Hello, I am printed by Stata.

Some examples based on the Stata 13 manual¶

Load the dataset "auto.dta" in Stata return it back to Python as a Pandas dataframe¶

The code cell below runs the Stata command sysuse auto.dta to load the dataset and returns it back to Python via the -o car_df argument.

In [5]:

%%stata -o car_df
sysuse auto.dta

(1978 Automobile Data)

car_df is a regular Pandas dataframe on which Python / Pandas actions can be performed.

In [6]:

car_df.head()

Out[6]:

	make	price	mpg	rep78	headroom	trunk	weight	length	turn	displacement	gear_ratio	foreign
0	AMC Concord	4099	22	3.0	2.5	11	2930	186	40	121	3.58	Domestic
1	AMC Pacer	4749	17	3.0	3.0	11	3350	173	40	258	2.53	Domestic
2	AMC Spirit	3799	22	NaN	3.0	12	2640	168	35	121	3.08	Domestic
3	Buick Century	4816	20	3.0	4.5	16	3250	196	40	196	2.93	Domestic
4	Buick Electra	7827	15	4.0	4.0	20	4080	222	43	350	2.41	Domestic

Basic descriptive statistics¶

The argument -d or --data is used to define which dataframe should be set as dataset in Stata.
In the example below the Stata function tabulate is used to generate some descriptive statistics for the dataframe car_df.

In [7]:

%%stata -d car_df
tabulate foreign headroom

           |                                        headroom
   foreign |       1.5          2        2.5          3        3.5          4        4.5          5 |     Total
-----------+----------------------------------------------------------------------------------------+----------
  Domestic |         3         10          4          7         13         10          4          1 |        52 
   Foreign |         1          3         10          6          2          0          0          0 |        22 
-----------+----------------------------------------------------------------------------------------+----------
     Total |         4         13         14         13         15         10          4          1 |        74

These descriptive statistics can be replicated in Pandas using the crosstab fuction, see the code below.

In [8]:

pd.crosstab(car_df['foreign'], car_df['headroom'], margins=True)

Out[8]:

headroom	1.5	2.0	2.5	3.0	3.5	4.0	4.5	5.0	All
foreign
Domestic	3	10	4	7	13	10	4	1	52
Foreign	1	3	10	6	2	0	0	0	22
All	4	13	14	13	15	10	4	1	74

Stata graphs¶

Note: due to a limitation of Stata it currently returns the graph as a PDF.
This is a temporary workaround that I hope to find a more suitable fix for in the future.

In [9]:

%%stata -gr
use https://stats.idre.ucla.edu/stat/data/hsb2.dta, clear
graph twoway scatter read math

(highschool and beyond (200 cases))

Out[9]:

Use Python lists as Stata macros¶

In many situations it is convenient to define values or variable names in a Python list or equivalently in a Stata macro.
The -i or --input argument makes a Python list available for use in Stata as a local macro.
For example, -i main_var converts the Python list ['mpg', 'rep78'] into the following Stata macro: ``main_var'`.

In [10]:

main_var = ['mpg', 'rep78']
control_var = ['gear_ratio', 'trunk', 'weight', 'displacement']

In [11]:

%%stata -d car_df -i main_var -i control_var

display "`main_var'"
display "`control_var'"

regress price `main_var' `control_var', vce(robust)

mpg rep78
gear_ratio trunk weight displacement

Linear regression                               Number of obs     =         69
                                                F(6, 62)          =       8.60
                                                Prob > F          =     0.0000
                                                R-squared         =     0.4124
                                                Root MSE          =     2338.1

------------------------------------------------------------------------------
             |               Robust
       price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         mpg |  -76.95578   84.95038    -0.91   0.369    -246.7692     92.8576
       rep78 |   899.0818   299.7541     3.00   0.004      299.882    1498.282
  gear_ratio |   1479.744   917.5363     1.61   0.112    -354.3846    3313.873
       trunk |  -110.3163   80.16622    -1.38   0.174    -270.5663    49.93365
      weight |   1.139509   1.187361     0.96   0.341    -1.233991     3.51301
displacement |   17.82274   8.523647     2.09   0.041     .7842094    34.86126
       _cons |  -5163.323   4965.389    -1.04   0.302    -15088.99    4762.348
------------------------------------------------------------------------------

Modify dataset in Stata and return it to Python¶

It is possible create new variables or modify the existing dataset in Stata and have it returned as a Pandas dataframe.
In the example below the output -o car_df will overwrite the data -d car_df, effectively modifying the dataframe in place.
Note, the argument -np or --noprint can be used to supress any output below the code cell.

In [12]:

%%stata -d car_df -o car_df -np

generate weight_squared = weight^2
generate log_weight = log(weight)

In [13]:

car_df.head(3)

Out[13]:

	make	price	mpg	rep78	headroom	trunk	weight	length	turn	displacement	gear_ratio	foreign	weight_squared	log_weight
0	AMC Concord	4099	22	3.0	2.5	11	2930	186	40	121	3.58	Domestic	8584900.0	7.982758
1	AMC Pacer	4749	17	3.0	3.0	11	3350	173	40	258	2.53	Domestic	11222500.0	8.116715
2	AMC Spirit	3799	22	NaN	3.0	12	2640	168	35	121	3.08	Domestic	6969600.0	7.878534

Set a custom working directory for this Stata code cell¶

Using a directory defined in a variable (this is useful if you need it for many cells)¶

In [14]:

directory = '~/sandbox'

In [15]:

%%stata -cwd directory -np
display "`c(pwd)'"

It is also possible to provide the directory as an argument¶

In [16]:

%%stata -cwd '~/sandbox' -np
display "`c(pwd)'"

An example case¶

Create the variable large in Python and use it as the dependent variable for a binary choice estimation by Stata.

In [17]:

car_df['large'] = [1 if x > 3 and y > 200 else 0 for x, y in zip(car_df['headroom'], car_df['length'])]

In [18]:

car_df[['headroom', 'length', 'large']].head(7)

Out[18]:

	headroom	length	large
0	2.5	186	0
1	3.0	173	0
2	3.0	168	0
3	4.5	196	0
4	4.0	222	1
5	4.0	218	1
6	3.0	170	0

In [19]:

%%stata -d car_df -i main_var -i control_var

logit large `main_var' `control_var', vce(cluster make)

Iteration 0:   log pseudolikelihood =  -39.60355  
Iteration 1:   log pseudolikelihood = -19.307161  
Iteration 2:   log pseudolikelihood = -13.526857  
Iteration 3:   log pseudolikelihood = -10.999644  
Iteration 4:   log pseudolikelihood = -10.726345  
Iteration 5:   log pseudolikelihood = -10.723111  
Iteration 6:   log pseudolikelihood = -10.723109  
Iteration 7:   log pseudolikelihood = -10.723109  

Logistic regression                             Number of obs     =         69
                                                Wald chi2(6)      =      12.90
                                                Prob > chi2       =     0.0446
Log pseudolikelihood = -10.723109               Pseudo R2         =     0.7292

                                  (Std. Err. adjusted for 69 clusters in make)
------------------------------------------------------------------------------
             |               Robust
       large |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         mpg |  -.5846335   .2941083    -1.99   0.047    -1.161075   -.0081918
       rep78 |  -1.298127   1.264918    -1.03   0.305    -3.777322    1.181067
  gear_ratio |  -1.331913   3.389448    -0.39   0.694    -7.975109    5.311283
       trunk |   1.210178   .4830082     2.51   0.012     .2634991    2.156856
      weight |  -.0007284   .0022358    -0.33   0.745    -.0051105    .0036536
displacement |    .001631   .0160425     0.10   0.919    -.0298119    .0330738
       _cons |  -.2977676    16.7841    -0.02   0.986    -33.19401    32.59847
------------------------------------------------------------------------------
Note: 8 failures and 0 successes completely determined.