Python example

This document shows how you can perform a very simple statistical analysis in the Python programming language. To run this, you will need the IPython notebook; however, you can copy and paste the code parts of the document into a plain text file and run it as a python script as follows:

python my_script.py

Assuming you have named the file my_script.py.

Imports and load data

Before we start, we need to import the specific libraries that contain the bulk of data and statistical functionality. These are the following:

In [1]:
import pandas as pd
import statsmodels.api as sm

Then we can load the file with the data we'll use (you'll need connectivity to pull it down):

In [3]:
db = pd.read_csv('https://raw.github.com/darribas/buzz_adam/master/buzz_data.csv')
In [4]:
db
Out[4]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 96 entries, 0 to 95
Data columns (total 24 columns):
id                                     96  non-null values
4sq-pct_Arts & Entertainment           96  non-null values
4sq-pct_College & University           96  non-null values
4sq-pct_Food                           96  non-null values
4sq-pct_Other                          96  non-null values
4sq-pct_Outdoors & Recreation          96  non-null values
4sq-pct_Professional & Other Places    96  non-null values
4sq-pct_Travel & Transport             96  non-null values
industrie_pct_area                     96  non-null values
kantoor_pct_area                       96  non-null values
sport_pct_area                         96  non-null values
winkel_pct_area                        96  non-null values
4sq_total_venues                       96  non-null values
total_units                            96  non-null values
div_i                                  96  non-null values
checkins_all                           96  non-null values
('WeekDay', 'Afternoon')               96  non-null values
('WeekDay', 'Evening')                 96  non-null values
('WeekDay', 'Morning')                 96  non-null values
('WeekDay', 'Night')                   96  non-null values
('WeekEnd', 'Afternoon')               96  non-null values
('WeekEnd', 'Evening')                 96  non-null values
('WeekEnd', 'Morning')                 96  non-null values
('WeekEnd', 'Night')                   96  non-null values
dtypes: float64(20), int64(3), object(1)

Create decriptives

To first get a sense of the data, we can inspect the top rows (note the transpose with .T in order for the table to fit in):

In [8]:
db.head().T
Out[8]:
0 1 2 3 4
id BU03630000 BU03630001 BU03630002 BU03630003 BU03630004
4sq-pct_Arts & Entertainment 23.15789 22.41379 25 19.75309 29.26829
4sq-pct_College & University 2.105263 0 0 1.234568 0
4sq-pct_Food 5.263158 6.034483 7.352941 3.703704 2.439024
4sq-pct_Other 57.89474 52.58621 58.82353 65.4321 56.09756
4sq-pct_Outdoors & Recreation 0 0 0 0 0
4sq-pct_Professional & Other Places 4.210526 2.586207 5.882353 4.938272 2.439024
4sq-pct_Travel & Transport 4.210526 12.93103 2.941176 3.703704 4.878049
industrie_pct_area 3.180117 1.002782 2.649969 1.572429 4.066521
kantoor_pct_area 14.23777 17.45941 23.67496 30.78645 13.96826
sport_pct_area 0.03911904 0.9169031 0.07142607 0.3230344 0.3108789
winkel_pct_area 7.09489 29.07499 5.959857 9.063615 4.053148
4sq_total_venues 95 116 68 81 41
total_units 4238 4836 5923 3639 7162
div_i 0.5702 0.6314 0.5228 0.5196 0.5394
checkins_all 2750 6620 2521 3468 2076
('WeekDay', 'Afternoon') 743 1834 685 1040 664
('WeekDay', 'Evening') 620 1236 588 789 397
('WeekDay', 'Morning') 402 1407 427 463 422
('WeekDay', 'Night') 185 419 184 262 98
('WeekEnd', 'Afternoon') 346 816 311 372 238
('WeekEnd', 'Evening') 203 333 145 241 124
('WeekEnd', 'Morning') 111 324 80 96 77
('WeekEnd', 'Night') 140 251 101 205 56

Or the tail of the table:

In [9]:
db.tail().T
Out[9]:
91 92 93 94 95
id BU03631453 BU03631454 BU03631459 BU03631490 BU03631491
4sq-pct_Arts & Entertainment 33.33333 22.22222 0 0 0
4sq-pct_College & University 0 0 10 4.166667 0
4sq-pct_Food 0 22.22222 10 4.166667 0
4sq-pct_Other 33.33333 44.44444 60 54.16667 28.57143
4sq-pct_Outdoors & Recreation 0 0 0 0 0
4sq-pct_Professional & Other Places 0 11.11111 20 25 28.57143
4sq-pct_Travel & Transport 0 0 0 8.333333 14.28571
industrie_pct_area 0.5510426 1.05968 0.3963571 1.581789 1.801225
kantoor_pct_area 1.254399 5.504788 50.14588 36.31685 18.88625
sport_pct_area 0.5164656 0.1423928 0 0.3649561 1.867882
winkel_pct_area 3.936081 3.15024 1.114295 3.382682 0.7877441
4sq_total_venues 3 9 10 24 7
total_units 3172 5456 1444 10191 5051
div_i 0.6032 0.5514 0.5304 0.5692 0.5576
checkins_all 431 483 543 2179 815
('WeekDay', 'Afternoon') 87 98 169 889 202
('WeekDay', 'Evening') 64 124 82 190 111
('WeekDay', 'Morning') 37 104 181 789 239
('WeekDay', 'Night') 88 30 18 54 115
('WeekEnd', 'Afternoon') 55 55 46 102 59
('WeekEnd', 'Evening') 58 41 19 54 35
('WeekEnd', 'Morning') 7 25 24 79 29
('WeekEnd', 'Night') 35 6 4 22 25

Basic descriptives of the data are created with the following simple command (note the transpose again):

In [12]:
db.describe().T
Out[12]:
count mean std min 25% 50% 75% max
4sq-pct_Arts & Entertainment 96 7.993073 12.321591 0.000000 0.000000 0.000000 15.885417 46.153846
4sq-pct_College & University 96 1.502825 7.376999 0.000000 0.000000 0.000000 0.000000 50.000000
4sq-pct_Food 96 6.498157 16.067714 0.000000 0.000000 0.000000 5.794335 100.000000
4sq-pct_Other 96 46.071709 30.456685 0.000000 25.000000 50.000000 63.727273 100.000000
4sq-pct_Outdoors & Recreation 96 1.617461 10.619857 0.000000 0.000000 0.000000 0.000000 100.000000
4sq-pct_Professional & Other Places 96 12.211715 22.494019 0.000000 0.000000 0.000000 16.666667 100.000000
4sq-pct_Travel & Transport 96 7.041147 16.066249 0.000000 0.000000 0.000000 11.111111 100.000000
industrie_pct_area 96 7.194845 14.445025 0.000000 0.788846 1.970315 4.891877 76.819769
kantoor_pct_area 96 8.522095 12.678802 0.013986 1.398705 3.651725 10.338975 84.004070
sport_pct_area 96 1.351429 3.579424 0.000000 0.090404 0.362568 1.204256 29.412587
winkel_pct_area 96 3.408922 3.666317 0.000000 1.307761 2.381766 4.416234 29.074987
4sq_total_venues 96 12.864583 22.031852 0.000000 1.000000 6.000000 11.000000 116.000000
total_units 96 4822.031250 3001.489607 13.000000 2390.250000 4844.500000 6722.750000 14763.000000
div_i 96 0.601440 0.144747 0.000000 0.524900 0.599200 0.725200 0.806600
checkins_all 96 806.531250 1070.210338 5.000000 147.000000 422.000000 1028.250000 6620.000000
('WeekDay', 'Afternoon') 96 204.895833 299.699480 0.000000 33.000000 80.500000 262.000000 1834.000000
('WeekDay', 'Evening') 96 155.375000 220.786644 0.000000 20.000000 78.000000 178.000000 1236.000000
('WeekDay', 'Morning') 96 184.625000 250.739212 1.000000 33.750000 87.000000 274.000000 1407.000000
('WeekDay', 'Night') 96 59.364583 78.068003 0.000000 9.000000 30.000000 82.750000 419.000000
('WeekEnd', 'Afternoon') 96 84.031250 122.064388 0.000000 20.000000 42.000000 90.750000 816.000000
('WeekEnd', 'Evening') 96 49.052083 67.586883 0.000000 8.750000 23.500000 55.000000 333.000000
('WeekEnd', 'Morning') 96 37.937500 47.425024 0.000000 8.750000 19.500000 53.250000 324.000000
('WeekEnd', 'Night') 96 31.250000 48.053041 0.000000 4.000000 14.500000 38.000000 263.000000

Simple graphics

We can visualize the data too. For example, we can see the density distribution of the variable div_i:

In [14]:
db['div_i'].plot(kind='kde')
Out[14]:
<matplotlib.axes.AxesSubplot at 0x50fa890>

Or we can get a scatter plot between div_i and industrie_pct_area:

In [15]:
scatter(db['div_i'], db['industrie_pct_area'])
Out[15]:
<matplotlib.collections.PathCollection at 0x53a0b90>

we can get a full matrix of plots with the following commands:

In [20]:
from pandas.tools.plotting import scatter_matrix
scatter_matrix(db.ix[:, :5], figsize=(12, 12))
show()

Estimate a model

Let us run an OLS model with the following equation:

$$ checkins\_all = \alpha + \beta_1 total\_units + \beta_2 div_i $$
In [24]:
ols = sm.OLS(db['checkins_all'], db[['total_units', 'div_i']]).fit()

A plain text summary can be obtained as follows:

In [28]:
smry = ols.summary()
smry
Out[28]:
OLS Regression Results
Dep. Variable: checkins_all R-squared: 0.392
Model: OLS Adj. R-squared: 0.379
Method: Least Squares F-statistic: 30.29
Date: Tue, 14 Jan 2014 Prob (F-statistic): 7.02e-11
Time: 20:53:19 Log-Likelihood: -803.27
No. Observations: 96 AIC: 1611.
Df Residuals: 94 BIC: 1616.
Df Model: 2
coef std err t P>|t| [95.0% Conf. Int.]
total_units 0.1100 0.040 2.759 0.007 0.031 0.189
div_i 376.2314 365.721 1.029 0.306 -349.916 1102.378
Omnibus: 82.475 Durbin-Watson: 1.138
Prob(Omnibus): 0.000 Jarque-Bera (JB): 580.650
Skew: 2.850 Prob(JB): 8.19e-127
Kurtosis: 13.615 Cond. No. 1.93e+04

And if we need to export it to latex, it is also easy:

In [29]:
inlatex = smry.as_latex()
print inlatex
\begin{center}
\begin{tabular}{lclc}
\toprule
\textbf{Dep. Variable:}    &   checkins_all   & \textbf{  R-squared:         } &     0.392   \\
\textbf{Model:}            &       OLS        & \textbf{  Adj. R-squared:    } &     0.379   \\
\textbf{Method:}           &  Least Squares   & \textbf{  F-statistic:       } &     30.29   \\
\textbf{Date:}             & Tue, 14 Jan 2014 & \textbf{  Prob (F-statistic):} &  7.02e-11   \\
\textbf{Time:}             &     20:53:19     & \textbf{  Log-Likelihood:    } &   -803.27   \\
\textbf{No. Observations:} &          96      & \textbf{  AIC:               } &     1611.   \\
\textbf{Df Residuals:}     &          94      & \textbf{  BIC:               } &     1616.   \\
\bottomrule
\end{tabular}
\begin{tabular}{lccccc}
                     & \textbf{coef} & \textbf{std err} & \textbf{t} & \textbf{P$>$$|$t$|$} & \textbf{[95.0\% Conf. Int.]}  \\
\midrule
\textbf{total_units} &       0.1100  &        0.040     &     2.759  &         0.007        &         0.031     0.189       \\
\textbf{div_i}       &     376.2314  &      365.721     &     1.029  &         0.306        &      -349.916  1102.378       \\
\bottomrule
\end{tabular}
\begin{tabular}{lclc}
\textbf{Omnibus:}       & 82.475 & \textbf{  Durbin-Watson:     } &     1.138  \\
\textbf{Prob(Omnibus):} &  0.000 & \textbf{  Jarque-Bera (JB):  } &   580.650  \\
\textbf{Skew:}          &  2.850 & \textbf{  Prob(JB):          } & 8.19e-127  \\
\textbf{Kurtosis:}      & 13.615 & \textbf{  Cond. No.          } &  1.93e+04  \\
\bottomrule
\end{tabular}
%\caption{OLS Regression Results}
\end{center}