Python example¶

This document shows how you can perform a very simple statistical analysis in the Python programming language. To run this, you will need the IPython notebook; however, you can copy and paste the code parts of the document into a plain text file and run it as a python script as follows:

python my_script.py

Assuming you have named the file my_script.py.

Imports and load data¶

Before we start, we need to import the specific libraries that contain the bulk of data and statistical functionality. These are the following:

In [1]:

import pandas as pd
import statsmodels.api as sm

Then we can load the file with the data we'll use (you'll need connectivity to pull it down):

In [3]:

db = pd.read_csv('https://raw.github.com/darribas/buzz_adam/master/buzz_data.csv')

In [4]:

db

Out[4]:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 96 entries, 0 to 95
Data columns (total 24 columns):
id                                     96  non-null values
4sq-pct_Arts & Entertainment           96  non-null values
4sq-pct_College & University           96  non-null values
4sq-pct_Food                           96  non-null values
4sq-pct_Other                          96  non-null values
4sq-pct_Outdoors & Recreation          96  non-null values
4sq-pct_Professional & Other Places    96  non-null values
4sq-pct_Travel & Transport             96  non-null values
industrie_pct_area                     96  non-null values
kantoor_pct_area                       96  non-null values
sport_pct_area                         96  non-null values
winkel_pct_area                        96  non-null values
4sq_total_venues                       96  non-null values
total_units                            96  non-null values
div_i                                  96  non-null values
checkins_all                           96  non-null values
('WeekDay', 'Afternoon')               96  non-null values
('WeekDay', 'Evening')                 96  non-null values
('WeekDay', 'Morning')                 96  non-null values
('WeekDay', 'Night')                   96  non-null values
('WeekEnd', 'Afternoon')               96  non-null values
('WeekEnd', 'Evening')                 96  non-null values
('WeekEnd', 'Morning')                 96  non-null values
('WeekEnd', 'Night')                   96  non-null values
dtypes: float64(20), int64(3), object(1)

Create decriptives¶

To first get a sense of the data, we can inspect the top rows (note the transpose with .T in order for the table to fit in):

In [8]:

db.head().T

Out[8]:

	0	1	2	3	4
id	BU03630000	BU03630001	BU03630002	BU03630003	BU03630004
4sq-pct_Arts & Entertainment	23.15789	22.41379	25	19.75309	29.26829
4sq-pct_College & University	2.105263	0	0	1.234568	0
4sq-pct_Food	5.263158	6.034483	7.352941	3.703704	2.439024
4sq-pct_Other	57.89474	52.58621	58.82353	65.4321	56.09756
4sq-pct_Outdoors & Recreation	0	0	0	0	0
4sq-pct_Professional & Other Places	4.210526	2.586207	5.882353	4.938272	2.439024
4sq-pct_Travel & Transport	4.210526	12.93103	2.941176	3.703704	4.878049
industrie_pct_area	3.180117	1.002782	2.649969	1.572429	4.066521
kantoor_pct_area	14.23777	17.45941	23.67496	30.78645	13.96826
sport_pct_area	0.03911904	0.9169031	0.07142607	0.3230344	0.3108789
winkel_pct_area	7.09489	29.07499	5.959857	9.063615	4.053148
4sq_total_venues	95	116	68	81	41
total_units	4238	4836	5923	3639	7162
div_i	0.5702	0.6314	0.5228	0.5196	0.5394
checkins_all	2750	6620	2521	3468	2076
('WeekDay', 'Afternoon')	743	1834	685	1040	664
('WeekDay', 'Evening')	620	1236	588	789	397
('WeekDay', 'Morning')	402	1407	427	463	422
('WeekDay', 'Night')	185	419	184	262	98
('WeekEnd', 'Afternoon')	346	816	311	372	238
('WeekEnd', 'Evening')	203	333	145	241	124
('WeekEnd', 'Morning')	111	324	80	96	77
('WeekEnd', 'Night')	140	251	101	205	56

Or the tail of the table:

In [9]:

db.tail().T

Out[9]:

	91	92	93	94	95
id	BU03631453	BU03631454	BU03631459	BU03631490	BU03631491
4sq-pct_Arts & Entertainment	33.33333	22.22222	0	0	0
4sq-pct_College & University	0	0	10	4.166667	0
4sq-pct_Food	0	22.22222	10	4.166667	0
4sq-pct_Other	33.33333	44.44444	60	54.16667	28.57143
4sq-pct_Outdoors & Recreation	0	0	0	0	0
4sq-pct_Professional & Other Places	0	11.11111	20	25	28.57143
4sq-pct_Travel & Transport	0	0	0	8.333333	14.28571
industrie_pct_area	0.5510426	1.05968	0.3963571	1.581789	1.801225
kantoor_pct_area	1.254399	5.504788	50.14588	36.31685	18.88625
sport_pct_area	0.5164656	0.1423928	0	0.3649561	1.867882
winkel_pct_area	3.936081	3.15024	1.114295	3.382682	0.7877441
4sq_total_venues	3	9	10	24	7
total_units	3172	5456	1444	10191	5051
div_i	0.6032	0.5514	0.5304	0.5692	0.5576
checkins_all	431	483	543	2179	815
('WeekDay', 'Afternoon')	87	98	169	889	202
('WeekDay', 'Evening')	64	124	82	190	111
('WeekDay', 'Morning')	37	104	181	789	239
('WeekDay', 'Night')	88	30	18	54	115
('WeekEnd', 'Afternoon')	55	55	46	102	59
('WeekEnd', 'Evening')	58	41	19	54	35
('WeekEnd', 'Morning')	7	25	24	79	29
('WeekEnd', 'Night')	35	6	4	22	25

Basic descriptives of the data are created with the following simple command (note the transpose again):

In [12]:

db.describe().T

Out[12]:

	count	mean	std	min	25%	50%	75%	max
4sq-pct_Arts & Entertainment	96	7.993073	12.321591	0.000000	0.000000	0.000000	15.885417	46.153846
4sq-pct_College & University	96	1.502825	7.376999	0.000000	0.000000	0.000000	0.000000	50.000000
4sq-pct_Food	96	6.498157	16.067714	0.000000	0.000000	0.000000	5.794335	100.000000
4sq-pct_Other	96	46.071709	30.456685	0.000000	25.000000	50.000000	63.727273	100.000000
4sq-pct_Outdoors & Recreation	96	1.617461	10.619857	0.000000	0.000000	0.000000	0.000000	100.000000
4sq-pct_Professional & Other Places	96	12.211715	22.494019	0.000000	0.000000	0.000000	16.666667	100.000000
4sq-pct_Travel & Transport	96	7.041147	16.066249	0.000000	0.000000	0.000000	11.111111	100.000000
industrie_pct_area	96	7.194845	14.445025	0.000000	0.788846	1.970315	4.891877	76.819769
kantoor_pct_area	96	8.522095	12.678802	0.013986	1.398705	3.651725	10.338975	84.004070
sport_pct_area	96	1.351429	3.579424	0.000000	0.090404	0.362568	1.204256	29.412587
winkel_pct_area	96	3.408922	3.666317	0.000000	1.307761	2.381766	4.416234	29.074987
4sq_total_venues	96	12.864583	22.031852	0.000000	1.000000	6.000000	11.000000	116.000000
total_units	96	4822.031250	3001.489607	13.000000	2390.250000	4844.500000	6722.750000	14763.000000
div_i	96	0.601440	0.144747	0.000000	0.524900	0.599200	0.725200	0.806600
checkins_all	96	806.531250	1070.210338	5.000000	147.000000	422.000000	1028.250000	6620.000000
('WeekDay', 'Afternoon')	96	204.895833	299.699480	0.000000	33.000000	80.500000	262.000000	1834.000000
('WeekDay', 'Evening')	96	155.375000	220.786644	0.000000	20.000000	78.000000	178.000000	1236.000000
('WeekDay', 'Morning')	96	184.625000	250.739212	1.000000	33.750000	87.000000	274.000000	1407.000000
('WeekDay', 'Night')	96	59.364583	78.068003	0.000000	9.000000	30.000000	82.750000	419.000000
('WeekEnd', 'Afternoon')	96	84.031250	122.064388	0.000000	20.000000	42.000000	90.750000	816.000000
('WeekEnd', 'Evening')	96	49.052083	67.586883	0.000000	8.750000	23.500000	55.000000	333.000000
('WeekEnd', 'Morning')	96	37.937500	47.425024	0.000000	8.750000	19.500000	53.250000	324.000000
('WeekEnd', 'Night')	96	31.250000	48.053041	0.000000	4.000000	14.500000	38.000000	263.000000

Simple graphics¶

We can visualize the data too. For example, we can see the density distribution of the variable div_i:

In [14]:

db['div_i'].plot(kind='kde')

Out[14]:

<matplotlib.axes.AxesSubplot at 0x50fa890>

Or we can get a scatter plot between div_i and industrie_pct_area:

In [15]:

scatter(db['div_i'], db['industrie_pct_area'])

Out[15]:

<matplotlib.collections.PathCollection at 0x53a0b90>

we can get a full matrix of plots with the following commands:

In [20]:

from pandas.tools.plotting import scatter_matrix
scatter_matrix(db.ix[:, :5], figsize=(12, 12))
show()

Estimate a model¶

Let us run an OLS model with the following equation:

$$ checkins\_all = \alpha + \beta_1 total\_units + \beta_2 div_i $$

In [24]:

ols = sm.OLS(db['checkins_all'], db[['total_units', 'div_i']]).fit()

A plain text summary can be obtained as follows:

In [28]:

smry = ols.summary()
smry

Out[28]:

OLS Regression Results
Dep. Variable:	checkins_all	R-squared:	0.392
Model:	OLS	Adj. R-squared:	0.379
Method:	Least Squares	F-statistic:	30.29
Date:	Tue, 14 Jan 2014	Prob (F-statistic):	7.02e-11
Time:	20:53:19	Log-Likelihood:	-803.27
No. Observations:	96	AIC:	1611.
Df Residuals:	94	BIC:	1616.
Df Model:	2

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
total_units	0.1100	0.040	2.759	0.007	0.031 0.189
div_i	376.2314	365.721	1.029	0.306	-349.916 1102.378

Omnibus:	82.475	Durbin-Watson:	1.138
Prob(Omnibus):	0.000	Jarque-Bera (JB):	580.650
Skew:	2.850	Prob(JB):	8.19e-127
Kurtosis:	13.615	Cond. No.	1.93e+04

And if we need to export it to latex, it is also easy:

In [29]:

inlatex = smry.as_latex()
print inlatex

\begin{center}
\begin{tabular}{lclc}
\toprule
\textbf{Dep. Variable:}    &   checkins_all   & \textbf{  R-squared:         } &     0.392   \\
\textbf{Model:}            &       OLS        & \textbf{  Adj. R-squared:    } &     0.379   \\
\textbf{Method:}           &  Least Squares   & \textbf{  F-statistic:       } &     30.29   \\
\textbf{Date:}             & Tue, 14 Jan 2014 & \textbf{  Prob (F-statistic):} &  7.02e-11   \\
\textbf{Time:}             &     20:53:19     & \textbf{  Log-Likelihood:    } &   -803.27   \\
\textbf{No. Observations:} &          96      & \textbf{  AIC:               } &     1611.   \\
\textbf{Df Residuals:}     &          94      & \textbf{  BIC:               } &     1616.   \\
\bottomrule
\end{tabular}
\begin{tabular}{lccccc}
                     & \textbf{coef} & \textbf{std err} & \textbf{t} & \textbf{P$>$$|$t$|$} & \textbf{[95.0\% Conf. Int.]}  \\
\midrule
\textbf{total_units} &       0.1100  &        0.040     &     2.759  &         0.007        &         0.031     0.189       \\
\textbf{div_i}       &     376.2314  &      365.721     &     1.029  &         0.306        &      -349.916  1102.378       \\
\bottomrule
\end{tabular}
\begin{tabular}{lclc}
\textbf{Omnibus:}       & 82.475 & \textbf{  Durbin-Watson:     } &     1.138  \\
\textbf{Prob(Omnibus):} &  0.000 & \textbf{  Jarque-Bera (JB):  } &   580.650  \\
\textbf{Skew:}          &  2.850 & \textbf{  Prob(JB):          } & 8.19e-127  \\
\textbf{Kurtosis:}      & 13.615 & \textbf{  Cond. No.          } &  1.93e+04  \\
\bottomrule
\end{tabular}
%\caption{OLS Regression Results}
\end{center}