This document shows how you can perform a very simple statistical analysis in the Python programming language. To run this, you will need the IPython notebook; however, you can copy and paste the code parts of the document into a plain text file and run it as a python script as follows:
python my_script.py
Assuming you have named the file my_script.py
.
Before we start, we need to import the specific libraries that contain the bulk of data and statistical functionality. These are the following:
import pandas as pd
import statsmodels.api as sm
Then we can load the file with the data we'll use (you'll need connectivity to pull it down):
db = pd.read_csv('https://raw.github.com/darribas/buzz_adam/master/buzz_data.csv')
db
<class 'pandas.core.frame.DataFrame'> Int64Index: 96 entries, 0 to 95 Data columns (total 24 columns): id 96 non-null values 4sq-pct_Arts & Entertainment 96 non-null values 4sq-pct_College & University 96 non-null values 4sq-pct_Food 96 non-null values 4sq-pct_Other 96 non-null values 4sq-pct_Outdoors & Recreation 96 non-null values 4sq-pct_Professional & Other Places 96 non-null values 4sq-pct_Travel & Transport 96 non-null values industrie_pct_area 96 non-null values kantoor_pct_area 96 non-null values sport_pct_area 96 non-null values winkel_pct_area 96 non-null values 4sq_total_venues 96 non-null values total_units 96 non-null values div_i 96 non-null values checkins_all 96 non-null values ('WeekDay', 'Afternoon') 96 non-null values ('WeekDay', 'Evening') 96 non-null values ('WeekDay', 'Morning') 96 non-null values ('WeekDay', 'Night') 96 non-null values ('WeekEnd', 'Afternoon') 96 non-null values ('WeekEnd', 'Evening') 96 non-null values ('WeekEnd', 'Morning') 96 non-null values ('WeekEnd', 'Night') 96 non-null values dtypes: float64(20), int64(3), object(1)
To first get a sense of the data, we can inspect the top rows (note the transpose with .T
in order for the table to fit in):
db.head().T
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
id | BU03630000 | BU03630001 | BU03630002 | BU03630003 | BU03630004 |
4sq-pct_Arts & Entertainment | 23.15789 | 22.41379 | 25 | 19.75309 | 29.26829 |
4sq-pct_College & University | 2.105263 | 0 | 0 | 1.234568 | 0 |
4sq-pct_Food | 5.263158 | 6.034483 | 7.352941 | 3.703704 | 2.439024 |
4sq-pct_Other | 57.89474 | 52.58621 | 58.82353 | 65.4321 | 56.09756 |
4sq-pct_Outdoors & Recreation | 0 | 0 | 0 | 0 | 0 |
4sq-pct_Professional & Other Places | 4.210526 | 2.586207 | 5.882353 | 4.938272 | 2.439024 |
4sq-pct_Travel & Transport | 4.210526 | 12.93103 | 2.941176 | 3.703704 | 4.878049 |
industrie_pct_area | 3.180117 | 1.002782 | 2.649969 | 1.572429 | 4.066521 |
kantoor_pct_area | 14.23777 | 17.45941 | 23.67496 | 30.78645 | 13.96826 |
sport_pct_area | 0.03911904 | 0.9169031 | 0.07142607 | 0.3230344 | 0.3108789 |
winkel_pct_area | 7.09489 | 29.07499 | 5.959857 | 9.063615 | 4.053148 |
4sq_total_venues | 95 | 116 | 68 | 81 | 41 |
total_units | 4238 | 4836 | 5923 | 3639 | 7162 |
div_i | 0.5702 | 0.6314 | 0.5228 | 0.5196 | 0.5394 |
checkins_all | 2750 | 6620 | 2521 | 3468 | 2076 |
('WeekDay', 'Afternoon') | 743 | 1834 | 685 | 1040 | 664 |
('WeekDay', 'Evening') | 620 | 1236 | 588 | 789 | 397 |
('WeekDay', 'Morning') | 402 | 1407 | 427 | 463 | 422 |
('WeekDay', 'Night') | 185 | 419 | 184 | 262 | 98 |
('WeekEnd', 'Afternoon') | 346 | 816 | 311 | 372 | 238 |
('WeekEnd', 'Evening') | 203 | 333 | 145 | 241 | 124 |
('WeekEnd', 'Morning') | 111 | 324 | 80 | 96 | 77 |
('WeekEnd', 'Night') | 140 | 251 | 101 | 205 | 56 |
Or the tail of the table:
db.tail().T
91 | 92 | 93 | 94 | 95 | |
---|---|---|---|---|---|
id | BU03631453 | BU03631454 | BU03631459 | BU03631490 | BU03631491 |
4sq-pct_Arts & Entertainment | 33.33333 | 22.22222 | 0 | 0 | 0 |
4sq-pct_College & University | 0 | 0 | 10 | 4.166667 | 0 |
4sq-pct_Food | 0 | 22.22222 | 10 | 4.166667 | 0 |
4sq-pct_Other | 33.33333 | 44.44444 | 60 | 54.16667 | 28.57143 |
4sq-pct_Outdoors & Recreation | 0 | 0 | 0 | 0 | 0 |
4sq-pct_Professional & Other Places | 0 | 11.11111 | 20 | 25 | 28.57143 |
4sq-pct_Travel & Transport | 0 | 0 | 0 | 8.333333 | 14.28571 |
industrie_pct_area | 0.5510426 | 1.05968 | 0.3963571 | 1.581789 | 1.801225 |
kantoor_pct_area | 1.254399 | 5.504788 | 50.14588 | 36.31685 | 18.88625 |
sport_pct_area | 0.5164656 | 0.1423928 | 0 | 0.3649561 | 1.867882 |
winkel_pct_area | 3.936081 | 3.15024 | 1.114295 | 3.382682 | 0.7877441 |
4sq_total_venues | 3 | 9 | 10 | 24 | 7 |
total_units | 3172 | 5456 | 1444 | 10191 | 5051 |
div_i | 0.6032 | 0.5514 | 0.5304 | 0.5692 | 0.5576 |
checkins_all | 431 | 483 | 543 | 2179 | 815 |
('WeekDay', 'Afternoon') | 87 | 98 | 169 | 889 | 202 |
('WeekDay', 'Evening') | 64 | 124 | 82 | 190 | 111 |
('WeekDay', 'Morning') | 37 | 104 | 181 | 789 | 239 |
('WeekDay', 'Night') | 88 | 30 | 18 | 54 | 115 |
('WeekEnd', 'Afternoon') | 55 | 55 | 46 | 102 | 59 |
('WeekEnd', 'Evening') | 58 | 41 | 19 | 54 | 35 |
('WeekEnd', 'Morning') | 7 | 25 | 24 | 79 | 29 |
('WeekEnd', 'Night') | 35 | 6 | 4 | 22 | 25 |
Basic descriptives of the data are created with the following simple command (note the transpose again):
db.describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
4sq-pct_Arts & Entertainment | 96 | 7.993073 | 12.321591 | 0.000000 | 0.000000 | 0.000000 | 15.885417 | 46.153846 |
4sq-pct_College & University | 96 | 1.502825 | 7.376999 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 50.000000 |
4sq-pct_Food | 96 | 6.498157 | 16.067714 | 0.000000 | 0.000000 | 0.000000 | 5.794335 | 100.000000 |
4sq-pct_Other | 96 | 46.071709 | 30.456685 | 0.000000 | 25.000000 | 50.000000 | 63.727273 | 100.000000 |
4sq-pct_Outdoors & Recreation | 96 | 1.617461 | 10.619857 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 100.000000 |
4sq-pct_Professional & Other Places | 96 | 12.211715 | 22.494019 | 0.000000 | 0.000000 | 0.000000 | 16.666667 | 100.000000 |
4sq-pct_Travel & Transport | 96 | 7.041147 | 16.066249 | 0.000000 | 0.000000 | 0.000000 | 11.111111 | 100.000000 |
industrie_pct_area | 96 | 7.194845 | 14.445025 | 0.000000 | 0.788846 | 1.970315 | 4.891877 | 76.819769 |
kantoor_pct_area | 96 | 8.522095 | 12.678802 | 0.013986 | 1.398705 | 3.651725 | 10.338975 | 84.004070 |
sport_pct_area | 96 | 1.351429 | 3.579424 | 0.000000 | 0.090404 | 0.362568 | 1.204256 | 29.412587 |
winkel_pct_area | 96 | 3.408922 | 3.666317 | 0.000000 | 1.307761 | 2.381766 | 4.416234 | 29.074987 |
4sq_total_venues | 96 | 12.864583 | 22.031852 | 0.000000 | 1.000000 | 6.000000 | 11.000000 | 116.000000 |
total_units | 96 | 4822.031250 | 3001.489607 | 13.000000 | 2390.250000 | 4844.500000 | 6722.750000 | 14763.000000 |
div_i | 96 | 0.601440 | 0.144747 | 0.000000 | 0.524900 | 0.599200 | 0.725200 | 0.806600 |
checkins_all | 96 | 806.531250 | 1070.210338 | 5.000000 | 147.000000 | 422.000000 | 1028.250000 | 6620.000000 |
('WeekDay', 'Afternoon') | 96 | 204.895833 | 299.699480 | 0.000000 | 33.000000 | 80.500000 | 262.000000 | 1834.000000 |
('WeekDay', 'Evening') | 96 | 155.375000 | 220.786644 | 0.000000 | 20.000000 | 78.000000 | 178.000000 | 1236.000000 |
('WeekDay', 'Morning') | 96 | 184.625000 | 250.739212 | 1.000000 | 33.750000 | 87.000000 | 274.000000 | 1407.000000 |
('WeekDay', 'Night') | 96 | 59.364583 | 78.068003 | 0.000000 | 9.000000 | 30.000000 | 82.750000 | 419.000000 |
('WeekEnd', 'Afternoon') | 96 | 84.031250 | 122.064388 | 0.000000 | 20.000000 | 42.000000 | 90.750000 | 816.000000 |
('WeekEnd', 'Evening') | 96 | 49.052083 | 67.586883 | 0.000000 | 8.750000 | 23.500000 | 55.000000 | 333.000000 |
('WeekEnd', 'Morning') | 96 | 37.937500 | 47.425024 | 0.000000 | 8.750000 | 19.500000 | 53.250000 | 324.000000 |
('WeekEnd', 'Night') | 96 | 31.250000 | 48.053041 | 0.000000 | 4.000000 | 14.500000 | 38.000000 | 263.000000 |
We can visualize the data too. For example, we can see the density distribution of the variable div_i
:
db['div_i'].plot(kind='kde')
<matplotlib.axes.AxesSubplot at 0x50fa890>
Or we can get a scatter plot between div_i
and industrie_pct_area
:
scatter(db['div_i'], db['industrie_pct_area'])
<matplotlib.collections.PathCollection at 0x53a0b90>
we can get a full matrix of plots with the following commands:
from pandas.tools.plotting import scatter_matrix
scatter_matrix(db.ix[:, :5], figsize=(12, 12))
show()
Let us run an OLS model with the following equation:
$$ checkins\_all = \alpha + \beta_1 total\_units + \beta_2 div_i $$ols = sm.OLS(db['checkins_all'], db[['total_units', 'div_i']]).fit()
A plain text summary can be obtained as follows:
smry = ols.summary()
smry
Dep. Variable: | checkins_all | R-squared: | 0.392 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.379 |
Method: | Least Squares | F-statistic: | 30.29 |
Date: | Tue, 14 Jan 2014 | Prob (F-statistic): | 7.02e-11 |
Time: | 20:53:19 | Log-Likelihood: | -803.27 |
No. Observations: | 96 | AIC: | 1611. |
Df Residuals: | 94 | BIC: | 1616. |
Df Model: | 2 |
coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
total_units | 0.1100 | 0.040 | 2.759 | 0.007 | 0.031 0.189 |
div_i | 376.2314 | 365.721 | 1.029 | 0.306 | -349.916 1102.378 |
Omnibus: | 82.475 | Durbin-Watson: | 1.138 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 580.650 |
Skew: | 2.850 | Prob(JB): | 8.19e-127 |
Kurtosis: | 13.615 | Cond. No. | 1.93e+04 |
And if we need to export it to latex, it is also easy:
inlatex = smry.as_latex()
print inlatex
\begin{center} \begin{tabular}{lclc} \toprule \textbf{Dep. Variable:} & checkins_all & \textbf{ R-squared: } & 0.392 \\ \textbf{Model:} & OLS & \textbf{ Adj. R-squared: } & 0.379 \\ \textbf{Method:} & Least Squares & \textbf{ F-statistic: } & 30.29 \\ \textbf{Date:} & Tue, 14 Jan 2014 & \textbf{ Prob (F-statistic):} & 7.02e-11 \\ \textbf{Time:} & 20:53:19 & \textbf{ Log-Likelihood: } & -803.27 \\ \textbf{No. Observations:} & 96 & \textbf{ AIC: } & 1611. \\ \textbf{Df Residuals:} & 94 & \textbf{ BIC: } & 1616. \\ \bottomrule \end{tabular} \begin{tabular}{lccccc} & \textbf{coef} & \textbf{std err} & \textbf{t} & \textbf{P$>$$|$t$|$} & \textbf{[95.0\% Conf. Int.]} \\ \midrule \textbf{total_units} & 0.1100 & 0.040 & 2.759 & 0.007 & 0.031 0.189 \\ \textbf{div_i} & 376.2314 & 365.721 & 1.029 & 0.306 & -349.916 1102.378 \\ \bottomrule \end{tabular} \begin{tabular}{lclc} \textbf{Omnibus:} & 82.475 & \textbf{ Durbin-Watson: } & 1.138 \\ \textbf{Prob(Omnibus):} & 0.000 & \textbf{ Jarque-Bera (JB): } & 580.650 \\ \textbf{Skew:} & 2.850 & \textbf{ Prob(JB): } & 8.19e-127 \\ \textbf{Kurtosis:} & 13.615 & \textbf{ Cond. No. } & 1.93e+04 \\ \bottomrule \end{tabular} %\caption{OLS Regression Results} \end{center}