from IPython.display import HTML
lic = """
<a rel="license" href="http://creativecommons.org/licenses/by/4.0/"><img alt="Creative Commons License" style="border-width:0" src="http://i.creativecommons.org/l/by/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">"Replication of results for The socio-cultural sources of urban buzz"</span> by <a xmlns:cc="http://creativecommons.org/ns#" href="http://darribas.org" property="cc:attributionName" rel="cc:attributionURL">Daniel Arribas-Bel</a> is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>."""
HTML(lic)
This notebook is stored in an open repository on the following link:
import copy
import pysal as ps
import numpy as np
import pandas as pd
import statsmodels.api as sm
from pyGDsandbox.dataIO import dbf2df, df2dbf
The data are provided in the same repository as this notebook and incorporate a csv
file with the attributes plus a shapefile with the geometry of the neighborhoods in Amsterdam. If you have downloaded the repository, you can load the data with:
db = pd.read_csv('buzz_data.csv', index_col=0)
If you don't have the csv around but you are connected to the internet, you can load it remotely like this:
url = 'https://raw.github.com/darribas/buzz_adam/master/buzz_data.csv'
db = pd.read_csv(url, index_col=0)
If the data has loaded correctly, the info of the table should look like this, 23 columns with data on 96 neighborhoods:
db
<class 'pandas.core.frame.DataFrame'> Index: 96 entries, BU03630000 to BU03631491 Data columns (total 23 columns): 4sq-pct_Arts & Entertainment 96 non-null values 4sq-pct_College & University 96 non-null values 4sq-pct_Food 96 non-null values 4sq-pct_Other 96 non-null values 4sq-pct_Outdoors & Recreation 96 non-null values 4sq-pct_Professional & Other Places 96 non-null values 4sq-pct_Travel & Transport 96 non-null values industrie_pct_area 96 non-null values kantoor_pct_area 96 non-null values sport_pct_area 96 non-null values winkel_pct_area 96 non-null values 4sq_total_venues 96 non-null values total_units 96 non-null values div_i 96 non-null values checkins_all 96 non-null values ('WeekDay', 'Afternoon') 96 non-null values ('WeekDay', 'Evening') 96 non-null values ('WeekDay', 'Morning') 96 non-null values ('WeekDay', 'Night') 96 non-null values ('WeekEnd', 'Afternoon') 96 non-null values ('WeekEnd', 'Evening') 96 non-null values ('WeekEnd', 'Morning') 96 non-null values ('WeekEnd', 'Night') 96 non-null values dtypes: float64(20), int64(3)
The paper contains two descriptives: a choropleth of the distribution and a table that shows basic statistics of each variable. The latter is very simple to replicate (note I only show the variables displayed in the table of the paper, although the dataset contains a few more:
vars = ['4sq-pct_Arts & Entertainment', \
'4sq-pct_College & University', \
'4sq-pct_Food', \
'4sq-pct_Other', \
'4sq-pct_Outdoors & Recreation', \
'4sq-pct_Professional & Other Places', \
'4sq-pct_Travel & Transport', \
'industrie_pct_area', \
'kantoor_pct_area', \
'sport_pct_area', \
'winkel_pct_area', \
'4sq_total_venues', \
'total_units', \
'div_i', \
'checkins_all']
np.round(db[vars].describe().T[['mean', 'std', 'min', '25%', '50%', '75%', 'max']], 3)
mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|
4sq-pct_Arts & Entertainment | 7.993 | 12.322 | 0.000 | 0.000 | 0.000 | 15.885 | 46.154 |
4sq-pct_College & University | 1.503 | 7.377 | 0.000 | 0.000 | 0.000 | 0.000 | 50.000 |
4sq-pct_Food | 6.498 | 16.068 | 0.000 | 0.000 | 0.000 | 5.794 | 100.000 |
4sq-pct_Other | 46.072 | 30.457 | 0.000 | 25.000 | 50.000 | 63.727 | 100.000 |
4sq-pct_Outdoors & Recreation | 1.617 | 10.620 | 0.000 | 0.000 | 0.000 | 0.000 | 100.000 |
4sq-pct_Professional & Other Places | 12.212 | 22.494 | 0.000 | 0.000 | 0.000 | 16.667 | 100.000 |
4sq-pct_Travel & Transport | 7.041 | 16.066 | 0.000 | 0.000 | 0.000 | 11.111 | 100.000 |
industrie_pct_area | 7.195 | 14.445 | 0.000 | 0.789 | 1.970 | 4.892 | 76.820 |
kantoor_pct_area | 8.522 | 12.679 | 0.014 | 1.399 | 3.652 | 10.339 | 84.004 |
sport_pct_area | 1.351 | 3.579 | 0.000 | 0.090 | 0.363 | 1.204 | 29.413 |
winkel_pct_area | 3.409 | 3.666 | 0.000 | 1.308 | 2.382 | 4.416 | 29.075 |
4sq_total_venues | 12.865 | 22.032 | 0.000 | 1.000 | 6.000 | 11.000 | 116.000 |
total_units | 4822.031 | 3001.490 | 13.000 | 2390.250 | 4844.500 | 6722.750 | 14763.000 |
div_i | 0.601 | 0.145 | 0.000 | 0.525 | 0.599 | 0.725 | 0.807 |
checkins_all | 806.531 | 1070.210 | 5.000 | 147.000 | 422.000 | 1028.250 | 6620.000 |
The map in the paper was created using QGIS, and it overlays the neighborhood polygons with the actual checkin points. I cannot re-share the individual data (although you can contact the authors), but we can re-create the choropleth (you will need PySAL 1.7 to reproduce the following cell):
from pysal.contrib.viz import mapping as maps
shp_link = 'adam_buzz_buurts.shp'
_ = maps.plot_choropleth(shp_link, db['checkins_all'].values, 'quantiles', \
k=9, cmap='YlGn', figsize=(12, 12), \
title='Checkin volume')
The main part of the paper is devoted to regressions on the total volume of checkins of a neighborhood. In particular, we estimate a baseline OLS model as specified in Equation (1):
$$ B_i = \alpha + \beta_1 F_i + \beta_2 E_i + \gamma Div_i + \epsilon_i $$Such regression, displayed in the first column of Table 2 is estimated next:
name_y = 'checkins_all'
x = db[vars[:-1]]
x['constant'] = 1
model = sm.OLS(np.log(db[name_y]), x).fit()
print model.summary()
OLS Regression Results ============================================================================== Dep. Variable: checkins_all R-squared: 0.735 Model: OLS Adj. R-squared: 0.689 Method: Least Squares F-statistic: 16.04 Date: Wed, 18 Dec 2013 Prob (F-statistic): 7.37e-18 Time: 11:40:45 Log-Likelihood: -108.82 No. Observations: 96 AIC: 247.6 Df Residuals: 81 BIC: 286.1 Df Model: 14 ======================================================================================================= coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------------------------------- 4sq-pct_Arts & Entertainment 0.0285 0.008 3.456 0.001 0.012 0.045 4sq-pct_College & University -0.0104 0.012 -0.880 0.381 -0.034 0.013 4sq-pct_Food 0.0208 0.006 3.633 0.000 0.009 0.032 4sq-pct_Other 0.0121 0.003 3.590 0.001 0.005 0.019 4sq-pct_Outdoors & Recreation 0.0171 0.008 2.059 0.043 0.001 0.034 4sq-pct_Professional & Other Places 0.0129 0.005 2.513 0.014 0.003 0.023 4sq-pct_Travel & Transport 0.0138 0.006 2.423 0.018 0.002 0.025 industrie_pct_area 0.0184 0.008 2.339 0.022 0.003 0.034 kantoor_pct_area 0.0339 0.008 4.467 0.000 0.019 0.049 sport_pct_area -0.0271 0.030 -0.915 0.363 -0.086 0.032 winkel_pct_area 0.0187 0.032 0.593 0.555 -0.044 0.081 4sq_total_venues 0.0209 0.006 3.390 0.001 0.009 0.033 total_units 0.0002 3.63e-05 5.097 0.000 0.000 0.000 div_i 1.5176 0.803 1.890 0.062 -0.080 3.116 constant 2.1549 0.524 4.112 0.000 1.112 3.198 ============================================================================== Omnibus: 2.335 Durbin-Watson: 1.596 Prob(Omnibus): 0.311 Jarque-Bera (JB): 1.711 Skew: -0.273 Prob(JB): 0.425 Kurtosis: 3.360 Cond. No. 6.24e+04 ============================================================================== Warnings: [1] The condition number is large, 6.24e+04. This might indicate that there are strong multicollinearity or other numerical problems.
As mentioned in the paper, values for nearby neighborhoods are likely to be correlated. To statistically test, the presence of spatial autocorrelation and, if so, of what type, we run the LM-Lag and LM-Error diagnostics. For that we previously require a spatial weights matrix, which we build following the queen contiguity and row-standardize (we also write it out for later use):
wo = ps.queen_from_shapefile('adam_buzz_buurts.shp')
neigh = copy.copy(wo.neighbors)
weigh = copy.copy(wo.weights)
neigh[27].extend([26, 28])
weigh[27].extend([1., 1.])
w = ps.W(neigh, weigh)
ps.open('adam_buzz_buurts_queenPS.gal', 'w').write(w)
w.transform = 'r'
WARNING: there is one disconnected observation (no neighbors) Island id: [27]
The island mentioned in the warning is actually a coding error in the shapefile, so in this case we fix it manually. The diagnostics are easily run as follows (note that we require an OLS from PySAL, which we create as mmodel
):
mmodel = ps.spreg.OLS(np.log(db[[name_y]].values),
db[vars[:-1]].values, w=w, spat_diag=True, nonspat_diag=False)
from pysal.spreg.summary_output import summary_spat_diag_ols
print '\n\t\tSPATIAL DIAGNOSTICS'
print '\t\t-------------------\n'
print summary_spat_diag_ols(mmodel, False)
SPATIAL DIAGNOSTICS ------------------- Lagrange Multiplier (lag) 1 20.874762 0.0000049 Robust LM (lag) 1 12.079756 0.0005097 Lagrange Multiplier (error) 1 8.942169 0.0027866 Robust LM (error) 1 0.147163 0.7012617 Lagrange Multiplier (SARMA) 2 21.021925 0.0000272
The clear indication towards a spatial lag prompts us to run such specficiation. In formal notation, this boils down to:
$$ B_i = \alpha + \rho \displaystyle\sum^N_j w_{ij} B_j + \beta_1 F_i + \beta_2 E_i + \gamma Div_i + \epsilon_i $$Since the model is estimated using ML, we use spdep
in R for the regression:
%load_ext rmagic
%R library(spdep)
Loading required package: sp Loading required package: Matrix Loading required package: lattice
%%R
db.link = 'adam_vars_richdiv.csv'
db = read.csv(db.link, sep=';')
landuses = names(db)[grep('_pct_area', names(db))]
landuses = c('industrie_pct_area', 'kantoor_pct_area', 'sport_pct_area',
'winkel_pct_area')
sq4 = names(db)[grep('X4sq.pct', names(db))][1:7]
other = c()
name.X = c(sq4, landuses,
c(
'X4sq_total_venues',
'total_units',
'div_i'
)
)
x = db[name.X]
########### Rescaling ##############
x['total_units'] <- x['total_units'] / 1000
x['div_i'] <- x['div_i'] * 10
####################################
y = 'checkins_all'
y = log(db[, y])
wnb <- read.gal('adam_buzz_buurts_queenPS.gal',
override.id=T)
w <- nb2listw(wnb, style="W")
model <- lagsarlm(y ~ ., data=x, w)
print(summary(model))
Call:lagsarlm(formula = y ~ ., data = x, listw = w) Residuals: Min 1Q Median 3Q Max -1.5708760 -0.3645739 -0.0035605 0.4496111 2.1311628 Type: lag Coefficients: (asymptotic standard errors) Estimate Std. Error z value Pr(>|z|) (Intercept) -0.0510550 0.6701247 -0.0762 0.9392701 X4sq.pct_Arts...Entertainment 0.0157820 0.0068313 2.3102 0.0208746 X4sq.pct_College...University -0.0063080 0.0096102 -0.6564 0.5115730 X4sq.pct_Food 0.0132307 0.0048092 2.7511 0.0059391 X4sq.pct_Other 0.0127954 0.0027478 4.6566 3.214e-06 X4sq.pct_Outdoors...Recreation 0.0093330 0.0067592 1.3808 0.1673445 X4sq.pct_Professional...Other.Places 0.0099200 0.0042057 2.3587 0.0183402 X4sq.pct_Travel...Transport 0.0134285 0.0046176 2.9081 0.0036366 industrie_pct_area 0.0216871 0.0063970 3.3902 0.0006985 kantoor_pct_area 0.0223510 0.0063125 3.5407 0.0003990 sport_pct_area -0.0213363 0.0242902 -0.8784 0.3797309 winkel_pct_area 0.0407958 0.0256985 1.5875 0.1124046 X4sq_total_venues 0.0137894 0.0053181 2.5929 0.0095162 total_units 0.1736649 0.0295318 5.8806 4.088e-09 div_i 0.1266651 0.0652172 1.9422 0.0521123 Rho: 0.43991, LR test value: 19.808, p-value: 8.5624e-06 Asymptotic standard error: 0.09327 z-value: 4.7165, p-value: 2.3991e-06 Wald statistic: 22.246, p-value: 2.3991e-06 Log likelihood: -98.91989 for lag model ML residual variance (sigma squared): 0.44128, (sigma: 0.66429) Number of observations: 96 Number of parameters estimated: 17 AIC: 231.84, (AIC for lm: 249.65) LM test for residual autocorrelation test value: 3.6243, p-value: 0.056941
These results match those in the second column of Table 2.
The last part of the paper explores the variability of the effect of diversity over different times of the week. To do that, we run a separate regression for the volume of checkins that occur within a given time window of the week. This is equivalent to:
$$ B_{it} = \alpha + \rho \displaystyle\sum^N_j w_{ij} B_{jt} + \beta_1 F_i + \beta_2 E_i + \gamma Div_i + \epsilon_i $$Since we are just exploring this variability, we do not report the entire model but only keep the coefficient for $Div_i$ to plot it over the week, together with its confidence intervals. The regressions though output the entire summary of the model to a file named 'lag_models_summaries.txt'
in the same folder.
%%R
cross.walk <- list("X..WeekDay....Morning.." = "Morning-WeekDay",
"X..WeekDay....Afternoon.." = "Afternoon-WeekDay",
"X..WeekDay....Evening.." = "Evening-WeekDay",
"X..WeekDay....Night.." = "Night-WeekDay",
"X..WeekEnd....Morning.." = "Morning-WeekEnd",
"X..WeekEnd....Afternoon.." = "Afternoon-WeekEnd",
"X..WeekEnd....Evening.." = "Evening-WeekEnd",
"X..WeekEnd....Night.." = "Night-WeekEnd"
)
db.link = 'buzz_data.csv'
db = read.csv(db.link)
landuses = names(db)[grep('_pct_area', names(db))]
landuses = c('industrie_pct_area', 'kantoor_pct_area', 'sport_pct_area',
'winkel_pct_area')
sq4 = names(db)[grep('X4sq.pct', names(db))][1:7]
other = c()
name.X = c(sq4, landuses,
c(
'X4sq_total_venues',
'total_units',
'div_i'
)
)
########### Rescaling ##############
db['total_units'] <- db['total_units'] / 1000
db['div_i'] <- db['div_i'] * 10
####################################
wnb <- read.gal('adam_buzz_buurts_queenPS.gal',
override.id=T)
w <- nb2listw(wnb, style="W")
week.times <- names(db)[grep('Week', names(db))]
res <- data.frame()
divs <- data.frame()
vari <- 15 # Position for Div_i
print('Running regressions')
sink('lag_models_summaries.txt')
for(wtime in week.times){
print(paste('Model for week time:', wtime))
y = db[, wtime]
y[y == 0] = 0.000001
y = log(y)
x = db[name.X]
model <- lagsarlm(y ~ ., data=x, w)
coefs <- as.data.frame(model$coefficients)
colnames(coefs) <- cross.walk[wtime]
res <- c(res, coefs)
print(summary(model))
div <- data.frame(beta=model$coefficients[vari], se=model$rest.se[vari],
lower=model$coefficients[vari] - model$rest.se[vari]*1.96,
upper=model$coefficients[vari] + model$rest.se[vari]*1.96
)
rownames(div) <- cross.walk[wtime]
divs <- rbind(div, divs)
}
sink()
res <- as.data.frame(res)
time.line <- c("Morning-WeekDay", "Afternoon-WeekDay", "Evening-WeekDay", "Night-WeekDay",
"Morning-WeekEnd", "Afternoon-WeekEnd", "Evening-WeekEnd", "Night-WeekEnd")
divs <- divs[time.line, ]
print(divs)
[1] "Running regressions" beta se lower upper Morning-WeekDay 0.1047662 0.0789775 -0.05002975 0.2595621 Afternoon-WeekDay 0.3147000 0.1538102 0.01323208 0.6161680 Evening-WeekDay 0.4999191 0.2386536 0.03215808 0.9676800 Night-WeekDay 1.1662525 0.3208840 0.53731992 1.7951850 Morning-WeekEnd 0.7755847 0.2077812 0.36833347 1.1828359 Afternoon-WeekEnd 0.6429518 0.1903473 0.26987110 1.0160325 Evening-WeekEnd 0.8430234 0.2695521 0.31470129 1.3713454 Night-WeekEnd 0.8461017 0.2893739 0.27892897 1.4132745
These results can be then plotted, reproducing the structure in Figure 2 of the paper.
%%R
par(mar = c(7, 4, 4, 2) + 0.1, bty='n')
ylab <- 'Div. coefficients'
plot(seq(8), xaxt = "n", divs$beta, type='l', xlab='', ylab=ylab,
ylim=c(min(divs$lower), max(divs$upper)))
#lines(seq(8), xaxt = "n", divs$lower, col='green', lw=0.5, type='l', xlab='', ylab='Div_i Coef.')
#lines(seq(8), xaxt = "n", divs$upper, col='green', lw=0.5, type='l', xlab='', ylab='Div_i Coef.')
xx <- c(seq(8), rev(seq(8)))
yy <- c(divs$upper, rev(divs$lower))
polygon(xx, yy, col = 'darkgray', border = NA)
lines(seq(8), xaxt = "n", divs$beta, type='l', xlab='', ylab='Div. Coef.',
ylim=c(min(divs$lower), max(divs$upper)))
abline(h=0, col='red', lw=2)
abline(v=4.5, col='black', lw=0.5)
axis(1, labels = FALSE)
text(1:8, par("usr")[3] - 0.075, srt = 45, adj = 1,
labels = time.line, xpd = TRUE)
text(2.25, 1.5, labels='Week days')
text(6.75, 1.5, labels='Weekend')
title(main=paste('Effect of diversity on buzz\n(spatial lag models)'))