%matplotlib inline
import pandas as pd
import numpy as np
import pysal as ps
import matplotlib.pyplot as plt
pandas
¶Data structures to read, interact, transform and write structured data.
db = pd.read_csv('../workshop_data/Houston_pop00.csv')
db.info()
db.head()
db.tail()
db.describe().T
downtown = db[db['dcbd'] < 15]
downtown.info()
And many more operations on tabular (multi-)indexed data. Check the documentation for more info and tutorials.
numpy
¶numpy
and scipy
are the foundational libraries for any kind of numeric computing in Python. Numpy offers the efficient matrix structure denominated array
or ndarray
(Numpy data array) as well as some basic statistical functions that may be applied to arrays.
To whet your appetite, let's first create a simple array. You can do this from a pre-existing Python list, for example:
l = [1, 2, 3, 4]
a = np.array(l)
a
At first sight, a
is not very different from l
; however, under the hood, it provides much more efficient structures for data manipulation (including C-optimized functions and other performance enhancements). Arrays contain only one data type and may have several dimensions, opening up the door for very fancy matrix manipulation.
print type(a[0])
l += 'a'
a = np.array(l)
print type(a[0])
l = [[1, 2], [3, 4], [5, 6]]
a = np.array(l)
print 'Array a has a dimension of: ', a.shape
print a
numpy
supports operations betwee arrays, such as sumation, difference, multiplication and division:
a = np.random.random((3, 2))
b = np.random.random((2, 3))
a, b
# Sum (note the transpose for dimensionality alignment)
a + b.T
# Difference (note the transpose for dimensionality alignment)
a - b.T
# Matrix product
c = np.dot(a, b)
c
# Matrix division by a scalar
c = a / 2.
c
scipy
is the sister library of Numpy and offers a wide arrange of statistical functions to operate on Numpy arrays. This provides much of the functionality that you find in the core packages of other statistical languages like R (in the r-base package) or Matlab.
Besides the core of scipy, the project also includes additional packages called scikits
that expand the main functionality in some particular way. Check out the scikits website to get a sense of what is covered.
pandas
heavily relies on numpy
under the hood, and it inherits many of its capabilities. For example, we can operate on vectors as one would do on numpy
arrays:
db['pop_dens'] = db['POP00'] / db['ALAND']
matplotlib
¶matplotlib
is the main tool for static graphical display in Python. It provides 2D and 3D functionality to plot data in a static way. The library may not appear as very intuitive at first, but if you get over the learning curve, it is extremely flexible and it allows to tweak every aspect of a figure. Because of its focus on flexibility, the defaults may not be the prettiest, but with some work on them, Matplotlib can create beautiful figures that rival in quality with any other library for static plotting (such as R's ggplot2
).
Part of the basic functionality is wrapped around pandas
, so that is a convenient way to get introduced to the library.
db['pop_dens'].plot(kind='kde')
db['dcbd'].hist(bins=50, color='k', alpha=0.5, grid=False)
db.pop_dens = db['POP00'] / db['ALAND']
db['pop_dens'].describe()
PySAL
introduction¶dbf
files IO:dbf = ps.open('../workshop_data/houston_tract_pop_emp_wgs84.dbf')
dbf.header
df = pd.DataFrame({'emp_dens': dbf.by_col('emp_dens'), \
'pop_dens': dbf.by_col('pop_dens'), \
'dcbd': dbf.by_col('dcbd'), \
'downtown': dbf.by_col('downtown')})
df.info()
You can create them from a shapefile:
w = ps.queen_from_shapefile('../../sdar_mini_repo/data/houston_tract_pop_emp_wgs84.shp')
w
Inspect and explore:
w.n
w[0]
w.transform = 'R'
w[0]
And save into a file:
f = ps.open('houston_tract_queen.gal', 'w')
f.write(w)
f.close()
PySAL
has a submodule (pysal.weights.Wsets
) that allows to combine different weights to obtain more sophisticated representations of your geography:
from IPython.display import HTML
HTML('<iframe src=http://pysal.readthedocs.org/en/v1.7/library/weights/Wsets.html#pysal.weights.Wsets width=100% height=350></iframe>')
Let us create a W
matrix that combines contiguity for the Houston tracts but excludes neighbors that are across the boundary of downtown (i.e. even if they are contiguous, they are not taken as neighbors if one is downtown and the other one is in the suburbs.
# Downtown/suburbs weights
dt_sb = ps.weights.block_weights(dbf.by_col('downtown'))
dt_sb = ps.weights.regime_weights(dbf.by_col('downtown'))
# Queen example
queen = ps.queen_from_shapefile('../../sdar_mini_repo/data/houston_tract_pop_emp_wgs84.shp')
The matrix we want is the result of intersecting the two matrices we just created:
w = ps.weights.Wsets.w_intersection(queen, dt_sb)
w.transform = 'R'
You can find an example to create a spatial weights matrix that combines contiguity and a block structure here.
wy = ps.lag_spatial(w, df['emp_dens'])
from pysal.contrib.viz import mapping as maps
shp_link = '../../sdar_mini_repo/data/houston_tract_pop_emp_wgs84.shp'
maps.plot_choropleth(shp_link, df['downtown'], 'unique_values', \
figsize=(12, 12))
shp_link = '../../sdar_mini_repo/data/houston_tract_pop_emp_wgs84.shp'
maps.plot_choropleth(shp_link, df['emp_dens'], 'quantiles', figsize=(12, 12))
ESDA tools are contained in the esda
module.
mi = ps.Moran(df['emp_dens'], w)
mi.I
lmi = ps.Moran_Local(df['emp_dens'].values, w)
lmi.p_sim
spreg
¶Standard:
ols_base = ps.spreg.OLS(df['emp_dens'].values[:, None], \
df[['pop_dens', 'dcbd']].values, \
name_x = ['pop_dens', 'dcbd'], name_y='emp_dens')
print ols_base.summary
ols_base.betas
ols_base.std_err
Using White correction:
ols_white = ps.spreg.OLS(df['emp_dens'].values[:, None], \
df[['pop_dens', 'dcbd']].values, \
robust='white', \
name_x = ['pop_dens', 'dcbd'], name_y='emp_dens')
ols_white.std_err
ols_sp_diag = ps.spreg.OLS(df['emp_dens'].values[:, None], \
df[['pop_dens', 'dcbd']].values, w, \
name_x = ['pop_dens', 'dcbd'], name_y='emp_dens', \
spat_diag=True)
ols_sp_diag.lm_error
ols_sp_diag.lm_lag
print ols_sp_diag.summary
Not by default in PySAL
but very straightforward with pandas
.
fes = pd.get_dummies(df['downtown'], prefix='downtown')
fes.head()
x = df[['pop_dens', 'dcbd']].join(fes.drop('downtown_0', axis=1))
ols_fe = ps.spreg.OLS(df['emp_dens'].values[:, None], x.values, w, \
name_x = list(x.columns), name_y='emp_dens', \
spat_diag=True)
print ols_fe.summary
downtown = df['downtown'].map({1: 'downtown', 0: 'suburbs'})
ols_regimes = ps.spreg.OLS_Regimes(df['emp_dens'].values[:, None], \
df[['pop_dens', 'dcbd']].values, downtown, w=w, \
name_x = ['pop_dens', 'dcbd'], name_y='emp_dens', \
spat_diag=True)
print ols_regimes.summary
df['w_pop_dens'] = ps.lag_spatial(w, df['pop_dens'])
ols_wx = ps.spreg.OLS(df['emp_dens'].values[:, None], \
df[['pop_dens', 'w_pop_dens', 'dcbd']].values, \
name_x = ['pop_dens', 'w_pop_dens', 'dcbd'], name_y='emp_dens')
print ols_wx.summary
Using Instrumental Variables (IV), as in Kelejian & Prucha (1999):
lag_iv = ps.spreg.GM_Lag(df['emp_dens'].values[:, None], \
df[['pop_dens', 'dcbd']].values, w=w, \
name_x = ['pop_dens', 'dcbd'], name_y='emp_dens')
print lag_iv.summary
Using Maximum Likelihood (ML):
lag_ml = ps.spreg.ML_Lag(df['emp_dens'].values[:, None], \
df[['pop_dens', 'dcbd']].values, w=w, \
name_x = ['pop_dens', 'dcbd'], name_y='emp_dens', \
method='ord')
print lag_ml.summary
Using GMM proposed by Arraiz et al. (2010):
ps.spreg.GM_Endog_Error_Het?
error_gmm_arraiz = ps.spreg.GM_Error_Het(df['emp_dens'].values[:, None], \
df[['pop_dens', 'dcbd']].values, w=w, \
name_x = ['pop_dens', 'dcbd'], name_y='emp_dens')
print error_gmm_arraiz.summary
Using GMM proposed by Drukker et al. (2010):
error_gmm_drucker = ps.spreg.GM_Error_Hom(df['emp_dens'].values[:, None], \
df[['pop_dens', 'dcbd']].values, w=w, \
name_x = ['pop_dens', 'dcbd'], name_y='emp_dens')
print error_gmm_drucker.summary
Using Maximum Likelihood:
error_ml = ps.spreg.ML_Error(df['emp_dens'].values[:, None], \
df[['pop_dens', 'dcbd']].values, w=w, \
name_x = ['pop_dens', 'dcbd'], name_y='emp_dens')
print error_ml.summary
Applying a Spatial-HAC correction to the VC matrix:
wk = ps.kernelW_from_shapefile('../../sdar_mini_repo/data/houston_tract_pop_emp_wgs84.shp', \
k=15,function='triangular', fixed=False)
ols_hac = ps.spreg.OLS(df['emp_dens'].values[:, None], \
df[['pop_dens', 'dcbd']].values, \
name_x = ['pop_dens', 'dcbd'], name_y='emp_dens', \
robust='hac', gwk=wk)
print ols_hac.summary
#Import the LM tests method
from pysal.spreg import LMtests
#Specify the files for the weights we want to try as a list
ws = [w, queen, dt_sb]
#Run the OLS
model = ps.spreg.OLS(df['emp_dens'].values[:, None], \
df[['pop_dens', 'dcbd']].values, \
spat_diag=False, nonspat_diag=False)
#Setup the loop over the weights files
for weights in ws:
lms = LMtests(model, w)
print '\tLM error: %.4f\t(%.4f)'%lms.lme
print '\tLM lag: %.4f\t(%.4f)'%lms.lml
print '\tSARMA: %.4f\t(%.4f)'%lms.sarma
print '\tRobust LM error: %.4f\t(%.4f)'%lms.rlme
print '\tRobust LM lag: %.4f\t(%.4f)'%lms.rlml
print '----------------\n'