Python Lab¶

Introduction to python for scientific computing¶

In [ ]:

%matplotlib inline
import pandas as pd
import numpy as np
import pysal as ps
import matplotlib.pyplot as plt

`pandas`¶

Data structures to read, interact, transform and write structured data.

Read your data:

In [ ]:

db = pd.read_csv('../workshop_data/Houston_pop00.csv')
db.info()

Explore it:

In [ ]:

db.head()

In [ ]:

db.tail()

In [ ]:

db.describe().T

Advanced subsetting:

In [ ]:

downtown = db[db['dcbd'] < 15]
downtown.info()

And many more operations on tabular (multi-)indexed data. Check the documentation for more info and tutorials.

`numpy`¶

numpy and scipy are the foundational libraries for any kind of numeric computing in Python. Numpy offers the efficient matrix structure denominated array or ndarray (Numpy data array) as well as some basic statistical functions that may be applied to arrays.

To whet your appetite, let's first create a simple array. You can do this from a pre-existing Python list, for example:

In [ ]:

l = [1, 2, 3, 4]
a = np.array(l)
a

At first sight, a is not very different from l; however, under the hood, it provides much more efficient structures for data manipulation (including C-optimized functions and other performance enhancements). Arrays contain only one data type and may have several dimensions, opening up the door for very fancy matrix manipulation.

In [ ]:

print type(a[0])
l += 'a'
a = np.array(l)
print type(a[0])

In [ ]:

l = [[1, 2], [3, 4], [5, 6]]
a = np.array(l)
print 'Array a has a dimension of: ', a.shape
print a

numpy supports operations betwee arrays, such as sumation, difference, multiplication and division:

In [ ]:

a = np.random.random((3, 2))
b = np.random.random((2, 3))
a, b

In [ ]:

# Sum (note the transpose for dimensionality alignment)
a + b.T

In [ ]:

# Difference (note the transpose for dimensionality alignment)
a - b.T

In [ ]:

# Matrix product
c = np.dot(a, b)
c

In [ ]:

# Matrix division by a scalar
c = a / 2.
c

scipy is the sister library of Numpy and offers a wide arrange of statistical functions to operate on Numpy arrays. This provides much of the functionality that you find in the core packages of other statistical languages like R (in the r-base package) or Matlab.

Besides the core of scipy, the project also includes additional packages called scikits that expand the main functionality in some particular way. Check out the scikits website to get a sense of what is covered.

pandas heavily relies on numpy under the hood, and it inherits many of its capabilities. For example, we can operate on vectors as one would do on numpy arrays:

In [ ]:

db['pop_dens'] = db['POP00'] / db['ALAND']

`matplotlib`¶

matplotlib is the main tool for static graphical display in Python. It provides 2D and 3D functionality to plot data in a static way. The library may not appear as very intuitive at first, but if you get over the learning curve, it is extremely flexible and it allows to tweak every aspect of a figure. Because of its focus on flexibility, the defaults may not be the prettiest, but with some work on them, Matplotlib can create beautiful figures that rival in quality with any other library for static plotting (such as R's ggplot2).

Part of the basic functionality is wrapped around pandas, so that is a convenient way to get introduced to the library.

In [ ]:

db['pop_dens'].plot(kind='kde')

In [ ]:

db['dcbd'].hist(bins=50, color='k', alpha=0.5, grid=False)

In [ ]:

db.pop_dens = db['POP00'] / db['ALAND']
db['pop_dens'].describe()

`PySAL` introduction¶

dbf files IO:

In [ ]:

dbf = ps.open('../workshop_data/houston_tract_pop_emp_wgs84.dbf')
dbf.header

In [ ]:

df = pd.DataFrame({'emp_dens': dbf.by_col('emp_dens'), \
                   'pop_dens': dbf.by_col('pop_dens'), \
                   'dcbd': dbf.by_col('dcbd'), \
                   'downtown': dbf.by_col('downtown')})
df.info()

Spatial Weights

You can create them from a shapefile:

In [ ]:

w = ps.queen_from_shapefile('../../sdar_mini_repo/data/houston_tract_pop_emp_wgs84.shp')

In [ ]:

Inspect and explore:

In [ ]:

w.n

In [ ]:

w[0]

In [ ]:

w.transform = 'R'

In [ ]:

w[0]

And save into a file:

In [ ]:

f = ps.open('houston_tract_queen.gal', 'w')
f.write(w)
f.close()

PySAL has a submodule (pysal.weights.Wsets) that allows to combine different weights to obtain more sophisticated representations of your geography:

In [ ]:

from IPython.display import HTML
HTML('<iframe src=http://pysal.readthedocs.org/en/v1.7/library/weights/Wsets.html#pysal.weights.Wsets width=100% height=350></iframe>')

Let us create a W matrix that combines contiguity for the Houston tracts but excludes neighbors that are across the boundary of downtown (i.e. even if they are contiguous, they are not taken as neighbors if one is downtown and the other one is in the suburbs.

In [ ]:

# Downtown/suburbs weights
dt_sb = ps.weights.block_weights(dbf.by_col('downtown'))
dt_sb = ps.weights.regime_weights(dbf.by_col('downtown'))

In [ ]:

# Queen example
queen = ps.queen_from_shapefile('../../sdar_mini_repo/data/houston_tract_pop_emp_wgs84.shp')

The matrix we want is the result of intersecting the two matrices we just created:

In [ ]:

w = ps.weights.Wsets.w_intersection(queen, dt_sb)

In [ ]:

w.transform = 'R'

You can find an example to create a spatial weights matrix that combines contiguity and a block structure here.

Spatial lag

In [ ]:

wy = ps.lag_spatial(w, df['emp_dens'])

Basic choropleth mapping

In [ ]:

from pysal.contrib.viz import mapping as maps

In [ ]:

shp_link = '../../sdar_mini_repo/data/houston_tract_pop_emp_wgs84.shp'
maps.plot_choropleth(shp_link, df['downtown'], 'unique_values', \
                     figsize=(12, 12))

In [ ]:

shp_link = '../../sdar_mini_repo/data/houston_tract_pop_emp_wgs84.shp'
maps.plot_choropleth(shp_link, df['emp_dens'], 'quantiles', figsize=(12, 12))

Global spatial autocorrelation

ESDA tools are contained in the esda module.

In [ ]:

mi = ps.Moran(df['emp_dens'], w)

In [ ]:

mi.I

Local spatial autocorrelation

In [ ]:

lmi = ps.Moran_Local(df['emp_dens'].values, w)

In [ ]:

lmi.p_sim

`spreg`¶

OLS¶

Standard:

In [ ]:

ols_base = ps.spreg.OLS(df['emp_dens'].values[:, None], \
                    df[['pop_dens', 'dcbd']].values, \
                    name_x = ['pop_dens', 'dcbd'], name_y='emp_dens')

In [ ]:

print ols_base.summary

In [ ]:

ols_base.betas

In [ ]:

ols_base.std_err

Using White correction:

In [ ]:

ols_white = ps.spreg.OLS(df['emp_dens'].values[:, None], \
                    df[['pop_dens', 'dcbd']].values, \
                    robust='white', \
                    name_x = ['pop_dens', 'dcbd'], name_y='emp_dens')

In [ ]:

ols_white.std_err

Spatial diagnostics¶

In [ ]:

ols_sp_diag = ps.spreg.OLS(df['emp_dens'].values[:, None], \
                    df[['pop_dens', 'dcbd']].values, w, \
                    name_x = ['pop_dens', 'dcbd'], name_y='emp_dens', \
                    spat_diag=True)

In [ ]:

ols_sp_diag.lm_error

In [ ]:

ols_sp_diag.lm_lag

In [ ]:

print ols_sp_diag.summary

OLS + spatial fixed effects¶

Not by default in PySAL but very straightforward with pandas.

In [ ]:

fes = pd.get_dummies(df['downtown'], prefix='downtown')
fes.head()

In [ ]:

x = df[['pop_dens', 'dcbd']].join(fes.drop('downtown_0', axis=1))
ols_fe = ps.spreg.OLS(df['emp_dens'].values[:, None], x.values, w, \
                    name_x = list(x.columns), name_y='emp_dens', \
                    spat_diag=True)

In [ ]:

print ols_fe.summary

OLS Regimes¶

In [ ]:

downtown = df['downtown'].map({1: 'downtown', 0: 'suburbs'})

In [ ]:

ols_regimes = ps.spreg.OLS_Regimes(df['emp_dens'].values[:, None], \
                    df[['pop_dens', 'dcbd']].values, downtown, w=w, \
                    name_x = ['pop_dens', 'dcbd'], name_y='emp_dens', \
                    spat_diag=True)

In [ ]:

print ols_regimes.summary

OLS + WX¶

In [ ]:

df['w_pop_dens'] = ps.lag_spatial(w, df['pop_dens'])

In [ ]:

ols_wx = ps.spreg.OLS(df['emp_dens'].values[:, None], \
                    df[['pop_dens', 'w_pop_dens', 'dcbd']].values, \
                    name_x = ['pop_dens', 'w_pop_dens', 'dcbd'], name_y='emp_dens')

In [ ]:

print ols_wx.summary

Spatial lag¶

Using Instrumental Variables (IV), as in Kelejian & Prucha (1999):

In [ ]:

lag_iv = ps.spreg.GM_Lag(df['emp_dens'].values[:, None], \
                    df[['pop_dens', 'dcbd']].values, w=w, \
                    name_x = ['pop_dens', 'dcbd'], name_y='emp_dens')

In [ ]:

print lag_iv.summary

Using Maximum Likelihood (ML):

In [ ]:

lag_ml = ps.spreg.ML_Lag(df['emp_dens'].values[:, None], \
                    df[['pop_dens', 'dcbd']].values, w=w, \
                    name_x = ['pop_dens', 'dcbd'], name_y='emp_dens', \
                    method='ord')

In [ ]:

print lag_ml.summary

Spatial error¶

Using GMM proposed by Arraiz et al. (2010):

In [ ]:

ps.spreg.GM_Endog_Error_Het?

In [ ]:

error_gmm_arraiz = ps.spreg.GM_Error_Het(df['emp_dens'].values[:, None], \
                    df[['pop_dens', 'dcbd']].values, w=w, \
                    name_x = ['pop_dens', 'dcbd'], name_y='emp_dens')

In [ ]:

print error_gmm_arraiz.summary

Using GMM proposed by Drukker et al. (2010):

In [ ]:

error_gmm_drucker = ps.spreg.GM_Error_Hom(df['emp_dens'].values[:, None], \
                    df[['pop_dens', 'dcbd']].values, w=w, \
                    name_x = ['pop_dens', 'dcbd'], name_y='emp_dens')

In [ ]:

print error_gmm_drucker.summary

Using Maximum Likelihood:

In [ ]:

error_ml = ps.spreg.ML_Error(df['emp_dens'].values[:, None], \
                    df[['pop_dens', 'dcbd']].values, w=w, \
                    name_x = ['pop_dens', 'dcbd'], name_y='emp_dens')

In [ ]:

print error_ml.summary

Applying a Spatial-HAC correction to the VC matrix:

In [ ]:

wk = ps.kernelW_from_shapefile('../../sdar_mini_repo/data/houston_tract_pop_emp_wgs84.shp', \
                                  k=15,function='triangular', fixed=False)

In [ ]:

ols_hac = ps.spreg.OLS(df['emp_dens'].values[:, None], \
                    df[['pop_dens', 'dcbd']].values, \
                    name_x = ['pop_dens', 'dcbd'], name_y='emp_dens', \
                    robust='hac', gwk=wk)

In [ ]:

print ols_hac.summary

Batch example: diagnostics for several weights matrices¶

In [ ]:

#Import the LM tests method
from pysal.spreg import LMtests
#Specify the files for the weights we want to try as a list
ws = [w, queen, dt_sb]
#Run the OLS
model = ps.spreg.OLS(df['emp_dens'].values[:, None], \
                     df[['pop_dens', 'dcbd']].values, \
                     spat_diag=False, nonspat_diag=False)
#Setup the loop over the weights files
for weights in ws:
    lms = LMtests(model, w)
    print '\tLM error: %.4f\t(%.4f)'%lms.lme
    print '\tLM lag:   %.4f\t(%.4f)'%lms.lml
    print '\tSARMA:    %.4f\t(%.4f)'%lms.sarma
    print '\tRobust LM error:   %.4f\t(%.4f)'%lms.rlme
    print '\tRobust LM lag:     %.4f\t(%.4f)'%lms.rlml
    print '----------------\n'

Python Lab¶

Introduction to python for scientific computing¶

pandas¶

numpy¶

matplotlib¶

PySAL introduction¶

spreg¶

OLS¶

Spatial diagnostics¶

OLS + spatial fixed effects¶

OLS Regimes¶

OLS + WX¶

Spatial lag¶

Spatial error¶

Batch example: diagnostics for several weights matrices¶

`pandas`¶

`numpy`¶

`matplotlib`¶

`PySAL` introduction¶

`spreg`¶