Contributors to this notebook

In [2]:
import pysal as ps  # 1.5 or higher
import numpy as np  # 1.7 or higher
import pandas as pd # 0.10 or higher

A brief introduction to PySAL, a library for spatial analysis

In this notebook, I will present the main functionality of the library PySAL. I will start with a bit of brackground and overview and then walk through the following sections:

  • Data I/O: read and write shapefiles
  • Shapefile visualization
  • Spatial Weights
  • Core functionality of PySAL
    • ESDA & spatial autocorrelation
    • Specific modules: region, spreg, spatial_dynamics and inequality
    • contrib module
  • A sneak peak into the current and future plans

Data I/O

One of the main advantages and contributions of PySAL is the easy interface it provides to ESRI shapefiles. The shapefile is the most common vector format of spatial data and it is widely used in applications in the social sciences because of the nature of the data at the disposal (areal or point typically).

For this example, we will use the shapefile amsterdam_hoods.shp from the data folder that contains the polygons of the different neighborhoods in Amsterdam. Along with the geometry of the polygons is a dbf that contains aggregated data for the checkins we have explored in the previous notebook.

Shapefiles

PySAL has a very pythonic way of interacting with shapefiles, with an open command like the standard library:

In [3]:
shp_link = 'data/amsterdam_hoods.shp'
shp = ps.open(shp_link)
shp
Out[3]:
<pysal.core.IOHandlers.pyShpIO.PurePyShpWrapper at 0x3c9c250>

Once open, we can get different attributes of the shapefile right away:

In [4]:
shp.header
Out[4]:
{'BBOX Mmax': 0.0,
 'BBOX Mmin': 0.0,
 'BBOX Xmax': 5.06863004280464,
 'BBOX Xmin': 4.72926974364051,
 'BBOX Ymax': 52.43213807123786,
 'BBOX Ymin': 52.2794164487147,
 'BBOX Zmax': 0.0,
 'BBOX Zmin': 0.0,
 'File Code': 9994,
 'File Length': 29934,
 'Shape Type': 5,
 'Unused0': 0,
 'Unused1': 0,
 'Unused2': 0,
 'Unused3': 0,
 'Unused4': 0,
 'Version': 1000}
In [5]:
shp.bbox
Out[5]:
[4.72926974364051, 52.2794164487147, 5.06863004280464, 52.43213807123786]

The object creates an iterator over the polygons that goes in the same order as the rows in the dbf. So you can iterate over the polygons to operate on them individually:

In [6]:
for poly in shp:
    print "Iterating over polygon: ", poly.id
Iterating over polygon:  1
Iterating over polygon:  2
Iterating over polygon:  3
Iterating over polygon:  4
Iterating over polygon:  5
Iterating over polygon:  6
Iterating over polygon:  7
Iterating over polygon:  8
Iterating over polygon:  9
Iterating over polygon:  10
Iterating over polygon:  11
Iterating over polygon:  12
Iterating over polygon:  13
Iterating over polygon:  14
Iterating over polygon:  15
Iterating over polygon:  16
Iterating over polygon:  17
Iterating over polygon:  18
Iterating over polygon:  19
Iterating over polygon:  20
Iterating over polygon:  21
Iterating over polygon:  22
Iterating over polygon:  23
Iterating over polygon:  24
Iterating over polygon:  25
Iterating over polygon:  26
Iterating over polygon:  27
Iterating over polygon:  28
Iterating over polygon:  29
Iterating over polygon:  30
Iterating over polygon:  31
Iterating over polygon:  32
Iterating over polygon:  33
Iterating over polygon:  34
Iterating over polygon:  35
Iterating over polygon:  36
Iterating over polygon:  37
Iterating over polygon:  38
Iterating over polygon:  39
Iterating over polygon:  40
Iterating over polygon:  41
Iterating over polygon:  42
Iterating over polygon:  43
Iterating over polygon:  44
Iterating over polygon:  45
Iterating over polygon:  46
Iterating over polygon:  47
Iterating over polygon:  48
Iterating over polygon:  49
Iterating over polygon:  50
Iterating over polygon:  51
Iterating over polygon:  52
Iterating over polygon:  53
Iterating over polygon:  54
Iterating over polygon:  55
Iterating over polygon:  56
Iterating over polygon:  57
Iterating over polygon:  58
Iterating over polygon:  59
Iterating over polygon:  60
Iterating over polygon:  61
Iterating over polygon:  62
Iterating over polygon:  63
Iterating over polygon:  64
Iterating over polygon:  65
Iterating over polygon:  66
Iterating over polygon:  67
Iterating over polygon:  68
Iterating over polygon:  69
Iterating over polygon:  70
Iterating over polygon:  71
Iterating over polygon:  72
Iterating over polygon:  73
Iterating over polygon:  74
Iterating over polygon:  75
Iterating over polygon:  76
Iterating over polygon:  77
Iterating over polygon:  78
Iterating over polygon:  79
Iterating over polygon:  80
Iterating over polygon:  81
Iterating over polygon:  82
Iterating over polygon:  83
Iterating over polygon:  84
Iterating over polygon:  85
Iterating over polygon:  86
Iterating over polygon:  87
Iterating over polygon:  88
Iterating over polygon:  89
Iterating over polygon:  90
Iterating over polygon:  91
Iterating over polygon:  92
Iterating over polygon:  93
Iterating over polygon:  94
Iterating over polygon:  95
Iterating over polygon:  96

Or, if the shapefile is not too big and your machine is generous in RAM, you can pack them all into a list:

In [7]:
polys = list(shp)
poly0 = shp[0]
poly0
Out[7]:
<pysal.cg.shapes.Polygon at 0x3c9c410>

Spend some time exploring the attributes of polygons:

In [8]:
# Note that this does not make much sense if the shapefile is not projected
poly0.area
Out[8]:
5.3677530927880024e-05
In [9]:
poly0.centroid
Out[9]:
(4.89742311742521, 52.37362424151918)
In [10]:
# Does not contain holes
poly0.holes
Out[10]:
[[]]
In [11]:
poly0.perimeter
Out[11]:
0.03272855678890349
In [12]:
poly0.vertices
Out[12]:
[(4.895853931294426, 52.3685306056879),
 (4.895608395880829, 52.368579135637),
 (4.895069059784516, 52.36827136401666),
 (4.894453735047168, 52.36833356576813),
 (4.893137111197583, 52.369431533940286),
 (4.892812396329113, 52.37008523352424),
 (4.893413457272433, 52.373189498497105),
 (4.893267486971745, 52.373313473206),
 (4.893312858925569, 52.37373586490692),
 (4.89351432623952, 52.37421329026471),
 (4.89689279143361, 52.37673534321573),
 (4.897384789054209, 52.376512932627534),
 (4.899711293018493, 52.37820146422082),
 (4.901999584276892, 52.37830579506311),
 (4.903090918019509, 52.37809026749979),
 (4.90352016827385, 52.3777682040377),
 (4.903251031067214, 52.377410159801585),
 (4.90281504973314, 52.37723979098532),
 (4.901595851275064, 52.37519349037614),
 (4.901280145956854, 52.3745915200307),
 (4.901099875926198, 52.37416215032509),
 (4.9014017684343205, 52.37410802005844),
 (4.900795976616639, 52.37307474415859),
 (4.900225948282847, 52.37319945442478),
 (4.899076049145883, 52.3719953434008),
 (4.895853931294426, 52.3685306056879)]

If you want to write shapefiles, it is possible and equally simple. All you have to do is open a new file and create the appropriate specs:

In [13]:
shpo = ps.open('data/test.shp', 'w')
shpo.header = shp.header

Now assume you want to write only the first polygon:

In [14]:
shpo.write(poly0)
shpo.close()

DBF files

As mentioned, along with the .shp object there is a .dbf file that contains attribute data for the polygons. You can easily access it and load it using pysal as well:

In [15]:
dbf = ps.open(shp_link.replace('.shp', '.dbf'))

Very much like with shapefiles, once it is loaded, you can access several aspects of it:

In [16]:
dbf.header
Out[16]:
['h_0',
 'h_1',
 'h_10',
 'h_11',
 'h_12',
 'h_13',
 'h_14',
 'h_15',
 'h_16',
 'h_17',
 'h_18',
 'h_19',
 'h_2',
 'h_20',
 'h_21',
 'h_22',
 'h_23',
 'h_3',
 'h_4',
 'h_5',
 'h_6',
 'h_7',
 'h_8',
 'h_9',
 'total',
 'total_rando',
 'venues',
 'buurt']
In [17]:
dbf.field_spec
Out[17]:
[('N', 36, 15),
 ('N', 36, 15),
 ('N', 36, 15),
 ('N', 36, 15),
 ('N', 36, 15),
 ('N', 36, 15),
 ('N', 36, 15),
 ('N', 36, 15),
 ('N', 36, 15),
 ('N', 36, 15),
 ('N', 36, 15),
 ('N', 36, 15),
 ('N', 36, 15),
 ('N', 36, 15),
 ('N', 36, 15),
 ('N', 36, 15),
 ('N', 36, 15),
 ('N', 36, 15),
 ('N', 36, 15),
 ('N', 36, 15),
 ('N', 36, 15),
 ('N', 36, 15),
 ('N', 36, 15),
 ('N', 36, 15),
 ('N', 36, 15),
 ('N', 36, 15),
 ('N', 36, 15),
 ('C', 14, 0)]
In [18]:
# Pull out collumn "total"
total = dbf.by_col("total")
total
Out[18]:
[2750.0,
 6620.0,
 2521.0,
 3468.0,
 2076.0,
 1434.0,
 2385.0,
 4209.0,
 1211.0,
 988.0,
 554.0,
 2019.0,
 241.0,
 1228.0,
 1330.0,
 424.0,
 222.0,
 315.0,
 134.0,
 672.0,
 289.0,
 498.0,
 539.0,
 420.0,
 218.0,
 2157.0,
 144.0,
 739.0,
 208.0,
 55.0,
 660.0,
 201.0,
 455.0,
 293.0,
 100.0,
 299.0,
 148.0,
 108.0,
 163.0,
 10.0,
 5.0,
 28.0,
 44.0,
 62.0,
 62.0,
 36.0,
 163.0,
 64.0,
 1199.0,
 136.0,
 41.0,
 54.0,
 76.0,
 151.0,
 125.0,
 23.0,
 38.0,
 501.0,
 369.0,
 72.0,
 355.0,
 1149.0,
 369.0,
 665.0,
 872.0,
 4008.0,
 1231.0,
 237.0,
 203.0,
 914.0,
 46.0,
 7.0,
 1191.0,
 535.0,
 895.0,
 380.0,
 1874.0,
 1477.0,
 88.0,
 971.0,
 1690.0,
 914.0,
 262.0,
 400.0,
 287.0,
 509.0,
 2811.0,
 897.0,
 428.0,
 190.0,
 1367.0,
 431.0,
 483.0,
 543.0,
 2179.0,
 815.0]

Pulling out columns results in a Python list, which may not be the most efficient data structure to store and operate on the data. It is preferable to convert it into a numpy array. You can do this for instance to pull automatically all the columns in the header:

In [19]:
cols = np.array([dbf.by_col(col) for col in dbf.header]).T
cols
Out[19]:
array([['30.0', '12.0', '136.0', ..., '483.0', '95.0', 'BU03630000'],
       ['54.0', '34.0', '339.0', ..., '88.0', '116.0', 'BU03630001'],
       ['18.0', '10.0', '132.0', ..., '815.0', '68.0', 'BU03630002'],
       ..., 
       ['2.0', '0.0', '42.0', ..., '420.0', '10.0', 'BU03631459'],
       ['0.0', '0.0', '149.0', ..., '1477.0', '24.0', 'BU03631490'],
       ['3.0', '1.0', '43.0', ..., '739.0', '7.0', 'BU03631491']], 
      dtype='|S10')

Note that, because arrays only allow for one data type, all the columns are coerced into strings (the neighbohood names cannot be understood as numbers). We can extract all the columns whose data type is not a string. This requires some Python foo, but it's possible in a single line:

In [20]:
numeric_cols = np.array([dbf.by_col(dbf.header[i]) for i in range(len(dbf.header)) if dbf.field_spec[i][0] == 'N']).T
numeric_cols
Out[20]:
array([[  3.00000000e+01,   1.20000000e+01,   1.36000000e+02, ...,
          2.75000000e+03,   4.83000000e+02,   9.50000000e+01],
       [  5.40000000e+01,   3.40000000e+01,   3.39000000e+02, ...,
          6.62000000e+03,   8.80000000e+01,   1.16000000e+02],
       [  1.80000000e+01,   1.00000000e+01,   1.32000000e+02, ...,
          2.52100000e+03,   8.15000000e+02,   6.80000000e+01],
       ..., 
       [  2.00000000e+00,   0.00000000e+00,   4.20000000e+01, ...,
          5.43000000e+02,   4.20000000e+02,   1.00000000e+01],
       [  0.00000000e+00,   0.00000000e+00,   1.49000000e+02, ...,
          2.17900000e+03,   1.47700000e+03,   2.40000000e+01],
       [  3.00000000e+00,   1.00000000e+00,   4.30000000e+01, ...,
          8.15000000e+02,   7.39000000e+02,   7.00000000e+00]])

However, it would be very nice if you could pull everything into one single object that contained all the information properly. This is a perfect task for a pandas.Dataframe. We will take advantage of the fact that you can create them from a dictionary where the key is the column name and the value is the column itself in a numpy array:

In [21]:
d = dict([(col, np.array(dbf.by_col(col))) for col in dbf.header])
df = pd.DataFrame(d)
df
Out[21]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 96 entries, 0 to 95
Data columns:
buurt          96  non-null values
h_0            96  non-null values
h_1            96  non-null values
h_10           96  non-null values
h_11           96  non-null values
h_12           96  non-null values
h_13           96  non-null values
h_14           96  non-null values
h_15           96  non-null values
h_16           96  non-null values
h_17           96  non-null values
h_18           96  non-null values
h_19           96  non-null values
h_2            96  non-null values
h_20           96  non-null values
h_21           96  non-null values
h_22           96  non-null values
h_23           96  non-null values
h_3            96  non-null values
h_4            96  non-null values
h_5            96  non-null values
h_6            96  non-null values
h_7            96  non-null values
h_8            96  non-null values
h_9            96  non-null values
total          96  non-null values
total_rando    96  non-null values
venues         96  non-null values
dtypes: float64(27), object(1)

Note that the columns are ordered alphabetially. If we wanted to get the original order:

In [22]:
df = df[dbf.header]
df
Out[22]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 96 entries, 0 to 95
Data columns:
h_0            96  non-null values
h_1            96  non-null values
h_10           96  non-null values
h_11           96  non-null values
h_12           96  non-null values
h_13           96  non-null values
h_14           96  non-null values
h_15           96  non-null values
h_16           96  non-null values
h_17           96  non-null values
h_18           96  non-null values
h_19           96  non-null values
h_2            96  non-null values
h_20           96  non-null values
h_21           96  non-null values
h_22           96  non-null values
h_23           96  non-null values
h_3            96  non-null values
h_4            96  non-null values
h_5            96  non-null values
h_6            96  non-null values
h_7            96  non-null values
h_8            96  non-null values
h_9            96  non-null values
total          96  non-null values
total_rando    96  non-null values
venues         96  non-null values
buurt          96  non-null values
dtypes: float64(27), object(1)
In [23]:
df.describe().T # Use .T to transpose the DataFrame and allow it to fit on screen
Out[23]:
count mean std min 25% 50% 75% max
h_0 96 8.104167 14.072354 0 0.75 2.0 6.00 81
h_1 96 4.822917 8.776757 0 0.00 1.0 4.25 51
h_10 96 43.375000 56.858874 0 6.00 22.5 56.50 339
h_11 96 48.781250 65.720333 0 9.75 21.0 56.25 380
h_12 96 44.625000 61.220267 0 7.75 20.0 56.75 375
h_13 96 44.239583 64.715205 0 7.75 17.5 50.25 402
h_14 96 46.104167 66.630083 0 7.00 20.0 60.25 393
h_15 96 49.458333 72.998330 0 7.00 21.0 60.25 497
h_16 96 55.718750 85.670569 0 9.75 24.5 69.25 603
h_17 96 59.229167 84.347212 0 11.00 31.5 59.25 513
h_18 96 58.468750 83.960010 0 9.75 25.5 58.00 388
h_19 96 47.739583 68.305698 0 6.00 23.0 58.00 360
h_2 96 2.343750 4.366929 0 0.00 1.0 2.00 22
h_20 96 38.989583 54.179914 0 6.00 20.5 47.25 308
h_21 96 32.364583 44.066247 0 4.50 16.0 39.25 251
h_22 96 23.072917 32.214820 0 3.00 10.5 28.25 186
h_23 96 14.343750 21.206170 0 2.00 7.5 18.25 130
h_3 96 1.656250 3.505869 0 0.00 0.0 2.00 24
h_4 96 3.906250 7.170228 0 0.00 1.0 4.00 36
h_5 96 12.166667 25.137377 0 1.00 5.0 10.25 172
h_6 96 31.760417 48.716526 0 5.00 13.0 35.25 287
h_7 96 47.927083 64.391115 0 7.75 21.5 71.00 372
h_8 96 46.489583 58.026127 0 8.00 19.5 77.25 324
h_9 96 40.843750 51.638894 0 7.75 21.0 57.25 292
total 96 806.531250 1070.210338 5 147.00 422.0 1028.25 6620
total_rando 96 806.531250 1070.210338 5 147.00 422.0 1028.25 6620
venues 96 12.864583 22.031852 0 1.00 6.0 11.00 116

Similarly as before, writing dbf files is no problem. Create a new output file, setup the right specifications, and write the records. For the sake of the example, we will write to a new .dbf the record of the first polygon, so we complete the example started with the shapefile.

In [24]:
dbfo = ps.open('data/test.dbf', 'w')
dbfo.header = dbf.header
dbfo.field_spec = dbf.field_spec
# Records need to be in a Python list
rec = list(df.ix[0].values)
dbfo.write(rec)
# Don't forget to close the file, the data is written at that moment!!!
dbfo.close()

Shapefile visualization

Although still experimental and very rough on the edges, PySAL provides a module to visualize shapefiles. This is very clean and lean and, unlike other options such as Basemap, does not depend on difficult to install third party libraries. The code is in the viz module, part of contrib, the sandbox and interface with third party libraries layer for PySAL.

Eventually, there will be two tiers of the API: a low-level one that translates a shapefile object into one that is Matplotlib-compatible but that does not produce any map right away (very flexible); and a high-level one that presents canned versions of very common plots (a-la pandas).

Let's begin with a quick and dirty visualization of the polygons:

In [25]:
from pysal.contrib.viz import mapping as viz
In [26]:
viz.plot_poly_lines(shp_link)

For the sake of pedagogy, let's create a similar plot using the low-level interface:

In [27]:
f = figure()
ax = f.add_subplot(111)
poly_collection = viz.map_poly_shp(shp_link)
poly_collection.set_facecolor('none')
ax.add_collection(poly_collection)
ax.autoscale_view()
ax.set_frame_on(False)
ax.axes.get_yaxis().set_visible(False)
ax.axes.get_xaxis().set_visible(False)
show()

An advantage of the latter is that it gives you much more flexibility to for example, add in an extra layer on the plot. Let's bring back the checkins from the previous notebook and overlay them on the figure (we will create a larger one to be able to see through):

In [28]:
db = pd.read_csv('http://ubuntuone.com/4oIpVJDCpdREhzNdSRwMx8')

f = figure(figsize=(12, 12))
ax = f.add_subplot(111)
poly_collection = viz.map_poly_shp(shp_link)
poly_collection.set_facecolor('none')
ax.add_collection(poly_collection)
ax.autoscale_view()
ax.set_frame_on(False)
ax.axes.get_yaxis().set_visible(False)
ax.axes.get_xaxis().set_visible(False)
scatter(db['lon'].values, db['lat'].values, marker='.', alpha=0.5, s=2.5)
title("Spatial distribution of checkins in Amsterdam")
show()