spandex
¶from IPython.display import Image
Image('diagrams/banner.png')
from IPython.display import Image
Image('diagrams/banner2.png')
spandex can be thought of as a lightweight, flexible spatial ETL (Extract, Transform, Load) for Python scripts.
We will work with a selection of public data corresponding to the Heather Farms neighborhood in Walnut Creek, Contra Costa County, CA. The data is provided as shapefiles.
Filename | Description | |
---|---|---|
bg.shp |
Block Groups | Census |
parcels.shp |
Parcels | Tax Assesor |
water.shp |
Bodies of Water | MTC |
%%bash
tree data
data ├── bg.dbf ├── bg.prj ├── bg.shp ├── bg.shx ├── parcels.dbf ├── parcels.prj ├── parcels.shp ├── parcels.shx ├── water.dbf ├── water.prj ├── water.shp └── water.shx 0 directories, 12 files
For demonstration purposes, we will increase the log level.
import logging
logging.basicConfig()
logger = logging.getLogger('spandex')
logger.setLevel(logging.INFO)
In a configuration file, we will configure spandex to use the local PostgreSQL database and the data directory.
%%bash
cat spandex.cfg
[database] # Database configuration is passed to psycopg2.connect, which also supports # libpq connection parameters: # http://initd.org/psycopg/docs/module.html#psycopg2.connect database = demo user = demo password = demo host = localhost [data] # Directory containing shapefiles to import. Can be an absolute path or # relative to the current working directory. directory = data # Spatial Reference System Identifier (SRID) of the imported shapefiles: # http://spatialreference.org/ # NAD83(HARN) / California zone 3. srid = 2768
Now let's load the shapefiles into the database with spandex.
from spandex import TableLoader
# Connect to database specified in configuration file.
loader = TableLoader(config_filename='spandex.cfg')
# Import the shapefiles.
# Automatically determines the projection and attribute encoding.
loader.load_shp_map({'bg': 'bg.shp',
'parcels': 'parcels.shp',
'water': 'water.shp'})
# Asssign shorter variable names to our tables for convenience.
bg = loader.tables.public.bg
parcels = loader.tables.public.parcels
water = loader.tables.public.water
INFO:spandex.io:Loading table water (SRID: 26910) from file water.shp (encoding: LATIN1). INFO:spandex.io:Loading table bg (SRID: 26910) from file bg.shp (encoding: LATIN1). INFO:spandex.io:Loading table parcels (SRID: 2227) from file parcels.shp (encoding: LATIN1).
Let's inspect the tables in Pandas.
from spandex.io import db_to_df
print(db_to_df.__doc__)
Return DataFrame from Query, table, or ORM objects, like columns. Parameters ---------- query : sqlalchemy.orm.Query, sqlalchemy.ext.declarative.DeclarativeMeta, or iterable Query ORM object, table ORM class, or list of ORM objects to query, like columns. index_name : str, optional Name of column to use as DataFrame index. If provided, column must be contained in query. Returns ------- df : pandas.DataFrame
# Census Block Groups
db_to_df(bg).head()
gid | objectid | statefp | countyfp | tractce | blkgrpce | geoid | namelsad | mtfcc | funcstat | ... | intptlat | intptlon | shape_leng | gisjoin | shape_le_1 | shape_area | hu | acres | huptac | geom | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1747 | 06 | 013 | 325000 | 1 | 060133250001 | Block Group 1 | G5030 | S | ... | +37.9373464 | -122.0657006 | 4403.996420 | G06001303250001 | 4394.178049 | 844524.907804 | 577 | 208.687 | 2.764910 | 3031303630303030323031453639303030303031303030... |
1 | 2 | 1748 | 06 | 013 | 340001 | 3 | 060133400013 | Block Group 3 | G5030 | S | ... | +37.9224487 | -122.0712187 | 4575.875026 | G06001303400013 | 4582.116715 | 880433.875758 | 527 | 217.560 | 2.422320 | 3031303630303030323031453639303030303031303030... |
2 | 3 | 1749 | 06 | 013 | 340001 | 4 | 060133400014 | Block Group 4 | G5030 | S | ... | +37.9160703 | -122.0700798 | 4275.774639 | G06001303400014 | 4275.691239 | 714331.577304 | 642 | 176.515 | 3.637080 | 3031303630303030323031453639303030303031303030... |
3 | 4 | 1789 | 06 | 013 | 346101 | 2 | 060133461012 | Block Group 2 | G5030 | S | ... | +37.8836058 | -122.0012773 | 14784.078161 | G06001303461012 | 14819.308820 | 6939940.134170 | 411 | 1714.900 | 0.239665 | 3031303630303030323031453639303030303031303030... |
4 | 5 | 1800 | 06 | 013 | 325000 | 3 | 060133250003 | Block Group 3 | G5030 | S | ... | +37.9300102 | -122.0799965 | 4654.367323 | G06001303250003 | 4664.409460 | 955305.947107 | 695 | 236.061 | 2.944150 | 3031303630303030323031453639303030303031303030... |
5 rows × 22 columns
# Parcels
parcels_df = db_to_df(parcels)
parcels_df.head()
gid | apn | parc_py_id | use_code | land_value | imp_val | yr_hs_blt | tla | yr_built | bldg_sqft | geom | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 148160029 | 135163 | 11 | 386290 | 443710 | 1979 | 2965 | None | 0 | 3031303630303030323042333038303030303031303030... |
1 | 2 | 148350020 | 136766 | 12 | 94218 | 58618 | 1927 | 3164 | None | 0 | 3031303630303030323042333038303030303031303030... |
2 | 3 | 143040077 | 138154 | 54 | 758486 | 14647195 | None | 0 | None | 103366 | 3031303630303030323042333038303030303031303030... |
3 | 4 | 148400035 | 132108 | 29 | 57421 | 41448 | 1979 | 708 | None | 0 | 3031303630303030323042333038303030303031303030... |
4 | 5 | 144170032 | 132165 | 29 | 195986 | 251514 | 1973 | 1568 | None | 0 | 3031303630303030323042333038303030303031303030... |
# Multiple records have the same parcel ID.
parcels_df[parcels_df.parc_py_id.duplicated()].head()
gid | apn | parc_py_id | use_code | land_value | imp_val | yr_hs_blt | tla | yr_built | bldg_sqft | geom | |
---|---|---|---|---|---|---|---|---|---|---|---|
8 | 9 | 148400036 | 132108 | 29 | 57421 | 41448 | 1979 | 708 | None | 0 | 3031303630303030323042333038303030303031303030... |
40 | 41 | 148400024 | 131862 | 29 | 57421 | 41448 | 1979 | 708 | None | 0 | 3031303630303030323042333038303030303031303030... |
42 | 43 | 148400022 | 131847 | 29 | 180442 | 79558 | 1979 | 708 | None | 0 | 3031303630303030323042333038303030303031303030... |
50 | 51 | 148400040 | 107759 | 87 | 0 | 0 | 0000 | 0 | None | 0 | 3031303630303030323042333038303030303031303030... |
58 | 59 | 148400018 | 131677 | 29 | 57421 | 41448 | 1979 | 708 | None | 0 | 3031303630303030323042333038303030303031303030... |
# Bodies of Water
db_to_df(water.geom).head()
geom | |
---|---|
0 | 3031303630303030323031453639303030303146303030... |
from spandex import spatialtoolz
# Reproject to a consistent projection.
print(spatialtoolz.conform_srids.__doc__)
print("Reprojecting to project SRID: {}".format(loader.srid))
spatialtoolz.conform_srids(loader.srid)
Reproject all non-conforming geometry columns into the specified SRID. Parameters ---------- srid : int Spatial Reference System Identifier (SRID). schema : schema class If schema is specified, only SRIDs within the specified schema are conformed. Returns ------- None Reprojecting to project SRID: 2768
# Find overlapping geometries.
print(spatialtoolz.geom_overlapping.__doc__)
spatialtoolz.geom_overlapping(parcels, key_name='parc_py_id',
output_table_name='parcels_overlapping')
Export overlapping geometries from a table into another table. The exported table contains the following columns: key_name_a, key_name_b: identifiers of the overlapping pair relation: DE-9IM representation of their spatial relation geom_a, geom_b: corresponding geometries overlap: 2D overlapping region (polygons) Parameters ---------- table : sqlalchemy.ext.declarative.DeclarativeMeta Table ORM class to query for overlapping geometries. key_name : str Name of column in the queried table containing a unique identifier, such as a primary key, to use for cross join and to identify geometries in the exported table. output_table_name : str Name of exported table. Table is created in the same schema as the queried table. Returns ------- None
from IPython.display import Image
Image('spatialtoolz/overlapping.png')
# Find unfilled geometries (interior rings).
print(spatialtoolz.geom_unfilled.__doc__)
spatialtoolz.geom_unfilled(parcels, output_table_name='parcels_unfilled')
Export rows containing interior rings into another table. Include the unfilled geometry in the exported table as a new column named "unfilled". Parameters ---------- table : sqlalchemy.ext.declarative.DeclarativeMeta Table ORM class to query for rows containing geometries with interior rings. output_table_name : str Name of exported table. Table is created in the same schema as the queried table. Returns ------- None
from IPython.display import Image
Image('spatialtoolz/unfilled.png')
# Tag parcels with block group FIPS code.
print(spatialtoolz.tag.__doc__)
spatialtoolz.tag(parcels, 'bg_id', bg, 'geoid')
Tag target table with attribute of a spatially-related source table. Parameters ---------- target_table : sqlalchemy.ext.declarative.DeclarativeMeta Target table ORM class to be tagged. target_column_name : str Name of column in target table to add (if doesn't exist) or update (if exists). This where the tag value will be stored. source_table : sqlalchemy.ext.declarative.DeclarativeMeta Source table ORM class containing information to tag target table. source_column_name : str Name of column in source table that contains the tagging information. how : str, optional How to relate the two tables spatially. If not specified, defaults to 'point_in_poly'. Other spatial relationships are not currently supported. df : pandas.DataFrame, optional DataFrame to return a tagged copy of. Returns ------- None However, if df argument is provided, pandas.DataFrame with the new or updated column is returned.
from IPython.display import Image
Image('spatialtoolz/tag.png')
# Calculate proportion of parcel overlapped by water.
print(spatialtoolz.proportion_overlap.__doc__)
spatialtoolz.proportion_overlap(parcels, water, 'proportion_water')
Calculate proportion of target table geometry overlap. Calculate proportion of geometry area in each row of target table that is overlapped by another table's geometry. Populate specified column in target table with proportion overlap value. Parameters ---------- target_table : sqlalchemy.ext.declarative.DeclarativeMeta Target table ORM class containing geometry to overlap. over_table : sqlalchemy.ext.declarative.DeclarativeMeta Table ORM class containing overlapping geometry. column_name : str Name of column in target table to add (if doesn't exist) or update (if exists). This is where the proportion overlap value will be stored. df : pandas.DataFrame, optional DataFrame to return a copy of with proportion overlap calculation. Returns ------- None However, if df argument is provided, pandas.DataFrame with the new or updated column is returned.
from IPython.display import Image
Image('spatialtoolz/overlap.png')
# Trim away parcel regions that are overlapped by water.
print(spatialtoolz.trim.__doc__)
spatialtoolz.trim(parcels.geom, water.geom)
Trim target geometry by removing intersection with a trim column. Parameters ---------- target_col : sqlalchemy.orm.attributes.InstrumentedAttribute Column ORM object to trim. trim_col : sqlalchemy.orm.attributes.InstrumentedAttribute Column ORM object to trim target column with. Returns ------- None
from IPython.display import Image
Image('spatialtoolz/trim.png')
import numpy as np
import pandas as pd
import urbansim.sim.simulation as sim
from spandex import TableLoader, TableFrame
from spandex.io import df_to_db, db_to_df
from spandex.sim import column
# Connect to database specified in configuration file.
loader = TableLoader(config_filename='spandex.cfg')
# Asssign shorter variable names to our tables for convenience.
bg = loader.tables.public.bg
parcels = loader.tables.public.parcels
water = loader.tables.public.water
# Register read-only input parcel table with UrbanSim simulation framework.
parcels_in = TableFrame(parcels, index_name='gid')
sim.add_table('parcels_in', parcels_in, copy=False)
# Register intermediate parcel table.
@sim.table()
def parcels_in2(parcels_in):
return pd.DataFrame(index=parcels_in.index)
# Register output parcel table.
@sim.table()
def parcels_out(parcels_in):
return pd.DataFrame(index=parcels_in.parc_py_id)
# Specify default output table.
def out(*args, **kwargs):
return column(table_name='parcels_out', groupby=parcels_in.parc_py_id,
*args, **kwargs)
Now we can register new, computed columns based on our input data.
Please note that these calculations are simplified for clarity, but can be as complex as needed in practice.
@out(astype=float, agg='sum')
def building_sqft(res_sqft='parcels_in.tla',
nonres_sqft='parcels_in.bldg_sqft',
res='parcels_in2.is_residential'):
return pd.concat([res_sqft[res], nonres_sqft[~res]])
@out(astype=float, agg='sum')
def improvement_value(value='parcels_in.imp_val'):
return value
@out(astype=float, agg='sum')
def land_value(value='parcels_in.land_value'):
return value
@out(fillna=-1, astype=int, agg='median')
def year_built(res_year='parcels_in.yr_hs_blt',
nonres_year='parcels_in.yr_built',
res='parcels_in2.is_residential'):
year = pd.concat([res_year[res], nonres_year[~res]])
return year
@column('parcels_in2', fillna=-1, astype=int)
def land_use_type_id(use_code='parcels_in.use_code'):
return use_code
@out(fillna=-1, astype=int, agg='median')
def land_use_type_id(use_code='parcels_in.use_code'):
return use_code
@out(astype=str, agg='first')
def apn(apn='parcels_in.apn'):
return apn
@out()
def county_id():
return 13
@out(astype=int)
def non_residential_sqft(sqft='parcels_out.building_sqft',
nonres='parcels_out.is_nonresidential'):
return sqft[nonres].fillna(0)
@column('parcels_in2', astype=bool)
def is_residential(use_code='parcels_in2.land_use_type_id'):
residential_codes = range(10, 30)
return use_code.isin(residential_codes)
@out(astype=bool)
def is_residential(use_code='parcels_out.land_use_type_id'):
residential_codes = range(10, 30)
return use_code.isin(residential_codes)
@out(astype=bool)
def is_nonresidential(use_code='parcels_out.land_use_type_id'):
nonresidential_codes = range(30, 100)
return use_code.isin(nonresidential_codes)
We register a simulation model that exports the output table of computed columns back to the database.
@sim.model()
def export(parcels_out):
df_to_db(parcels_out.to_frame(), 'parcels_out')
# Run the export back to the database.
sim.run(['export'])
Running model 'export' Time to execute model 'export': 5.61s Total time to execute: 5.61s
db_to_df(loader.tables.public.parcels_out, index_name='parc_py_id').head()
non_residential_sqft | year_built | is_residential | land_use_type_id | land_value | improvement_value | building_sqft | county_id | apn | is_nonresidential | id | |
---|---|---|---|---|---|---|---|---|---|---|---|
parc_py_id | |||||||||||
135163 | NaN | 1979 | True | 11 | 386290 | 443710 | 2965 | 13 | 148160029 | False | 1 |
136766 | NaN | 1927 | True | 12 | 94218 | 58618 | 3164 | 13 | 148350020 | False | 2 |
138154 | 103366 | -1 | False | 54 | 758486 | 14647195 | 103366 | 13 | 143040077 | True | 3 |
132108 | NaN | 1979 | True | 29 | 114842 | 82896 | 1416 | 13 | 148400035 | False | 4 |
132165 | NaN | 1973 | True | 29 | 195986 | 251514 | 1568 | 13 | 144170032 | False | 5 |
from IPython.display import SVG
SVG('diagrams/attributes.svg')
from IPython.display import Image
Image('usui/usui.png')
We'd like to get your feedback on spandex. For questions, comments, and suggestions, contact Dara Adib (dadib AT synthicity).