Loading data with `spandex`¶

Spatial Analysis and Data Extraction¶

In [1]:

from IPython.display import Image
Image('diagrams/banner.png')

Out[1]:

In [2]:

from IPython.display import Image
Image('diagrams/banner2.png')

Out[2]:

spandex can be thought of as a lightweight, flexible spatial ETL (Extract, Transform, Load) for Python scripts.

Example Data¶

We will work with a selection of public data corresponding to the Heather Farms neighborhood in Walnut Creek, Contra Costa County, CA. The data is provided as shapefiles.

Filename	Description
`bg.shp`	Block Groups	Census
`parcels.shp`	Parcels	Tax Assesor
`water.shp`	Bodies of Water	MTC

In [3]:

%%bash
tree data

data
├── bg.dbf
├── bg.prj
├── bg.shp
├── bg.shx
├── parcels.dbf
├── parcels.prj
├── parcels.shp
├── parcels.shx
├── water.dbf
├── water.prj
├── water.shp
└── water.shx

0 directories, 12 files

For demonstration purposes, we will increase the log level.

In [4]:

import logging
logging.basicConfig()
logger = logging.getLogger('spandex')
logger.setLevel(logging.INFO)

In a configuration file, we will configure spandex to use the local PostgreSQL database and the data directory.

In [5]:

%%bash
cat spandex.cfg

[database]
# Database configuration is passed to psycopg2.connect, which also supports
# libpq connection parameters:
# http://initd.org/psycopg/docs/module.html#psycopg2.connect
database = demo
user = demo
password = demo
host = localhost

[data]
# Directory containing shapefiles to import. Can be an absolute path or
# relative to the current working directory.
directory = data

# Spatial Reference System Identifier (SRID) of the imported shapefiles:
# http://spatialreference.org/
# NAD83(HARN) / California zone 3.
srid = 2768

Now let's load the shapefiles into the database with spandex.

In [6]:

from spandex import TableLoader


# Connect to database specified in configuration file.
loader = TableLoader(config_filename='spandex.cfg')

# Import the shapefiles.
# Automatically determines the projection and attribute encoding.
loader.load_shp_map({'bg':      'bg.shp',
                     'parcels': 'parcels.shp',
                     'water':   'water.shp'})

# Asssign shorter variable names to our tables for convenience.
bg = loader.tables.public.bg
parcels = loader.tables.public.parcels
water = loader.tables.public.water

INFO:spandex.io:Loading table water (SRID: 26910) from file water.shp (encoding: LATIN1).
INFO:spandex.io:Loading table bg (SRID: 26910) from file bg.shp (encoding: LATIN1).
INFO:spandex.io:Loading table parcels (SRID: 2227) from file parcels.shp (encoding: LATIN1).

Let's inspect the tables in Pandas.

In [7]:

from spandex.io import db_to_df
print(db_to_df.__doc__)

    Return DataFrame from Query, table, or ORM objects, like columns.

    Parameters
    ----------
    query : sqlalchemy.orm.Query, sqlalchemy.ext.declarative.DeclarativeMeta,
            or iterable
        Query ORM object, table ORM class, or list of ORM objects to query,
        like columns.
    index_name : str, optional
        Name of column to use as DataFrame index. If provided, column
        must be contained in query.

    Returns
    -------
    df : pandas.DataFrame

In [8]:

# Census Block Groups
db_to_df(bg).head()

Out[8]:

	gid	objectid	statefp	countyfp	tractce	blkgrpce	geoid	namelsad	mtfcc	funcstat	...	intptlat	intptlon	shape_leng	gisjoin	shape_le_1	shape_area	hu	acres	huptac	geom
0	1	1747	06	013	325000	1	060133250001	Block Group 1	G5030	S	...	+37.9373464	-122.0657006	4403.996420	G06001303250001	4394.178049	844524.907804	577	208.687	2.764910	3031303630303030323031453639303030303031303030...
1	2	1748	06	013	340001	3	060133400013	Block Group 3	G5030	S	...	+37.9224487	-122.0712187	4575.875026	G06001303400013	4582.116715	880433.875758	527	217.560	2.422320	3031303630303030323031453639303030303031303030...
2	3	1749	06	013	340001	4	060133400014	Block Group 4	G5030	S	...	+37.9160703	-122.0700798	4275.774639	G06001303400014	4275.691239	714331.577304	642	176.515	3.637080	3031303630303030323031453639303030303031303030...
3	4	1789	06	013	346101	2	060133461012	Block Group 2	G5030	S	...	+37.8836058	-122.0012773	14784.078161	G06001303461012	14819.308820	6939940.134170	411	1714.900	0.239665	3031303630303030323031453639303030303031303030...
4	5	1800	06	013	325000	3	060133250003	Block Group 3	G5030	S	...	+37.9300102	-122.0799965	4654.367323	G06001303250003	4664.409460	955305.947107	695	236.061	2.944150	3031303630303030323031453639303030303031303030...

5 rows × 22 columns

In [9]:

# Parcels
parcels_df = db_to_df(parcels)
parcels_df.head()

Out[9]:

	gid	apn	parc_py_id	use_code	land_value	imp_val	yr_hs_blt	tla	yr_built	bldg_sqft	geom
0	1	148160029	135163	11	386290	443710	1979	2965	None	0	3031303630303030323042333038303030303031303030...
1	2	148350020	136766	12	94218	58618	1927	3164	None	0	3031303630303030323042333038303030303031303030...
2	3	143040077	138154	54	758486	14647195	None	0	None	103366	3031303630303030323042333038303030303031303030...
3	4	148400035	132108	29	57421	41448	1979	708	None	0	3031303630303030323042333038303030303031303030...
4	5	144170032	132165	29	195986	251514	1973	1568	None	0	3031303630303030323042333038303030303031303030...

In [10]:

# Multiple records have the same parcel ID.
parcels_df[parcels_df.parc_py_id.duplicated()].head()

Out[10]:

	gid	apn	parc_py_id	use_code	land_value	imp_val	yr_hs_blt	tla	yr_built	geom
8	9	148400036	132108	29	57421	41448	1979	708	None	3031303630303030323042333038303030303031303030...
40	41	148400024	131862	29	57421	41448	1979	708	None	3031303630303030323042333038303030303031303030...
42	43	148400022	131847	29	180442	79558	1979	708	None	3031303630303030323042333038303030303031303030...
50	51	148400040	107759	87	0	0	0000	0	None	3031303630303030323042333038303030303031303030...
58	59	148400018	131677	29	57421	41448	1979	708	None	3031303630303030323042333038303030303031303030...

In [11]:

# Bodies of Water
db_to_df(water.geom).head()

Out[11]:

	geom
0	3031303630303030323031453639303030303146303030...

Spatial Demo¶

In [12]:

from spandex import spatialtoolz

In [13]:

# Reproject to a consistent projection.
print(spatialtoolz.conform_srids.__doc__)
print("Reprojecting to project SRID: {}".format(loader.srid))
spatialtoolz.conform_srids(loader.srid)

    Reproject all non-conforming geometry columns into the specified SRID.

    Parameters
    ----------
    srid : int
        Spatial Reference System Identifier (SRID).
    schema : schema class
        If schema is specified, only SRIDs within the specified schema
        are conformed.

    Returns
    -------
    None

    
Reprojecting to project SRID: 2768

In [14]:

# Find overlapping geometries.
print(spatialtoolz.geom_overlapping.__doc__)
spatialtoolz.geom_overlapping(parcels, key_name='parc_py_id',
                              output_table_name='parcels_overlapping')

    Export overlapping geometries from a table into another table.

    The exported table contains the following columns:
        key_name_a, key_name_b: identifiers of the overlapping pair
        relation: DE-9IM representation of their spatial relation
        geom_a, geom_b: corresponding geometries
        overlap: 2D overlapping region (polygons)

    Parameters
    ----------
    table : sqlalchemy.ext.declarative.DeclarativeMeta
        Table ORM class to query for overlapping geometries.
    key_name : str
        Name of column in the queried table containing a unique identifier,
        such as a primary key, to use for cross join and to identify
        geometries in the exported table.
    output_table_name : str
        Name of exported table. Table is created in the same schema as
        the queried table.

    Returns
    -------
    None

In [15]:

from IPython.display import Image
Image('spatialtoolz/overlapping.png')

Out[15]:

In [16]:

# Find unfilled geometries (interior rings).
print(spatialtoolz.geom_unfilled.__doc__)
spatialtoolz.geom_unfilled(parcels, output_table_name='parcels_unfilled')

    Export rows containing interior rings into another table.

    Include the unfilled geometry in the exported table as a new column
    named "unfilled".

    Parameters
    ----------
     table : sqlalchemy.ext.declarative.DeclarativeMeta
        Table ORM class to query for rows containing geometries with
        interior rings.
    output_table_name : str
        Name of exported table. Table is created in the same schema as
        the queried table.

    Returns
    -------
    None

In [17]:

from IPython.display import Image
Image('spatialtoolz/unfilled.png')

Out[17]:

In [18]:

# Tag parcels with block group FIPS code.
print(spatialtoolz.tag.__doc__)
spatialtoolz.tag(parcels, 'bg_id', bg, 'geoid')

    Tag target table with attribute of a spatially-related source table.

    Parameters
    ----------
    target_table : sqlalchemy.ext.declarative.DeclarativeMeta
        Target table ORM class to be tagged.
    target_column_name : str
        Name of column in target table to add (if doesn't exist)
        or update (if exists). This where the tag value will be stored.
    source_table : sqlalchemy.ext.declarative.DeclarativeMeta
        Source table ORM class containing information to tag target table.
    source_column_name : str
        Name of column in source table that contains the tagging information.
    how : str, optional
        How to relate the two tables spatially.
        If not specified, defaults to 'point_in_poly'.
        Other spatial relationships are not currently supported.
    df : pandas.DataFrame, optional
        DataFrame to return a tagged copy of.

    Returns
    -------
    None
        However, if df argument is provided, pandas.DataFrame with the
        new or updated column is returned.

In [19]:

from IPython.display import Image
Image('spatialtoolz/tag.png')

Out[19]:

In [20]:

# Calculate proportion of parcel overlapped by water.
print(spatialtoolz.proportion_overlap.__doc__)
spatialtoolz.proportion_overlap(parcels, water, 'proportion_water')

    Calculate proportion of target table geometry overlap.

    Calculate proportion of geometry area in each row of target table that
    is overlapped by another table's geometry. Populate specified column in
    target table with proportion overlap value.

    Parameters
    ----------
    target_table : sqlalchemy.ext.declarative.DeclarativeMeta
        Target table ORM class containing geometry to overlap.
    over_table : sqlalchemy.ext.declarative.DeclarativeMeta
        Table ORM class containing overlapping geometry.
    column_name : str
        Name of column in target table to add (if doesn't exist) or
        update (if exists). This is where the proportion overlap value
        will be stored.
    df : pandas.DataFrame, optional
        DataFrame to return a copy of with proportion overlap calculation.

    Returns
    -------
    None
        However, if df argument is provided, pandas.DataFrame with the
        new or updated column is returned.

In [21]:

from IPython.display import Image
Image('spatialtoolz/overlap.png')

Out[21]:

In [22]:

# Trim away parcel regions that are overlapped by water. 
print(spatialtoolz.trim.__doc__)
spatialtoolz.trim(parcels.geom, water.geom)

    Trim target geometry by removing intersection with a trim column.

    Parameters
    ----------
    target_col : sqlalchemy.orm.attributes.InstrumentedAttribute
        Column ORM object to trim.
    trim_col : sqlalchemy.orm.attributes.InstrumentedAttribute
        Column ORM object to trim target column with.

    Returns
    -------
    None

In [23]:

from IPython.display import Image
Image('spatialtoolz/trim.png')

Out[23]:

Attribute Demo¶

In [24]:

import numpy as np
import pandas as pd
import urbansim.sim.simulation as sim

from spandex import TableLoader, TableFrame
from spandex.io import df_to_db, db_to_df
from spandex.sim import column


# Connect to database specified in configuration file.
loader = TableLoader(config_filename='spandex.cfg')

# Asssign shorter variable names to our tables for convenience.
bg = loader.tables.public.bg
parcels = loader.tables.public.parcels
water = loader.tables.public.water

In [25]:

# Register read-only input parcel table with UrbanSim simulation framework.
parcels_in = TableFrame(parcels, index_name='gid')
sim.add_table('parcels_in', parcels_in, copy=False)

# Register intermediate parcel table.
@sim.table()
def parcels_in2(parcels_in):
    return pd.DataFrame(index=parcels_in.index)

# Register output parcel table.
@sim.table()
def parcels_out(parcels_in):
    return pd.DataFrame(index=parcels_in.parc_py_id)

# Specify default output table.
def out(*args, **kwargs):
    return column(table_name='parcels_out', groupby=parcels_in.parc_py_id,
                  *args, **kwargs)

Now we can register new, computed columns based on our input data.

Please note that these calculations are simplified for clarity, but can be as complex as needed in practice.

In [26]:

@out(astype=float, agg='sum')
def building_sqft(res_sqft='parcels_in.tla',
                  nonres_sqft='parcels_in.bldg_sqft',
                  res='parcels_in2.is_residential'):
    return pd.concat([res_sqft[res], nonres_sqft[~res]])


@out(astype=float, agg='sum')
def improvement_value(value='parcels_in.imp_val'):
    return value


@out(astype=float, agg='sum')
def land_value(value='parcels_in.land_value'):
    return value


@out(fillna=-1, astype=int, agg='median')
def year_built(res_year='parcels_in.yr_hs_blt',
               nonres_year='parcels_in.yr_built',
               res='parcels_in2.is_residential'):
    year = pd.concat([res_year[res], nonres_year[~res]])
    return year


@column('parcels_in2', fillna=-1, astype=int)
def land_use_type_id(use_code='parcels_in.use_code'):
    return use_code


@out(fillna=-1, astype=int, agg='median')
def land_use_type_id(use_code='parcels_in.use_code'):
    return use_code


@out(astype=str, agg='first')
def apn(apn='parcels_in.apn'):
    return apn


@out()
def county_id():
    return 13
    

@out(astype=int)
def non_residential_sqft(sqft='parcels_out.building_sqft',
                         nonres='parcels_out.is_nonresidential'):
    return sqft[nonres].fillna(0)


@column('parcels_in2', astype=bool)
def is_residential(use_code='parcels_in2.land_use_type_id'):
    residential_codes = range(10, 30)
    return use_code.isin(residential_codes)


@out(astype=bool)
def is_residential(use_code='parcels_out.land_use_type_id'):
    residential_codes = range(10, 30)
    return use_code.isin(residential_codes)


@out(astype=bool)
def is_nonresidential(use_code='parcels_out.land_use_type_id'):
    nonresidential_codes = range(30, 100)
    return use_code.isin(nonresidential_codes)

We register a simulation model that exports the output table of computed columns back to the database.

In [27]:

@sim.model()
def export(parcels_out):
    df_to_db(parcels_out.to_frame(), 'parcels_out')

# Run the export back to the database.
sim.run(['export'])

Running model 'export'
Time to execute model 'export': 5.61s
Total time to execute: 5.61s

In [28]:

db_to_df(loader.tables.public.parcels_out, index_name='parc_py_id').head()

Out[28]:

	non_residential_sqft	year_built	is_residential	land_use_type_id	land_value	improvement_value	building_sqft	county_id	apn	is_nonresidential	id
parc_py_id
135163	NaN	1979	True	11	386290	443710	2965	13	148160029	False	1
136766	NaN	1927	True	12	94218	58618	3164	13	148350020	False	2
138154	103366	-1	False	54	758486	14647195	103366	13	143040077	True	3
132108	NaN	1979	True	29	114842	82896	1416	13	148400035	False	4
132165	NaN	1973	True	29	195986	251514	1568	13	144170032	False	5

In [29]:

from IPython.display import SVG
SVG('diagrams/attributes.svg')

Out[29]:

In [30]:

from IPython.display import Image
Image('usui/usui.png')

Out[30]:

Contact¶

We'd like to get your feedback on spandex. For questions, comments, and suggestions, contact Dara Adib (dadib AT synthicity).

Loading data with spandex¶