This script combines building geometries and energy consumption information from publicly available New York City data, and then exports data to GeoJSON and TopoJSON formats for interactive visualization with d3.
Most of the heavy lifting is done by Pandas. GeoPandas is used to help manage the geometry data. You will need both of these installed in order to run this code. If you wish to perform the GeoJSON to TopoJSON included at the very end of this script then you will also need to have TopoJSON installed (notes are included for this below).
import os, re
from pandas import *
from geopandas import *
pylab.rcParams['savefig.dpi'] = 100
The data for building footprints, heights and elevations was downloaded from NYC Open Data:
https://data.cityofnewyork.us/Housing-Development/Building-Footprints/tb92-6tj8
https://data.cityofnewyork.us/download/tb92-6tj8/application/zip
The zip file contains a metadata file with some explanation of the data. The data is contained in a Shapefile.
Note that the building_1214_metadata.htm file that makes part of the downloaded zip describes a different map projection specification (in its 'well-known-text' field) than what is found in the building_1214.prj file. The metadata looks to be correct and the .prj file incorrect - the geometry data in the shape file is in feet rather than in longitude-latitude. So the 'well-known-text' field from the metadata file must be copied into the .prj file before reading the shape file with GeoPandas (else all variety of headaches arise when trying to translate and/or export to GeoJSON, as I discovered the hard way).
The shape file is 216 MB, so be prepared to wait for the following command - it takes about 4 minutes on my mac air.
shpfile = 'building_footprints_shape_webmercator_12-14/building_1214.shp'
bldgs = GeoDataFrame.from_file(shpfile)
print 'number of rows:\t' + str(len(bldgs))
print 'columns:\t' + str(bldgs.columns)
print 'crs:\t\t' + str(bldgs.crs) # projection info for the geometry data
number of rows: 1082381 columns: Index([u'BBL', u'BIN', u'BUILDING_T', u'DOITT_ID', u'FEATURE_CO', u'GROUND_ELE', u'HEIGHT_ROO', u'LAST_MODIF', u'NAME', u'OBJECTID', u'SUB_FEATUR', u'Shape_Area', u'Shape_Leng', u'geometry'], dtype='object') crs: {'init': u'epsg:2263'}
There are 1,082,381 buildings in this dataset. Note that the geometry values are in feet, using the EPSG 2263 projection (as specified in the source metadata file).
For clarity, we revise the table to only include the columns we are interested in. Note that 'BBL' stands for 'Borough - Block - Lot number'. This column is shared with the energy data described below, and will be used as a key for merging the tables, so we make it an index here. For convenience we add a new column 'boro' (keeping with traditional NY spelling) which contains the borough as a single-character string: '1'=Manhattan, '2'=Bronx, etc.
def boronum(bbl):
return int(bbl[0])
boro = bldgs['BBL'].apply(boronum)
bldgs['boro'] = boro
bldgs = bldgs[['BBL','boro','GROUND_ELE','HEIGHT_ROO','Shape_Area','Shape_Leng','geometry']]
bldgs.set_index('BBL', inplace=True)
bldgs.sort(inplace=True)
bldgs.head()
boro | GROUND_ELE | HEIGHT_ROO | Shape_Area | Shape_Leng | geometry | |
---|---|---|---|---|---|---|
BBL | ||||||
1000010010 | 1 | 9 | 10.29 | 277.908100 | 68.170905 | POLYGON ((-8239895.129312261 4966979.423426831... |
1000010010 | 1 | 10 | 21.52 | 3198.675255 | 365.577623 | POLYGON ((-8240318.942692845 4966178.186055238... |
1000010010 | 1 | 25 | 29.67 | 3863.966910 | 256.501983 | POLYGON ((-8239093.847102767 4966770.272564691... |
1000010010 | 1 | 7 | 26.66 | 3138.298489 | 240.395131 | POLYGON ((-8239010.691476121 4966850.24966739,... |
1000010010 | 1 | 11 | 28.60 | 3509.856484 | 301.591446 | POLYGON ((-8239472.205369915 4966449.2249446, ... |
Although Cartesian geometry in feet is useful for many purposes, it is not great for browser-based mapping with standard js libraries which expect geometries in longitude-latitude. We thus add a new column with the geometry projected to EPSG 4326, which is the standard geo description used in browser-based mapping.
bldgs['geometry_latlon'] = bldgs['geometry'].to_crs(epsg=4326)
bldgs.head()
boro | GROUND_ELE | HEIGHT_ROO | Shape_Area | Shape_Leng | geometry | geometry_latlon | |
---|---|---|---|---|---|---|---|
BBL | |||||||
1000010010 | 1 | 9 | 10.29 | 277.908100 | 68.170905 | POLYGON ((-8239895.129312261 4966979.423426831... | POLYGON ((-112.6757750753299 48.09341020820187... |
1000010010 | 1 | 10 | 21.52 | 3198.675255 | 365.577623 | POLYGON ((-8240318.942692845 4966178.186055238... | POLYGON ((-112.6759406036311 48.09094854809784... |
1000010010 | 1 | 25 | 29.67 | 3863.966910 | 256.501983 | POLYGON ((-8239093.847102767 4966770.272564691... | POLYGON ((-112.6724726056846 48.09382693018913... |
1000010010 | 1 | 7 | 26.66 | 3138.298489 | 240.395131 | POLYGON ((-8239010.691476121 4966850.24966739,... | POLYGON ((-112.6723061851977 48.09412009655221... |
1000010010 | 1 | 11 | 28.60 | 3509.856484 | 301.591446 | POLYGON ((-8239472.205369915 4966449.2249446, ... | POLYGON ((-112.6733040928157 48.09259833372172... |
Note the problem here though. These first few buildings in the dataset are located on Governor's Island (just south of the island of Manhattan), and we know that the longitude and latitude of the center of the island is 40.6914° N, 74.0161° W (ref: google search), but our conversion notes them as approximately 48.1° N, 112.7° W. Further investigation of the graphs (below) suggests that the result is also rotated. Something is incorrect about this conversion - if I figure out what, I will post a revision to this document. But for the immediate purposes of data visualization we can use the conversion as it stands, since the resulting maps look to be generally correct, just translated and rotated (or possibly scaled improperly).
# notes for further investigation of the conversion issue:
# GeoPandas uses pyproj; the code below provides an initial suggestion that something might be amiss with the units handling
#
#import pyproj
#x1 = -8239895.129312261
#y1 = 4966979.423426831
#wgs84 = pyproj.Proj("+init=EPSG:4326") # LatLon with WGS84 datum used by GPS units and Google Earth
#epsg2263 = pyproj.Proj("+init=EPSG:2263",preserve_units=True) # local NYC coordinate system
#print pyproj.transform(epsg2263, wgs84, x1, y1)
#conv = 0.3048
#print pyproj.transform(epsg2263, wgs84, x1*conv, y1*conv)
#
# outputs:
# (-112.67577507532987, 48.093410208201874)
# (-87.23011424552195, 43.594327921961934)
A couple of quick graphs to ensure that the data format is roughly as expected.
bldgs.head(500).plot()
<matplotlib.axes._subplots.AxesSubplot at 0x201891f90>
bldgs[bldgs.boro == 1].plot()
<matplotlib.axes._subplots.AxesSubplot at 0x202ba9250>
Energy consumption data is available for a subset of the buildings via NYC Open Data, and was downloaded in csv form:
This dataset contains annual energy and water consumption data for buildings in New York City that have greater than 50,000 square feet of floor area, as per the disclosure specifications in NYC Local Law 84.
The csv file presents one major challenge before it can be read by Pandas. The last data column in the csv occasionally has newline characters within it - in these cases, the data entry is held within double quotes. In this cleaning script, these intra-entry newline characters are replaced by single space (' ') characters. The script also removes some unnecessary white space from the csv. A new 'cleaned' csv file is produced.
filename = 'Energy_and_Water_Data_Disclosure_for_Local_Law_84__2012_.csv'
cleanedcsv = ''
multilinekeeper = ''
with open(filename,'rU') as f:
for line in f:
line = re.sub(' +',' ',line)
line = re.sub(', ',',',line)
if multilinekeeper == '':
if len(re.findall('"',line)) == 1:
multilinekeeper = line.strip()
else:
cleanedcsv += line
else:
if '"' in line:
cleanedcsv += multilinekeeper + line
multilinekeeper = ''
else:
multilinekeeper += ' ' + line.strip()
with open('cleaned_' + filename,'w') as f:
f.write(cleanedcsv)
The cleaned csv can now be read into a dataframe by Pandas.
energy = read_csv('cleaned_' + filename)
print 'number of rows:\t' + str(len(energy))
print 'columns:'
print energy.columns
number of rows: 14112 columns: Index([u'BBL', u'Street Number', u'Street Name', u'Borough', u'Zip', u'Benchmarking Submission', u'Entry Number', u'Site EUI(kBtu/ft2)', u'Weather Normalized Source EUI(kBtu/ft2)', u'Indoor Water Intensity (All Water Sources)(gal/ft2)', u'Reported Water Method', u'ENERGY STAR Score', u'Total GHG Emissions(MtCO2e)', u'Property Floor Area (Buildngs and Parking)(ft2)', u'Primary Property Type - Self Selected', u'Number of Buildings', u'Reported BINs'], dtype='object')
Note that the number of entries in this table (14,112) is much smaller than the number of entries in the buildings table (1,082,381), since it only contains entries for buildings with more than 50,000 ft2 of floor area.
The BBL column was read in as integers, but want it to be strings in order to to compare with the bldgs BBL key. For clarity, we also take only the subset of data columns we are interested in, and we set the 'BBL' column to be the index.
energy['BBL'] = energy['BBL'].astype(str)
energy = energy.ix[:,[0,1,2,7,8,11,12,13,14,15]]
energy.set_index('BBL', inplace=True)
energy.sort(inplace=True)
energy.head()
Street Number | Street Name | Site EUI(kBtu/ft2) | Weather Normalized Source EUI(kBtu/ft2) | ENERGY STAR Score | Total GHG Emissions(MtCO2e) | Property Floor Area (Buildngs and Parking)(ft2) | Primary Property Type - Self Selected | Number of Buildings | |
---|---|---|---|---|---|---|---|---|---|
BBL | |||||||||
1000010010 | 1 | GOVERNORS ISLAND | NaN | NaN | NaN | NaN | NaN | NaN | 1 |
1000020002 | NaN | MARGINAL STREET | NaN | NaN | NaN | NaN | NaN | NaN | 0 |
1000047501 | 1 | WATER STREET | 102.0 | 287.9 | 75 | 25932.68 | 2428325 | Office | 1 |
1000057501 | 125 | BROAD STREET | 119.6 | 261.5 | 70 | 11637.42 | 1338000 | Office | 1 |
1000087501 | 39 | WHITEHALL STREET | 59.0 | 163.2 | NaN | 848.55 | 169055 | Multifamily Housing | 1 |
energy.plot(x=2,y=5,kind='scatter')
<matplotlib.axes._subplots.AxesSubplot at 0x28be14610>
An outer merge brings the two data tables together. For convenience, we add two new columns to denote which original tables had which entries (which will let us subset data when desired, akin to taking an inner merge but without duplication). Note that the merge results in a Pandas DataFrame object, rather than a GeoPandas GeoDataFrame object.
bldgsenergy = merge(bldgs, energy, left_index=True, right_index=True, how='outer')
print 'number of rows:\t' + str(len(bldgsenergy))
print 'columns:'
print bldgsenergy.columns
bldgsenergy.ix[:,['HEIGHT_ROO','geometry','geometry_latlon','Site EUI(kBtu/ft2)','Total GHG Emissions(MtCO2e)',
'Number of Buildings']].head()
number of rows: 1100100 columns: Index([u'boro', u'GROUND_ELE', u'HEIGHT_ROO', u'Shape_Area', u'Shape_Leng', u'geometry', u'geometry_latlon', u'Street Number', u'Street Name', u'Site EUI(kBtu/ft2)', u'Weather Normalized Source EUI(kBtu/ft2)', u'ENERGY STAR Score', u'Total GHG Emissions(MtCO2e)', u'Property Floor Area (Buildngs and Parking)(ft2)', u'Primary Property Type - Self Selected', u'Number of Buildings'], dtype='object')
HEIGHT_ROO | geometry | geometry_latlon | Site EUI(kBtu/ft2) | Total GHG Emissions(MtCO2e) | Number of Buildings | |
---|---|---|---|---|---|---|
BBL | ||||||
1000010010 | 10.29 | POLYGON ((-8239895.129312261 4966979.423426831... | POLYGON ((-112.6757750753299 48.09341020820187... | NaN | NaN | 1 |
1000010010 | 21.52 | POLYGON ((-8240318.942692845 4966178.186055238... | POLYGON ((-112.6759406036311 48.09094854809784... | NaN | NaN | 1 |
1000010010 | 29.67 | POLYGON ((-8239093.847102767 4966770.272564691... | POLYGON ((-112.6724726056846 48.09382693018913... | NaN | NaN | 1 |
1000010010 | 26.66 | POLYGON ((-8239010.691476121 4966850.24966739,... | POLYGON ((-112.6723061851977 48.09412009655221... | NaN | NaN | 1 |
1000010010 | 28.60 | POLYGON ((-8239472.205369915 4966449.2249446, ... | POLYGON ((-112.6733040928157 48.09259833372172... | NaN | NaN | 1 |
def inbldgs(i):
return i in bldgs.index
def inenergy(i):
return i in energy.index
bldgsenergy['geometry_avail'] = Series(bldgsenergy.index,index=bldgsenergy.index).apply(inbldgs)
bldgsenergy['energy_avail'] = Series(bldgsenergy.index,index=bldgsenergy.index).apply(inenergy)
bldgsenergy.ix[:,['geometry_avail','energy_avail','HEIGHT_ROO','geometry','geometry_latlon','Site EUI(kBtu/ft2)',
'Total GHG Emissions(MtCO2e)','Number of Buildings']].head()
geometry_avail | energy_avail | HEIGHT_ROO | geometry | geometry_latlon | Site EUI(kBtu/ft2) | Total GHG Emissions(MtCO2e) | Number of Buildings | |
---|---|---|---|---|---|---|---|---|
BBL | ||||||||
1000010010 | True | True | 10.29 | POLYGON ((-8239895.129312261 4966979.423426831... | POLYGON ((-112.6757750753299 48.09341020820187... | NaN | NaN | 1 |
1000010010 | True | True | 21.52 | POLYGON ((-8240318.942692845 4966178.186055238... | POLYGON ((-112.6759406036311 48.09094854809784... | NaN | NaN | 1 |
1000010010 | True | True | 29.67 | POLYGON ((-8239093.847102767 4966770.272564691... | POLYGON ((-112.6724726056846 48.09382693018913... | NaN | NaN | 1 |
1000010010 | True | True | 26.66 | POLYGON ((-8239010.691476121 4966850.24966739,... | POLYGON ((-112.6723061851977 48.09412009655221... | NaN | NaN | 1 |
1000010010 | True | True | 28.60 | POLYGON ((-8239472.205369915 4966449.2249446, ... | POLYGON ((-112.6733040928157 48.09259833372172... | NaN | NaN | 1 |
print 'number of rows in bldgs:\t\t\t\t\t' + str(len(bldgs))
print 'number of rows in energy:\t\t\t\t\t' + str(len(energy))
print 'number of rows in bldgsenergy:\t\t\t\t\t' + str(len(bldgsenergy))
print 'number of rows in bldgsenergy with geometry:\t\t\t' + str(len(bldgsenergy[bldgsenergy['geometry_avail']]))
print 'number of rows in bldgsenergy with energy:\t\t\t' + str(len(bldgsenergy[bldgsenergy['energy_avail']]))
print 'number of rows in bldgsenergy with both gometry and energy:\t' + \
str(len(bldgsenergy[bldgsenergy['energy_avail'] & bldgsenergy['geometry_avail']]))
number of rows in bldgs: 1082381 number of rows in energy: 14112 number of rows in bldgsenergy: 1100100 number of rows in bldgsenergy with geometry: 1099761 number of rows in bldgsenergy with energy: 45620 number of rows in bldgsenergy with both gometry and energy: 45281
We now have a tidy data table in the variable 'bldgsenergy'. A note for further investigation later, if this data is to be used for further analysis: there look to be a lot of buildings that share a BBL identifier, which will likely cause some hassles.
For the d3 visualization, we will use just a small section of the city, and will only consider those buildings for which we have both geometry data and energy data.
bldgsenergygeo = GeoDataFrame(bldgsenergy[bldgsenergy['geometry_avail'] & bldgsenergy['energy_avail']])
bldgsenergygeo.head(500).plot()
<matplotlib.axes._subplots.AxesSubplot at 0x28c0c5750>
For the GeoJSON export, we will use the latitude and longitude as the geometry.
bldgsenergygeo.rename(columns={'geometry':'geometryfeet'}, inplace=True)
bldgsenergygeo.rename(columns={'geometry_latlon':'geometry'}, inplace=True)
bldgsenergygeo.head(500).plot()
<matplotlib.axes._subplots.AxesSubplot at 0x27e55a510>
In the interactive graph, we want to have the address variables reformatted to a single column of strings.
bldgsenergygeo['Address'] = bldgsenergygeo['Street Number'] + ' ' + bldgsenergygeo['Street Name']
We want to keep the file size to a minimum in order to keep the graph interaction smooth, so we will only include the columns we will use.
bldgsenergygeo.columns
Index([u'boro', u'GROUND_ELE', u'HEIGHT_ROO', u'Shape_Area', u'Shape_Leng', u'geometryfeet', u'geometry', u'Street Number', u'Street Name', u'Site EUI(kBtu/ft2)', u'Weather Normalized Source EUI(kBtu/ft2)', u'ENERGY STAR Score', u'Total GHG Emissions(MtCO2e)', u'Property Floor Area (Buildngs and Parking)(ft2)', u'Primary Property Type - Self Selected', u'Number of Buildings', u'geometry_avail', u'energy_avail', u'Address'], dtype='object')
bldgsenergygeo = bldgsenergygeo.ix[:,[0,2,6,9,10,11,12,13,14,18]]
bldgsenergygeo.head()
boro | HEIGHT_ROO | geometry | Site EUI(kBtu/ft2) | Weather Normalized Source EUI(kBtu/ft2) | ENERGY STAR Score | Total GHG Emissions(MtCO2e) | Property Floor Area (Buildngs and Parking)(ft2) | Primary Property Type - Self Selected | Address | |
---|---|---|---|---|---|---|---|---|---|---|
BBL | ||||||||||
1000010010 | 1 | 10.29 | POLYGON ((-112.6757750753299 48.09341020820187... | NaN | NaN | NaN | NaN | NaN | NaN | 1 GOVERNORS ISLAND |
1000010010 | 1 | 21.52 | POLYGON ((-112.6759406036311 48.09094854809784... | NaN | NaN | NaN | NaN | NaN | NaN | 1 GOVERNORS ISLAND |
1000010010 | 1 | 29.67 | POLYGON ((-112.6724726056846 48.09382693018913... | NaN | NaN | NaN | NaN | NaN | NaN | 1 GOVERNORS ISLAND |
1000010010 | 1 | 26.66 | POLYGON ((-112.6723061851977 48.09412009655221... | NaN | NaN | NaN | NaN | NaN | NaN | 1 GOVERNORS ISLAND |
1000010010 | 1 | 28.60 | POLYGON ((-112.6733040928157 48.09259833372172... | NaN | NaN | NaN | NaN | NaN | NaN | 1 GOVERNORS ISLAND |
For the graph, we are only interested in a region roughly correlated with Midtown in Manhattan - we will select only those rows for which boro = 1 and they have points between 48.117 and 48.129 degrees latitude (noting that these are not the proper latitude numbers, given our earlier problem).
bldgsenergygeomidtown = bldgsenergygeo[(bldgsenergygeo.boro == 1) &
(bldgsenergygeo.geometry.bounds['miny'] < 48.129) &
(bldgsenergygeo.geometry.bounds['maxy'] > 48.117) ]
print len(bldgsenergygeomidtown)
bldgsenergygeomidtown.head()
2348
boro | HEIGHT_ROO | geometry | Site EUI(kBtu/ft2) | Weather Normalized Source EUI(kBtu/ft2) | ENERGY STAR Score | Total GHG Emissions(MtCO2e) | Property Floor Area (Buildngs and Parking)(ft2) | Primary Property Type - Self Selected | Address | |
---|---|---|---|---|---|---|---|---|---|---|
BBL | ||||||||||
1006650010 | 1 | 37.756688 | POLYGON ((-112.6865457039683 48.11973501029388... | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1006800001 | 1 | 12.110000 | POLYGON ((-112.6858515626195 48.1203140388577,... | NaN | NaN | NaN | NaN | NaN | NaN | 360 WEST STREET |
1006800001 | 1 | 32.826755 | POLYGON ((-112.6848804688391 48.12020064719905... | NaN | NaN | NaN | NaN | NaN | NaN | 360 WEST STREET |
1006800001 | 1 | 12.467149 | POLYGON ((-112.6853330443247 48.12031129386528... | NaN | NaN | NaN | NaN | NaN | NaN | 360 WEST STREET |
1006800001 | 1 | 11.625806 | POLYGON ((-112.6852809913021 48.12031449853709... | NaN | NaN | NaN | NaN | NaN | NaN | 360 WEST STREET |
with open('bldgsenergygeo.json','w') as f:
f.write(bldgsenergygeomidtown.to_json())
This writes a GeoJSON file for Manhattan to 'bldgsenergygeo.json' that can be used for browser-based mapping. For the d3.js interactive graph, the TopoJSON format is used because it is a smaller file format and thus allows for faster uploading and interaction. Converting between the GeoJSON file and the TopoJSON file is carried out using this js library, which requires node.js and is installed via 'sudo npm install topojson'. The conversion is carried out with the following command line call. The resulting file 'bldgsenergytopo.json' is 1.2 MB, compared with the 2.9 MB GeoJSON file.
os.system('topojson -p -o bldgsenergytopo.json bldgsenergygeo.json')
0
The html and js code for the graph is included in this github repo, and the graph can be viewed in this blog post.