Created: 2014-08-02 / Last Updated: 2014-10-14
Contents:
Package Dependencies:
numpy
: Numerical computing and array processing capabilities for Python. more infopandas
: Data structures and data manipulation tools for analysis and statistics. more infopandasql
: Perform SQL SELECT
statements on Python Pandas DataFrame objects. more info & sourceReferences:
The pandasql
package is a Python package that creates tables in a local SQLite database, runs the requested SQL query on those tables, and then drops the tables from the SQLite database, modeled after the sqldf
library in R. Although this can be slower than subsetting a data frame directly, in cases where the data frame is very complex and conditions get difficult to parse, SQL can provide a straightforward, declarative way of requesting a particular data subset.
The pandasql
package depends on the pandas
package, which provides data manipulation and DataFrame support, which in turn depends on the numpy
package. If you are starting with Python, I recommend the Anaconda distribution from Continuum Analytics, which has both numpy
and pandas
.
If the pandasql
package doesn't exist in your environment, install the package. Once you have it, load it.
$ pip install -U pandasql
import pandas
from pandasql import sqldf
pandasql
package: SQL on Pandas DataFrames in Python¶To use the sqldf
function (the main function in this package), we must provide two parameters:
locals()
or globals()
)To make this look more like R's sqldf
operates, and to remove the annoyance of having to specify the session/environment variables we can write a lambda proxy function.
mysqldf = lambda q: sqldf(q, globals())
Now that we have the function and know how to use it, we'll load some test data.
For this example, we'll use the Basic Business Licenses data from the DC government open data site.
The data is available as a zip file, so we'll download it if it's not already available, and then unzip it if it hasn't been already.
import os
import urllib2
import zipfile
url = "http://data.octo.dc.gov/feeds/bbl/bbl_current_csv.zip"
data_dir = "../data/"
current_dir = os.getcwd()
os.chdir(data_dir)
local_zipped = "bbl_current_csv.zip"
local_data = "bbl_current_csv.csv"
if not (os.path.isfile(local_data) and os.path.isfile(local_zipped)):
filename = os.path.join(os.getcwd(), local_zipped)
urllib2.urlretrieve(url, local_zipped)
if not os.path.isfile(local_data):
with ZipFile(local_zipped, 'r') as zipdata:
zipdata.extractall()
bbl = pandas.DataFrame.from_csv(local_data)
os.chdir(data_dir)
Now that we have the data frame bbl
, let's take a look at what's in it.
bbl.head()
LICENSESTATUS | LICENSECATEGORY | CUST_NUM | TRADE_NAME | LICENSE_START_DATE | LICENSE_EXPIRATION_DATE | LICENSE_ISSUE_DATE | AGENT_PHONE | LASTMODIFIEDDATE | CITY | ... | WARD | ANC | SMD | DISTRICT | PSA | NEIGHBORHOODCLUSTER | HOTSPOT2006NAME | HOTSPOT2005NAME | HOTSPOT2004NAME | BUSINESSIMPROVEMENTDISTRICT | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
BBL_LICENSE_FACT_ID | |||||||||||||||||||||
144376 | CANCELLED | Charitable Solicitation | 65990715 | WASHINGTON ARCHITECTURAL FOUNDATION | 9/1/2011 12:00:00 AM | 8/31/2013 12:00:00 AM | 7/15/2014 12:00:00 AM | 9999999999 | 7/17/2014 6:45:37 AM | WASHINGTON | ... | 2 | 2B | 2B07 | SECOND | 208 | 6 | NONE | NONE | NONE | NONE |
146806 | CANCELLED | General Business Licenses | 400312000769 | NaN | 3/1/2012 12:00:00 AM | 2/28/2014 12:00:00 AM | 7/15/2014 12:00:00 AM | 2028350680 | 7/17/2014 6:45:37 AM | WASHINGTON | ... | 2 | 2B | 2B05 | SECOND | 207 | 6 | NONE | NONE | NONE | GOLDEN TRIANGLE BID |
151842 | CANCELLED | Cigarette Wholesale | 410411000056 | WAREHOUSE ON THE WHEEL | 7/1/2011 12:00:00 AM | 6/30/2013 12:00:00 AM | 7/15/2014 12:00:00 AM | 2025488000 | 7/17/2014 6:45:37 AM | WASHINGTON | ... | 5 | 5B | 5B09 | FIFTH | 505 | 23 | NONE | NONE | NONE | NONE |
152391 | ACTIVE | Gen Contr-Construction Mngr | 410512000305 | REDLINE PROJECTS LLC | 4/1/2014 12:00:00 AM | 3/31/2016 12:00:00 AM | 7/15/2014 12:00:00 AM | 2024210949 | 7/17/2014 6:45:37 AM | WASHINGTON | ... | 5 | 5C | 5C11 | FOURTH | 405 | 21 | NONE | NONE | NONE | NONE |
162782 | READY TO BATCH PRINT | One Family Rental | 500512001166 | THE DUMAN KIM 2008 IRREVOCABLE TRUST | 3/1/2014 12:00:00 AM | 2/28/2016 12:00:00 AM | 7/15/2014 12:00:00 AM | 2024664747 | 7/17/2014 6:45:37 AM | WASHINGTON | ... | 2 | 2A | 2A02 | SECOND | 207 | 5 | NONE | NONE | NONE | NONE |
5 rows × 28 columns
Now, let's do a few queries on it just to get the hang of using sqldf
(and our version mysqldf
).
mysqldf("select * from bbl where BUSINESSIMPROVEMENTDISTRICT = 'DOWNTOWN BID'")
BBL_LICENSE_FACT_ID | LICENSESTATUS | LICENSECATEGORY | CUST_NUM | TRADE_NAME | LICENSE_START_DATE | LICENSE_EXPIRATION_DATE | LICENSE_ISSUE_DATE | AGENT_PHONE | LASTMODIFIEDDATE | ... | WARD | ANC | SMD | DISTRICT | PSA | NEIGHBORHOODCLUSTER | HOTSPOT2006NAME | HOTSPOT2005NAME | HOTSPOT2004NAME | BUSINESSIMPROVEMENTDISTRICT | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 168986 | ACTIVE | Hotel | 5.000594e+07 | CCMH METRO CENTER LLC | 5/1/2014 12:00:00 AM | 4/30/2016 12:00:00 AM | 7/15/2014 12:00:00 AM | 2027372200 | 7/17/2014 6:45:37 AM | ... | 2 | 2C | 2C03 | FIRST | 101 | 8 | NONE | NONE | NONE | DOWNTOWN BID |
1 | 240380 | READY TO BATCH PRINT | General Business Licenses | 7.010787e+07 | KIPLINGER WASHINGTON EDITORS, INC. (THE) | 9/1/2014 12:00:00 AM | 8/31/2016 12:00:00 AM | 7/15/2014 12:00:00 AM | 2025723100 | 7/17/2014 6:45:37 AM | ... | 2 | 2F | 2F03 | THIRD | 307 | 8 | NONE | NONE | NONE | DOWNTOWN BID |
2 | 240701 | ACTIVE | Cigarette Retail | 5.300160e+07 | MAHIL CORPORATION | 8/1/2014 12:00:00 AM | 7/31/2016 12:00:00 AM | 7/15/2014 12:00:00 AM | 9999999999 | 7/17/2014 6:45:37 AM | ... | 2 | 2F | 2F03 | SECOND | 207 | 8 | NONE | NONE | NONE | DOWNTOWN BID |
3 | 249449 | ACTIVE | Charitable Solicitation | 4.002120e+11 | WASHINGTON AREA WOMEN'S FOUNDATION | 8/1/2014 12:00:00 AM | 7/31/2016 12:00:00 AM | 7/15/2014 12:00:00 AM | 2023477737 | 7/17/2014 6:45:37 AM | ... | 2 | 2F | 2F03 | FIRST | 101 | 8 | NONE | NONE | NONE | DOWNTOWN BID |
4 | 249571 | READY TO BATCH PRINT | One Family Rental | 5.005120e+11 | None | 8/1/2014 12:00:00 AM | 7/31/2016 12:00:00 AM | 7/15/2014 12:00:00 AM | 2022979315 | 7/17/2014 6:45:37 AM | ... | 2 | 2F | 2F06 | FIRST | 101 | 8 | NONE | NONE | NONE | DOWNTOWN BID |
5 | 272574 | ACTIVE | Charitable Solicitation | 4.002140e+11 | CITYDANCE ENSEMBLE INC | 7/1/2014 12:00:00 AM | 6/30/2016 12:00:00 AM | 7/15/2014 12:00:00 AM | 2025722100 | 7/17/2014 6:45:38 AM | ... | 2 | 2B | 2B05 | SECOND | 207 | 6 | NONE | NONE | NONE | DOWNTOWN BID |
6 | 272576 | ACTIVE | Charitable Solicitation | 4.002140e+11 | WASHINGTON ARCHITECHTURAL FOUNDATION | 7/1/2014 12:00:00 AM | 6/30/2016 12:00:00 AM | 7/15/2014 12:00:00 AM | 2026831566 | 7/17/2014 6:45:38 AM | ... | 2 | 6C | 6C09 | FIRST | 102 | 8 | NONE | NONE | NONE | DOWNTOWN BID |
7 | 272589 | ACTIVE | Parking Facility Attendant | 4.110140e+11 | COLONIAL PARKING, INC. | 7/1/2014 12:00:00 AM | 6/30/2016 12:00:00 AM | 7/15/2014 12:00:00 AM | 2023931747 | 7/17/2014 6:45:38 AM | ... | 2 | 2F | 2F03 | SECOND | 207 | 8 | NONE | NONE | NONE | DOWNTOWN BID |
8 rows × 29 columns
mysqldf("select * from bbl where WARD = 6")
BBL_LICENSE_FACT_ID | LICENSESTATUS | LICENSECATEGORY | CUST_NUM | TRADE_NAME | LICENSE_START_DATE | LICENSE_EXPIRATION_DATE | LICENSE_ISSUE_DATE | AGENT_PHONE | LASTMODIFIEDDATE | ... | WARD | ANC | SMD | DISTRICT | PSA | NEIGHBORHOODCLUSTER | HOTSPOT2006NAME | HOTSPOT2005NAME | HOTSPOT2004NAME | BUSINESSIMPROVEMENTDISTRICT | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 164337 | READY TO BATCH PRINT | One Family Rental | 6.600302e+07 | LAWRENCE PLUTKIN & RUTH HANSEN | 5/1/2014 12:00:00 AM | 4/30/2016 12:00:00 AM | 7/15/2014 12:00:00 AM | 2025443900 | 7/17/2014 6:45:37 AM | ... | 6 | 6C | 6C06 | FIRST | 104 | 25 | NONE | NONE | NONE | NONE |
1 | 239799 | READY TO BATCH PRINT | Charitable Solicitation | 7.010666e+07 | CENTER FOR STATE & LOCAL GOVERNMENT EXCELLENCE... | 8/1/2014 12:00:00 AM | 7/31/2016 12:00:00 AM | 7/15/2014 12:00:00 AM | 2029999999 | 7/17/2014 6:45:37 AM | ... | 6 | 6C | 6C05 | FIRST | 103 | 25 | NONE | NONE | NONE | NOMA BID |
2 | 242055 | READY TO BATCH PRINT | One Family Rental | 6.600593e+07 | CHRISTOPHER SIMPSON | 9/1/2014 12:00:00 AM | 8/31/2016 12:00:00 AM | 7/15/2014 12:00:00 AM | 2025467000 | 7/17/2014 6:45:37 AM | ... | 6 | 6B | 6B08 | FIRST | 108 | 26 | NONE | NONE | NONE | NONE |
3 | 242787 | READY TO BATCH PRINT | One Family Rental | 6.800585e+07 | UNITED HOUSE OF PRAYER FOR ALL PEOPLE OF THE C... | 9/1/2014 12:00:00 AM | 8/31/2016 12:00:00 AM | 7/15/2014 12:00:00 AM | 2028823956 | 7/17/2014 6:45:37 AM | ... | 6 | 2C | 2C03 | THIRD | 308 | 8 | NONE | NONE | NONE | NONE |
4 | 245630 | CANCELLED | Cigarette Retail | 4.103120e+11 | RASO CORPORATION | 5/1/2014 12:00:00 AM | 4/30/2016 12:00:00 AM | 7/15/2014 12:00:00 AM | 2025752450 | 7/17/2014 6:45:37 AM | ... | 6 | 6A | 6A01 | FIRST | 104 | 25 | NONE | NONE | NONE | NONE |
5 | 245768 | ACTIVE | Restaurant | 9.313120e+11 | PACIFICO ON EIGHTH, LLC | 6/1/2014 12:00:00 AM | 5/31/2016 12:00:00 AM | 7/15/2014 12:00:00 AM | 2025365650 | 7/17/2014 6:45:37 AM | ... | 6 | 6B | 6B04 | FIRST | 107 | 26 | NONE | NONE | NONE | CAPITOL HILL BID |
6 | 248720 | READY TO BATCH PRINT | One Family Rental | 5.005120e+11 | YARMOUTH MANAGEMENT | 8/1/2014 12:00:00 AM | 7/31/2016 12:00:00 AM | 7/15/2014 12:00:00 AM | 2025473511 | 7/17/2014 6:45:37 AM | ... | 6 | 6A | 6A06 | FIRST | 108 | 25 | NONE | NONE | NONE | NONE |
7 | 249244 | READY TO BATCH PRINT | One Family Rental | 5.005120e+11 | YARMOUTH MANAGEMENT | 8/1/2014 12:00:00 AM | 7/31/2016 12:00:00 AM | 7/15/2014 12:00:00 AM | 2026473511 | 7/17/2014 6:45:37 AM | ... | 6 | 6A | 6A05 | FIRST | 107 | 25 | NONE | NONE | NONE | NONE |
8 | 249666 | READY TO BATCH PRINT | One Family Rental | 5.005120e+11 | YARMOUTH MANAGEMENT | 8/1/2014 12:00:00 AM | 7/31/2016 12:00:00 AM | 7/15/2014 12:00:00 AM | 2025473511 | 7/17/2014 6:45:37 AM | ... | 6 | 6A | 6A03 | FIRST | 107 | 25 | NONE | NONE | NONE | NONE |
9 | 249875 | READY TO BATCH PRINT | One Family Rental | 5.005120e+11 | None | 8/1/2014 12:00:00 AM | 7/31/2016 12:00:00 AM | 7/15/2014 12:00:00 AM | 2025473511 | 7/17/2014 6:45:37 AM | ... | 6 | 6B | 6B10 | FIRST | 108 | 26 | NONE | NONE | NONE | NONE |
10 | 272577 | ACTIVE | Charitable Solicitation | 4.002140e+11 | SHAW COMMUNITY MINISTRY | 7/1/2014 12:00:00 AM | 6/30/2016 12:00:00 AM | 7/15/2014 12:00:00 AM | None | 7/17/2014 6:45:38 AM | ... | 6 | 2C | 2C01 | THIRD | 307 | 7 | NONE | NONE | NONE | NONE |
11 | 272580 | ACTIVE | General Business Licenses | 4.003149e+11 | DC Tree LLC; Matthew T. Jones | 7/1/2014 12:00:00 AM | 6/30/2016 12:00:00 AM | 7/15/2014 12:00:00 AM | 202-294-1570 | 7/17/2014 6:45:38 AM | ... | 6 | 2C | 2C02 | THIRD | 308 | 7 | NONE | NONE | NONE | NONE |
12 | 272584 | ACTIVE | General Business Licenses | 4.003149e+11 | Health Strategies and Associates; Dianne Harris | 7/1/2014 12:00:00 AM | 6/30/2016 12:00:00 AM | 7/15/2014 12:00:00 AM | 202-578-4031 | 7/17/2014 6:45:38 AM | ... | 6 | 6D | 6D07 | FIRST | 106 | 27 | NONE | NONE | NONE | CAPITOL RIVERFRONT BID |
13 | 272593 | ACTIVE | One Family Rental | 5.005140e+11 | None | 7/1/2014 12:00:00 AM | 6/30/2016 12:00:00 AM | 7/15/2014 12:00:00 AM | 2022907971 | 7/17/2014 6:45:38 AM | ... | 6 | 6A | 6A04 | FIRST | 108 | 26 | NONE | NONE | NONE | NONE |
14 | 272600 | ACTIVE | One Family Rental | 5.005149e+11 | MEI MEI PENG | 7/1/2014 12:00:00 AM | 6/30/2016 12:00:00 AM | 7/15/2014 12:00:00 AM | 202-546-1771 | 7/17/2014 6:45:38 AM | ... | 6 | 6B | 6B03 | FIRST | 107 | 26 | NONE | NONE | NONE | NONE |
15 | 272601 | ACTIVE | One Family Rental | 5.005149e+11 | Mark Muenchrath; Mark Muenchrath | 7/1/2014 12:00:00 AM | 6/30/2016 12:00:00 AM | 7/15/2014 12:00:00 AM | 202-997-3575 | 7/17/2014 6:45:38 AM | ... | 6 | 6B | 6B01 | FIRST | 106 | 26 | NONE | NONE | NONE | NONE |
16 | 272602 | ACTIVE | One Family Rental | 5.005149e+11 | Holmes Enterprises; Paloma Holmes | 7/1/2014 12:00:00 AM | 6/30/2016 12:00:00 AM | 7/15/2014 12:00:00 AM | 202-462-7200 | 7/17/2014 6:45:38 AM | ... | 6 | 6B | 6B04 | FIRST | 107 | 26 | NONE | NONE | NONE | NONE |
17 | 272614 | ACTIVE | Hotel | 5.107140e+11 | MHF NOMA OPERATING IV LLC | 7/1/2014 12:00:00 AM | 6/30/2016 12:00:00 AM | 7/15/2014 12:00:00 AM | 8004831140 | 7/17/2014 6:45:38 AM | ... | 6 | 6C | 6C04 | FIFTH | 506 | 25 | NONE | NONE | NONE | NOMA BID |
18 | 272616 | ACTIVE | Swimming Pool | 9.211140e+11 | MHF NOMA OPERATING IV LLC | 7/1/2014 12:00:00 AM | 6/30/2016 12:00:00 AM | 7/15/2014 12:00:00 AM | 8004831140 | 7/17/2014 6:45:38 AM | ... | 6 | 6C | 6C04 | FIFTH | 506 | 25 | NONE | NONE | NONE | NOMA BID |
19 | 272618 | ACTIVE | Food Products | 9.306140e+11 | MHF NOMA OPERATING IV LLC | 7/1/2014 12:00:00 AM | 6/30/2016 12:00:00 AM | 7/15/2014 12:00:00 AM | 8004831140 | 7/17/2014 6:45:38 AM | ... | 6 | 6C | 6C04 | FIFTH | 506 | 25 | NONE | NONE | NONE | NOMA BID |
20 rows × 29 columns
The sqldf()
function returns a data frame, so we can store it in another variable for use in a plot, for example.
ward6 = mysqldf("select LICENSECATEGORY, count(*) as LICENSECOUNT from bbl where WARD = 6 group by LICENSECATEGORY")
ward6
LICENSECATEGORY | LICENSECOUNT | |
---|---|---|
0 | Charitable Solicitation | 2 |
1 | Cigarette Retail | 1 |
2 | Food Products | 1 |
3 | General Business Licenses | 2 |
4 | Hotel | 1 |
5 | One Family Rental | 11 |
6 | Restaurant | 1 |
7 | Swimming Pool | 1 |
%matplotlib inline
import matplotlib.pyplot as plt
plt.figure()
ward6.plot(kind='barh')
<matplotlib.axes.AxesSubplot at 0x10adc6710>
<matplotlib.figure.Figure at 0x10a82ab90>
query = """
select BBL_LICENSE_FACT_ID as bbl_id,
LATITUDE as lat, LONGITUDE as lon,
WARD as ward, LICENSESTATUS as status,
BUSINESSIMPROVEMENTDISTRICT as bid
from bbl
where LICENSESTATUS != 'CANCELLED'
and WARD != 'NA'
and LATITUDE <> 0
AND LONGITUDE <> 0"""
results = mysqldf(query)
results.head()
bbl_id | lat | lon | ward | status | bid | |
---|---|---|---|---|---|---|
0 | 152391 | 38.928336 | -77.005566 | 5 | ACTIVE | NONE |
1 | 162782 | 38.904394 | -77.049809 | 2 | READY TO BATCH PRINT | NONE |
2 | 164337 | 38.895944 | -76.999045 | 6 | READY TO BATCH PRINT | NONE |
3 | 168986 | 38.899361 | -77.027786 | 2 | ACTIVE | DOWNTOWN BID |
4 | 174936 | 38.923781 | -77.051009 | 3 | ACTIVE | NONE |
from ggplot import *
ggplot(aes(x='lat', y='lon', color='bid'), data=results) + \
geom_point() + ggtitle("Business Improvement District Locations") + \
xlab("Latitude") + ylab("Longitude") + facet_grid("ward", "status", scales="free_xy")
<ggplot: (282735529)>