PfDA
, Chap 1 Preliminaries, especially the installation instructions for EPD Free for your computer platform. I want you to try installing EPD Free (or EPD Academic) before class on Thursday.PfDA
, Chap 3PfDA
, Appendix: Python Language Essentials -- to help remind yourself of key elements of standard PythonPfDA
, Chap 2 Introductory ExamplesOn Tuesday, I asked you to discuss the population of California. If you do a Google search...you might end up at California QuickFacts from the US Census Bureau. Compare to the quickfacts about Alameda County.
Today we download the data for the USA, states, and counties:
The entire State and County QuickFacts dataset, with U.S., state, and county data is available for download. Downloadable data files for cities may be issued later. The current downloadable data set may include items not displayed on QuickFacts tables.
Download 3 files into a directory....perhaps where you launched iPython:
# YOU NEED TO FILL IN data_dir for your own directory path
data_dir = "/Users/raymondyee/D/Document/Working_with_Open_Data/day02/"
dataset_fname = data_dir + "DataSet.txt"
datadict_fname = data_dir + "DataDict.txt"
fips_fname = data_dir + "FIPS_CountyName.txt"
5+5
10
# on Mac, Linux system
!head $fips_fname
00000 UNITED STATES 01000 ALABAMA 01001 Autauga County, AL 01003 Baldwin County, AL 01005 Barbour County, AL 01007 Bibb County, AL 01009 Blount County, AL 01011 Bullock County, AL 01013 Butler County, AL 01015 Calhoun County, AL
!grep -i California $fips_fname
06000 CALIFORNIA
!grep Alameda $fips_fname
06001 Alameda County, CA
# You might do something like this....
data_dir = "/Users/raymondyee/D/Document/Working_with_Open_Data/day02/"
fips_fname = data_dir + "FIPS_CountyName.txt"
# PfDA p. 430 for brief explanation of file open
from itertools import islice
f = open(fips_fname)
for (i, row) in enumerate(islice(f,5)):
print i, row
0 00000 UNITED STATES 1 01000 ALABAMA 2 01001 Autauga County, AL 3 01003 Baldwin County, AL 4 01005 Barbour County, AL
data_dir = "/Users/raymondyee/D/Document/Working_with_Open_Data/day02/"
fips_fname = data_dir + "FIPS_CountyName.txt"
from itertools import islice
# PfDA p. 430 for brief explanation of file open
f = open(fips_fname)
[row for row in islice(f,5)]
['00000 UNITED STATES\n', '01000 ALABAMA\n', '01001 Autauga County, AL\n', '01003 Baldwin County, AL\n', '01005 Barbour County, AL\n']
data_dir = "/Users/raymondyee/D/Document/Working_with_Open_Data/day02/"
fips_fname = data_dir + "FIPS_CountyName.txt"
f = open(fips_fname)
for (i, row) in enumerate(f):
try:
a = row.decode('ascii')
except Exception as e:
print i, row, e
1835 35013 Do�a Ana County, NM 'ascii' codec can't decode byte 0xb1 in position 8: ordinal not in range(128)
http://www.doughellmann.com/PyMOTW/codecs/#working-with-files
encodings: http://docs.python.org/2/library/codecs.html#standard-encodings
'ascii' vs 'utf-8' vs 'iso-8859-1'
import codecs
from itertools import islice
# YOU NEED TO FILL IN data_dir for your own directory path
data_dir = "/Users/raymondyee/D/Document/Working_with_Open_Data/day02/"
fips_fname = data_dir + "FIPS_CountyName.txt"
f = codecs.open(fips_fname, encoding='latin8')
for (i, row) in enumerate(islice(f, None)):
pass
import codecs
from itertools import islice
# YOU NEED TO FILL IN data_dir for your own directory path
data_dir = "/Users/raymondyee/D/Document/Working_with_Open_Data/day02/"
fips_fname = data_dir + "FIPS_CountyName.txt"
f = codecs.open(fips_fname, encoding='iso-8859-1')
fips = dict()
for row in islice(f, 5):
print row[:5], row[6:-1]
fips[row[:5]] = row[6:-1]
print fips
00000 UNITED STATES 01000 ALABAMA 01001 Autauga County, AL 01003 Baldwin County, AL 01005 Barbour County, AL {u'01003': u'Baldwin County, AL', u'01005': u'Barbour County, AL', u'00000': u'UNITED STATES', u'01000': u'ALABAMA', u'01001': u'Autauga County, AL'}
import re
import string
import codecs
data_dir = "/Users/raymondyee/D/Document/Working_with_Open_Data/day02/"
fips_fname = data_dir + "FIPS_CountyName.txt"
f = codecs.open(fips_fname, encoding='iso-8859-1')
fips = dict(re.match(r"([^ ]+)\s(.*)$", row).groups() for row in islice(f, None))
print list(islice(fips.keys(),5))
[u'16079', u'33017', u'16073', u'16071', u'16077']
len(fips)
3195
# check on CA and Alameda County
print fips['06000'], fips['06001']
CALIFORNIA Alameda County, CA
give me a list of FIPS codes that correspond to states -- sorted by FIPS code
# work out hierarchy w/ FIPS
# not CSV -- parse on first space
import codecs
from itertools import islice
import re
import string
# YOU NEED TO FILL IN data_dir for your own directory path
data_dir = "/Users/raymondyee/D/Document/Working_with_Open_Data/day02/"
fips_fname = data_dir + "FIPS_CountyName.txt"
f = codecs.open(fips_fname, encoding='iso-8859-1')
fips = dict()
fips = dict(re.match(r"([^ ]+)\s(.*)$", row).groups() for row in islice(f, None))
states_fips = sorted([k for k in fips.keys() if k[-3:] == '000' and k != '00000'])
print states_fips
[u'01000', u'02000', u'04000', u'05000', u'06000', u'08000', u'09000', u'10000', u'11000', u'12000', u'13000', u'15000', u'16000', u'17000', u'18000', u'19000', u'20000', u'21000', u'22000', u'23000', u'24000', u'25000', u'26000', u'27000', u'28000', u'29000', u'30000', u'31000', u'32000', u'33000', u'34000', u'35000', u'36000', u'37000', u'38000', u'39000', u'40000', u'41000', u'42000', u'44000', u'45000', u'46000', u'47000', u'48000', u'49000', u'50000', u'51000', u'53000', u'54000', u'55000', u'56000']
# work out hierarchy w/ FIPS
# not CSV -- parse on first space
import codecs
from itertools import islice
import re
import string
# YOU NEED TO FILL IN data_dir for your own directory path
data_dir = "/Users/raymondyee/D/Document/Working_with_Open_Data/day02/"
fips_fname = data_dir + "FIPS_CountyName.txt"
f = codecs.open(fips_fname, encoding='iso-8859-1')
fips = dict()
for row in islice(f, None):
row = string.strip(row)
row = row.split(" ",1)
fips.update({row[0]:row[1]})
states_fips = sorted([k for k in fips.keys() if k[-3:] == '000' and k != '00000'])
print states_fips
[u'01000', u'02000', u'04000', u'05000', u'06000', u'08000', u'09000', u'10000', u'11000', u'12000', u'13000', u'15000', u'16000', u'17000', u'18000', u'19000', u'20000', u'21000', u'22000', u'23000', u'24000', u'25000', u'26000', u'27000', u'28000', u'29000', u'30000', u'31000', u'32000', u'33000', u'34000', u'35000', u'36000', u'37000', u'38000', u'39000', u'40000', u'41000', u'42000', u'44000', u'45000', u'46000', u'47000', u'48000', u'49000', u'50000', u'51000', u'53000', u'54000', u'55000', u'56000']
# to check
states_fips == [u'01000', u'02000', u'04000', u'05000', u'06000', u'08000',
u'09000', u'10000', u'11000', u'12000', u'13000', u'15000', u'16000', u'17000',
u'18000', u'19000', u'20000', u'21000', u'22000', u'23000', u'24000', u'25000',
u'26000', u'27000', u'28000', u'29000', u'30000', u'31000', u'32000', u'33000',
u'34000', u'35000', u'36000', u'37000', u'38000', u'39000', u'40000', u'41000',
u'42000', u'44000', u'45000', u'46000', u'47000', u'48000', u'49000', u'50000',
u'51000', u'53000', u'54000', u'55000', u'56000']
True
len(fips)
3195
from collections import Counter
Counter([1,2,3,2])
Counter({2: 2, 1: 1, 3: 1})
from collections import Counter
counties_count_by_state = Counter((k[:2] for k in fips.iterkeys() if k[-3:] != '000'))
counties_count_by_state['06'] #CA
58
# check
counties_count_by_state['06'] == 58 #CA
True
# return fips codes for county for given state prefix
def county_fips_for_state(state):
# allow state to be of form '06000' or '06' -> look at first 2 digits only in state
for k in fips.iterkeys():
if k[:2] == state[:2] and k[-3:] != '000':
yield k
# check for CA
print [(k, fips[k]) for k in list(county_fips_for_state('06000')) ]
[(u'06115', u'Yuba County, CA'), (u'06111', u'Ventura County, CA'), (u'06113', u'Yolo County, CA'), (u'06029', u'Kern County, CA'), (u'06021', u'Glenn County, CA'), (u'06023', u'Humboldt County, CA'), (u'06025', u'Imperial County, CA'), (u'06027', u'Inyo County, CA'), (u'06107', u'Tulare County, CA'), (u'06105', u'Trinity County, CA'), (u'06103', u'Tehama County, CA'), (u'06101', u'Sutter County, CA'), (u'06033', u'Lake County, CA'), (u'06007', u'Butte County, CA'), (u'06005', u'Amador County, CA'), (u'06003', u'Alpine County, CA'), (u'06001', u'Alameda County, CA'), (u'06009', u'Calaveras County, CA'), (u'06093', u'Siskiyou County, CA'), (u'06059', u'Orange County, CA'), (u'06095', u'Solano County, CA'), (u'06097', u'Sonoma County, CA'), (u'06011', u'Colusa County, CA'), (u'06013', u'Contra Costa County, CA'), (u'06015', u'Del Norte County, CA'), (u'06017', u'El Dorado County, CA'), (u'06019', u'Fresno County, CA'), (u'06043', u'Mariposa County, CA'), (u'06083', u'Santa Barbara County, CA'), (u'06045', u'Mendocino County, CA'), (u'06087', u'Santa Cruz County, CA'), (u'06085', u'Santa Clara County, CA'), (u'06081', u'San Mateo County, CA'), (u'06041', u'Marin County, CA'), (u'06109', u'Tuolumne County, CA'), (u'06039', u'Madera County, CA'), (u'06031', u'Kings County, CA'), (u'06037', u'Los Angeles County, CA'), (u'06035', u'Lassen County, CA'), (u'06091', u'Sierra County, CA'), (u'06089', u'Shasta County, CA'), (u'06065', u'Riverside County, CA'), (u'06067', u'Sacramento County, CA'), (u'06061', u'Placer County, CA'), (u'06063', u'Plumas County, CA'), (u'06069', u'San Benito County, CA'), (u'06099', u'Stanislaus County, CA'), (u'06077', u'San Joaquin County, CA'), (u'06075', u'San Francisco County, CA'), (u'06073', u'San Diego County, CA'), (u'06071', u'San Bernardino County, CA'), (u'06079', u'San Luis Obispo County, CA'), (u'06047', u'Merced County, CA'), (u'06049', u'Modoc County, CA'), (u'06055', u'Napa County, CA'), (u'06057', u'Nevada County, CA'), (u'06051', u'Mono County, CA'), (u'06053', u'Monterey County, CA')]
CA_county_fips = set([u'06099',
u'06057', u'06069', u'06093', u'06095', u'06097', u'06011', u'06013',
u'06015', u'06017', u'06115', u'06019', u'06079', u'06111', u'06047',
u'06113', u'06077', u'06039', u'06073', u'06071', u'06033', u'06031',
u'06037', u'06035', u'06091', u'06051', u'06065', u'06089', u'06087',
u'06085', u'06083', u'06041', u'06081', u'06007', u'06005', u'06075',
u'06003', u'06001', u'06109', u'06107', u'06105', u'06103', u'06009',
u'06101', u'06029', u'06067', u'06061', u'06045', u'06063', u'06021',
u'06059', u'06023', u'06025', u'06027', u'06043', u'06055', u'06053',
u'06049'])
print set(county_fips_for_state('06000')) == CA_county_fips
True
Suggestion: use csv.DictReader to fread in file into dataset dict
# now load dataset
import codecs
import csv
from itertools import islice
# YOU NEED TO FILL IN data_dir for your own directory path
data_dir = "/Users/raymondyee/D/Document/Working_with_Open_Data/day02/"
dataset_fname = data_dir + "DataSet.txt"
f = codecs.open(dataset_fname, encoding='utf-8')
reader = csv.DictReader(f)
dataset = dict([(row["fips"], row) for row in islice(reader, None)])
# check number of keys and population of US
print len(dataset.keys()) == 3195
print dataset['00000']['POP010210'] == '308745538'
True True
sum([int(dataset[k]['POP010210']) for k in states_fips])
308745538
int(dataset['06000']['POP010210']) # total 2010 population of CA
37253956
# sum up all the counties too to confirm state totals
# CA
state = '06000'
sum([int(dataset[cf]['POP010210']) for cf in county_fips_for_state(state)]) == int(dataset[state]['POP010210'])
True
for state in states_fips:
print state, fips[state], int(dataset[state]['POP010210']), sum([int(dataset[cf]['POP010210']) for cf in county_fips_for_state(state)]) == int(dataset[state]['POP010210'])
01000 ALABAMA 4779736 True 02000 ALASKA 710231 True 04000 ARIZONA 6392017 True 05000 ARKANSAS 2915918 True 06000 CALIFORNIA 37253956 True 08000 COLORADO 5029196 True 09000 CONNECTICUT 3574097 True 10000 DELAWARE 897934 True 11000 DISTRICT OF COLUMBIA 601723 True 12000 FLORIDA 18801310 True 13000 GEORGIA 9687653 True 15000 HAWAII 1360301 True 16000 IDAHO 1567582 True 17000 ILLINOIS 12830632 True 18000 INDIANA 6483802 True 19000 IOWA 3046355 True 20000 KANSAS 2853118 True 21000 KENTUCKY 4339367 True 22000 LOUISIANA 4533372 True 23000 MAINE 1328361 True 24000 MARYLAND 5773552 True 25000 MASSACHUSETTS 6547629 True 26000 MICHIGAN 9883640 True 27000 MINNESOTA 5303925 True 28000 MISSISSIPPI 2967297 True 29000 MISSOURI 5988927 True 30000 MONTANA 989415 True 31000 NEBRASKA 1826341 True 32000 NEVADA 2700551 True 33000 NEW HAMPSHIRE 1316470 True 34000 NEW JERSEY 8791894 True 35000 NEW MEXICO 2059179 True 36000 NEW YORK 19378102 True 37000 NORTH CAROLINA 9535483 True 38000 NORTH DAKOTA 672591 True 39000 OHIO 11536504 True 40000 OKLAHOMA 3751351 True 41000 OREGON 3831074 True 42000 PENNSYLVANIA 12702379 True 44000 RHODE ISLAND 1052567 True 45000 SOUTH CAROLINA 4625364 True 46000 SOUTH DAKOTA 814180 True 47000 TENNESSEE 6346105 True 48000 TEXAS 25145561 True 49000 UTAH 2763885 True 50000 VERMONT 625741 True 51000 VIRGINIA 8001024 True 53000 WASHINGTON 6724540 True 54000 WEST VIRGINIA 1852994 True 55000 WISCONSIN 5686986 True 56000 WYOMING 563626 True