Warm up with sequences, lists¶

Warm-up exercise I: verifying sum of integers calculated by (young) Gauss¶

http://mathandmultimedia.com/2010/09/15/sum-first-n-positive-integers/

Gauss displayed his genius at an early age. According to anecdotes, when he was in primary school, he was punished by his teacher due to misbehavior. He was told to add the numbers from 1 to 100. He was able to compute its sum, which is 5050, in a matter of seconds.

Now, how on earth did he do it?

Warm Up Exercise II: Wheat and chessboard problem¶

http://en.wikipedia.org/wiki/Wheat_and_chessboard_problem :

If a chessboard were to have wheat placed upon each square such that one grain were placed on the first square, two on the second, four on the third, and so on (doubling the number of grains on each subsequent square), how many grains of wheat would be on the chessboard at the finish?

The total number of grains equals 18,446,744,073,709,551,615, which is a much higher number than most people intuitively expect.

try using pow

In [14]:

# Legend of the Chessboard YouTube video

from IPython.display import YouTubeVideo
YouTubeVideo('t3d0Y-JpRRg')

Out[14]:

In [15]:

# generator comprehension

k  = (pow(2,n) for n in xrange(64))
k.next()

Out[15]:

In [16]:

__builtin__.sum((pow(2,n) for n in xrange(64)))

Out[16]:

18446744073709551615L

In [17]:

pow(2,64) -1

Out[17]:

18446744073709551615L

Slicing/Indexing Review¶

http://stackoverflow.com/a/509295/7782

Use on any of the sequence types (python docs on sequence types):

There are seven sequence types: strings, Unicode strings, lists, tuples, bytearrays, buffers, and xrange objects.

The use of square brackets are for accessing slices of sequence.

Let's remind ourselves of how to use slices

s[i]
s[i:j]
s[i:j:k]
meaning of negative indices
0-base counting

In [18]:

m = range(10)
m

Out[18]:

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [19]:

m[0]

Out[19]:

In [20]:

m[-1]

Out[20]:

In [21]:

m[::-1]

Out[21]:

[9, 8, 7, 6, 5, 4, 3, 2, 1, 0]

In [22]:

m[2:3]

Out[22]:

[2]

In [23]:

import string
alphabet = string.lowercase

alphabet

Out[23]:

'abcdefghijklmnopqrstuvwxyz'

In [24]:

# 13 letter of the alphabet
alphabet[12]

Out[24]:

'm'

We will revisit generalized slicing in NumPy.

Import/naming conventions and pylab mode¶

http://my.safaribooksonline.com/book/programming/python/9781449323592/1dot-preliminaries/id2699702

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas import Series, DataFrame

These imports done for you in pylab mode.

pylab mode¶

ipython --help

yields

--pylab=<CaselessStrEnum> (InteractiveShellApp.pylab)
    Default: None
    Choices: ['tk', 'qt', 'wx', 'gtk', 'osx', 'inline', 'auto']
    Pre-load matplotlib and numpy for interactive use, selecting a particular
    matplotlib backend and loop integration.

In [25]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas import Series, DataFrame

NumPy¶

http://www.numpy.org/:

NumPy is the fundamental package for scientific computing with Python. It contains among other things:

a powerful N-dimensional array object [let's start with 1 and 2 dimensions]
sophisticated (broadcasting) functions [what is broadcasting?]
tools for integrating C/C++ and Fortran code [why useful?]
useful linear algebra, Fourier transform, and random number capabilities

Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.

See PfDA, Chapter 4

ndarray.ndim, ndarray.shape¶

In [26]:

# zero-dimensions

a0 = array(5)
a0

Out[26]:

array(5)

use shape to get a tuple of array dimensions

In [27]:

a0.ndim, a0.shape

Out[27]:

(0, ())

In [28]:

# 1-d array
a1 = array([1,2])
a1.ndim, a1.shape

Out[28]:

(1, (2,))

In [29]:

# 2-d array

a2 = array(([1,2], [3,4]))
a2.ndim, a2.shape

Out[29]:

(2, (2, 2))

dtype: type of given ndarray¶

In [30]:

a2.dtype

Out[30]:

dtype('int32')

np.arange¶

arange is one instance of ndarray creating function in NumPy

Compare to xrange.

In [31]:

type(arange(10))

Out[31]:

numpy.ndarray

In [32]:

for k in arange(10):
    print k

In [33]:

list(arange(10)) == list(xrange(10))

Out[33]:

True

NumPy.ndarray.reshape¶

In [34]:

#how to map 0..63 -> 2x2 array
a3 = np.arange(64).reshape(8,8)
a3

Out[34]:

array([[ 0,  1,  2,  3,  4,  5,  6,  7],
       [ 8,  9, 10, 11, 12, 13, 14, 15],
       [16, 17, 18, 19, 20, 21, 22, 23],
       [24, 25, 26, 27, 28, 29, 30, 31],
       [32, 33, 34, 35, 36, 37, 38, 39],
       [40, 41, 42, 43, 44, 45, 46, 47],
       [48, 49, 50, 51, 52, 53, 54, 55],
       [56, 57, 58, 59, 60, 61, 62, 63]])

In [35]:

a3[1,2]

Out[35]:

In [36]:

for i in range(8):
    for j in range(8):
        if a3[i,j] != i*8 + j:
            print i, j

scalar multiplication¶

example of broadcasting:

The term broadcasting describes how numpy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes. Broadcasting provides a means of vectorizing array operations so that looping occurs in C instead of Python. It does this without making needless copies of data and usually leads to efficient algorithm implementations. There are, however, cases where broadcasting is a bad idea because it leads to inefficient use of memory that slows computation.

In [37]:

2*a3

Out[37]:

array([[  0,   2,   4,   6,   8,  10,  12,  14],
       [ 16,  18,  20,  22,  24,  26,  28,  30],
       [ 32,  34,  36,  38,  40,  42,  44,  46],
       [ 48,  50,  52,  54,  56,  58,  60,  62],
       [ 64,  66,  68,  70,  72,  74,  76,  78],
       [ 80,  82,  84,  86,  88,  90,  92,  94],
       [ 96,  98, 100, 102, 104, 106, 108, 110],
       [112, 114, 116, 118, 120, 122, 124, 126]])

add 2 to all elements in a3¶

In [38]:

a3+2

Out[38]:

array([[ 2,  3,  4,  5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14, 15, 16, 17],
       [18, 19, 20, 21, 22, 23, 24, 25],
       [26, 27, 28, 29, 30, 31, 32, 33],
       [34, 35, 36, 37, 38, 39, 40, 41],
       [42, 43, 44, 45, 46, 47, 48, 49],
       [50, 51, 52, 53, 54, 55, 56, 57],
       [58, 59, 60, 61, 62, 63, 64, 65]])

sorting¶

In [39]:

# reverse sort -- best way?
#http://stackoverflow.com/a/6771620/7782

np.sort(arange(100))[::-1]

Out[39]:

array([99, 98, 97, 96, 95, 94, 93, 92, 91, 90, 89, 88, 87, 86, 85, 84, 83,
       82, 81, 80, 79, 78, 77, 76, 75, 74, 73, 72, 71, 70, 69, 68, 67, 66,
       65, 64, 63, 62, 61, 60, 59, 58, 57, 56, 55, 54, 53, 52, 51, 50, 49,
       48, 47, 46, 45, 44, 43, 42, 41, 40, 39, 38, 37, 36, 35, 34, 33, 32,
       31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15,
       14, 13, 12, 11, 10,  9,  8,  7,  6,  5,  4,  3,  2,  1,  0])

Boolean slice: important novel type of slicing¶

This stuff is a bit tricky (see PfDA, pp. 89-92)

Consider example of picking out whole numbers less than 20 that are evenly divisible by 3. Generate a list of such numbers

In [40]:

# list comprehension

[i for i in xrange(20) if i % 3 == 0]

Out[40]:

[0, 3, 6, 9, 12, 15, 18]

In [41]:

a3 = arange(20) 
a3

Out[41]:

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])

In [42]:

# basic indexing

print a3[0]
print a3[::-1]
print a3[2:5]

0
[19 18 17 16 15 14 13 12 11 10  9  8  7  6  5  4  3  2  1  0]
[2 3 4]

In [43]:

np.mod(a3, 3)

Out[43]:

array([0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1])

In [44]:

np.mod(a3, 3) == 0

Out[44]:

array([ True, False, False,  True, False, False,  True, False, False,
        True, False, False,  True, False, False,  True, False, False,
        True, False], dtype=bool)

In [45]:

divisible_by_3 = np.mod(a3, 3) == 0

In [46]:

a3[divisible_by_3]

Out[46]:

array([ 0,  3,  6,  9, 12, 15, 18])

In [47]:

# if you want to understand this in terms of the overloaded operators -- don't worry if you don't get this.
a3.__getitem__(np.mod(a3,3).__eq__(0))

Out[47]:

array([ 0,  3,  6,  9, 12, 15, 18])

Exercise: Calculate a series that holds all the squares less than 100¶

Use arange, np.sqrt, astype

In [48]:

a4 = arange(100)
a4sqrt = np.sqrt(a4)
a4[a4sqrt == a4sqrt.astype(np.int)]

Out[48]:

array([ 0,  1,  4,  9, 16, 25, 36, 49, 64, 81])

We will come back to indexing later.¶

http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html

Pandas¶

pandas.Series¶

Make a series out of an array

In [49]:

s1 = Series(arange(5))

confirm that the type of s1 is what you would expect

In [50]:

type(s1)

Out[50]:

pandas.core.series.Series

show that the series is also an array

In [51]:

s1.ndim, isinstance(s1, np.ndarray)

Out[51]:

(1, True)

In [52]:

s1.index

Out[52]:

Int64Index([0, 1, 2, 3, 4], dtype=int64)

In [53]:

import string
allTheLetters = string.lowercase
allTheLetters

Out[53]:

'abcdefghijklmnopqrstuvwxyz'

In [54]:

s2 = Series(data=arange(5), index=list(allTheLetters)[:5])
s2

Out[54]:

In [55]:

s2.index

Out[55]:

Index([a, b, c, d, e], dtype=object)

http://my.safaribooksonline.com/book/programming/python/9781449323592/5dot-getting-started-with-pandas/id2828378 :

Compared with a regular NumPy array, you can use values in the index when selecting single values or a set of values

In [56]:

# can use both numeric indexing and the labels
s2[0], s2['a']

Out[56]:

(0, 0)

In [57]:

for i in range(len(s2)):
    print i, s2[i]

it is possible conflict in indexing -- consider

In [58]:

s3 = Series(data=['albert', 'betty', 'cathy'], index=[3,1, 0])
s3

Out[58]:

3    albert
1     betty
0     cathy

In [59]:

s3[0], list(s3)[0]

Out[59]:

('cathy', 'albert')

but slicing works to return specific numeric index

In [60]:

s3[::-1]

Out[60]:

0     cathy
1     betty
3    albert

In [61]:

for i in range(len(s3)):
    print i, s3[i:i+1]

0 3    albert
1 1    betty
2 0    cathy

In [62]:

s3.name = 'person names'
s3.name

Out[62]:

'person names'

In [63]:

s3.index.name = 'confounding label'
s3.index.name

Out[63]:

'confounding label'

In [64]:

s3

Out[64]:

confounding label
3                    albert
1                     betty
0                     cathy
Name: person names

Important points remaining:

"NumPy array operations, such as filtering with a boolean array, scalar multiplication, or applying math functions, will preserve the index-value link"
"Another way to think about a Series is as a fixed-length, ordered dict, as it is a mapping of index values to data values. It can be substituted into many functions that expect a dict"

Gauss & Chess revisited, using Series¶

You get some nice matplotlib integration via pandas

In [65]:

# Gauss addition using np.arange, Series 

from pandas import Series
Series(arange(101).cumsum()).plot()

Out[65]:

<matplotlib.axes.AxesSubplot at 0x4dce930>

In [66]:

from pandas import Series
Series((pow(2,k) for k in xrange(64)), dtype=np.float64).cumsum().plot()

Out[66]:

<matplotlib.axes.AxesSubplot at 0x4e1ab30>

Wheat and Chessboard w/ NumPy¶

http://docs.scipy.org/doc/numpy/reference/ufuncs.html

In [67]:

2*ones(64, dtype=np.int)

Out[67]:

array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [68]:

arange(64)

Out[68]:

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
       51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63])

In [69]:

sum(np.power(2, arange(64, dtype=np.uint64)))

Out[69]:

18446744073709551615

In [70]:

sum(np.power(2*ones(64, dtype=np.uint64), arange(64))) 

Out[70]:

1.8446744073709552e+19

In [71]:

precise_ans = sum([pow(2,n) for n in xrange(64)])
np_ans = sum(np.power(2*ones(64, dtype=np.uint64), arange(64)))

precise_ans, np_ans

Out[71]:

(18446744073709551615L, 1.8446744073709552e+19)

In [72]:

# Raise an assertion if two items are not equal up to desired precision.
np.testing.assert_almost_equal(precise_ans, np_ans) is None

Out[72]:

True

File Dependency Check for Census calculation¶

Here we check to make sure that the expected files are in place.

This notebook assumes relative location of files as laid out in https://github.com/rdhyee/working-open-data

In [73]:

import os

# relative to parent dir
DATA_FILES = {"datadict":"data/census/DataDict.txt",
               "dataset":"data/census/DataSet.txt",
               "fips": "data/census/FIPS_CountyName.txt"}

def file_path(key):
    return os.path.join(os.pardir, DATA_FILES[key])

for file_key in DATA_FILES.keys():
    abs_fname = file_path(file_key)
    print abs_fname, os.path.exists(abs_fname)
    

../data/census/DataDict.txt True
../data/census/FIPS_CountyName.txt True
../data/census/DataSet.txt True

You can download git repo as zip file if you are having problems with github

In [74]:

import pandas as pd
from pandas import Series, DataFrame

Getting more out of iPython notebook¶

PfDA, p. 55 -- Table 3-2

http://my.safaribooksonline.com/book/programming/python/9781449323592/3dot-ipython-an-interactive-computing-and-development-environment/id2774782#ipython_magic_table

Census Quickfacts¶

In [75]:

# use head on Mac/linuc to look at file

fips_path = file_path("fips")
!head $fips_path

00000 UNITED STATES
01000 ALABAMA
01001 Autauga County, AL
01003 Baldwin County, AL
01005 Barbour County, AL
01007 Bibb County, AL
01009 Blount County, AL
01011 Bullock County, AL
01013 Butler County, AL
01015 Calhoun County, AL

trying out by hand the construction of a simple DataFrame to see what loading a list of dict into DataFrame gets us¶

In [76]:

df1 = DataFrame([{'fips':'00000', 'geog_entity':'UNITED STATES'}, 
              {'fips':'01000', 'geog_entity':'ALABAMA'},
              ])
df1

Out[76]:

	fips	geog_entity
0	00000	UNITED STATES
1	01000	ALABAMA

In [77]:

# check type of df1
type(df1)

Out[77]:

pandas.core.frame.DataFrame

In [78]:

# column headings

df1.columns

Out[78]:

Index([fips, geog_entity], dtype=object)

In [79]:

# index -- ascribed automatically since we didn't explicitly specify labels
df1.index

Out[79]:

Int64Index([0, 1], dtype=int64)

We can use set_index to makes fips the index

In [80]:

# http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.set_index.html#pandas.DataFrame.set_index
# p. 150
df1.set_index('fips', inplace=True, drop=True)

/Users/raymondyee/.virtualenvs/epd1/lib/python2.7/site-packages/pandas/core/frame.py:2762: FutureWarning: set_index with inplace=True  will return None from pandas 0.11 onward
  " from pandas 0.11 onward", FutureWarning)

Out[80]:

	geog_entity
fips
00000	UNITED STATES
01000	ALABAMA

In [81]:

# now check the index
df1.index

Out[81]:

Index([00000, 01000], dtype=object)

In [82]:

# to access row with index of 01000
df1.ix['01000'] 

Out[82]:

geog_entity    ALABAMA
Name: 01000

In [83]:

# to read off the geog_entity column
df1['geog_entity']

Out[83]:

fips
00000    UNITED STATES
01000          ALABAMA
Name: geog_entity

In [84]:

# look at how we can do row slices
df1.ix[0:1]

Out[84]:

	geog_entity
fips
00000	UNITED STATES

In [85]:

# get rows in reverse order
df1.ix[::-1]

Out[85]:

	geog_entity
fips
01000	ALABAMA
00000	UNITED STATES

Create the index explicitly in instantiation of the simple DataFrame

In [86]:

df2 = DataFrame(data = [{'geog_entity':'UNITED STATES'}, 
                {'geog_entity':'ALABAMA'},
                ], index = ['00000','01000'])
df2

Out[86]:

	geog_entity
00000	UNITED STATES
01000	ALABAMA

In [87]:

# note index name is empty
df2.index.name is None

Out[87]:

True

In [88]:

# set index name
df2.index.name = 'fips'
df2

Out[88]:

	geog_entity
fips
00000	UNITED STATES
01000	ALABAMA

Exercise: Complete this code to create fips_df from the fips file¶

In [89]:

import codecs
from itertools import islice

from pandas import Series, DataFrame

f = codecs.open(file_path("fips"), encoding='iso-8859-1')

fips_list = list()

for row in islice(f, None):
    fips_list.append({'fips': row[:5], 'geog_entity': row[6:-1]})
    
fips_df = DataFrame(fips_list)
fips_df

Out[89]:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3195 entries, 0 to 3194
Data columns:
fips           3195  non-null values
geog_entity    3195  non-null values
dtypes: object(2)

In [90]:

# display the first 5 rows
fips_df[:5]

Out[90]:

	fips	geog_entity
0	00000	UNITED STATES
1	01000	ALABAMA
2	01001	Autauga County, AL
3	01003	Baldwin County, AL
4	01005	Barbour County, AL

let's try using read_table directly to read in fips code¶

In [91]:

import codecs
from itertools import islice

import pandas as pd
from pandas import Series, DataFrame

f = codecs.open(file_path("fips"), encoding='iso-8859-1')

fips_df2 = pd.read_table(f, header=None,names=['row'])

# show first five rows
fips_df2[:5]

Out[91]:

	row
0	00000 UNITED STATES
1	01000 ALABAMA
2	01001 Autauga County, AL
3	01003 Baldwin County, AL
4	01005 Barbour County, AL

Exercise: create a fips column¶

hint: consider using fips_df2.row.str[:] and see what happens if you use

fips_df2["junk"] = 'hello'

BTW, you can delete columns -- e.g.,

del fips_df2["junk"]

In [92]:

fips_df2["fips"] = fips_df2.row.str[:5]

In [93]:

# show first five rows
fips_df2[:5]

Out[93]:

	row	fips
0	00000 UNITED STATES	00000
1	01000 ALABAMA	01000
2	01001 Autauga County, AL	01001
3	01003 Baldwin County, AL	01003
4	01005 Barbour County, AL	01005

Exercise: Likewise, create the geog_entity column and get rid of row column¶

In [94]:

fips_df2["geog_entity"] = fips_df2.row.str[6:]
fips_df2[:5]

Out[94]:

	row	fips	geog_entity
0	00000 UNITED STATES	00000	UNITED STATES
1	01000 ALABAMA	01000	ALABAMA
2	01001 Autauga County, AL	01001	Autauga County, AL
3	01003 Baldwin County, AL	01003	Baldwin County, AL
4	01005 Barbour County, AL	01005	Barbour County, AL

In [95]:

del fips_df2["row"]

In [96]:

fips_df2[:5]

Out[96]:

	fips	geog_entity
0	00000	UNITED STATES
1	01000	ALABAMA
2	01001	Autauga County, AL
3	01003	Baldwin County, AL
4	01005	Barbour County, AL

In [97]:

# this is a check that you have the right columns

assert set(fips_df2.columns) == set(['fips', 'geog_entity'])
assert len(fips_df2) == 3195

Now set the fips to be the index¶

In [98]:

fips_df2.set_index('fips', inplace=True, drop=True)

Out[98]:

<class 'pandas.core.frame.DataFrame'>
Index: 3195 entries, 00000 to 56045
Data columns:
geog_entity    3195  non-null values
dtypes: object(1)

In [99]:

# print out first five rows
fips_df2[:5]

Out[99]:

	geog_entity
fips
00000	UNITED STATES
01000	ALABAMA
01001	Autauga County, AL
01003	Baldwin County, AL
01005	Barbour County, AL

In [100]:

fips_df2.ix['00000']

Out[100]:

geog_entity    UNITED STATES
Name: 00000

Note: see how to pull out country + states¶

In [101]:

# first 5 rows
fips_df2[Series(fips_df2.index).str[-3:] == '000'][:5]

Out[101]:

	geog_entity
fips
00000	UNITED STATES
01000	ALABAMA
02000	ALASKA
04000	ARIZONA
05000	ARKANSAS

In [102]:

len(fips_df2[Series(fips_df2.index).str[-3:] == '000'])

Out[102]:

In [103]:

# check type
isinstance(fips_df2.index, pd.Index)

Out[103]:

True

In [104]:

# can create a Boolean slice to pull out a country
is_country = fips_df.fips.str[:] == '00000'

In [105]:

fips_df[is_country]

Out[105]:

	fips	geog_entity
0	00000	UNITED STATES

Exercise: create a Boolean index `is_state` to pull out states and a DataFrame `states_df` to hold the states + DC; also write `is_county` to pull out counties¶

Hints:

Recall how to distinguish states fips -- end in '000' but is not '00000' (which is the US code)
use can use & to do a boolean and on Boolean indices -- (see PfDA, p. 91)

In [106]:

# how about states?
# make use of 
# http://proquest.safaribooksonline.com/book/programming/python/9781449323592/7dot-data-wrangling-clean-transform-merge-reshape/id2801165

is_state = (fips_df.fips.str[-3:] == '000') & (fips_df.fips.str[:] != '00000')
is_county = fips_df.fips.str[-3:] != '000'

states_df = fips_df[is_state]
print len(states_df)
states_df[:5]

Out[106]:

	fips	geog_entity
1	01000	ALABAMA
69	02000	ALASKA
99	04000	ARIZONA
115	05000	ARKANSAS
191	06000	CALIFORNIA

Check whether is_state and states_df look correct¶

In [107]:

assert set(is_state.value_counts().iteritems()) == set([(True, 51), (False, 3144)])

In [108]:

assert set(states_df["geog_entity"]) == set([u'VERMONT',
     u'GEORGIA', u'IOWA', u'KANSAS', u'FLORIDA', u'VIRGINIA', u'NORTH CAROLINA',
     u'HAWAII', u'NEW YORK', u'CALIFORNIA', u'ALABAMA', u'IDAHO', u'DELAWARE',
     u'ALASKA', u'ILLINOIS', u'SOUTH DAKOTA', u'CONNECTICUT', u'MONTANA',
     u'MASSACHUSETTS', u'NEW HAMPSHIRE', u'MARYLAND', u'NEW MEXICO',
     u'MISSISSIPPI', u'TENNESSEE', u'COLORADO', u'NEW JERSEY', u'UTAH',
     u'MICHIGAN', u'WEST VIRGINIA', u'WASHINGTON', u'MINNESOTA', u'OREGON',
     u'WYOMING', u'OHIO', u'SOUTH CAROLINA', u'INDIANA', u'NEVADA',
     u'LOUISIANA', u'NEBRASKA', u'ARIZONA', u'WISCONSIN', u'NORTH DAKOTA',
     u'PENNSYLVANIA', u'OKLAHOMA', u'KENTUCKY', u'RHODE ISLAND', u'DISTRICT OF COLUMBIA',
     u'ARKANSAS', u'MISSOURI', u'TEXAS', u'MAINE'])

In [109]:

assert set(is_county.value_counts().iteritems()) == set([(False, 52), (True, 3143)])

Now it's time to work with the population dataset¶

remind ourselves of what's in the dataset by reading it into a dict¶

In [110]:

import csv
import codecs
import pandas as pd
from pandas import DataFrame, Series
from itertools import islice

f = codecs.open(file_path("dataset"), encoding='iso-8859-1')

reader = csv.DictReader(f)
dataset = dict([(row["fips"], row) for row in islice(reader, None)])

f.close()

In [111]:

# print out entry for USA

print dataset['00000']

{'RHI125211': '78.1', 'SBO415207': '8.3', 'VET605211': '22215303', 'RHI225211': '13.1', 'PVY020211': '14.3', 'HSD310211': '2.60', 'POP645211': '12.8', 'AGE135211': '6.5', 'EDU685211': '28.2', 'RHI625211': '2.3', 'SBO215207': '5.7', 'PST045212': '313914040', 'SBO015207': '28.8', 'POP715211': '84.6', 'PST120211': '0.9', 'PST120212': '1.7', 'POP010210': '308745538', 'PST045211': '311587816', 'SBO315207': '7.1', 'POP060210': '87.4', 'RHI425211': '5.0', 'POP815211': '20.3', 'HSD410211': '114761359', 'HSG495211': '186200', 'BZA010210': '7396628', 'LFE305211': '25.4', 'BZA110210': '111970095', 'AGE775211': '13.3', 'HSG096211': '25.9', 'RHI525211': '0.2', 'LND110210': '3531905.43', 'PST040210': '308747508', 'RHI825211': '63.4', 'BZA115210': '-1.8', 'NES010210': '22110628', 'fips': '00000', 'MAN450207': '5338306501', 'EDU635211': '85.4', 'RTN131207': '12990', 'RHI725211': '16.7', 'BPS030211': '624061', 'INC110211': '52762', 'AGE295211': '23.7', 'SBO115207': '0.9', 'INC910211': '27915', 'RHI325211': '1.2', 'WTN220207': '4174286516', 'HSG445211': '66.1', 'SBO515207': '0.1', 'AFN120207': '613795732', 'RTN130207': '3917663456', 'HSG010211': '132312404', 'SEX255211': '50.8', 'SBO001207': '27092908'}

In [112]:

# make sure we have the right 2010 census population for the USA

assert dataset['00000']['POP010210'] == '308745538'

let's use pd.read_csv to read in file -- there's a bit of tricky type conversion issues here.¶

In [113]:

# let's try pd.read_csv 
# how to give hints about data type to pd.read_csv?

import codecs

# read data in and merge with fips_df

#dtype = [('fips', 'S'), ('POP010210', 'i')]
# problems with string conversion when I try to explicitly parse fips as a np.string_
# dtype = {'fips':np.string_}
# possibly relevant sections of the code:

# v0.10.1: https://github.com/pydata/pandas/tree/v0.10.1
# https://github.com/pydata/pandas/blob/v0.10.1/pandas/io/parsers.py
# https://github.com/pydata/pandas/blob/v0.10.1/pandas/src/parser.pyx#L897 -> _convert_with_dtype

# let's try object for fips and int for POP010210
dtype = {'fips':np.object, 'POP010210':np.int}

f = codecs.open(file_path("dataset"), encoding='iso-8859-1')
dataset_df = pd.read_csv(f, dtype=dtype)

In [114]:

# confirm data types for fips, which should not be convered to int
dataset_df.fips.dtype, dataset_df.POP010210.dtype

Out[114]:

(dtype('object'), dtype('int64'))

In [115]:

# read off US population
dataset_df[is_country].ix[0]['POP010210']

Out[115]:

308745538

In [116]:

# 3 most populous entities 
# http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.sort_index.html
dataset_df.sort_index(by='POP010210')[-1:-4:-1][["fips", "POP010210"]]

Out[116]:

	fips	POP010210
0	00000	308745538
191	06000	37253956
2567	48000	25145561

Merge fips w/ dataset¶

In [117]:

# before doing merge,list expected columns

'fips' in fips_df.columns, 'fips' in dataset_df.columns

Out[117]:

(True, True)

In [118]:

# PfDA, p. 28
df = pd.merge(fips_df, dataset_df)

In [119]:

df.sort_index(by='POP010210')[-1:-5:-1][["fips", "geog_entity", "POP010210"]]

Out[119]:

	fips	geog_entity	POP010210
0	00000	UNITED STATES	308745538
191	06000	CALIFORNIA	37253956
2567	48000	TEXAS	25145561
1861	36000	NEW YORK	19378102

Exercise: make a DataFrame called state_pop of 51 state-like entities that has fips, geog_entity, POP010210 columns, sorted by 2010 census in descending pop¶

In [120]:

state_pop = df[is_state][["fips", "geog_entity", "POP010210"]].sort_index(by='POP010210')[::-1]
state_pop[:5]

Out[120]:

	fips	geog_entity	POP010210
191	06000	CALIFORNIA	37253956
2567	48000	TEXAS	25145561
1861	36000	NEW YORK	19378102
330	12000	FLORIDA	18801310
609	17000	ILLINOIS	12830632

In [121]:

# check state_pop behavior

assert state_pop.shape  == (51, 3)
assert set(state_pop.columns) == set(['POP010210', 'fips', 'geog_entity'])
assert list(state_pop[:5]["geog_entity"]) == [u'CALIFORNIA', u'TEXAS', u'NEW YORK', u'FLORIDA', u'ILLINOIS']
assert state_pop[:5]["POP010210"].sum() == 113409561

In [122]:

# add up all the states to match US population

assert state_pop["POP010210"].sum() == df[is_country]["POP010210"].sum()

In [123]:

sum(df[is_country]['POP010210']), sum(df[is_state]['POP010210']), sum(df[is_county]['POP010210'])

Out[123]:

(308745538, 308745538, 308745538)

Exercise: write function counties_for_state to return a Boolean index for a fiven state prefix or fips code¶

e.g

df[counties_for_state('06')]

should be a DataFrame with CA counties

In [124]:

def counties_for_state(state):
    return (df.fips.str[:2] == state[:2]) & (df.fips.str[-3:] != '000') 

In [125]:

assert set(df[counties_for_state('06')]["geog_entity"]) == set([u'Nevada County, CA',
     u'Alameda County, CA', u'Kings County, CA', u'Ventura County, CA',
     u'El Dorado County, CA', u'San Joaquin County, CA', u'Alpine County, CA',
     u'San Luis Obispo County, CA', u'Modoc County, CA', u'Colusa County, CA',
     u'Stanislaus County, CA', u'Sonoma County, CA', u'Tulare County, CA',
     u'Shasta County, CA', u'Yolo County, CA', u'Placer County, CA', u'Glenn County, CA',
     u'Sacramento County, CA', u'San Francisco County, CA',
     u'Madera County, CA', u'Imperial County, CA', u'Plumas County, CA',
     u'San Mateo County, CA', u'Riverside County, CA', u'Calaveras County, CA',
     u'Napa County, CA', u'Mendocino County, CA', u'Mariposa County, CA',
     u'Santa Barbara County, CA', u'Inyo County, CA', u'Butte County, CA',
     u'Trinity County, CA', u'Los Angeles County, CA', u'Lassen County, CA',
     u'Yuba County, CA', u'Amador County, CA', u'Marin County, CA', u'Humboldt County, CA', 
     u'Merced County, CA', u'Lake County, CA', u'San Diego County, CA',
     u'Monterey County, CA', u'Sutter County, CA', u'Solano County, CA',
     u'Tuolumne County, CA', u'San Bernardino County, CA', u'Fresno County, CA',
     u'Santa Cruz County, CA', u'San Benito County, CA', u'Contra Costa County, CA', 
     u'Orange County, CA', u'Del Norte County, CA', u'Mono County, CA',
     u'Siskiyou County, CA', u'Kern County, CA', u'Sierra County, CA', u'Tehama County, CA',
     u'Santa Clara County, CA'])

Now we are ready to tally up all states w/ their counties¶

In [126]:

for (k, s) in df[is_state][['fips', 'geog_entity', 'POP010210']].iterrows():
    print s["fips"], s["geog_entity"], s["POP010210"] == df[counties_for_state(s["fips"])]['POP010210'].sum()

01000 ALABAMA True
02000 ALASKA True
04000 ARIZONA True
05000 ARKANSAS True
06000 CALIFORNIA True
08000 COLORADO True
09000 CONNECTICUT True
10000 DELAWARE True
11000 DISTRICT OF COLUMBIA True
12000 FLORIDA True
13000 GEORGIA True
15000 HAWAII True
16000 IDAHO True
17000 ILLINOIS True
18000 INDIANA True
19000 IOWA True
20000 KANSAS True
21000 KENTUCKY True
22000 LOUISIANA True
23000 MAINE True
24000 MARYLAND True
25000 MASSACHUSETTS True
26000 MICHIGAN True
27000 MINNESOTA True
28000 MISSISSIPPI True
29000 MISSOURI True
30000 MONTANA True
31000 NEBRASKA True
32000 NEVADA True
33000 NEW HAMPSHIRE True
34000 NEW JERSEY True
35000 NEW MEXICO True
36000 NEW YORK True
37000 NORTH CAROLINA True
38000 NORTH DAKOTA True
39000 OHIO True
40000 OKLAHOMA True
41000 OREGON True
42000 PENNSYLVANIA True
44000 RHODE ISLAND True
45000 SOUTH CAROLINA True
46000 SOUTH DAKOTA True
47000 TENNESSEE True
48000 TEXAS True
49000 UTAH True
50000 VERMONT True
51000 VIRGINIA True
53000 WASHINGTON True
54000 WEST VIRGINIA True
55000 WISCONSIN True
56000 WYOMING True

Do this tally in a slightly different way¶

In [127]:

# http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.set_index.html#pandas.DataFrame.set_index
# p. 150
df.set_index('fips', inplace=True, drop=False)

Out[127]:

<class 'pandas.core.frame.DataFrame'>
Index: 3195 entries, 00000 to 56045
Data columns:
fips           3195  non-null values
geog_entity    3195  non-null values
PST045212      3195  non-null values
PST045211      3195  non-null values
PST040210      3195  non-null values
PST120212      3195  non-null values
PST120211      3195  non-null values
POP010210      3195  non-null values
AGE135211      3195  non-null values
AGE295211      3195  non-null values
AGE775211      3195  non-null values
SEX255211      3195  non-null values
RHI125211      3195  non-null values
RHI225211      3195  non-null values
RHI325211      3195  non-null values
RHI425211      3195  non-null values
RHI525211      3195  non-null values
RHI625211      3195  non-null values
RHI725211      3195  non-null values
RHI825211      3195  non-null values
POP715211      3195  non-null values
POP645211      3195  non-null values
POP815211      3195  non-null values
EDU635211      3195  non-null values
EDU685211      3195  non-null values
VET605211      3195  non-null values
LFE305211      3195  non-null values
HSG010211      3195  non-null values
HSG445211      3195  non-null values
HSG096211      3195  non-null values
HSG495211      3195  non-null values
HSD410211      3195  non-null values
HSD310211      3195  non-null values
INC910211      3195  non-null values
INC110211      3195  non-null values
PVY020211      3195  non-null values
BZA010210      3195  non-null values
BZA110210      3195  non-null values
BZA115210      3195  non-null values
NES010210      3195  non-null values
SBO001207      3195  non-null values
SBO315207      3195  non-null values
SBO115207      3195  non-null values
SBO215207      3195  non-null values
SBO515207      3195  non-null values
SBO415207      3195  non-null values
SBO015207      3195  non-null values
MAN450207      3195  non-null values
WTN220207      3195  non-null values
RTN130207      3195  non-null values
RTN131207      3195  non-null values
AFN120207      3195  non-null values
BPS030211      3195  non-null values
LND110210      3195  non-null values
POP060210      3195  non-null values
dtypes: float64(33), int64(20), object(2)

In [128]:

for k in df[is_state]['fips']:
    print k, df.ix[k]["geog_entity"], df.ix[k]["POP010210"] == sum(df[counties_for_state(k)]['POP010210'])

01000 ALABAMA True
02000 ALASKA True
04000 ARIZONA True
05000 ARKANSAS True
06000 CALIFORNIA True
08000 COLORADO True
09000 CONNECTICUT True
10000 DELAWARE True
11000 DISTRICT OF COLUMBIA True
12000 FLORIDA True
13000 GEORGIA True
15000 HAWAII True
16000 IDAHO True
17000 ILLINOIS True
18000 INDIANA True
19000 IOWA True
20000 KANSAS True
21000 KENTUCKY True
22000 LOUISIANA True
23000 MAINE True
24000 MARYLAND True
25000 MASSACHUSETTS True
26000 MICHIGAN True
27000 MINNESOTA True
28000 MISSISSIPPI True
29000 MISSOURI True
30000 MONTANA True
31000 NEBRASKA True
32000 NEVADA True
33000 NEW HAMPSHIRE True
34000 NEW JERSEY True
35000 NEW MEXICO True
36000 NEW YORK True
37000 NORTH CAROLINA True
38000 NORTH DAKOTA True
39000 OHIO True
40000 OKLAHOMA True
41000 OREGON True
42000 PENNSYLVANIA True
44000 RHODE ISLAND True
45000 SOUTH CAROLINA True
46000 SOUTH DAKOTA True
47000 TENNESSEE True
48000 TEXAS True
49000 UTAH True
50000 VERMONT True
51000 VIRGINIA True
53000 WASHINGTON True
54000 WEST VIRGINIA True
55000 WISCONSIN True
56000 WYOMING True

In [129]:

# does order of A== B vs B== A matter -- shouldn't

state = '06000'
k0 = df.fips.str[:2] == state[:2]
type(k0)
k1 = state[:2] == df.fips.str[:2]
print type(k0), type(k1)
print k0.value_counts()
print k1.value_counts()
print np.all(k0 == k1)

<class 'pandas.core.series.Series'> <class 'pandas.core.series.Series'>
False    3136
True       59
False    3136
True       59
True

Addendum: ties to operator overloading in Python -- read beyond this point only if really interested!¶

In [130]:

n0 = 5
n0 == 5

Out[130]:

True

Now I thought I'd be able to use a n0.__eq__(5) but nope -- it's complicated -- see http://stackoverflow.com/questions/2281222/why-when-in-python-does-x-y-call-y-eq-x#comment2254663_2282795

In [131]:

try:
    n0.__eq__(5)
except Exception as e:
    print e

'int' object has no attribute '__eq__'

can do: int.__cmp__(x)

In [132]:

(n0.__cmp__(4), n0.__cmp__(5), n0.__cmp__(6))

Out[132]:

(1, 0, -1)

how about ndarray?

In [133]:

arange(5) == 2 

Out[133]:

array([False, False,  True, False, False], dtype=bool)

In [134]:

# 
# http://docs.scipy.org/doc/numpy/reference/generated/numpy.array_equal.html
np.array_equal(arange(5) == 2 , arange(5).__eq__(2))

Out[134]:

True

Appendix: underlying mechanics of slicing¶

Useful if you want to understand how the slicing syntax really works.

In [135]:

isinstance([1,2], list)

Out[135]:

True

In [136]:

isinstance(arange(5), list) # what does that mean -- could still be list-like

Out[136]:

False

In [137]:

l1 = range(5)

In [138]:

type(l1)

Out[138]:

list

In [139]:

l1[0], l1.__getitem__(0), l1[0] == l1.__getitem__(0)

Out[139]:

(0, 0, True)

In [140]:

l1[::-1], l1.__getitem__(slice(None, None, -1))

Out[140]:

([4, 3, 2, 1, 0], [4, 3, 2, 1, 0])

In [141]:

ar1 = arange(5)
ar1[3], ar1.__getitem__(3)

Out[141]:

(3, 3)

In [142]:

ar1 == 2

Out[142]:

array([False, False,  True, False, False], dtype=bool)

In [143]:

ar1[ar1 == 2].shape

Out[143]:

(1,)

In [144]:

ar1.__eq__(2)

Out[144]:

array([False, False,  True, False, False], dtype=bool)

In [145]:

ar1.__getitem__(slice(2, 4, None))

Out[145]:

array([2, 3])

In [146]:

slice(ar1.__eq__(2), None, None)

Out[146]:

slice(array([False, False,  True, False, False], dtype=bool), None, None)

In [147]:

ar1.__getitem__(ar1.__eq__(2))

Out[147]:

array([2])

In [148]:

ar1[:2], ar1.__getitem__(slice(2))

Out[148]:

(array([0, 1]), array([0, 1]))

In [149]:

ar1 + 7

Out[149]:

array([ 7,  8,  9, 10, 11])

In [150]:

ar1.__add__(7)

Out[150]:

array([ 7,  8,  9, 10, 11])

In [151]:

min(ar1 + 7)

Out[151]:

In [152]:

alphabet[:]

Out[152]:

'abcdefghijklmnopqrstuvwxyz'