Warm up with sequences, lists¶

Warm-up exercise I: verifying sum of integers calculated by (young) Gauss¶

http://mathandmultimedia.com/2010/09/15/sum-first-n-positive-integers/

Gauss displayed his genius at an early age. According to anecdotes, when he was in primary school, he was punished by his teacher due to misbehavior. He was told to add the numbers from 1 to 100. He was able to compute its sum, which is 5050, in a matter of seconds.

Now, how on earth did he do it?

Bonus Exercise: write an iterator for triangular numbers¶

In [ ]:

from itertools import islice

def triangular():
   # FILL IN  
   pass

In [ ]:

# a check
assert list(islice(triangular(), 5)) == [1, 3, 6, 10, 15]

In [ ]:

for i, n in enumerate(islice(triangular(), 10)):
    print i+1, n

In [ ]:

list(islice(triangular(), 100))[-1]

In [ ]:

list(islice(triangular(),99,100))[0]

Warm Up Exercise II: Wheat and chessboard problem¶

http://en.wikipedia.org/wiki/Wheat_and_chessboard_problem :

If a chessboard were to have wheat placed upon each square such that one grain were placed on the first square, two on the second, four on the third, and so on (doubling the number of grains on each subsequent square), how many grains of wheat would be on the chessboard at the finish?

The total number of grains equals 18,446,744,073,709,551,615, which is a much higher number than most people intuitively expect.

try using pow

In [10]:

# Legend of the Chessboard YouTube video

from IPython.display import YouTubeVideo
YouTubeVideo('t3d0Y-JpRRg')

Out[10]:

Exercise: verify answer: The total number of grains equals 18,446,744,073,709,551,615,¶

In [ ]:

#FILL IN YOUR CODE

Slicing/Indexing Review¶

http://stackoverflow.com/a/509295/7782

Use on any of the sequence types (python docs on sequence types):

There are seven sequence types: strings, Unicode strings, lists, tuples, bytearrays, buffers, and xrange objects.

The use of square brackets are for accessing slices of sequence.

Let's remind ourselves of how to use slices

s[i]
s[i:j]
s[i:j:k]
meaning of negative indices
0-base counting

In [ ]:

m = range(10)
m

In [ ]:

m[0]

In [ ]:

m[-1]

In [ ]:

m[::-1]

In [ ]:

m[2:3]

In [ ]:

import string
alphabet = string.lowercase

alphabet

In [ ]:

# 13 letter of the alphabet
alphabet[12]

We will revisit generalized slicing in NumPy.

Import/naming conventions and pylab mode¶

http://my.safaribooksonline.com/book/programming/python/9781449323592/1dot-preliminaries/id2699702

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

These imports done for you in pylab mode.

pylab mode¶

ipython --help

yields

--pylab=<CaselessStrEnum> (InteractiveShellApp.pylab)
    Default: None
    Choices: ['tk', 'qt', 'wx', 'gtk', 'osx', 'inline', 'auto']
    Pre-load matplotlib and numpy for interactive use, selecting a particular
    matplotlib backend and loop integration.

NumPy¶

http://www.numpy.org/:

NumPy is the fundamental package for scientific computing with Python. It contains among other things:

a powerful N-dimensional array object [let's start with 1 and 2 dimensions]
sophisticated (broadcasting) functions [what is broadcasting?]
tools for integrating C/C++ and Fortran code [why useful?]
useful linear algebra, Fourier transform, and random number capabilities

Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.

See PfDA, Chapter 4

ndarray.ndim, ndarray.shape¶

In [ ]:

# zero-dimensions

a0 = array(5)
a0

use shape to get a tuple of array dimensions

In [ ]:

a0.ndim, a0.shape

In [ ]:

# 1-d array
a1 = array([1,2])
a1.ndim, a1.shape

In [ ]:

# 2-d array

a2 = array(([1,2], [3,4]))
a2.ndim, a2.shape

dtype: type of given ndarray¶

In [ ]:

a2.dtype

np.arange¶

arange is one instance of ndarray creating function in NumPy

Compare to xrange.

In [ ]:

type(arange(10))

In [ ]:

for k in arange(10):
    print k

In [ ]:

list(arange(10)) == list(xrange(10))

NumPy.ndarray.reshape¶

In [ ]:

#how to map 0..63 -> 2x2 array
a3 = np.arange(64).reshape(8,8)
a3

In [ ]:

a3[1,2]

In [ ]:

for i in range(8):
    for j in range(8):
        if a3[i,j] != i*8 + j:
            print i, j

scalar multiplication¶

example of broadcasting:

The term broadcasting describes how numpy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes. Broadcasting provides a means of vectorizing array operations so that looping occurs in C instead of Python. It does this without making needless copies of data and usually leads to efficient algorithm implementations. There are, however, cases where broadcasting is a bad idea because it leads to inefficient use of memory that slows computation.

In [ ]:

2*a3

add 2 to all elements in a3¶

In [ ]:

a3+2

sorting¶

In [ ]:

# reverse sort -- best way?
#http://stackoverflow.com/a/6771620/7782

np.sort(arange(100))[::-1]

Boolean slice: important novel type of slicing¶

This stuff is a bit tricky (see PfDA, pp. 89-92)

Consider example of picking out whole numbers less than 20 that are evenly divisible by 3. Generate a list of such numbers

In [ ]:

# list comprehension

[i for i in xrange(20) if i % 3 == 0]

In [ ]:

a3 = arange(20) 
a3

In [ ]:

# basic indexing

print a3[0]
print a3[::-1]
print a3[2:5]

In [ ]:

np.mod(a3, 3)

In [ ]:

np.mod(a3, 3) == 0

In [ ]:

divisible_by_3 = np.mod(a3, 3) == 0

In [ ]:

a3[divisible_by_3]

In [ ]:

# if you want to understand this in terms of the overloaded operators -- don't worry if you don't get this.
a3.__getitem__(np.mod(a3,3).__eq__(0))

Exercise: Calculate a series that holds all the squares less than 100¶

Hint: one way is to use arange, np.sqrt, astype

Answer should be:

array([ 0,  1,  4,  9, 16, 25, 36, 49, 64, 81])

In [ ]:

#FILL IN

We will come back to indexing later.¶

http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html

Pandas¶

pandas.Series¶

Make a series out of an array

In [ ]:

s1 = Series(arange(5))

confirm that the type of s1 is what you would expect

In [ ]:

type(s1)

show that the series is also an array

In [ ]:

s1.ndim, isinstance(s1, np.ndarray)

In [ ]:

s1.index

In [ ]:

import string
allTheLetters = string.lowercase
allTheLetters

In [ ]:

s2 = Series(data=arange(5), index=list(allTheLetters)[:5])
s2

In [ ]:

s2.index

http://my.safaribooksonline.com/book/programming/python/9781449323592/5dot-getting-started-with-pandas/id2828378 :

Compared with a regular NumPy array, you can use values in the index when selecting single values or a set of values

In [ ]:

# can use both numeric indexing and the labels
s2[0], s2['a']

In [ ]:

for i in range(len(s2)):
    print i, s2[i]

it is possible conflict in indexing -- consider

In [ ]:

s3 = Series(data=['albert', 'betty', 'cathy'], index=[3,1, 0])
s3

In [ ]:

s3[0], list(s3)[0]

but slicing works to return specific numeric index

In [ ]:

s3[::-1]

In [ ]:

for i in range(len(s3)):
    print i, s3[i:i+1]

In [ ]:

s3.name = 'person names'
s3.name

In [ ]:

s3.index.name = 'confounding label'
s3.index.name

In [ ]:

s3

Important points remaining:

"NumPy array operations, such as filtering with a boolean array, scalar multiplication, or applying math functions, will preserve the index-value link"
"Another way to think about a Series is as a fixed-length, ordered dict, as it is a mapping of index values to data values. It can be substituted into many functions that expect a dict"

Gauss & Chess revisited, using Series¶

You get some nice matplotlib integration via pandas

In [ ]:

# Gauss addition using np.arange, Series 

from pandas import Series
Series(arange(101).cumsum()).plot()

In [ ]:

from pandas import Series
Series((pow(2,k) for k in xrange(64)), dtype=np.float64).cumsum().plot()

Wheat and Chessboard w/ NumPy¶

http://docs.scipy.org/doc/numpy/reference/ufuncs.html

In [ ]:

2*ones(64, dtype=np.int)

In [ ]:

arange(64)

In [ ]:

sum(np.power(2, arange(64, dtype=np.uint64)))

In [ ]:

sum(np.power(2*ones(64, dtype=np.uint64), arange(64))) 

In [ ]:

precise_ans = sum([pow(2,n) for n in xrange(64)])
np_ans = sum(np.power(2*ones(64, dtype=np.uint64), arange(64)))

precise_ans, np_ans

In [ ]:

# Raise an assertion if two items are not equal up to desired precision.
np.testing.assert_almost_equal(precise_ans, np_ans) is None

File Dependency Check for Census calculation¶

Here we check to make sure that the expected files are in place.

This notebook assumes relative location of files as laid out in https://github.com/rdhyee/working-open-data

In [ ]:

import os

# relative to parent dir
DATA_FILES = {"datadict":"data/census/DataDict.txt",
               "dataset":"data/census/DataSet.txt",
               "fips": "data/census/FIPS_CountyName.txt"}

def file_path(key):
    return os.path.join(os.pardir, DATA_FILES[key])

for file_key in DATA_FILES.keys():
    abs_fname = file_path(file_key)
    print abs_fname, os.path.exists(abs_fname)
    

You can download git repo as zip file if you are having problems with github

In [ ]:

import pandas as pd
from pandas import Series, DataFrame

Getting more out of iPython notebook¶

PfDA, p. 55 -- Table 3-2

http://my.safaribooksonline.com/book/programming/python/9781449323592/3dot-ipython-an-interactive-computing-and-development-environment/id2774782#ipython_magic_table

Census Quickfacts¶

In [ ]:

# use head on Mac/linuc to look at file

fips_path = file_path("fips")
!head $fips_path

trying out by hand the construction of a simple DataFrame to see what loading a list of dict into DataFrame gets us¶

In [ ]:

df1 = DataFrame([{'fips':'00000', 'geog_entity':'UNITED STATES'}, 
              {'fips':'01000', 'geog_entity':'ALABAMA'},
              ])
df1

In [ ]:

# check type of df1
type(df1)

In [ ]:

# column headings

df1.columns

In [ ]:

# index -- ascribed automatically since we didn't explicitly specify labels
df1.index

We can use set_index to makes fips the index

In [ ]:

# http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.set_index.html#pandas.DataFrame.set_index
# p. 150
df1.set_index('fips', inplace=True, drop=True)

In [ ]:

# now check the index
df1.index

In [ ]:

# to access row with index of 01000
df1.ix['01000'] 

In [ ]:

# to read off the geog_entity column
df1['geog_entity']

In [ ]:

# look at how we can do row slices
df1.ix[0:1]

In [ ]:

# get rows in reverse order
df1.ix[::-1]

Create the index explicitly in instantiation of the simple DataFrame

In [ ]:

df2 = DataFrame(data = [{'geog_entity':'UNITED STATES'}, 
                {'geog_entity':'ALABAMA'},
                ], index = ['00000','01000'])
df2

In [ ]:

# note index name is empty
df2.index.name is None

In [ ]:

# set index name
df2.index.name = 'fips'
df2

Exercise: Complete this code to create fips_df from the fips file¶

In [ ]:

import codecs
from itertools import islice

from pandas import Series, DataFrame

f = codecs.open(file_path("fips"), encoding='iso-8859-1')

fips_list = list()

for row in islice(f, None):
    # FILL IN
    pass
    
fips_df = DataFrame(fips_list)
fips_df

In [ ]:

# display the first 5 rows
fips_df[:5]

let's try using read_table directly to read in fips code¶

In [ ]:

import codecs
from itertools import islice

import pandas as pd
from pandas import Series, DataFrame

f = codecs.open(file_path("fips"), encoding='iso-8859-1')

fips_df2 = pd.read_table(f, header=None,names=['row'])

# show first five rows
fips_df2[:5]

Exercise: create a fips column¶

hint: consider using fips_df2.row.str[:] and see what happens if you use

fips_df2["junk"] = 'hello'

BTW, you can delete columns -- e.g.,

del fips_df2["junk"]

In [ ]:

# FILL IN

In [ ]:

# show first five rows
fips_df2[:5]

Exercise: Likewise, create the geog_entity column and get rid of row column¶

In [ ]:

#FILL IN

In [ ]:

# this is a check that you have the right columns

assert set(fips_df2.columns) == set(['fips', 'geog_entity'])
assert len(fips_df2) == 3195

Now set the fips to be the index¶

In [ ]:

fips_df2.set_index('fips', inplace=True, drop=True)

In [ ]:

# print out first five rows
fips_df2[:5]

In [ ]:

fips_df2.ix['00000']

Note: see how to pull out country + states¶

In [ ]:

# first 5 rows
fips_df2[Series(fips_df2.index).str[-3:] == '000'][:5]

In [ ]:

len(fips_df2[Series(fips_df2.index).str[-3:] == '000'])

In [ ]:

# check type
isinstance(fips_df2.index, pd.Index)

In [ ]:

# can create a Boolean slice to pull out a country
is_country = fips_df.fips.str[:] == '00000'

In [ ]:

fips_df[is_country]

Exercise: create a Boolean index `is_state` to pull out states and a DataFrame `states_df` to hold the states + DC; also write `is_county` to pull out counties¶

Hints:

Recall how to distinguish states fips -- end in '000' but is not '00000' (which is the US code)
use can use & to do a boolean and on Boolean indices -- (see PfDA, p. 91)

In [ ]:

# how about states?
# make use of 
# http://proquest.safaribooksonline.com/book/programming/python/9781449323592/7dot-data-wrangling-clean-transform-merge-reshape/id2801165

#FILL IN 

is_state = None
is_county = None

states_df = fips_df[is_state]
print len(states_df)
states_df[:5]

Check whether is_state and states_df look correct¶

In [ ]:

assert set(is_state.value_counts().iteritems()) == set([(True, 51), (False, 3144)])

In [ ]:

assert set(states_df["geog_entity"]) == set([u'VERMONT',
     u'GEORGIA', u'IOWA', u'KANSAS', u'FLORIDA', u'VIRGINIA', u'NORTH CAROLINA',
     u'HAWAII', u'NEW YORK', u'CALIFORNIA', u'ALABAMA', u'IDAHO', u'DELAWARE',
     u'ALASKA', u'ILLINOIS', u'SOUTH DAKOTA', u'CONNECTICUT', u'MONTANA',
     u'MASSACHUSETTS', u'NEW HAMPSHIRE', u'MARYLAND', u'NEW MEXICO',
     u'MISSISSIPPI', u'TENNESSEE', u'COLORADO', u'NEW JERSEY', u'UTAH',
     u'MICHIGAN', u'WEST VIRGINIA', u'WASHINGTON', u'MINNESOTA', u'OREGON',
     u'WYOMING', u'OHIO', u'SOUTH CAROLINA', u'INDIANA', u'NEVADA',
     u'LOUISIANA', u'NEBRASKA', u'ARIZONA', u'WISCONSIN', u'NORTH DAKOTA',
     u'PENNSYLVANIA', u'OKLAHOMA', u'KENTUCKY', u'RHODE ISLAND', u'DISTRICT OF COLUMBIA',
     u'ARKANSAS', u'MISSOURI', u'TEXAS', u'MAINE'])

In [ ]:

assert set(is_county.value_counts().iteritems()) == set([(False, 52), (True, 3143)])

Now it's time to work with the population dataset¶

remind ourselves of what's in the dataset by reading it into a dict¶

In [ ]:

import csv
import codecs
import pandas as pd
from pandas import DataFrame, Series
from itertools import islice

f = codecs.open(file_path("dataset"), encoding='iso-8859-1')

reader = csv.DictReader(f)
dataset = dict([(row["fips"], row) for row in islice(reader, None)])

f.close()

In [ ]:

# print out entry for USA

print dataset['00000']

In [ ]:

# make sure we have the right 2010 census population for the USA

assert dataset['00000']['POP010210'] == '308745538'

let's use pd.read_csv to read in file -- there's a bit of tricky type conversion issues here.¶

In [ ]:

# let's try pd.read_csv 
# how to give hints about data type to pd.read_csv?

import codecs

# read data in and merge with fips_df

#dtype = [('fips', 'S'), ('POP010210', 'i')]
# problems with string conversion when I try to explicitly parse fips as a np.string_
# dtype = {'fips':np.string_}
# possibly relevant sections of the code:

# v0.10.1: https://github.com/pydata/pandas/tree/v0.10.1
# https://github.com/pydata/pandas/blob/v0.10.1/pandas/io/parsers.py
# https://github.com/pydata/pandas/blob/v0.10.1/pandas/src/parser.pyx#L897 -> _convert_with_dtype

# let's try object for fips and int for POP010210
dtype = {'fips':np.object, 'POP010210':np.int}

f = codecs.open(file_path("dataset"), encoding='iso-8859-1')
dataset_df = pd.read_csv(f, dtype=dtype)

In [ ]:

# confirm data types for fips, which should not be convered to int
dataset_df.fips.dtype, dataset_df.POP010210.dtype

In [ ]:

# read off US population
dataset_df[is_country].ix[0]['POP010210']

In [ ]:

# 3 most populous entities 
# http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.sort_index.html
dataset_df.sort_index(by='POP010210')[-1:-4:-1][["fips", "POP010210"]]

Merge fips w/ dataset¶

In [ ]:

# PfDA, p. 28
df = pd.merge(fips_df, dataset_df)

In [ ]:

df.sort_index(by='POP010210')[-1:-5:-1][["fips", "geog_entity", "POP010210"]]

Exercise: make a DataFrame called state_pop of 51 state-like entities that has fips, geog_entity, POP010210 columns, sorted by 2010 census in descending pop¶

In [ ]:

# calculate state_pop

In [ ]:

# check state_pop behavior

assert state_pop.shape  == (51, 3)
assert set(state_pop.columns) == set(['POP010210', 'fips', 'geog_entity'])
assert list(state_pop[:5]["geog_entity"]) == [u'CALIFORNIA', u'TEXAS', u'NEW YORK', u'FLORIDA', u'ILLINOIS']
assert state_pop[:5]["POP010210"].sum() == 113409561

In [ ]:

# add up all the states to match US population

assert state_pop["POP010210"].sum() == df[is_country]["POP010210"].sum()

In [ ]:

sum(df[is_country]['POP010210']), sum(df[is_state]['POP010210']), sum(df[is_county]['POP010210'])

Exercise: write function counties_for_state to return a Boolean index for a fiven state prefix or fips code¶

e.g

df[counties_for_state('06')]

should be a DataFrame with CA counties

In [ ]:

def counties_for_state(state):
    # Fill in
    
    pass

In [ ]:

assert set(df[counties_for_state('06')]["geog_entity"]) == set([u'Nevada County, CA',
     u'Alameda County, CA', u'Kings County, CA', u'Ventura County, CA',
     u'El Dorado County, CA', u'San Joaquin County, CA', u'Alpine County, CA',
     u'San Luis Obispo County, CA', u'Modoc County, CA', u'Colusa County, CA',
     u'Stanislaus County, CA', u'Sonoma County, CA', u'Tulare County, CA',
     u'Shasta County, CA', u'Yolo County, CA', u'Placer County, CA', u'Glenn County, CA',
     u'Sacramento County, CA', u'San Francisco County, CA',
     u'Madera County, CA', u'Imperial County, CA', u'Plumas County, CA',
     u'San Mateo County, CA', u'Riverside County, CA', u'Calaveras County, CA',
     u'Napa County, CA', u'Mendocino County, CA', u'Mariposa County, CA',
     u'Santa Barbara County, CA', u'Inyo County, CA', u'Butte County, CA',
     u'Trinity County, CA', u'Los Angeles County, CA', u'Lassen County, CA',
     u'Yuba County, CA', u'Amador County, CA', u'Marin County, CA', u'Humboldt County, CA', 
     u'Merced County, CA', u'Lake County, CA', u'San Diego County, CA',
     u'Monterey County, CA', u'Sutter County, CA', u'Solano County, CA',
     u'Tuolumne County, CA', u'San Bernardino County, CA', u'Fresno County, CA',
     u'Santa Cruz County, CA', u'San Benito County, CA', u'Contra Costa County, CA', 
     u'Orange County, CA', u'Del Norte County, CA', u'Mono County, CA',
     u'Siskiyou County, CA', u'Kern County, CA', u'Sierra County, CA', u'Tehama County, CA',
     u'Santa Clara County, CA'])

Now we are ready to tally up all states w/ their counties¶

In [ ]:

for (k, s) in df[is_state][['fips', 'geog_entity', 'POP010210']].iterrows():
    print s["fips"],s["geog_entity"], s["POP010210"] == df[counties_for_state(s["fips"])]['POP010210'].sum()

Do this tally in a slightly different way¶

In [ ]:

# http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.set_index.html#pandas.DataFrame.set_index
# p. 150
df.set_index('fips', inplace=True, drop=False)

In [ ]:

for k in df[is_state]['fips']:
    print k, df.ix[k]["geog_entity"], df.ix[k]["POP010210"] == sum(df[counties_for_state(k)]['POP010210'])

In [ ]:

Addendum: ties to operator overloading in Python -- read beyond this point only if really interested!¶

In [ ]:

n0 = 5
n0 == 5

Now I thought I'd be able to use a n0.__eq__(5) but nope -- it's complicated -- see http://stackoverflow.com/questions/2281222/why-when-in-python-does-x-y-call-y-eq-x#comment2254663_2282795

In [ ]:

try:
    n0.__eq__(5)
except Exception as e:
    print e

can do: int.__cmp__(x)

In [ ]:

(n0.__cmp__(4), n0.__cmp__(5), n0.__cmp__(6))

how about ndarray?

In [ ]:

arange(5) == 2 

In [ ]:

# 
# http://docs.scipy.org/doc/numpy/reference/generated/numpy.array_equal.html
np.array_equal(arange(5) == 2 , arange(5).__eq__(2))

Appendix: underlying mechanics of slicing¶

Useful if you want to understand how the slicing syntax really works.

In [ ]:

isinstance([1,2], list)

In [ ]:

isinstance(arange(5), list) # what does that mean -- could still be list-like

In [ ]:

l1 = range(5)

In [ ]:

type(l1)

In [ ]:

l1[0], l1.__getitem__(0), l1[0] == l1.__getitem__(0)

In [ ]:

l1[::-1], l1.__getitem__(slice(None, None, -1))

In [ ]:

ar1 = arange(5)
ar1[3], ar1.__getitem__(3)

In [ ]:

ar1 == 2

In [ ]:

ar1[ar1 == 2].shape

In [ ]:

ar1.__eq__(2)

In [ ]:

ar1.__getitem__(slice(2, 4, None))

In [ ]:

slice(ar1.__eq__(2), None, None)

In [ ]:

ar1.__getitem__(ar1.__eq__(2))

In [ ]:

ar1[:2], ar1.__getitem__(slice(2))

In [ ]:

ar1 + 7

In [ ]:

ar1.__add__(7)

In [ ]:

min(ar1 + 7)

In [ ]:

alphabet[:]

Warm up with sequences, lists¶

Warm-up exercise I: verifying sum of integers calculated by (young) Gauss¶

Bonus Exercise: write an iterator for triangular numbers¶

Warm Up Exercise II: Wheat and chessboard problem¶

Exercise: verify answer: The total number of grains equals 18,446,744,073,709,551,615,¶

Slicing/Indexing Review¶

Import/naming conventions and pylab mode¶

pylab mode¶

NumPy¶

ndarray.ndim, ndarray.shape¶

dtype: type of given ndarray¶

np.arange¶

NumPy.ndarray.reshape¶

scalar multiplication¶

add 2 to all elements in a3¶

sorting¶

Boolean slice: important novel type of slicing¶

Exercise: Calculate a series that holds all the squares less than 100¶

We will come back to indexing later.¶

Pandas¶

pandas.Series¶

Gauss & Chess revisited, using Series¶

Wheat and Chessboard w/ NumPy¶

File Dependency Check for Census calculation¶

Let's do the pandas related imports¶

Getting more out of iPython notebook¶

Census Quickfacts¶

trying out by hand the construction of a simple DataFrame to see what loading a list of dict into DataFrame gets us¶

Exercise: Complete this code to create fips_df from the fips file¶

let's try using read_table directly to read in fips code¶

Exercise: create a fips column¶

Exercise: Likewise, create the geog_entity column and get rid of row column¶

Now set the fips to be the index¶

Note: see how to pull out country + states¶

Exercise: create a Boolean index is_state to pull out states and a DataFrame states_df to hold the states + DC; also write is_county to pull out counties¶

Check whether is_state and states_df look correct¶

Now it's time to work with the population dataset¶

remind ourselves of what's in the dataset by reading it into a dict¶

let's use pd.read_csv to read in file -- there's a bit of tricky type conversion issues here.¶

Merge fips w/ dataset¶

Exercise: make a DataFrame called state_pop of 51 state-like entities that has fips, geog_entity, POP010210 columns, sorted by 2010 census in descending pop¶

Exercise: write function counties_for_state to return a Boolean index for a fiven state prefix or fips code¶

Now we are ready to tally up all states w/ their counties¶

Do this tally in a slightly different way¶

Addendum: ties to operator overloading in Python -- read beyond this point only if really interested!¶

Appendix: underlying mechanics of slicing¶

Exercise: create a Boolean index `is_state` to pull out states and a DataFrame `states_df` to hold the states + DC; also write `is_county` to pull out counties¶