http://mathandmultimedia.com/2010/09/15/sum-first-n-positive-integers/
Gauss displayed his genius at an early age. According to anecdotes, when he was in primary school, he was punished by his teacher due to misbehavior. He was told to add the numbers from 1 to 100. He was able to compute its sum, which is 5050, in a matter of seconds.
Now, how on earth did he do it?
See also:
Let's verify this result in a number of ways. Take some time now to write some code to add up 1 to 100.
Specifically:
sum
The Python yield keyword explained
Beware: in ipython w/ pylab mode, sum
might be overwritten by numpy's sum -- use __builtin__.sum
if you want http://docs.python.org/2/library/functions.html#sum as opposed to http://docs.scipy.org/doc/numpy/reference/generated/numpy.sum.html
# using loop
n = 100
s = 0L
for i in xrange(n+1):
s += i
print s
5050
# range
print range(101)
sum(range(101))
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100]
5050
# xrange
sum(xrange(101))
5050
from itertools import islice, count
c = count(0, 1)
# look at how count() works by repetively calling c.next()
print c.next()
print c.next()
0 1
# count
__builtin__.sum(islice(count(0,1), 101L))
5050
import string
def alpha1():
m = list(string.lowercase)
while m:
yield m.pop(0)
import string
k = (s for s in list(string.lowercase))
list(k)
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
def my_count(start, step):
n = start
while True:
yield n
n += step
__builtin__.sum(islice(my_count(0,1), 101L))
5050
$T_n= \sum_{k=1}^n k = 1+2+3+ \dotsb +n = \frac{n(n+1)}{2} = {n+1 \choose 2}$
from itertools import islice
def triangular():
n = 1
i = 1
while True:
yield n
i +=1
n += i
for i, n in enumerate(islice(triangular(), 10)):
print i+1, n
1 1 2 3 3 6 4 10 5 15 6 21 7 28 8 36 9 45 10 55
list(islice(triangular(), 100))[-1]
5050
list(islice(triangular(),99,100))[0]
5050
http://en.wikipedia.org/wiki/Wheat_and_chessboard_problem :
If a chessboard were to have wheat placed upon each square such that one grain were placed on the first square, two on the second, four on the third, and so on (doubling the number of grains on each subsequent square), how many grains of wheat would be on the chessboard at the finish?
The total number of grains equals 18,446,744,073,709,551,615, which is a much higher number than most people intuitively expect.
# Legend of the Chessboard YouTube video
from IPython.display import YouTubeVideo
YouTubeVideo('t3d0Y-JpRRg')
# generator comprehension
k = (pow(2,n) for n in xrange(64))
k.next()
1
__builtin__.sum((pow(2,n) for n in xrange(64)))
18446744073709551615L
pow(2,64) -1
18446744073709551615L
http://stackoverflow.com/a/509295/7782
Use on any of the sequence types (python docs on sequence types):
There are seven sequence types: strings, Unicode strings, lists, tuples, bytearrays, buffers, and xrange objects.
The use of square brackets are for accessing slices of sequence.
Let's remind ourselves of how to use slices
s[i]
s[i:j]
s[i:j:k]
m = range(10)
m
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
m[0]
0
m[-1]
9
m[::-1]
[9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
m[2:3]
[2]
import string
alphabet = string.lowercase
alphabet
'abcdefghijklmnopqrstuvwxyz'
# 13 letter of the alphabet
alphabet[12]
'm'
We will revisit generalized slicing in NumPy.
http://my.safaribooksonline.com/book/programming/python/9781449323592/1dot-preliminaries/id2699702
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas import Series, DataFrame
These imports done for you in pylab
mode.
ipython --help
yields
--pylab=<CaselessStrEnum> (InteractiveShellApp.pylab)
Default: None
Choices: ['tk', 'qt', 'wx', 'gtk', 'osx', 'inline', 'auto']
Pre-load matplotlib and numpy for interactive use, selecting a particular
matplotlib backend and loop integration.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas import Series, DataFrame
NumPy is the fundamental package for scientific computing with Python. It contains among other things:
Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.
See PfDA
, Chapter 4
# zero-dimensions
a0 = array(5)
a0
array(5)
use shape to get a tuple of array dimensions
a0.ndim, a0.shape
(0, ())
# 1-d array
a1 = array([1,2])
a1.ndim, a1.shape
(1, (2,))
# 2-d array
a2 = array(([1,2], [3,4]))
a2.ndim, a2.shape
(2, (2, 2))
a2.dtype
dtype('int32')
arange is one instance of ndarray
creating function in NumPy
Compare to xrange
.
type(arange(10))
numpy.ndarray
for k in arange(10):
print k
0 1 2 3 4 5 6 7 8 9
list(arange(10)) == list(xrange(10))
True
#how to map 0..63 -> 2x2 array
a3 = np.arange(64).reshape(8,8)
a3
array([[ 0, 1, 2, 3, 4, 5, 6, 7], [ 8, 9, 10, 11, 12, 13, 14, 15], [16, 17, 18, 19, 20, 21, 22, 23], [24, 25, 26, 27, 28, 29, 30, 31], [32, 33, 34, 35, 36, 37, 38, 39], [40, 41, 42, 43, 44, 45, 46, 47], [48, 49, 50, 51, 52, 53, 54, 55], [56, 57, 58, 59, 60, 61, 62, 63]])
a3[1,2]
10
for i in range(8):
for j in range(8):
if a3[i,j] != i*8 + j:
print i, j
example of broadcasting:
The term broadcasting describes how numpy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes. Broadcasting provides a means of vectorizing array operations so that looping occurs in C instead of Python. It does this without making needless copies of data and usually leads to efficient algorithm implementations. There are, however, cases where broadcasting is a bad idea because it leads to inefficient use of memory that slows computation.
2*a3
array([[ 0, 2, 4, 6, 8, 10, 12, 14], [ 16, 18, 20, 22, 24, 26, 28, 30], [ 32, 34, 36, 38, 40, 42, 44, 46], [ 48, 50, 52, 54, 56, 58, 60, 62], [ 64, 66, 68, 70, 72, 74, 76, 78], [ 80, 82, 84, 86, 88, 90, 92, 94], [ 96, 98, 100, 102, 104, 106, 108, 110], [112, 114, 116, 118, 120, 122, 124, 126]])
a3+2
array([[ 2, 3, 4, 5, 6, 7, 8, 9], [10, 11, 12, 13, 14, 15, 16, 17], [18, 19, 20, 21, 22, 23, 24, 25], [26, 27, 28, 29, 30, 31, 32, 33], [34, 35, 36, 37, 38, 39, 40, 41], [42, 43, 44, 45, 46, 47, 48, 49], [50, 51, 52, 53, 54, 55, 56, 57], [58, 59, 60, 61, 62, 63, 64, 65]])
# reverse sort -- best way?
#http://stackoverflow.com/a/6771620/7782
np.sort(arange(100))[::-1]
array([99, 98, 97, 96, 95, 94, 93, 92, 91, 90, 89, 88, 87, 86, 85, 84, 83, 82, 81, 80, 79, 78, 77, 76, 75, 74, 73, 72, 71, 70, 69, 68, 67, 66, 65, 64, 63, 62, 61, 60, 59, 58, 57, 56, 55, 54, 53, 52, 51, 50, 49, 48, 47, 46, 45, 44, 43, 42, 41, 40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0])
This stuff is a bit tricky (see PfDA, pp. 89-92)
Consider example of picking out whole numbers less than 20 that are evenly divisible by 3. Generate a list of such numbers
# list comprehension
[i for i in xrange(20) if i % 3 == 0]
[0, 3, 6, 9, 12, 15, 18]
a3 = arange(20)
a3
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19])
# basic indexing
print a3[0]
print a3[::-1]
print a3[2:5]
0 [19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0] [2 3 4]
np.mod(a3, 3)
array([0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1])
np.mod(a3, 3) == 0
array([ True, False, False, True, False, False, True, False, False, True, False, False, True, False, False, True, False, False, True, False], dtype=bool)
divisible_by_3 = np.mod(a3, 3) == 0
a3[divisible_by_3]
array([ 0, 3, 6, 9, 12, 15, 18])
# if you want to understand this in terms of the overloaded operators -- don't worry if you don't get this.
a3.__getitem__(np.mod(a3,3).__eq__(0))
array([ 0, 3, 6, 9, 12, 15, 18])
Use arange, np.sqrt, astype
a4 = arange(100)
a4sqrt = np.sqrt(a4)
a4[a4sqrt == a4sqrt.astype(np.int)]
array([ 0, 1, 4, 9, 16, 25, 36, 49, 64, 81])
Make a series out of an array
s1 = Series(arange(5))
confirm that the type of s1 is what you would expect
type(s1)
pandas.core.series.Series
show that the series is also an array
s1.ndim, isinstance(s1, np.ndarray)
(1, True)
s1.index
Int64Index([0, 1, 2, 3, 4], dtype=int64)
import string
allTheLetters = string.lowercase
allTheLetters
'abcdefghijklmnopqrstuvwxyz'
s2 = Series(data=arange(5), index=list(allTheLetters)[:5])
s2
a 0 b 1 c 2 d 3 e 4
s2.index
Index([a, b, c, d, e], dtype=object)
Compared with a regular NumPy array, you can use values in the index when selecting single values or a set of values
# can use both numeric indexing and the labels
s2[0], s2['a']
(0, 0)
for i in range(len(s2)):
print i, s2[i]
0 0 1 1 2 2 3 3 4 4
it is possible conflict in indexing -- consider
s3 = Series(data=['albert', 'betty', 'cathy'], index=[3,1, 0])
s3
3 albert 1 betty 0 cathy
s3[0], list(s3)[0]
('cathy', 'albert')
but slicing works to return specific numeric index
s3[::-1]
0 cathy 1 betty 3 albert
for i in range(len(s3)):
print i, s3[i:i+1]
0 3 albert 1 1 betty 2 0 cathy
s3.name = 'person names'
s3.name
'person names'
s3.index.name = 'confounding label'
s3.index.name
'confounding label'
s3
confounding label 3 albert 1 betty 0 cathy Name: person names
Important points remaining:
You get some nice matplotlib
integration via pandas
# Gauss addition using np.arange, Series
from pandas import Series
Series(arange(101).cumsum()).plot()
<matplotlib.axes.AxesSubplot at 0x4dce930>
from pandas import Series
Series((pow(2,k) for k in xrange(64)), dtype=np.float64).cumsum().plot()
<matplotlib.axes.AxesSubplot at 0x4e1ab30>
2*ones(64, dtype=np.int)
array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
arange(64)
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63])
sum(np.power(2, arange(64, dtype=np.uint64)))
18446744073709551615
sum(np.power(2*ones(64, dtype=np.uint64), arange(64)))
1.8446744073709552e+19
precise_ans = sum([pow(2,n) for n in xrange(64)])
np_ans = sum(np.power(2*ones(64, dtype=np.uint64), arange(64)))
precise_ans, np_ans
(18446744073709551615L, 1.8446744073709552e+19)
# Raise an assertion if two items are not equal up to desired precision.
np.testing.assert_almost_equal(precise_ans, np_ans) is None
True
Here we check to make sure that the expected files are in place.
This notebook assumes relative location of files as laid out in https://github.com/rdhyee/working-open-data
import os
# relative to parent dir
DATA_FILES = {"datadict":"data/census/DataDict.txt",
"dataset":"data/census/DataSet.txt",
"fips": "data/census/FIPS_CountyName.txt"}
def file_path(key):
return os.path.join(os.pardir, DATA_FILES[key])
for file_key in DATA_FILES.keys():
abs_fname = file_path(file_key)
print abs_fname, os.path.exists(abs_fname)
../data/census/DataDict.txt True ../data/census/FIPS_CountyName.txt True ../data/census/DataSet.txt True
You can download git repo as zip file if you are having problems with github
import pandas as pd
from pandas import Series, DataFrame
# use head on Mac/linuc to look at file
fips_path = file_path("fips")
!head $fips_path
00000 UNITED STATES 01000 ALABAMA 01001 Autauga County, AL 01003 Baldwin County, AL 01005 Barbour County, AL 01007 Bibb County, AL 01009 Blount County, AL 01011 Bullock County, AL 01013 Butler County, AL 01015 Calhoun County, AL
df1 = DataFrame([{'fips':'00000', 'geog_entity':'UNITED STATES'},
{'fips':'01000', 'geog_entity':'ALABAMA'},
])
df1
fips | geog_entity | |
---|---|---|
0 | 00000 | UNITED STATES |
1 | 01000 | ALABAMA |
# check type of df1
type(df1)
pandas.core.frame.DataFrame
# column headings
df1.columns
Index([fips, geog_entity], dtype=object)
# index -- ascribed automatically since we didn't explicitly specify labels
df1.index
Int64Index([0, 1], dtype=int64)
We can use set_index
to makes fips the index
# http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.set_index.html#pandas.DataFrame.set_index
# p. 150
df1.set_index('fips', inplace=True, drop=True)
/Users/raymondyee/.virtualenvs/epd1/lib/python2.7/site-packages/pandas/core/frame.py:2762: FutureWarning: set_index with inplace=True will return None from pandas 0.11 onward " from pandas 0.11 onward", FutureWarning)
geog_entity | |
---|---|
fips | |
00000 | UNITED STATES |
01000 | ALABAMA |
# now check the index
df1.index
Index([00000, 01000], dtype=object)
# to access row with index of 01000
df1.ix['01000']
geog_entity ALABAMA Name: 01000
# to read off the geog_entity column
df1['geog_entity']
fips 00000 UNITED STATES 01000 ALABAMA Name: geog_entity
# look at how we can do row slices
df1.ix[0:1]
geog_entity | |
---|---|
fips | |
00000 | UNITED STATES |
# get rows in reverse order
df1.ix[::-1]
geog_entity | |
---|---|
fips | |
01000 | ALABAMA |
00000 | UNITED STATES |
Create the index explicitly in instantiation of the simple DataFrame
df2 = DataFrame(data = [{'geog_entity':'UNITED STATES'},
{'geog_entity':'ALABAMA'},
], index = ['00000','01000'])
df2
geog_entity | |
---|---|
00000 | UNITED STATES |
01000 | ALABAMA |
# note index name is empty
df2.index.name is None
True
# set index name
df2.index.name = 'fips'
df2
geog_entity | |
---|---|
fips | |
00000 | UNITED STATES |
01000 | ALABAMA |
import codecs
from itertools import islice
from pandas import Series, DataFrame
f = codecs.open(file_path("fips"), encoding='iso-8859-1')
fips_list = list()
for row in islice(f, None):
fips_list.append({'fips': row[:5], 'geog_entity': row[6:-1]})
fips_df = DataFrame(fips_list)
fips_df
<class 'pandas.core.frame.DataFrame'> Int64Index: 3195 entries, 0 to 3194 Data columns: fips 3195 non-null values geog_entity 3195 non-null values dtypes: object(2)
# display the first 5 rows
fips_df[:5]
fips | geog_entity | |
---|---|---|
0 | 00000 | UNITED STATES |
1 | 01000 | ALABAMA |
2 | 01001 | Autauga County, AL |
3 | 01003 | Baldwin County, AL |
4 | 01005 | Barbour County, AL |
import codecs
from itertools import islice
import pandas as pd
from pandas import Series, DataFrame
f = codecs.open(file_path("fips"), encoding='iso-8859-1')
fips_df2 = pd.read_table(f, header=None,names=['row'])
# show first five rows
fips_df2[:5]
row | |
---|---|
0 | 00000 UNITED STATES |
1 | 01000 ALABAMA |
2 | 01001 Autauga County, AL |
3 | 01003 Baldwin County, AL |
4 | 01005 Barbour County, AL |
hint: consider using fips_df2.row.str[:]
and see what happens if you use
fips_df2["junk"] = 'hello'
BTW, you can delete columns -- e.g.,
del fips_df2["junk"]
fips_df2["fips"] = fips_df2.row.str[:5]
# show first five rows
fips_df2[:5]
row | fips | |
---|---|---|
0 | 00000 UNITED STATES | 00000 |
1 | 01000 ALABAMA | 01000 |
2 | 01001 Autauga County, AL | 01001 |
3 | 01003 Baldwin County, AL | 01003 |
4 | 01005 Barbour County, AL | 01005 |
fips_df2["geog_entity"] = fips_df2.row.str[6:]
fips_df2[:5]
row | fips | geog_entity | |
---|---|---|---|
0 | 00000 UNITED STATES | 00000 | UNITED STATES |
1 | 01000 ALABAMA | 01000 | ALABAMA |
2 | 01001 Autauga County, AL | 01001 | Autauga County, AL |
3 | 01003 Baldwin County, AL | 01003 | Baldwin County, AL |
4 | 01005 Barbour County, AL | 01005 | Barbour County, AL |
del fips_df2["row"]
fips_df2[:5]
fips | geog_entity | |
---|---|---|
0 | 00000 | UNITED STATES |
1 | 01000 | ALABAMA |
2 | 01001 | Autauga County, AL |
3 | 01003 | Baldwin County, AL |
4 | 01005 | Barbour County, AL |
# this is a check that you have the right columns
assert set(fips_df2.columns) == set(['fips', 'geog_entity'])
assert len(fips_df2) == 3195
fips_df2.set_index('fips', inplace=True, drop=True)
<class 'pandas.core.frame.DataFrame'> Index: 3195 entries, 00000 to 56045 Data columns: geog_entity 3195 non-null values dtypes: object(1)
# print out first five rows
fips_df2[:5]
geog_entity | |
---|---|
fips | |
00000 | UNITED STATES |
01000 | ALABAMA |
01001 | Autauga County, AL |
01003 | Baldwin County, AL |
01005 | Barbour County, AL |
fips_df2.ix['00000']
geog_entity UNITED STATES Name: 00000
# first 5 rows
fips_df2[Series(fips_df2.index).str[-3:] == '000'][:5]
geog_entity | |
---|---|
fips | |
00000 | UNITED STATES |
01000 | ALABAMA |
02000 | ALASKA |
04000 | ARIZONA |
05000 | ARKANSAS |
len(fips_df2[Series(fips_df2.index).str[-3:] == '000'])
52
# check type
isinstance(fips_df2.index, pd.Index)
True
# can create a Boolean slice to pull out a country
is_country = fips_df.fips.str[:] == '00000'
fips_df[is_country]
fips | geog_entity | |
---|---|---|
0 | 00000 | UNITED STATES |
is_state
to pull out states and a DataFrame states_df
to hold the states + DC; also write is_county
to pull out counties¶Hints:
&
to do a boolean and
on Boolean indices -- (see PfDA
, p. 91)# how about states?
# make use of
# http://proquest.safaribooksonline.com/book/programming/python/9781449323592/7dot-data-wrangling-clean-transform-merge-reshape/id2801165
is_state = (fips_df.fips.str[-3:] == '000') & (fips_df.fips.str[:] != '00000')
is_county = fips_df.fips.str[-3:] != '000'
states_df = fips_df[is_state]
print len(states_df)
states_df[:5]
51
fips | geog_entity | |
---|---|---|
1 | 01000 | ALABAMA |
69 | 02000 | ALASKA |
99 | 04000 | ARIZONA |
115 | 05000 | ARKANSAS |
191 | 06000 | CALIFORNIA |
assert set(is_state.value_counts().iteritems()) == set([(True, 51), (False, 3144)])
assert set(states_df["geog_entity"]) == set([u'VERMONT',
u'GEORGIA', u'IOWA', u'KANSAS', u'FLORIDA', u'VIRGINIA', u'NORTH CAROLINA',
u'HAWAII', u'NEW YORK', u'CALIFORNIA', u'ALABAMA', u'IDAHO', u'DELAWARE',
u'ALASKA', u'ILLINOIS', u'SOUTH DAKOTA', u'CONNECTICUT', u'MONTANA',
u'MASSACHUSETTS', u'NEW HAMPSHIRE', u'MARYLAND', u'NEW MEXICO',
u'MISSISSIPPI', u'TENNESSEE', u'COLORADO', u'NEW JERSEY', u'UTAH',
u'MICHIGAN', u'WEST VIRGINIA', u'WASHINGTON', u'MINNESOTA', u'OREGON',
u'WYOMING', u'OHIO', u'SOUTH CAROLINA', u'INDIANA', u'NEVADA',
u'LOUISIANA', u'NEBRASKA', u'ARIZONA', u'WISCONSIN', u'NORTH DAKOTA',
u'PENNSYLVANIA', u'OKLAHOMA', u'KENTUCKY', u'RHODE ISLAND', u'DISTRICT OF COLUMBIA',
u'ARKANSAS', u'MISSOURI', u'TEXAS', u'MAINE'])
assert set(is_county.value_counts().iteritems()) == set([(False, 52), (True, 3143)])
import csv
import codecs
import pandas as pd
from pandas import DataFrame, Series
from itertools import islice
f = codecs.open(file_path("dataset"), encoding='iso-8859-1')
reader = csv.DictReader(f)
dataset = dict([(row["fips"], row) for row in islice(reader, None)])
f.close()
# print out entry for USA
print dataset['00000']
{'RHI125211': '78.1', 'SBO415207': '8.3', 'VET605211': '22215303', 'RHI225211': '13.1', 'PVY020211': '14.3', 'HSD310211': '2.60', 'POP645211': '12.8', 'AGE135211': '6.5', 'EDU685211': '28.2', 'RHI625211': '2.3', 'SBO215207': '5.7', 'PST045212': '313914040', 'SBO015207': '28.8', 'POP715211': '84.6', 'PST120211': '0.9', 'PST120212': '1.7', 'POP010210': '308745538', 'PST045211': '311587816', 'SBO315207': '7.1', 'POP060210': '87.4', 'RHI425211': '5.0', 'POP815211': '20.3', 'HSD410211': '114761359', 'HSG495211': '186200', 'BZA010210': '7396628', 'LFE305211': '25.4', 'BZA110210': '111970095', 'AGE775211': '13.3', 'HSG096211': '25.9', 'RHI525211': '0.2', 'LND110210': '3531905.43', 'PST040210': '308747508', 'RHI825211': '63.4', 'BZA115210': '-1.8', 'NES010210': '22110628', 'fips': '00000', 'MAN450207': '5338306501', 'EDU635211': '85.4', 'RTN131207': '12990', 'RHI725211': '16.7', 'BPS030211': '624061', 'INC110211': '52762', 'AGE295211': '23.7', 'SBO115207': '0.9', 'INC910211': '27915', 'RHI325211': '1.2', 'WTN220207': '4174286516', 'HSG445211': '66.1', 'SBO515207': '0.1', 'AFN120207': '613795732', 'RTN130207': '3917663456', 'HSG010211': '132312404', 'SEX255211': '50.8', 'SBO001207': '27092908'}
# make sure we have the right 2010 census population for the USA
assert dataset['00000']['POP010210'] == '308745538'
# let's try pd.read_csv
# how to give hints about data type to pd.read_csv?
import codecs
# read data in and merge with fips_df
#dtype = [('fips', 'S'), ('POP010210', 'i')]
# problems with string conversion when I try to explicitly parse fips as a np.string_
# dtype = {'fips':np.string_}
# possibly relevant sections of the code:
# v0.10.1: https://github.com/pydata/pandas/tree/v0.10.1
# https://github.com/pydata/pandas/blob/v0.10.1/pandas/io/parsers.py
# https://github.com/pydata/pandas/blob/v0.10.1/pandas/src/parser.pyx#L897 -> _convert_with_dtype
# let's try object for fips and int for POP010210
dtype = {'fips':np.object, 'POP010210':np.int}
f = codecs.open(file_path("dataset"), encoding='iso-8859-1')
dataset_df = pd.read_csv(f, dtype=dtype)
# confirm data types for fips, which should not be convered to int
dataset_df.fips.dtype, dataset_df.POP010210.dtype
(dtype('object'), dtype('int64'))
# read off US population
dataset_df[is_country].ix[0]['POP010210']
308745538
# 3 most populous entities
# http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.sort_index.html
dataset_df.sort_index(by='POP010210')[-1:-4:-1][["fips", "POP010210"]]
fips | POP010210 | |
---|---|---|
0 | 00000 | 308745538 |
191 | 06000 | 37253956 |
2567 | 48000 | 25145561 |
# before doing merge,list expected columns
'fips' in fips_df.columns, 'fips' in dataset_df.columns
(True, True)
# PfDA, p. 28
df = pd.merge(fips_df, dataset_df)
df.sort_index(by='POP010210')[-1:-5:-1][["fips", "geog_entity", "POP010210"]]
fips | geog_entity | POP010210 | |
---|---|---|---|
0 | 00000 | UNITED STATES | 308745538 |
191 | 06000 | CALIFORNIA | 37253956 |
2567 | 48000 | TEXAS | 25145561 |
1861 | 36000 | NEW YORK | 19378102 |
state_pop = df[is_state][["fips", "geog_entity", "POP010210"]].sort_index(by='POP010210')[::-1]
state_pop[:5]
fips | geog_entity | POP010210 | |
---|---|---|---|
191 | 06000 | CALIFORNIA | 37253956 |
2567 | 48000 | TEXAS | 25145561 |
1861 | 36000 | NEW YORK | 19378102 |
330 | 12000 | FLORIDA | 18801310 |
609 | 17000 | ILLINOIS | 12830632 |
# check state_pop behavior
assert state_pop.shape == (51, 3)
assert set(state_pop.columns) == set(['POP010210', 'fips', 'geog_entity'])
assert list(state_pop[:5]["geog_entity"]) == [u'CALIFORNIA', u'TEXAS', u'NEW YORK', u'FLORIDA', u'ILLINOIS']
assert state_pop[:5]["POP010210"].sum() == 113409561
# add up all the states to match US population
assert state_pop["POP010210"].sum() == df[is_country]["POP010210"].sum()
sum(df[is_country]['POP010210']), sum(df[is_state]['POP010210']), sum(df[is_county]['POP010210'])
(308745538, 308745538, 308745538)
e.g
df[counties_for_state('06')]
should be a DataFrame with CA counties
def counties_for_state(state):
return (df.fips.str[:2] == state[:2]) & (df.fips.str[-3:] != '000')
assert set(df[counties_for_state('06')]["geog_entity"]) == set([u'Nevada County, CA',
u'Alameda County, CA', u'Kings County, CA', u'Ventura County, CA',
u'El Dorado County, CA', u'San Joaquin County, CA', u'Alpine County, CA',
u'San Luis Obispo County, CA', u'Modoc County, CA', u'Colusa County, CA',
u'Stanislaus County, CA', u'Sonoma County, CA', u'Tulare County, CA',
u'Shasta County, CA', u'Yolo County, CA', u'Placer County, CA', u'Glenn County, CA',
u'Sacramento County, CA', u'San Francisco County, CA',
u'Madera County, CA', u'Imperial County, CA', u'Plumas County, CA',
u'San Mateo County, CA', u'Riverside County, CA', u'Calaveras County, CA',
u'Napa County, CA', u'Mendocino County, CA', u'Mariposa County, CA',
u'Santa Barbara County, CA', u'Inyo County, CA', u'Butte County, CA',
u'Trinity County, CA', u'Los Angeles County, CA', u'Lassen County, CA',
u'Yuba County, CA', u'Amador County, CA', u'Marin County, CA', u'Humboldt County, CA',
u'Merced County, CA', u'Lake County, CA', u'San Diego County, CA',
u'Monterey County, CA', u'Sutter County, CA', u'Solano County, CA',
u'Tuolumne County, CA', u'San Bernardino County, CA', u'Fresno County, CA',
u'Santa Cruz County, CA', u'San Benito County, CA', u'Contra Costa County, CA',
u'Orange County, CA', u'Del Norte County, CA', u'Mono County, CA',
u'Siskiyou County, CA', u'Kern County, CA', u'Sierra County, CA', u'Tehama County, CA',
u'Santa Clara County, CA'])
for (k, s) in df[is_state][['fips', 'geog_entity', 'POP010210']].iterrows():
print s["fips"], s["geog_entity"], s["POP010210"] == df[counties_for_state(s["fips"])]['POP010210'].sum()
01000 ALABAMA True 02000 ALASKA True 04000 ARIZONA True 05000 ARKANSAS True 06000 CALIFORNIA True 08000 COLORADO True 09000 CONNECTICUT True 10000 DELAWARE True 11000 DISTRICT OF COLUMBIA True 12000 FLORIDA True 13000 GEORGIA True 15000 HAWAII True 16000 IDAHO True 17000 ILLINOIS True 18000 INDIANA True 19000 IOWA True 20000 KANSAS True 21000 KENTUCKY True 22000 LOUISIANA True 23000 MAINE True 24000 MARYLAND True 25000 MASSACHUSETTS True 26000 MICHIGAN True 27000 MINNESOTA True 28000 MISSISSIPPI True 29000 MISSOURI True 30000 MONTANA True 31000 NEBRASKA True 32000 NEVADA True 33000 NEW HAMPSHIRE True 34000 NEW JERSEY True 35000 NEW MEXICO True 36000 NEW YORK True 37000 NORTH CAROLINA True 38000 NORTH DAKOTA True 39000 OHIO True 40000 OKLAHOMA True 41000 OREGON True 42000 PENNSYLVANIA True 44000 RHODE ISLAND True 45000 SOUTH CAROLINA True 46000 SOUTH DAKOTA True 47000 TENNESSEE True 48000 TEXAS True 49000 UTAH True 50000 VERMONT True 51000 VIRGINIA True 53000 WASHINGTON True 54000 WEST VIRGINIA True 55000 WISCONSIN True 56000 WYOMING True
# http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.set_index.html#pandas.DataFrame.set_index
# p. 150
df.set_index('fips', inplace=True, drop=False)
<class 'pandas.core.frame.DataFrame'> Index: 3195 entries, 00000 to 56045 Data columns: fips 3195 non-null values geog_entity 3195 non-null values PST045212 3195 non-null values PST045211 3195 non-null values PST040210 3195 non-null values PST120212 3195 non-null values PST120211 3195 non-null values POP010210 3195 non-null values AGE135211 3195 non-null values AGE295211 3195 non-null values AGE775211 3195 non-null values SEX255211 3195 non-null values RHI125211 3195 non-null values RHI225211 3195 non-null values RHI325211 3195 non-null values RHI425211 3195 non-null values RHI525211 3195 non-null values RHI625211 3195 non-null values RHI725211 3195 non-null values RHI825211 3195 non-null values POP715211 3195 non-null values POP645211 3195 non-null values POP815211 3195 non-null values EDU635211 3195 non-null values EDU685211 3195 non-null values VET605211 3195 non-null values LFE305211 3195 non-null values HSG010211 3195 non-null values HSG445211 3195 non-null values HSG096211 3195 non-null values HSG495211 3195 non-null values HSD410211 3195 non-null values HSD310211 3195 non-null values INC910211 3195 non-null values INC110211 3195 non-null values PVY020211 3195 non-null values BZA010210 3195 non-null values BZA110210 3195 non-null values BZA115210 3195 non-null values NES010210 3195 non-null values SBO001207 3195 non-null values SBO315207 3195 non-null values SBO115207 3195 non-null values SBO215207 3195 non-null values SBO515207 3195 non-null values SBO415207 3195 non-null values SBO015207 3195 non-null values MAN450207 3195 non-null values WTN220207 3195 non-null values RTN130207 3195 non-null values RTN131207 3195 non-null values AFN120207 3195 non-null values BPS030211 3195 non-null values LND110210 3195 non-null values POP060210 3195 non-null values dtypes: float64(33), int64(20), object(2)
for k in df[is_state]['fips']:
print k, df.ix[k]["geog_entity"], df.ix[k]["POP010210"] == sum(df[counties_for_state(k)]['POP010210'])
01000 ALABAMA True 02000 ALASKA True 04000 ARIZONA True 05000 ARKANSAS True 06000 CALIFORNIA True 08000 COLORADO True 09000 CONNECTICUT True 10000 DELAWARE True 11000 DISTRICT OF COLUMBIA True 12000 FLORIDA True 13000 GEORGIA True 15000 HAWAII True 16000 IDAHO True 17000 ILLINOIS True 18000 INDIANA True 19000 IOWA True 20000 KANSAS True 21000 KENTUCKY True 22000 LOUISIANA True 23000 MAINE True 24000 MARYLAND True 25000 MASSACHUSETTS True 26000 MICHIGAN True 27000 MINNESOTA True 28000 MISSISSIPPI True 29000 MISSOURI True 30000 MONTANA True 31000 NEBRASKA True 32000 NEVADA True 33000 NEW HAMPSHIRE True 34000 NEW JERSEY True 35000 NEW MEXICO True 36000 NEW YORK True 37000 NORTH CAROLINA True 38000 NORTH DAKOTA True 39000 OHIO True 40000 OKLAHOMA True 41000 OREGON True 42000 PENNSYLVANIA True 44000 RHODE ISLAND True 45000 SOUTH CAROLINA True 46000 SOUTH DAKOTA True 47000 TENNESSEE True 48000 TEXAS True 49000 UTAH True 50000 VERMONT True 51000 VIRGINIA True 53000 WASHINGTON True 54000 WEST VIRGINIA True 55000 WISCONSIN True 56000 WYOMING True
# does order of A== B vs B== A matter -- shouldn't
state = '06000'
k0 = df.fips.str[:2] == state[:2]
type(k0)
k1 = state[:2] == df.fips.str[:2]
print type(k0), type(k1)
print k0.value_counts()
print k1.value_counts()
print np.all(k0 == k1)
<class 'pandas.core.series.Series'> <class 'pandas.core.series.Series'> False 3136 True 59 False 3136 True 59 True
n0 = 5
n0 == 5
True
Now I thought I'd be able to use a n0.__eq__(5)
but nope -- it's complicated -- see http://stackoverflow.com/questions/2281222/why-when-in-python-does-x-y-call-y-eq-x#comment2254663_2282795
try:
n0.__eq__(5)
except Exception as e:
print e
'int' object has no attribute '__eq__'
can do: int.__cmp__(x)
(n0.__cmp__(4), n0.__cmp__(5), n0.__cmp__(6))
(1, 0, -1)
how about ndarray?
arange(5) == 2
array([False, False, True, False, False], dtype=bool)
#
# http://docs.scipy.org/doc/numpy/reference/generated/numpy.array_equal.html
np.array_equal(arange(5) == 2 , arange(5).__eq__(2))
True
Useful if you want to understand how the slicing syntax really works.
isinstance([1,2], list)
True
isinstance(arange(5), list) # what does that mean -- could still be list-like
False
l1 = range(5)
type(l1)
list
l1[0], l1.__getitem__(0), l1[0] == l1.__getitem__(0)
(0, 0, True)
l1[::-1], l1.__getitem__(slice(None, None, -1))
([4, 3, 2, 1, 0], [4, 3, 2, 1, 0])
ar1 = arange(5)
ar1[3], ar1.__getitem__(3)
(3, 3)
ar1 == 2
array([False, False, True, False, False], dtype=bool)
ar1[ar1 == 2].shape
(1,)
ar1.__eq__(2)
array([False, False, True, False, False], dtype=bool)
ar1.__getitem__(slice(2, 4, None))
array([2, 3])
slice(ar1.__eq__(2), None, None)
slice(array([False, False, True, False, False], dtype=bool), None, None)
ar1.__getitem__(ar1.__eq__(2))
array([2])
ar1[:2], ar1.__getitem__(slice(2))
(array([0, 1]), array([0, 1]))
ar1 + 7
array([ 7, 8, 9, 10, 11])
ar1.__add__(7)
array([ 7, 8, 9, 10, 11])
min(ar1 + 7)
7
alphabet[:]
'abcdefghijklmnopqrstuvwxyz'