http://mathandmultimedia.com/2010/09/15/sum-first-n-positive-integers/
Gauss displayed his genius at an early age. According to anecdotes, when he was in primary school, he was punished by his teacher due to misbehavior. He was told to add the numbers from 1 to 100. He was able to compute its sum, which is 5050, in a matter of seconds.
Now, how on earth did he do it?
See also:
Let's verify this result in a number of ways. Take some time now to write some code to add up 1 to 100.
Specifically:
sum
The Python yield keyword explained
Beware: in ipython w/ pylab mode, sum
might be overwritten by numpy's sum -- use __builtin__.sum
if you want http://docs.python.org/2/library/functions.html#sum as opposed to http://docs.scipy.org/doc/numpy/reference/generated/numpy.sum.html
#FILL IN WITH YOUR CODE TO CALCULATE ANSWER OF 5050
Here's a forumla for a triangular number
$T_n= \sum_{k=1}^n k = 1+2+3+ \dotsb +n = \frac{n(n+1)}{2} = {n+1 \choose 2}$
n = 100
n*(n+1)/2
The following is a hint for writing triangular
from itertools import islice
def my_count(start, step):
n = start
while True:
yield n
n += step
__builtin__.sum(islice(my_count(0,1), 101L))
from itertools import islice
def triangular():
# FILL IN
pass
# a check
assert list(islice(triangular(), 5)) == [1, 3, 6, 10, 15]
for i, n in enumerate(islice(triangular(), 10)):
print i+1, n
list(islice(triangular(), 100))[-1]
list(islice(triangular(),99,100))[0]
http://en.wikipedia.org/wiki/Wheat_and_chessboard_problem :
If a chessboard were to have wheat placed upon each square such that one grain were placed on the first square, two on the second, four on the third, and so on (doubling the number of grains on each subsequent square), how many grains of wheat would be on the chessboard at the finish?
The total number of grains equals 18,446,744,073,709,551,615, which is a much higher number than most people intuitively expect.
# Legend of the Chessboard YouTube video
from IPython.display import YouTubeVideo
YouTubeVideo('t3d0Y-JpRRg')
#FILL IN YOUR CODE
http://stackoverflow.com/a/509295/7782
Use on any of the sequence types (python docs on sequence types):
There are seven sequence types: strings, Unicode strings, lists, tuples, bytearrays, buffers, and xrange objects.
The use of square brackets are for accessing slices of sequence.
Let's remind ourselves of how to use slices
s[i]
s[i:j]
s[i:j:k]
m = range(10)
m
m[0]
m[-1]
m[::-1]
m[2:3]
import string
alphabet = string.lowercase
alphabet
# 13 letter of the alphabet
alphabet[12]
We will revisit generalized slicing in NumPy.
http://my.safaribooksonline.com/book/programming/python/9781449323592/1dot-preliminaries/id2699702
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
These imports done for you in pylab
mode.
ipython --help
yields
--pylab=<CaselessStrEnum> (InteractiveShellApp.pylab)
Default: None
Choices: ['tk', 'qt', 'wx', 'gtk', 'osx', 'inline', 'auto']
Pre-load matplotlib and numpy for interactive use, selecting a particular
matplotlib backend and loop integration.
NumPy is the fundamental package for scientific computing with Python. It contains among other things:
Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.
See PfDA
, Chapter 4
# zero-dimensions
a0 = array(5)
a0
use shape to get a tuple of array dimensions
a0.ndim, a0.shape
# 1-d array
a1 = array([1,2])
a1.ndim, a1.shape
# 2-d array
a2 = array(([1,2], [3,4]))
a2.ndim, a2.shape
a2.dtype
arange is one instance of ndarray
creating function in NumPy
Compare to xrange
.
type(arange(10))
for k in arange(10):
print k
list(arange(10)) == list(xrange(10))
#how to map 0..63 -> 2x2 array
a3 = np.arange(64).reshape(8,8)
a3
a3[1,2]
for i in range(8):
for j in range(8):
if a3[i,j] != i*8 + j:
print i, j
example of broadcasting:
The term broadcasting describes how numpy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes. Broadcasting provides a means of vectorizing array operations so that looping occurs in C instead of Python. It does this without making needless copies of data and usually leads to efficient algorithm implementations. There are, however, cases where broadcasting is a bad idea because it leads to inefficient use of memory that slows computation.
2*a3
a3+2
# reverse sort -- best way?
#http://stackoverflow.com/a/6771620/7782
np.sort(arange(100))[::-1]
This stuff is a bit tricky (see PfDA, pp. 89-92)
Consider example of picking out whole numbers less than 20 that are evenly divisible by 3. Generate a list of such numbers
# list comprehension
[i for i in xrange(20) if i % 3 == 0]
a3 = arange(20)
a3
# basic indexing
print a3[0]
print a3[::-1]
print a3[2:5]
np.mod(a3, 3)
np.mod(a3, 3) == 0
divisible_by_3 = np.mod(a3, 3) == 0
a3[divisible_by_3]
# if you want to understand this in terms of the overloaded operators -- don't worry if you don't get this.
a3.__getitem__(np.mod(a3,3).__eq__(0))
Hint: one way is to use arange, np.sqrt, astype
Answer should be:
array([ 0, 1, 4, 9, 16, 25, 36, 49, 64, 81])
#FILL IN
Make a series out of an array
s1 = Series(arange(5))
confirm that the type of s1 is what you would expect
type(s1)
show that the series is also an array
s1.ndim, isinstance(s1, np.ndarray)
s1.index
import string
allTheLetters = string.lowercase
allTheLetters
s2 = Series(data=arange(5), index=list(allTheLetters)[:5])
s2
s2.index
Compared with a regular NumPy array, you can use values in the index when selecting single values or a set of values
# can use both numeric indexing and the labels
s2[0], s2['a']
for i in range(len(s2)):
print i, s2[i]
it is possible conflict in indexing -- consider
s3 = Series(data=['albert', 'betty', 'cathy'], index=[3,1, 0])
s3
s3[0], list(s3)[0]
but slicing works to return specific numeric index
s3[::-1]
for i in range(len(s3)):
print i, s3[i:i+1]
s3.name = 'person names'
s3.name
s3.index.name = 'confounding label'
s3.index.name
s3
Important points remaining:
You get some nice matplotlib
integration via pandas
# Gauss addition using np.arange, Series
from pandas import Series
Series(arange(101).cumsum()).plot()
from pandas import Series
Series((pow(2,k) for k in xrange(64)), dtype=np.float64).cumsum().plot()
2*ones(64, dtype=np.int)
arange(64)
sum(np.power(2, arange(64, dtype=np.uint64)))
sum(np.power(2*ones(64, dtype=np.uint64), arange(64)))
precise_ans = sum([pow(2,n) for n in xrange(64)])
np_ans = sum(np.power(2*ones(64, dtype=np.uint64), arange(64)))
precise_ans, np_ans
# Raise an assertion if two items are not equal up to desired precision.
np.testing.assert_almost_equal(precise_ans, np_ans) is None
Here we check to make sure that the expected files are in place.
This notebook assumes relative location of files as laid out in https://github.com/rdhyee/working-open-data
import os
# relative to parent dir
DATA_FILES = {"datadict":"data/census/DataDict.txt",
"dataset":"data/census/DataSet.txt",
"fips": "data/census/FIPS_CountyName.txt"}
def file_path(key):
return os.path.join(os.pardir, DATA_FILES[key])
for file_key in DATA_FILES.keys():
abs_fname = file_path(file_key)
print abs_fname, os.path.exists(abs_fname)
You can download git repo as zip file if you are having problems with github
import pandas as pd
from pandas import Series, DataFrame
# use head on Mac/linuc to look at file
fips_path = file_path("fips")
!head $fips_path
df1 = DataFrame([{'fips':'00000', 'geog_entity':'UNITED STATES'},
{'fips':'01000', 'geog_entity':'ALABAMA'},
])
df1
# check type of df1
type(df1)
# column headings
df1.columns
# index -- ascribed automatically since we didn't explicitly specify labels
df1.index
We can use set_index
to makes fips the index
# http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.set_index.html#pandas.DataFrame.set_index
# p. 150
df1.set_index('fips', inplace=True, drop=True)
# now check the index
df1.index
# to access row with index of 01000
df1.ix['01000']
# to read off the geog_entity column
df1['geog_entity']
# look at how we can do row slices
df1.ix[0:1]
# get rows in reverse order
df1.ix[::-1]
Create the index explicitly in instantiation of the simple DataFrame
df2 = DataFrame(data = [{'geog_entity':'UNITED STATES'},
{'geog_entity':'ALABAMA'},
], index = ['00000','01000'])
df2
# note index name is empty
df2.index.name is None
# set index name
df2.index.name = 'fips'
df2
import codecs
from itertools import islice
from pandas import Series, DataFrame
f = codecs.open(file_path("fips"), encoding='iso-8859-1')
fips_list = list()
for row in islice(f, None):
# FILL IN
pass
fips_df = DataFrame(fips_list)
fips_df
# display the first 5 rows
fips_df[:5]
import codecs
from itertools import islice
import pandas as pd
from pandas import Series, DataFrame
f = codecs.open(file_path("fips"), encoding='iso-8859-1')
fips_df2 = pd.read_table(f, header=None,names=['row'])
# show first five rows
fips_df2[:5]
hint: consider using fips_df2.row.str[:]
and see what happens if you use
fips_df2["junk"] = 'hello'
BTW, you can delete columns -- e.g.,
del fips_df2["junk"]
# FILL IN
# show first five rows
fips_df2[:5]
#FILL IN
# this is a check that you have the right columns
assert set(fips_df2.columns) == set(['fips', 'geog_entity'])
assert len(fips_df2) == 3195
fips_df2.set_index('fips', inplace=True, drop=True)
# print out first five rows
fips_df2[:5]
fips_df2.ix['00000']
# first 5 rows
fips_df2[Series(fips_df2.index).str[-3:] == '000'][:5]
len(fips_df2[Series(fips_df2.index).str[-3:] == '000'])
# check type
isinstance(fips_df2.index, pd.Index)
# can create a Boolean slice to pull out a country
is_country = fips_df.fips.str[:] == '00000'
fips_df[is_country]
is_state
to pull out states and a DataFrame states_df
to hold the states + DC; also write is_county
to pull out counties¶Hints:
&
to do a boolean and
on Boolean indices -- (see PfDA
, p. 91)# how about states?
# make use of
# http://proquest.safaribooksonline.com/book/programming/python/9781449323592/7dot-data-wrangling-clean-transform-merge-reshape/id2801165
#FILL IN
is_state = None
is_county = None
states_df = fips_df[is_state]
print len(states_df)
states_df[:5]
assert set(is_state.value_counts().iteritems()) == set([(True, 51), (False, 3144)])
assert set(states_df["geog_entity"]) == set([u'VERMONT',
u'GEORGIA', u'IOWA', u'KANSAS', u'FLORIDA', u'VIRGINIA', u'NORTH CAROLINA',
u'HAWAII', u'NEW YORK', u'CALIFORNIA', u'ALABAMA', u'IDAHO', u'DELAWARE',
u'ALASKA', u'ILLINOIS', u'SOUTH DAKOTA', u'CONNECTICUT', u'MONTANA',
u'MASSACHUSETTS', u'NEW HAMPSHIRE', u'MARYLAND', u'NEW MEXICO',
u'MISSISSIPPI', u'TENNESSEE', u'COLORADO', u'NEW JERSEY', u'UTAH',
u'MICHIGAN', u'WEST VIRGINIA', u'WASHINGTON', u'MINNESOTA', u'OREGON',
u'WYOMING', u'OHIO', u'SOUTH CAROLINA', u'INDIANA', u'NEVADA',
u'LOUISIANA', u'NEBRASKA', u'ARIZONA', u'WISCONSIN', u'NORTH DAKOTA',
u'PENNSYLVANIA', u'OKLAHOMA', u'KENTUCKY', u'RHODE ISLAND', u'DISTRICT OF COLUMBIA',
u'ARKANSAS', u'MISSOURI', u'TEXAS', u'MAINE'])
assert set(is_county.value_counts().iteritems()) == set([(False, 52), (True, 3143)])
import csv
import codecs
import pandas as pd
from pandas import DataFrame, Series
from itertools import islice
f = codecs.open(file_path("dataset"), encoding='iso-8859-1')
reader = csv.DictReader(f)
dataset = dict([(row["fips"], row) for row in islice(reader, None)])
f.close()
# print out entry for USA
print dataset['00000']
# make sure we have the right 2010 census population for the USA
assert dataset['00000']['POP010210'] == '308745538'
# let's try pd.read_csv
# how to give hints about data type to pd.read_csv?
import codecs
# read data in and merge with fips_df
#dtype = [('fips', 'S'), ('POP010210', 'i')]
# problems with string conversion when I try to explicitly parse fips as a np.string_
# dtype = {'fips':np.string_}
# possibly relevant sections of the code:
# v0.10.1: https://github.com/pydata/pandas/tree/v0.10.1
# https://github.com/pydata/pandas/blob/v0.10.1/pandas/io/parsers.py
# https://github.com/pydata/pandas/blob/v0.10.1/pandas/src/parser.pyx#L897 -> _convert_with_dtype
# let's try object for fips and int for POP010210
dtype = {'fips':np.object, 'POP010210':np.int}
f = codecs.open(file_path("dataset"), encoding='iso-8859-1')
dataset_df = pd.read_csv(f, dtype=dtype)
# confirm data types for fips, which should not be convered to int
dataset_df.fips.dtype, dataset_df.POP010210.dtype
# read off US population
dataset_df[is_country].ix[0]['POP010210']
# 3 most populous entities
# http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.sort_index.html
dataset_df.sort_index(by='POP010210')[-1:-4:-1][["fips", "POP010210"]]
# PfDA, p. 28
df = pd.merge(fips_df, dataset_df)
df.sort_index(by='POP010210')[-1:-5:-1][["fips", "geog_entity", "POP010210"]]
# calculate state_pop
# check state_pop behavior
assert state_pop.shape == (51, 3)
assert set(state_pop.columns) == set(['POP010210', 'fips', 'geog_entity'])
assert list(state_pop[:5]["geog_entity"]) == [u'CALIFORNIA', u'TEXAS', u'NEW YORK', u'FLORIDA', u'ILLINOIS']
assert state_pop[:5]["POP010210"].sum() == 113409561
# add up all the states to match US population
assert state_pop["POP010210"].sum() == df[is_country]["POP010210"].sum()
sum(df[is_country]['POP010210']), sum(df[is_state]['POP010210']), sum(df[is_county]['POP010210'])
e.g
df[counties_for_state('06')]
should be a DataFrame with CA counties
def counties_for_state(state):
# Fill in
pass
assert set(df[counties_for_state('06')]["geog_entity"]) == set([u'Nevada County, CA',
u'Alameda County, CA', u'Kings County, CA', u'Ventura County, CA',
u'El Dorado County, CA', u'San Joaquin County, CA', u'Alpine County, CA',
u'San Luis Obispo County, CA', u'Modoc County, CA', u'Colusa County, CA',
u'Stanislaus County, CA', u'Sonoma County, CA', u'Tulare County, CA',
u'Shasta County, CA', u'Yolo County, CA', u'Placer County, CA', u'Glenn County, CA',
u'Sacramento County, CA', u'San Francisco County, CA',
u'Madera County, CA', u'Imperial County, CA', u'Plumas County, CA',
u'San Mateo County, CA', u'Riverside County, CA', u'Calaveras County, CA',
u'Napa County, CA', u'Mendocino County, CA', u'Mariposa County, CA',
u'Santa Barbara County, CA', u'Inyo County, CA', u'Butte County, CA',
u'Trinity County, CA', u'Los Angeles County, CA', u'Lassen County, CA',
u'Yuba County, CA', u'Amador County, CA', u'Marin County, CA', u'Humboldt County, CA',
u'Merced County, CA', u'Lake County, CA', u'San Diego County, CA',
u'Monterey County, CA', u'Sutter County, CA', u'Solano County, CA',
u'Tuolumne County, CA', u'San Bernardino County, CA', u'Fresno County, CA',
u'Santa Cruz County, CA', u'San Benito County, CA', u'Contra Costa County, CA',
u'Orange County, CA', u'Del Norte County, CA', u'Mono County, CA',
u'Siskiyou County, CA', u'Kern County, CA', u'Sierra County, CA', u'Tehama County, CA',
u'Santa Clara County, CA'])
for (k, s) in df[is_state][['fips', 'geog_entity', 'POP010210']].iterrows():
print s["fips"],s["geog_entity"], s["POP010210"] == df[counties_for_state(s["fips"])]['POP010210'].sum()
# http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.set_index.html#pandas.DataFrame.set_index
# p. 150
df.set_index('fips', inplace=True, drop=False)
for k in df[is_state]['fips']:
print k, df.ix[k]["geog_entity"], df.ix[k]["POP010210"] == sum(df[counties_for_state(k)]['POP010210'])
n0 = 5
n0 == 5
Now I thought I'd be able to use a n0.__eq__(5)
but nope -- it's complicated -- see http://stackoverflow.com/questions/2281222/why-when-in-python-does-x-y-call-y-eq-x#comment2254663_2282795
try:
n0.__eq__(5)
except Exception as e:
print e
can do: int.__cmp__(x)
(n0.__cmp__(4), n0.__cmp__(5), n0.__cmp__(6))
how about ndarray?
arange(5) == 2
#
# http://docs.scipy.org/doc/numpy/reference/generated/numpy.array_equal.html
np.array_equal(arange(5) == 2 , arange(5).__eq__(2))
Useful if you want to understand how the slicing syntax really works.
isinstance([1,2], list)
isinstance(arange(5), list) # what does that mean -- could still be list-like
l1 = range(5)
type(l1)
l1[0], l1.__getitem__(0), l1[0] == l1.__getitem__(0)
l1[::-1], l1.__getitem__(slice(None, None, -1))
ar1 = arange(5)
ar1[3], ar1.__getitem__(3)
ar1 == 2
ar1[ar1 == 2].shape
ar1.__eq__(2)
ar1.__getitem__(slice(2, 4, None))
slice(ar1.__eq__(2), None, None)
ar1.__getitem__(ar1.__eq__(2))
ar1[:2], ar1.__getitem__(slice(2))
ar1 + 7
ar1.__add__(7)
min(ar1 + 7)
alphabet[:]