Working with Open Data Midterm (March 18, 2014)
There are 94 points in this exam: 2 each for the 47 questions. The questions are either multiple choice or short answers. For multiple choice, just write the number of the choice selected.
Name: ______________________________________
`
Consider this code to construct a DataFrame of populations of countries.
import json
import requests
from pandas import DataFrame
# read population in from JSON-formatted data derived from the Wikipedia
pop_json_url = "https://gist.github.com/rdhyee/8511607/" + \
"raw/f16257434352916574473e63612fcea55a0c1b1c/population_of_countries.json"
pop_list= requests.get(pop_json_url).json()
df = DataFrame(pop_list)
df[:5]
0 | 1 | 2 | |
---|---|---|---|
0 | 1 | China | 1385566537 |
1 | 2 | India | 1252139596 |
2 | 3 | United States | 320050716 |
3 | 4 | Indonesia | 249865631 |
4 | 5 | Brazil | 200361925 |
5 rows × 3 columns
Note the dtypes
of the columns.
df.dtypes
0 float64 1 object 2 int64 dtype: object
s = sum(df[df[1].str.startswith('C')][2])
s
1667559248
Q1: What is the relationship between s
and the population of China, where s
is defined as
s = sum(df[df[1].str.startswith('C')][2])
s
is greater than the population of Chinas
is the same as the population of Chinas
is less than the population of Chinas
is not a number.A1:
1
s2 = sum(df[df[1].str.startswith('X')][2])
s2
0
Q2: What is the relationship between s2
and the population of China, where s2
is defined by:
s2 = sum(df[df[1].str.startswith('X')][2])
s2
is greater than the population of Chinas2
is the same as the population of Chinas2
is less than the population of Chinas2
is not a number.A2:
3
df.columns = ['Number','Country','Population']
Q3: What happens when the following statement is run?
df.columns = ['Number','Country','Population']
df
gets a new attribute called columns
df
's columns are renamed based on the listA3:
3
try:
df.columns = ['Number','Country']
except Exception as e:
print e
Length mismatch: Expected axis has 3 elements, new values have 2 elements
Q4: This statement does the following
df.columns = ['Number','Country']
df
gets a new attribute called columns
df
's columns are renamed based on the listA4:
4
df.columns = ['Number','Country','Population']
s=sum(df[df['Country'].str.startswith('C')]['Population'])
s
1667559248
Q5: How would you rewrite the following statement to get the same result as
s = sum(df[df[1].str.startswith('C')][2])
after running:
df.columns = ['Number','Country','Population']
A5:
s=sum(df[df['Country'].str.startswith('C')]['Population'])
len(df[df["Population"]>1000000000])
2
Q6. What is
len(df[df["Population"] > 1000000000])
A6:
2
";".join(df[df['Population']>1000000000]['Country'].apply(lambda s: s[0]))
u'C;I'
Q7. What is
";".join(df[df['Population']>1000000000]['Country'].apply(lambda s: s[0]))
A7:
C;I
len(";".join(df[df['Population']>1000000000]['Country'].apply(lambda s: s[0])))
3
Q8. What is
len(";".join(df[df['Population']>1000000000]['Country'].apply(lambda s: s[0])))
A8:
3
from pandas import DataFrame, Series
import numpy as np
s1 = Series(np.arange(-1,4))
s1
0 -1 1 0 2 1 3 2 4 3 dtype: int64
s1 + 1
0 0 1 1 2 2 3 3 4 4 dtype: int64
Q9: What is
s1 + 1
A9:
0 0 1 1 2 2 3 3 4 4
s1.apply(lambda k: 2*k).sum()
10
Q10: What is
s1.apply(lambda k: 2*k).sum()
A10:
10
s1.cumsum()[3]
2
Q11: What is
s1.cumsum()[3]
A11:
2
s1.cumsum() - s1.cumsum()
0 0 1 0 2 0 3 0 4 0 dtype: int64
Q12: What is
s1.cumsum() - s1.cumsum()
A12:
0 0 1 0 2 0 3 0 4 0
len(s1.cumsum() - s1.cumsum())
5
Q13. What is
len(s1.cumsum() - s1.cumsum())
A13:
5
np.any(s1 > 2)
True
Q14: What is
np.any(s1 > 2)
A14:
True
np.all(s1<3)
False
Q15. What is
np.all(s1<3)
A15:
False
Consider the following code to load population(s) from the Census API.
from census import Census
from us import states
import settings
c = Census(settings.CENSUS_KEY)
c.sf1.get(('NAME', 'P0010001'), {'for': 'state:%s' % states.CA.fips})
[{u'NAME': u'California', u'P0010001': u'37253956', u'state': u'06'}]
Q16: What is the purpose of settings.CENSUS_KEY
?
A16:
2
Q17. When we run
pip install census
we are:
A17:
1
Consider r1
and r2
:
r1 = c.sf1.get(('NAME', 'P0010001'), {'for': 'county:*', 'in': 'state:%s' % states.CA.fips})
r2 = c.sf1.get(('NAME', 'P0010001'), {'for': 'county:*', 'in': 'state:*' })
len(r1), len(r2)
(58, 3221)
Q18: What is the difference between r1
and r2
?
r1 = c.sf1.get(('NAME', 'P0010001'), {'for': 'county:*', 'in': 'state:%s' % states.CA.fips})
r2 = c.sf1.get(('NAME', 'P0010001'), {'for': 'county:*', 'in': 'state:*' })
A18:
r1
is a list holding the name and total population from the 2010 US Census for every county in California.
r2
holds the name and total population from the 2010 US Census for every county in all US states, DC, and Puerto Rico.
Q19. What's the relationship between len(r1)
and len(r2)
?
len(r1)
is less than len(r2)
len(r1)
equals len(r2)
len(r1)
is greater than len(r2)
A19:
1
Q20: Which is a correct geographic hierarchy?
Nation > States = Nation is subdivided into States
A20:
3
from pandas import DataFrame
import numpy as np
from census import Census
from us import states
import settings
c = Census(settings.CENSUS_KEY)
r = c.sf1.get(('NAME', 'P0010001'), {'for': 'state:*'})
df1 = DataFrame(r)
df1.head()
NAME | P0010001 | state | |
---|---|---|---|
0 | Alabama | 4779736 | 01 |
1 | Alaska | 710231 | 02 |
2 | Arizona | 6392017 | 04 |
3 | Arkansas | 2915918 | 05 |
4 | California | 37253956 | 06 |
5 rows × 3 columns
len(df1)
52
Q21: Why does df
have 52 items? Please explain
A21:
When queried for "states", the US Census API returns data for the 50 states, the District of Columbia, and Puerto Rico: (50+1+1 = 52 entities).
Consider the two following expressions:
print df1.P0010001.sum()
print
print df1.P0010001.astype(int).sum()
477973671023163920172915918372539565029196357409789793460172318801310968765313603011567582128306326483802304635528531184339367453337213283615773552654762998836405303925296729759889279894151826341270055113164708791894205917919378102953548367259111536504375135138310741270237910525674625364814180634610525145561276388562574180010246724540185299456869865636263725789 312471327
Q22: Why is df1.P0010001.sum()
different from df1.P0010001.astype(int).sum()
?
A22:
The data type of df1.P0010001
is a string. Hence, performing sum
on it concatenates the string representation of populations into a longer string. In contrast, once df1.P0010001
is converted into integers via df1.P0010001.astype(int)
, a sum
operation adds up all the populations into a single integer.
df1.P0010001 = df1.P0010001.astype(int)
df1[['NAME','P0010001']].sort('P0010001', ascending=True).head()
NAME | P0010001 | |
---|---|---|
50 | Wyoming | 563626 |
8 | District of Columbia | 601723 |
45 | Vermont | 625741 |
34 | North Dakota | 672591 |
1 | Alaska | 710231 |
5 rows × 2 columns
Q23: Describe the output of the following:
df1.P0010001 = df1.P0010001.astype(int)
df1[['NAME','P0010001']].sort('P0010001', ascending=True).head()
A23: A DataFrame (with 5 rows and 2 columns (NAME, P0010001)) listing the 5 least populous states in ascending order by population.
df1.set_index('NAME', inplace=True)
df1.ix['Nebraska']
P0010001 1826341 state 31 Name: Nebraska, dtype: object
Q24: After running:
df1.set_index('NAME', inplace=True)
how would you access the Series for the state of Nebraska?
df1['Nebraska']
df1[1]
df1.ix['Nebraska']
df1[df1['NAME'] == 'Nebraska']
A24:
3
len(states.STATES)
51
Q25. What is len(states.STATES)
?
A25:
51
len(df1[np.in1d(df1.state, [s.fips for s in states.STATES])])
51
Q26. What is
len(df1[np.in1d(df1.state, [s.fips for s in states.STATES])])
A26:
51
In the next question, we will make use of the negation operator ~
. Take a look at a specific example
~Series([True, True, False, True])
0 False 1 False 2 True 3 False dtype: bool
list(df1[~np.in1d(df1.state, [s.fips for s in states.STATES])].index)[0]
u'Puerto Rico'
Q27. What is
list(df1[~np.in1d(df1.state, [s.fips for s in states.STATES])].index)[0]
A27:
Puerto Rico
Consider pop1
and pop2
:
pop1 = df1['P0010001'].astype('int').sum()
pop2 = df1[np.in1d(df1.state, [s.fips for s in states.STATES])]['P0010001'].astype('int').sum()
pop1-pop2
3725789
Q28. What does pop11 - pop2
represent?
A28: The population of Puerto Rico in the 2010 Census.
sum(range(1, 101))
5050
Q29. Given that
range(10)
is
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
How to get the total of every integer from 1 to 100?
sum(range(1, 101))
sum(range(100))
sum(range(1, 100))
A29:
1
# itertools is a great library
# http://docs.python.org/2/library/itertools.html#itertools.count
# itertools.count(start=0, step=1):
# "Make an iterator that returns evenly spaced values starting with step."
from itertools import islice, count
c = count(0, 1)
print c.next()
print c.next()
0 1
Q30. What output is produced from
# itertools is a great library
# http://docs.python.org/2/library/itertools.html#itertools.count
# itertools.count(start=0, step=1):
# "Make an iterator that returns evenly spaced values starting with step."
from itertools import islice, count
c = count(0, 1)
print c.next()
print c.next()
A30:
0 1
(2*Series(np.arange(101))).sum()
10100
Q31. Recalling that
1+2+3+...+100 = 5050
what is:
(2*Series(np.arange(101))).sum()
A31:
10100
Consider the follow generator that we used to query for census places.
import pandas as pd
from pandas import DataFrame
import census
import settings
import us
from itertools import islice
c=census.Census(settings.CENSUS_KEY)
def places(variables="NAME"):
for state in us.states.STATES:
geo = {'for':'place:*', 'in':'state:{s_fips}'.format(s_fips=state.fips)}
for place in c.sf1.get(variables, geo=geo):
yield place
Now we compute a DataFrame for the places: places_df
r = list(islice(places("NAME,P0010001"), None))
places_df = DataFrame(r)
places_df.P0010001 = places_df.P0010001.astype('int')
print "number of places", len(places_df)
print "total pop", places_df.P0010001.sum()
places_df.head()
number of places 29261 total pop 228457238
NAME | P0010001 | place | state | |
---|---|---|---|---|
0 | Abanda CDP | 192 | 00100 | 01 |
1 | Abbeville city | 2688 | 00124 | 01 |
2 | Adamsville city | 4522 | 00460 | 01 |
3 | Addison town | 758 | 00484 | 01 |
4 | Akron town | 356 | 00676 | 01 |
5 rows × 4 columns
We display the most populous places from California
places_df[places_df.state=='06'].sort_index(by='P0010001', ascending=False).head()
NAME | P0010001 | place | state | |
---|---|---|---|---|
2714 | Los Angeles city | 3792621 | 44000 | 06 |
3112 | San Diego city | 1307402 | 66000 | 06 |
3122 | San Jose city | 945942 | 68000 | 06 |
3116 | San Francisco city | 805235 | 67000 | 06 |
2425 | Fresno city | 494665 | 27000 | 06 |
5 rows × 4 columns
places_df['label'] = places_df.apply(lambda s: s['state']+s['place'], axis=1)
places_df.ix[3122]['label']
u'0668000'
Q32. Given
places_df[places_df.state=='06'].sort_index(by='P0010001', ascending=False).head()
is
NAME | P0010001 | place | state | |
---|---|---|---|---|
2714 | Los Angeles city | 3792621 | 44000 | 06 |
3112 | San Diego city | 1307402 | 66000 | 06 |
3122 | San Jose city | 945942 | 68000 | 06 |
3116 | San Francisco city | 805235 | 67000 | 06 |
2425 | Fresno city | 494665 | 27000 | 06 |
5 rows × 4 columns
places_df.ix[3122]['label']
after we add the label
column with:
places_df['label'] = places_df.apply(lambda s: s['state']+s['place'], axis=1)
A32:
0668000
places_df["NAME"][3122]
u'San Jose city'
Q33. What is
places_df["NAME"][3122]
A33:
San Jose city
Now let's set up a DataFrame with some letters and properties of letters.
# numpy and pandas related imports
import numpy as np
from pandas import Series, DataFrame
import pandas as pd
# for example, using lower and uppercase English letters
import string
lower = Series(list(string.lowercase), name='lower')
upper = Series(list(string.uppercase), name='upper')
df2 = pd.concat((lower, upper), axis=1)
df2['ord'] = df2['lower'].apply(ord)
df2.head()
lower | upper | ord | |
---|---|---|---|
0 | a | A | 97 |
1 | b | B | 98 |
2 | c | C | 99 |
3 | d | D | 100 |
4 | e | E | 101 |
5 rows × 3 columns
Note that string.upper
takes a letter and returns its uppercase version. For example:
string.upper('b')
'B'
np.all(df2['lower'].apply(string.upper) == df2['upper'])
True
Q34. What is
np.all(df2['lower'].apply(string.upper) == df2['upper'])
A34:
True
df2.apply(lambda s: s['lower'] + s['upper'], axis=1)[6]
'gG'
Q35. What is
df2.apply(lambda s: s['lower'] + s['upper'], axis=1)[6]
A35:
gG
Please remind yourself what enumerate
does.
words = ['Berkeley', 'I', 'School']
for (i, word) in islice(enumerate(words),1):
print (i, word)
(0, 'Berkeley')
list(enumerate(words))[2][1]
'School'
Q36. What is
list(enumerate(words))[2][1]
A36:
School
Now consider the generator g2
def g2():
words = ['Berkeley', 'I', 'School']
for word in words:
if word != 'I':
for letter in list(word):
yield letter
my_g2 = g2()
len(list(my_g2))
14
Q37. What is
len(list(my_g2))
A37:
14
def g3():
words = ['Berkeley', 'I', 'School']
for word in words:
yield words
len(list(g3()))
3
Q38. What is
len(list(g3()))
A38:
3
Consider using groupby
with a DataFrame with states.
import us
import census
import settings
import pandas as pd
import numpy as np
from pandas import DataFrame, Series
from itertools import islice
c = census.Census(settings.CENSUS_KEY)
def states(variables='NAME'):
geo={'for':'state:*'}
states_fips = set([state.fips for state in us.states.STATES])
# need to filter out non-states
for r in c.sf1.get(variables, geo=geo, year=2010):
if r['state'] in states_fips:
yield r
# make a dataframe from the total populations of states in the 2010 Census
df = DataFrame(states('NAME,P0010001'))
df.P0010001 = df.P0010001.astype('int')
df['first_letter'] = df.NAME.apply(lambda s:s[0])
df.head()
NAME | P0010001 | state | first_letter | |
---|---|---|---|---|
0 | Alabama | 4779736 | 01 | A |
1 | Alaska | 710231 | 02 | A |
2 | Arizona | 6392017 | 04 | A |
3 | Arkansas | 2915918 | 05 | A |
4 | California | 37253956 | 06 | C |
5 rows × 4 columns
For reference, here's a list of all the states
print list(df.NAME)
[u'Alabama', u'Alaska', u'Arizona', u'Arkansas', u'California', u'Colorado', u'Connecticut', u'Delaware', u'District of Columbia', u'Florida', u'Georgia', u'Hawaii', u'Idaho', u'Illinois', u'Indiana', u'Iowa', u'Kansas', u'Kentucky', u'Louisiana', u'Maine', u'Maryland', u'Massachusetts', u'Michigan', u'Minnesota', u'Mississippi', u'Missouri', u'Montana', u'Nebraska', u'Nevada', u'New Hampshire', u'New Jersey', u'New Mexico', u'New York', u'North Carolina', u'North Dakota', u'Ohio', u'Oklahoma', u'Oregon', u'Pennsylvania', u'Rhode Island', u'South Carolina', u'South Dakota', u'Tennessee', u'Texas', u'Utah', u'Vermont', u'Virginia', u'Washington', u'West Virginia', u'Wisconsin', u'Wyoming']
df.groupby('first_letter').apply(lambda g:list(g.NAME))['C']
[u'California', u'Colorado', u'Connecticut']
Q39. What is
df.groupby('first_letter').apply(lambda g:list(g.NAME))['C']
A39:
[u'California', u'Colorado', u'Connecticut']
df.groupby('first_letter').apply(lambda g:len(g.NAME))['A']
4
Q40. What is
df.groupby('first_letter').apply(lambda g:len(g.NAME))['A']
A40:
4
df.groupby('first_letter').agg('count')['first_letter']['P']
1
Q41. What is
df.groupby('first_letter').agg('count')['first_letter']['P']
A41:
1
len(df.groupby('NAME'))
51
Q42. What is
len(df.groupby('NAME'))
A42:
51
Recall the code from the diversity calculations
def normalize(s):
"""take a Series and divide each item by the sum so that the new series adds up to 1.0"""
total = np.sum(s)
return s.astype('float') / total
def entropy(series):
"""Normalized Shannon Index"""
# a series in which all the entries are equal should result in normalized entropy of 1.0
# eliminate 0s
series1 = series[series!=0]
# if len(series) < 2 (i.e., 0 or 1) then return 0
if len(series1) > 1:
# calculate the maximum possible entropy for given length of input series
max_s = -np.log(1.0/len(series))
total = float(sum(series1))
p = series1.astype('float')/float(total)
return sum(-p*np.log(p))/max_s
else:
return 0.0
def gini_simpson(s):
# https://en.wikipedia.org/wiki/Diversity_index#Gini.E2.80.93Simpson_index
s1 = normalize(s)
return 1-np.sum(s1*s1)
Q43. Suppose you have 10 people and 5 categories, how you would you maximize the Shannon entropy?
A43:
3
entropy(Series([0,0,10,0,0]))
0.0
Q44. What is
entropy(Series([0,0,10,0,0]))
A44:
0
entropy(Series([10,0,0,0,0]))
0.0
Q45. What is
entropy(Series([10,0,0,0,0]))
A45:
0
entropy(Series([1,1,1,1,1]))
1.0000000000000002
Q46. What is
entropy(Series([1,1,1,1,1]))
A46:
1
gini_simpson(Series([2,2,2,2,2]))
0.79999999999999993
Q47. What is
gini_simpson(Series([2,2,2,2,2]))
A47:
0.8