Read in JSON and DataFrame Basics

In [2]:

# read population in
import json
import requests
from pandas import DataFrame

# pop_json_url holds a 
pop_json_url = "https://gist.github.com/rdhyee/8511607/raw/f16257434352916574473e63612fcea55a0c1b1c/population_of_countries.json"
pop_list= requests.get(pop_json_url).json()

df = DataFrame(pop_list)
df[:5]

Out[2]:

	0	1	2
0	1	China	1385566537
1	2	India	1252139596
2	3	United States	320050716
3	4	Indonesia	249865631
4	5	Brazil	200361925

5 rows × 3 columns

In [11]:

df.dtypes

Out[11]:

0    float64
1     object
2      int64
dtype: object

Q: Based on the above statement, which of these would you expect to see in pop_list?

['1', 'United States', '320050716']
[1, 'United States', 320050716]
['United States', 320050716]
[1, 'United States', '320050716']

Q: What is the relationship between s and the population of China?

s = sum(df[df[1].str.startswith('C')][2])

s is greater than the population of China
s is the same as the population of China
s is less than the population of China
s is not a number.

Q: This statement does the following?

df.columns = ['Number','Country','Population']

Nothing
df gets a new attribute called columns
df's columns are renamed based on the list
Throws an exception

Q: How would you rewrite this statement to get the same result

s = sum(df[df[1].str.startswith('C')][2])

after running:

df.columns = ['Number','Country','Population']

Series Examples

In [54]:

from pandas import DataFrame, Series
import numpy as np

s1 = Series(np.arange(1,4))
s1

Out[54]:

0    1
1    2
2    3
dtype: int64

Q: What is

s1 + 1

Q: What is

s1.apply(lambda k: 2*k).sum()

Q: What is

s1.cumsum()[1]

Q: What is

s1.cumsum() + s1.cumsum()

Q: Describe what is happening in these statements:

s1 + 1

and

s1.cumsum() + s1.cumsum()

Q: What is

np.any(s1 > 2)

** Census API Examples **

In [62]:

from census import Census
from us import states

import settings

c = Census(settings.CENSUS_KEY)
c.sf1.get(('NAME', 'P0010001'), {'for': 'state:%s' % states.CA.fips})

Out[62]:

[{u'NAME': u'California', u'P0010001': u'37253956', u'state': u'06'}]

Q: What is the purpose of settings.CENSUS_KEY?

It is the password for the Census Python package
It is an API Access key for authentication with the Census API
It is an API Access key for authentication with Github
It is key shared by all users of the Census API

Q: What is the difference between r1 and r2?

r1 = c.sf1.get(('NAME', 'P0010001'), {'for': 'county:*', 'in': 'state:%s' % states.CA.fips})
r2 = c.sf1.get(('NAME', 'P0010001'), {'for': 'county:*', 'in': 'state:*' })

Q: Which is the correct geographic hierarchy?

Nation > States = Nation is subdivided into States

Counties > States
Counties > Census Blocks > Census Tracks
Places > Counties
Census Tracts > Block Groups > Census Blocks

In [72]:

from pandas import DataFrame

r = c.sf1.get(('NAME', 'P0010001'), {'for': 'state:*'})
df = DataFrame(r)

df.head()

Out[72]:

	NAME	P0010001	state
0	Alabama	4779736	01
1	Alaska	710231	02
2	Arizona	6392017	04
3	Arkansas	2915918	05
4	California	37253956	06

5 rows × 3 columns

Q: Why does df have 52 items? Please explain

In [75]:

len(df)

Out[75]:

Q: Why are the results below different? Please explain

In [84]:

print df.P0010001.sum()
print
print df.P0010001.astype(int).sum()

477973671023163920172915918372539565029196357409789793460172318801310968765313603011567582128306326483802304635528531184339367453337213283615773552654762998836405303925296729759889279894151826341270055113164708791894205917919378102953548367259111536504375135138310741270237910525674625364814180634610525145561276388562574180010246724540185299456869865636263725789

312471327

Q: Describe the output of the following:

df.P0010001 = df.P0010001.astype(int)
df[['NAME','P0010001']].sort('P0010001', ascending=False).head()

Q: After running:

df.set_index('NAME', inplace=True)

how would you access the Series for the state of Alaska?

df['Alaska']
df[1]
df.ix['Alaska']
df[df['NAME'] == 'Alaska']

In [90]:

np.in1d([ s.fips for s in states.STATES], df.state)

Out[90]:

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True], dtype=bool)

In [91]:

df[np.in1d(df.state, [ s.fips for s in states.STATES])]

Out[91]:

	NAME	P0010001	state
0	Alabama	4779736	01
1	Alaska	710231	02
2	Arizona	6392017	04
3	Arkansas	2915918	05
4	California	37253956	06
5	Colorado	5029196	08
6	Connecticut	3574097	09
7	Delaware	897934	10
8	District of Columbia	601723	11
9	Florida	18801310	12
10	Georgia	9687653	13
11	Hawaii	1360301	15
12	Idaho	1567582	16
13	Illinois	12830632	17
14	Indiana	6483802	18
15	Iowa	3046355	19
16	Kansas	2853118	20
17	Kentucky	4339367	21
18	Louisiana	4533372	22
19	Maine	1328361	23
20	Maryland	5773552	24
21	Massachusetts	6547629	25
22	Michigan	9883640	26
23	Minnesota	5303925	27
24	Mississippi	2967297	28
25	Missouri	5988927	29
26	Montana	989415	30
27	Nebraska	1826341	31
28	Nevada	2700551	32
29	New Hampshire	1316470	33
30	New Jersey	8791894	34
31	New Mexico	2059179	35
32	New York	19378102	36
33	North Carolina	9535483	37
34	North Dakota	672591	38
35	Ohio	11536504	39
36	Oklahoma	3751351	40
37	Oregon	3831074	41
38	Pennsylvania	12702379	42
39	Rhode Island	1052567	44
40	South Carolina	4625364	45
41	South Dakota	814180	46
42	Tennessee	6346105	47
43	Texas	25145561	48
44	Utah	2763885	49
45	Vermont	625741	50
46	Virginia	8001024	51
47	Washington	6724540	53
48	West Virginia	1852994	54
49	Wisconsin	5686986	55
50	Wyoming	563626	56

51 rows × 3 columns