The upcoming midterm: (Day 17, 2014-03-18). It will probably consist of mostly multiple choice questions.
The goal of this notebook is to help students to prepare for the midterm through providing highlights of what we've covered so far.
[Working definition of open data](http://rdhyee.github.io/wwod14/day01.html#(19)
From http://en.wikipedia.org/w/index.php?title=Special:Cite&page=Open_data&id=532390265:
Open data is the idea that certain data should be freely available to everyone
to use and republish as they wish, without restrictions from copyright, patents or other mechanisms of control.
A piece of content or data is open if anyone is free to use, reuse, and
redistribute it — subject only, at most, to the requirement to attribute and/or share-alike.
[Day 1: OKFestival /OKCon as indicator of vibrancy of the international open data community](http://rdhyee.github.io/wwod14/day01.html#(20%29)
[Day 1: Examples of Open Data](http://rdhyee.github.io/wwod14/day01.html#(31%29)
PfDA
, Chap 3 Python for Data Analysis > 3. IPython: An Interactive Computing and Development EnvironmentPfDA
, Appendix: Python Language Essentials -- to help remind yourself of key elements of standard PythonPfDA
, Chap 2 Introductory ExamplesDay_01_B_World_Population.ipynb
The Racial Dot Map: One Dot Per Person | Weldon Cooper Center for Public Service
pip
and how to use it?# set up your census object
# example from https://github.com/sunlightlabs/census
from census import Census
from us import states
import settings
c = Census(settings.CENSUS_KEY)
for (i, state) in enumerate(states.STATES):
print i, state.name, state.fips
0 Alabama 01 1 Alaska 02 2 Arizona 04 3 Arkansas 05 4 California 06 5 Colorado 08 6 Connecticut 09 7 Delaware 10 8 District of Columbia 11 9 Florida 12 10 Georgia 13 11 Hawaii 15 12 Idaho 16 13 Illinois 17 14 Indiana 18 15 Iowa 19 16 Kansas 20 17 Kentucky 21 18 Louisiana 22 19 Maine 23 20 Maryland 24 21 Massachusetts 25 22 Michigan 26 23 Minnesota 27 24 Mississippi 28 25 Missouri 29 26 Montana 30 27 Nebraska 31 28 Nevada 32 29 New Hampshire 33 30 New Jersey 34 31 New Mexico 35 32 New York 36 33 North Carolina 37 34 North Dakota 38 35 Ohio 39 36 Oklahoma 40 37 Oregon 41 38 Pennsylvania 42 39 Rhode Island 44 40 South Carolina 45 41 South Dakota 46 42 Tennessee 47 43 Texas 48 44 Utah 49 45 Vermont 50 46 Virginia 51 47 Washington 53 48 West Virginia 54 49 Wisconsin 55 50 Wyoming 56
import requests
# get the total population of all states
url = "http://api.census.gov/data/2010/sf1?key={key}&get=P0010001,NAME&for=state:*".format(key=settings.CENSUS_KEY)
r = requests.get(url)
r.json()[:5]
[[u'P0010001', u'NAME', u'state'], [u'4779736', u'Alabama', u'01'], [u'710231', u'Alaska', u'02'], [u'6392017', u'Arizona', u'04'], [u'2915918', u'Arkansas', u'05']]
c.sf1.get(('NAME', 'P0010001'), {'for': 'state:%s' % states.CA.fips})
[{u'NAME': u'California', u'P0010001': u'37253956', u'state': u'06'}]
[Day 3: Key Concept for Today: Execution Environment of Python](http://rdhyee.github.io/wwod14/day03.html#(13%29)
[Day 3: How I use conda](http://rdhyee.github.io/wwod14/day03.html#(17%29)
PfDA
, Appendix: Python Language Essentials -- to help remind yourself of key elements of standard PythonPfDA
, Chap 3 Python for Data Analysis > 3. IPython: An Interactive Computing and Development EnvironmentPfDA
, Chap 2 Introductory Examples[Day 4](http://rdhyee.github.io/wwod14/day04.html#(7%29): Work through Day_04_B_numpy_and_pandas_series.ipynb (everything before Advanced: Operator Overloading)
You should be able to calculate the total population of the US in the fill-in section of Day_04_C_Census.ipynb.
You should be able to calculate the population of California by totaling the county populations.
For [Day 5: Geographical Hierarchies in the Census](http://rdhyee.github.io/wwod14/day05.html#(1%29), study:
Day_06_C_Calculating_Diversity_Preview.ipynb and Day_06_D_Assignment
generators Day 6: Generators for Geographic Entities
Day_06_D_Assignment.ipynb: exercise to write a generator for Census Places (answer: Day_06_E_Assignment_Answers.ipynb)
# You should understand how this works.
import pandas as pd
from pandas import DataFrame
import census
import settings
import us
from itertools import islice
c=census.Census(settings.CENSUS_KEY)
def places(variables="NAME"):
for state in us.states.STATES:
print state
geo = {'for':'place:*', 'in':'state:{s_fips}'.format(s_fips=state.fips)}
for place in c.sf1.get(variables, geo=geo):
yield place
r = list(islice(places("NAME,P0010001"), None))
places_df = DataFrame(r)
places_df.P0010001 = places_df.P0010001.astype('int')
places_df['FIPS'] = places_df.apply(lambda s: s['state']+s['place'], axis=1)
print "number of places", len(places_df)
print "total pop", places_df.P0010001.sum()
places_df.head()
assert places_df.P0010001.sum() == 228457238
# number of places in 2010 Census
assert len(places_df) == 29261
Alabama Alaska Arizona Arkansas California Colorado Connecticut Delaware District of Columbia Florida Georgia Hawaii Idaho Illinois Indiana Iowa Kansas Kentucky Louisiana Maine Maryland Massachusetts Michigan Minnesota Mississippi Missouri Montana Nebraska Nevada New Hampshire New Jersey New Mexico New York North Carolina North Dakota Ohio Oklahoma Oregon Pennsylvania Rhode Island South Carolina South Dakota Tennessee Texas Utah Vermont Virginia Washington West Virginia Wisconsin Wyoming number of places 29261 total pop 228457238
apply + lambda functions: Day_06_A_Apply_Lambda.ipynb
http://www.census.gov/developers/data/sf1.xml
compare to http://www.census.gov/prod/cen2010/briefs/c2010br-02.pdf
I think the P0050001 might be the key category
P0050002 Not Hispanic or Latino (total) =
P0050003 Not Hispanic White only
P0050004 Not Hispanic Black only
P0050006 Not Hispanic Asian only
Not Hispanic Other (should also be P0050002 - (P0050003 + P0050004 + P0050006)
P0050010 Hispanic or Latino
P0050010 = P0050011...P0050017
"Whites are coded as blue; African-Americans, green; Asians, red; Hispanics, orange; and all other racial categories are coded as brown."
[Day 7: Preview of Plotting Graphs and Maps](http://rdhyee.github.io/wwod14/day07.html#(3%29)
Do the following notebooks work for you to show basic graphics.
Day_07_E_Census_fields.ipynb is an exploration of the concepts and variables in the 2010 Census.
Day_07_F_Groupby.ipynb: gives you background on how to understand and use groupby
in Pandas. Don't miss AJ's Day_10_Groupby_Examples.ipynb, which should be helpful, especially if you found Day_10_Groupby_Examples.ipynb obscure.
Day_07_G_Calculating_Diversity.ipynb: a prelude to the big diversity-calculation assignment Day_08_A_Metro_Diversity.ipynb
not a focal point for the midterm (though, of course, it's good for projects to be in the background of your thinking)
Relevant references:
I will assume that you've read Chapter 8 of PfDA
and can run Day_11_B_Setting_Up_for_PfDA.ipynb.
study overview slide: [Day 12: Overview of Plotting Options](http://rdhyee.github.io/wwod14/day12.html#(3%29).
Note some fundamental conceptual aspects to matplotlib
(as I outline in Day_12_A_Matplotlib_Intro.ipynb
and try to make basic plots on your own (line plots, scatter plots, bar plots).
Day_12_B_Baby_Names_Starter.ipynb#Names-that-are-both-M-and-F
Before you use Day_13_C_Baby_Names_MF_Completed.ipynb, try the approach in Day_13_B_Baby_Names_MF_Starter.ipynb
Assignment in nbviewer.ipython.org/github/rdhyee/working-open-data-2014/blob/master/notebooks/Day_13_B_Baby_Names_MF_Starter.ipynb:
Submit a notebook that describes what you've learned about the nature of
ambigendered names in the baby names database. (Due date: Monday, March 10 Wed, March 12 at
11:5pm --> bCourses assignment) I'm interested in seeing what you do
with the data set in this regard. At the minimum, show that you are able to run
Day_13_C_Baby_Names_MF_Completed. Be creative and have fun.
[Day 13: mpld3 references](http://rdhyee.github.io/wwod14/day13.html#(5%29)