Notebook

Midterm¶

The upcoming midterm: (Day 17, 2014-03-18). It will probably consist of mostly multiple choice questions.

The goal of this notebook is to help students to prepare for the midterm through providing highlights of what we've covered so far.

Suggestions about How to Prepare¶

read through all the materials from the course so far and outline what you understand and don't.
focus on key concepts and those programming constructs that are repeated often.

Open Data¶

[Working definition of open data](http://rdhyee.github.io/wwod14/day01.html#(19)

From http://en.wikipedia.org/w/index.php?title=Special:Cite&page=Open_data&id=532390265:

Open data is the idea that certain data should be freely available to everyone

to use and republish as they wish, without restrictions from copyright, patents or other mechanisms of control.

http://opendefinition.org/:

A piece of content or data is open if anyone is free to use, reuse, and

redistribute it — subject only, at most, to the requirement to attribute and/or share-alike.

Examples of Open Data¶

[Day 1: OKFestival /OKCon as indicator of vibrancy of the international open data community](http://rdhyee.github.io/wwod14/day01.html#(20%29)

[Day 1: Examples of Open Data](http://rdhyee.github.io/wwod14/day01.html#(31%29)

Readings from Day 1 ¶

read Python for Data Analysis Chap 1. Preliminaries : Safari Books Online The instructions for using Enthought Python Distribution are out of date. If you are looking for a distribution, follow the installation instructions for Anaconda for your computer platform.
read PfDA, Chap 3 Python for Data Analysis > 3. IPython: An Interactive Computing and Development Environment
skim PfDA, Appendix: Python Language Essentials -- to help remind yourself of key elements of standard Python
skim PfDA, Chap 2 Introductory Examples

World Populations¶

Day_01_B_World_Population.ipynb

How was the JSON data from the Wikipeida and the CIA Factbook produced?
Why do the totals from the two sources differ?

Racial Dot Map (as a framing example)¶

The Racial Dot Map: One Dot Per Person | Weldon Cooper Center for Public Service

What is the Racial Dot Map displaying?
How would you get data relevant to the Racial Dot Map from the Census API?

Census API¶

Day_02_A_US_Census_API.ipynb

What's the purpose of an API key?
What is pip and how to use it?
Remember the issues of sometimes having to filter to Puerto Rico

In [1]:

# set up your census object
# example from https://github.com/sunlightlabs/census

from census import Census
from us import states

import settings

c = Census(settings.CENSUS_KEY)
for (i, state) in enumerate(states.STATES):
    print i, state.name, state.fips

0 Alabama 01
1 Alaska 02
2 Arizona 04
3 Arkansas 05
4 California 06
5 Colorado 08
6 Connecticut 09
7 Delaware 10
8 District of Columbia 11
9 Florida 12
10 Georgia 13
11 Hawaii 15
12 Idaho 16
13 Illinois 17
14 Indiana 18
15 Iowa 19
16 Kansas 20
17 Kentucky 21
18 Louisiana 22
19 Maine 23
20 Maryland 24
21 Massachusetts 25
22 Michigan 26
23 Minnesota 27
24 Mississippi 28
25 Missouri 29
26 Montana 30
27 Nebraska 31
28 Nevada 32
29 New Hampshire 33
30 New Jersey 34
31 New Mexico 35
32 New York 36
33 North Carolina 37
34 North Dakota 38
35 Ohio 39
36 Oklahoma 40
37 Oregon 41
38 Pennsylvania 42
39 Rhode Island 44
40 South Carolina 45
41 South Dakota 46
42 Tennessee 47
43 Texas 48
44 Utah 49
45 Vermont 50
46 Virginia 51
47 Washington 53
48 West Virginia 54
49 Wisconsin 55
50 Wyoming 56

Formulating-URL-requests-to-the-API-explicitly

In [2]:

import requests
# get the total population of all states
url = "http://api.census.gov/data/2010/sf1?key={key}&get=P0010001,NAME&for=state:*".format(key=settings.CENSUS_KEY)
r = requests.get(url)

r.json()[:5]

Out[2]:

[[u'P0010001', u'NAME', u'state'],
 [u'4779736', u'Alabama', u'01'],
 [u'710231', u'Alaska', u'02'],
 [u'6392017', u'Arizona', u'04'],
 [u'2915918', u'Arkansas', u'05']]

#Total-Population-of-California

In [3]:

c.sf1.get(('NAME', 'P0010001'), {'for': 'state:%s' % states.CA.fips})

Out[3]:

[{u'NAME': u'California', u'P0010001': u'37253956', u'state': u'06'}]

Execution environments for programs¶

[Day 3: Key Concept for Today: Execution Environment of Python](http://rdhyee.github.io/wwod14/day03.html#(13%29)

[Day 3: How I use conda](http://rdhyee.github.io/wwod14/day03.html#(17%29)

Learning the Basics of NumPy and Pandas¶

read Python for Data Analysis Chap 1. Preliminaries : Safari Books Online The instructions for using Enthought Python Distribution are out of date. If you are looking for a distribution, follow the installation instructions for Anaconda for your computer platform.
skim PfDA, Appendix: Python Language Essentials -- to help remind yourself of key elements of standard Python
read PfDA, Chap 3 Python for Data Analysis > 3. IPython: An Interactive Computing and Development Environment
read PfDA, Chap 2 Introductory Examples

[Day 4](http://rdhyee.github.io/wwod14/day04.html#(7%29): Work through Day_04_B_numpy_and_pandas_series.ipynb (everything before Advanced: Operator Overloading)

Census API skills¶

You should be able to calculate the total population of the US in the fill-in section of Day_04_C_Census.ipynb.

You should be able to calculate the population of California by totaling the county populations.

For [Day 5: Geographical Hierarchies in the Census](http://rdhyee.github.io/wwod14/day05.html#(1%29), study:

Day_05_A_Geographical_Hierarchies.ipynb
and answers (Day_05_B_Geographical_Hierarchies.ipynb

Day_06_C_Calculating_Diversity_Preview.ipynb and Day_06_D_Assignment

Generators¶

generators Day 6: Generators for Geographic Entities

Day_06_D_Assignment.ipynb: exercise to write a generator for Census Places (answer: Day_06_E_Assignment_Answers.ipynb)

In [4]:

# You should understand how this works.

import pandas as pd
from pandas import DataFrame

import census
import settings
import us

from itertools import islice

c=census.Census(settings.CENSUS_KEY)

def places(variables="NAME"):
    
    for state in us.states.STATES:
        print state
        geo = {'for':'place:*', 'in':'state:{s_fips}'.format(s_fips=state.fips)}
        for place in c.sf1.get(variables, geo=geo):
            yield place

r = list(islice(places("NAME,P0010001"), None))
places_df = DataFrame(r)
places_df.P0010001 = places_df.P0010001.astype('int')

places_df['FIPS'] = places_df.apply(lambda s: s['state']+s['place'], axis=1)

print "number of places", len(places_df)
print "total pop", places_df.P0010001.sum()
places_df.head()

assert places_df.P0010001.sum() == 228457238
# number of places in 2010 Census
assert len(places_df) == 29261

Alabama
Alaska
Arizona
Arkansas
California
Colorado
Connecticut
Delaware
District of Columbia
Florida
Georgia
Hawaii
Idaho
Illinois
Indiana
Iowa
Kansas
Kentucky
Louisiana
Maine
Maryland
Massachusetts
Michigan
Minnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
New Hampshire
New Jersey
New Mexico
New York
North Carolina
North Dakota
Ohio
Oklahoma
Oregon
Pennsylvania
Rhode Island
South Carolina
South Dakota
Tennessee
Texas
Utah
Vermont
Virginia
Washington
West Virginia
Wisconsin
Wyoming
number of places 29261
total pop 228457238

Apply and lambda functions¶

apply + lambda functions: Day_06_A_Apply_Lambda.ipynb

P005* variables in the census¶

http://www.census.gov/developers/data/sf1.xml

compare to http://www.census.gov/prod/cen2010/briefs/c2010br-02.pdf

I think the P0050001 might be the key category

P0010001 = P0050001
P0050001 = P0050002 + P0050010

P0050002 Not Hispanic or Latino (total) =

P0050003 Not Hispanic White only
P0050004 Not Hispanic Black only
P0050006 Not Hispanic Asian only
Not Hispanic Other (should also be P0050002 - (P0050003 + P0050004 + P0050006)
- P0050005 Not Hispanic: American Indian/ American Indian and Alaska Native alone
- P0050007 Not Hispanic: Native Hawaiian and Other Pacific Islander alone
- P0050008 Not Hispanic: Some Other Race alone
- P0050009 Not Hispanic: Two or More Races
P0050010 Hispanic or Latino

P0050010 = P0050011...P0050017

"Whites are coded as blue; African-Americans, green; Asians, red; Hispanics, orange; and all other racial categories are coded as brown."

Some graphics demonstrations you should try¶

[Day 7: Preview of Plotting Graphs and Maps](http://rdhyee.github.io/wwod14/day07.html#(3%29)

Do the following notebooks work for you to show basic graphics.

Census has lots of interesting data (optional)¶

Day_07_E_Census_fields.ipynb is an exploration of the concepts and variables in the 2010 Census.

Groupby¶

Day_07_F_Groupby.ipynb: gives you background on how to understand and use groupby in Pandas. Don't miss AJ's Day_10_Groupby_Examples.ipynb, which should be helpful, especially if you found Day_10_Groupby_Examples.ipynb obscure.

Census Metro Diversity Exercise¶

Day_07_G_Calculating_Diversity.ipynb: a prelude to the big diversity-calculation assignment Day_08_A_Metro_Diversity.ipynb

Projects¶

not a focal point for the midterm (though, of course, it's good for projects to be in the background of your thinking)

Relevant references:

Day 9: Creating Projects
[Day 9: Creating Projects: Project Topic Ideas](http://rdhyee.github.io/wwod14/day09.html#(5%29)
Project-Starter_OpenContext.ipynb
[Day 11: Project Brainstorming](http://rdhyee.github.io/wwod14/day11.html#(6%29)

Plotting and Mapping preparation¶

I will assume that you've read Chapter 8 of PfDA and can run Day_11_B_Setting_Up_for_PfDA.ipynb.

study overview slide: [Day 12: Overview of Plotting Options](http://rdhyee.github.io/wwod14/day12.html#(3%29).

Note some fundamental conceptual aspects to matplotlib (as I outline in Day_12_A_Matplotlib_Intro.ipynb and try to make basic plots on your own (line plots, scatter plots, bar plots).

Baby Names¶

Day_12_B_Baby_Names_Starter.ipynb#Names-that-are-both-M-and-F

Before you use Day_13_C_Baby_Names_MF_Completed.ipynb, try the approach in Day_13_B_Baby_Names_MF_Starter.ipynb

Assignment in nbviewer.ipython.org/github/rdhyee/working-open-data-2014/blob/master/notebooks/Day_13_B_Baby_Names_MF_Starter.ipynb:

Submit a notebook that describes what you've learned about the nature of

ambigendered names in the baby names database. (Due date: ~~Monday, March 10~~ Wed, March 12 at 11:5pm --> bCourses assignment) I'm interested in seeing what you do with the data set in this regard. At the minimum, show that you are able to run Day_13_C_Baby_Names_MF_Completed. Be creative and have fun.

mpld3¶

[Day 13: mpld3 references](http://rdhyee.github.io/wwod14/day13.html#(5%29)

Day_13_A_mpl3d.ipynb

pivot_table¶

Day_14_A_Pivot_table_example.ipynb