Chapter 2, 3 of PDA
import matplotlib.pyplot as plt
import numpy as np
from pylab import figure, show
from pandas import DataFrame, Series
import pandas as pd
To make it more practical for me to look at your homework, I'm again going to assume a relative placement of files. I placed the files from
https://github.com/pydata/pydata-book
in a local directory, which in my case is "/Users/raymondyee/D/Document/Working_with_Open_Data/pydata-book/"
and then symbolically linked (ln -s
) to the the pydata-book from the root directory of the working-open-data folder. i.e., on OS X
cd /Users/raymondyee/D/Document/Working_with_Open_Data/working-open-data
ln -s /Users/raymondyee/D/Document/Working_with_Open_Data/pydata-book/ pydata-book
That way the files from the pydata-book repository look like they sit in the working-open-data directory -- without having to actually copy the files.
With this arrangment, I should then be able to drop your notebook into my own notebooks directory and run them without having to mess around with paths.
import os
USAGOV_BITLY_PATH = os.path.join(os.pardir, "pydata-book", "ch02", "usagov_bitly_data2012-03-16-1331923249.txt")
MOVIELENS_DIR = os.path.join(os.pardir, "pydata-book", "ch02", "movielens")
NAMES_DIR = os.path.join(os.pardir, "pydata-book", "ch02", "names")
assert os.path.exists(USAGOV_BITLY_PATH)
assert os.path.exists(MOVIELENS_DIR)
assert os.path.exists(NAMES_DIR)
Please make sure the above assertions work
(PfDA
, p. 18)
What's in the data file?
In 2011, URL shortening service bit.ly partnered with the United States government website usa.gov to provide a feed of anonymous data gathered from users who shorten links ending with .gov or .mil.
Hourly archive of data: http://bitly.measuredvoice.com/bitly_archive/?C=M;O=D
open(USAGOV_BITLY_PATH).readline()
import json
records = [json.loads(line) for line in open(USAGOV_BITLY_PATH)] # list comprehension
Recall what records
is
len(records)
# list of dict -> DataFrame
frame = DataFrame(records)
frame
tz_counts = frame['tz'].value_counts()
tz_counts[:10]
# fillna
clean_tz = frame['tz'].fillna('Missing')
tz_counts = clean_tz.value_counts()
print tz_counts[:10]
(clean_tz == '').value_counts()
# '' -> 'Unknown'
clean_tz[clean_tz == ''] = 'Unknown'
tz_counts = clean_tz.value_counts()
tz_counts[:10]
frame['a'][1]
frame['a'][50]
frame['a'][51]
tz_counts[:10].plot(kind='barh', rot=0)
results = Series([x.split()[0] for x in frame.a.dropna()])
results[:5]
results.value_counts()[:8]
frame.a.notnull()
frame[frame.a.notnull()]
cframe = frame[frame.a.notnull()]
meaning of other attributes?
http://www.usa.gov/About/developer-resources/1usagov.shtml#data
frame.ll.notnull()
Hints:
apply
on Series
frame.t.dropna().apply(datetime.datetime.fromtimestamp)
# FILL IN
assert earliest_dt == datetime.datetime(2012, 3, 16, 11, 40, 47)
assert latest_dt == datetime.datetime(2012, 3, 16, 12, 40, 49)
Hints:
netlocs
as a Series, indexed by Network location part (http://docs.python.org/2/library/urlparse.html) of frame.u
, and holding the number of times that netloc occurs in frame.u
and sorted in descending order by that number.frame.u
# FILL IN
# https://github.com/pydata/pandas/issues/240
assert isinstance(netlocs, Series)
assert set(list(netlocs[:5].iteritems())) == set([(u'www.whitehouse.gov', 169),
(u'www.monroecounty.gov', 121),
(u'www.fda.gov', 112),
(u'www.nasa.gov', 733),
(u'www.nysdot.gov', 836)])
import pandas as pd
import codecs
names1880_file = codecs.open(os.path.join(NAMES_DIR,'yob2010.txt'), encoding='iso-8859-1')
names1880 = pd.read_csv(names1880_file, names=['name', 'sex', 'births'])
names1880
# sort by name
names1880.sort('births', ascending=False)[:10]
names1880[names1880.sex == 'F'].sort('births', ascending=False)[:10]
names1880['births'].plot()
names1880['births'].order(ascending=False).plot()
names1880['births'].order(ascending=False).cumsum().plot()
names1880['births'].count()
names1880.groupby('sex').births.sum()
# 2010 is the last available year right now
import os
years = range(1880, 2011)
pieces = []
columns = ['name', 'sex', 'births']
for year in years:
path = os.path.join(NAMES_DIR, 'yob%d.txt' % year)
frame = pd.read_csv(path, names=columns)
frame['year'] = year
pieces.append(frame)
# Concatenate everything into a single DataFrame
names = pd.concat(pieces, ignore_index=True)
names
total_births = names.pivot_table('births', rows='year', cols='sex', aggfunc=sum)
total_births[:5]
# how to calculate the total births / year?
# add prop
def add_prop(group):
# Integer division floors
births = group.births.astype(float)
group['prop'] = births / births.sum()
return group
names = names.groupby(['year', 'sex']).apply(add_prop)
# verify prop
np.allclose(names.groupby(['year', 'sex']).prop.sum(), 1)
total_births.plot(title='Total births by sex and year')