This notebook will walk you through some exercises to get practice using Pandas for data manipulation.
As you use this, feel free to make ample use of the Pandas Documentation, the Pandas StackOverflow Channel, and your favorite search engine. For example, if you search phrases like "Pandas sum all columns", you're very likely to find an answer to the question you have in mind.
Also, if it comes down to it, note that solutions are available in the Git repository.
# Start with our normal batch of imports and settings
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
# Following is optional: set plotting styles
import seaborn; seaborn.set()
In the lecture, we looked at the US Social Security Baby Names data. Here let's dive a little bit deeper into this.
Try to do the following with the Baby Names data.
(Here you can copy the code from the other notebook; make sure you understand what it's doing!)
def load_year(year):
data = pd.read_csv('../data/names/yob{0}.txt'.format(year),
names=['name', 'gender', 'births'])
data['year'] = year
return data
names = pd.concat([load_year(year) for year in range(1880, 2014)])
names.head()
name | gender | births | year | |
---|---|---|---|---|
0 | Mary | F | 7065 | 1880 |
1 | Anna | F | 2604 | 1880 |
2 | Emma | F | 2003 | 1880 |
3 | Elizabeth | F | 1939 | 1880 |
4 | Minnie | F | 1746 | 1880 |
Note: there are multiple ways to do this, but the first part will use masking and pivot tables, while the second part might also throw-in a groupby.
# Plot the prevalence of my name over time
my_name = 'Jacob'
subset = names[names.name == my_name]
births = subset.pivot_table('births', index='year',
columns='gender', aggfunc='sum')
births.plot();
# Whoah... there are some female "Jacob"s? Let's look at this:
births['F'].fillna(0).plot();
# Now we'll normalize the births against the total for each year
def add_birth_frac(group):
group['birth_frac'] = group.births / group.births.sum()
return group
names = names.groupby(['year', 'gender']).apply(add_birth_frac)
names.head()
name | gender | births | year | birth_frac | |
---|---|---|---|---|---|
0 | Mary | F | 7065 | 1880 | 0.077643 |
1 | Anna | F | 2604 | 1880 | 0.028618 |
2 | Emma | F | 2003 | 1880 | 0.022013 |
3 | Elizabeth | F | 1939 | 1880 | 0.021309 |
4 | Minnie | F | 1746 | 1880 | 0.019188 |
names[names.name == my_name].pivot_table('birth_frac', index='year',
columns='gender', aggfunc='sum').plot();
My name (Jacob) seems to have dipped in popularity in the 1960s, only to make a profound rise around the time I was born (1981). I'll choose to presume that I was the trend-setter.
This is a bit tricky: you might be tempted to use a groupby
and apply
over the multiple indices ['year', 'gender', 'name']
, but if you try this you'll find that it's very computationally intensive.
I'd suggest doing the following:
Is a name more likely to transition from female to male, or from male to female?
births = names.pivot_table('births', index=['year', 'name'], columns='gender')
births = births.fillna(0)
births.head()
gender | F | M | |
---|---|---|---|
year | name | ||
1880 | Aaron | 0 | 102 |
Ab | 0 | 5 | |
Abbie | 71 | 0 | |
Abbott | 0 | 5 | |
Abby | 6 | 0 |
early_period = births[:1920].fillna(0).sum(axis=0, level=1)
late_period = births[1980:].fillna(0).sum(axis=0, level=1)
early_period.head()
gender | F | M |
---|---|---|
name | ||
Aaron | 0 | 196 |
Ab | 0 | 5 |
Abbie | 152 | 0 |
Abbott | 0 | 5 |
Abby | 13 | 0 |
early_pct_M = 100 * early_period['M'] / early_period.sum(1)
late_pct_M = 100 * late_period['M'] / late_period.sum(1)
fractions = pd.DataFrame({'early_pct_M': early_pct_M,
'late_pct_M': late_pct_M})
fractions.head()
early_pct_M | late_pct_M | |
---|---|---|
Aaban | NaN | 100 |
Aabha | NaN | 0 |
Aabid | NaN | 100 |
Aabriella | NaN | 0 |
Aadam | NaN | 100 |
fractions[(fractions.early_pct_M < 40) & (fractions.late_pct_M > 60)]
early_pct_M | late_pct_M | |
---|---|---|
Byrd | 0 | 74.163180 |
Donnie | 0 | 89.382491 |
Texas | 0 | 61.595547 |
fractions[(fractions.late_pct_M < 40) & (fractions.early_pct_M > 60)]
early_pct_M | late_pct_M | |
---|---|---|
Addison | 100.000000 | 11.834706 |
Allison | 100.000000 | 1.339702 |
Ashley | 100.000000 | 1.829521 |
Aubrey | 100.000000 | 27.222843 |
Avery | 100.000000 | 35.322797 |
Bailey | 100.000000 | 19.046721 |
Beverly | 100.000000 | 1.213195 |
Billie | 100.000000 | 23.801506 |
Dana | 100.000000 | 21.765366 |
Dee | 76.923077 | 31.270896 |
Edie | 100.000000 | 1.747844 |
Hollie | 100.000000 | 7.608124 |
Holly | 100.000000 | 0.772741 |
Ivey | 100.000000 | 38.790821 |
Jodie | 100.000000 | 11.670106 |
Kelly | 100.000000 | 14.757141 |
Leslie | 90.804598 | 29.935140 |
Lindsay | 100.000000 | 3.829148 |
Lindsey | 100.000000 | 4.732257 |
Lonie | 100.000000 | 38.973799 |
Lynn | 85.714286 | 22.320630 |
Madison | 100.000000 | 2.156303 |
Marion | 62.171053 | 27.595220 |
Merida | 100.000000 | 0.000000 |
Morgan | 100.000000 | 16.627275 |
Nova | 100.000000 | 7.898499 |
Orla | 100.000000 | 29.629630 |
Ossie | 64.285714 | 36.361646 |
Pat | 100.000000 | 39.964689 |
Payton | 100.000000 | 33.960323 |
Presley | 100.000000 | 20.217981 |
Reese | 100.000000 | 38.903715 |
Sandy | 100.000000 | 13.159606 |
Shelby | 100.000000 | 10.215817 |
Stacy | 100.000000 | 11.978544 |
Sydney | 100.000000 | 5.343434 |
Taylor | 100.000000 | 26.055568 |
Tracy | 100.000000 | 19.573779 |
Apparently it's much more common for a name to transition from male to female than the other way around!