Solution: Hands-on with Pandas¶

This notebook will walk you through some exercises to get practice using Pandas for data manipulation.

As you use this, feel free to make ample use of the Pandas Documentation, the Pandas StackOverflow Channel, and your favorite search engine. For example, if you search phrases like "Pandas sum all columns", you're very likely to find an answer to the question you have in mind.

Also, if it comes down to it, note that solutions are available in the Git repository.

In [1]:

# Start with our normal batch of imports and settings
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Following is optional: set plotting styles
import seaborn; seaborn.set()

Diving Deeper into Baby Names¶

In the lecture, we looked at the US Social Security Baby Names data. Here let's dive a little bit deeper into this.

Try to do the following with the Baby Names data.

0. Load the baby names data¶

(Here you can copy the code from the other notebook; make sure you understand what it's doing!)

In [2]:

def load_year(year):
    data = pd.read_csv('../data/names/yob{0}.txt'.format(year),
                       names=['name', 'gender', 'births'])
    data['year'] = year
    return data

names = pd.concat([load_year(year) for year in range(1880, 2014)])
names.head()

Out[2]:

	name	gender	births	year
0	Mary	F	7065	1880
1	Anna	F	2604	1880
2	Emma	F	2003	1880
3	Elizabeth	F	1939	1880
4	Minnie	F	1746	1880

1. Find your own name within the data.¶

How many babies per year are born with your name?
What fraction of births each year have your name?

Note: there are multiple ways to do this, but the first part will use masking and pivot tables, while the second part might also throw-in a groupby.

In [3]:

# Plot the prevalence of my name over time
my_name = 'Jacob'
subset = names[names.name == my_name]

births = subset.pivot_table('births', index='year',
                            columns='gender', aggfunc='sum')
births.plot();

In [4]:

# Whoah... there are some female "Jacob"s? Let's look at this:
births['F'].fillna(0).plot();

In [5]:

# Now we'll normalize the births against the total for each year
def add_birth_frac(group):
    group['birth_frac'] = group.births / group.births.sum()
    return group

names = names.groupby(['year', 'gender']).apply(add_birth_frac)
names.head()

Out[5]:

	name	gender	births	year	birth_frac
0	Mary	F	7065	1880	0.077643
1	Anna	F	2604	1880	0.028618
2	Emma	F	2003	1880	0.022013
3	Elizabeth	F	1939	1880	0.021309
4	Minnie	F	1746	1880	0.019188

In [6]:

names[names.name == my_name].pivot_table('birth_frac', index='year',
                                         columns='gender', aggfunc='sum').plot();

My name (Jacob) seems to have dipped in popularity in the 1960s, only to make a profound rise around the time I was born (1981). I'll choose to presume that I was the trend-setter.

2. Find names which have switched genders.¶

This is a bit tricky: you might be tempted to use a groupby and apply over the multiple indices ['year', 'gender', 'name'], but if you try this you'll find that it's very computationally intensive.

I'd suggest doing the following:

Use a pivot table, and find the total number of births for each name before some early date (say, 1920) and after some later date (say, 1980).
Compute the percentage of males for each name within those groups.
Use masking to find which names have transitioned from a low percentage to a high percentage, and vice versa.

Is a name more likely to transition from female to male, or from male to female?

In [7]:

births = names.pivot_table('births', index=['year', 'name'], columns='gender')
births = births.fillna(0)
births.head()

Out[7]:

	gender	F	M
year	name
1880	Aaron	0	102
	Ab	0	5
	Abbie	71	0
	Abbott	0	5
	Abby	6	0

In [8]:

early_period = births[:1920].fillna(0).sum(axis=0, level=1)
late_period = births[1980:].fillna(0).sum(axis=0, level=1)

early_period.head()

Out[8]:

gender	F	M
name
Aaron	0	196
Ab	0	5
Abbie	152	0
Abbott	0	5
Abby	13	0

In [9]:

early_pct_M = 100 * early_period['M'] / early_period.sum(1)
late_pct_M = 100 * late_period['M'] / late_period.sum(1)
fractions = pd.DataFrame({'early_pct_M': early_pct_M,
                         'late_pct_M': late_pct_M})
fractions.head()

Out[9]:

	early_pct_M	late_pct_M
Aaban	NaN	100
Aabha	NaN	0
Aabid	NaN	100
Aabriella	NaN	0
Aadam	NaN	100

Names which switched from female to male:¶

In [10]:

fractions[(fractions.early_pct_M < 40) & (fractions.late_pct_M > 60)]

Out[10]:

	early_pct_M	late_pct_M
Byrd	0	74.163180
Donnie	0	89.382491
Texas	0	61.595547

Names which switched from male to female:¶

In [11]:

fractions[(fractions.late_pct_M < 40) & (fractions.early_pct_M > 60)]

Out[11]:

	early_pct_M	late_pct_M
Addison	100.000000	11.834706
Allison	100.000000	1.339702
Ashley	100.000000	1.829521
Aubrey	100.000000	27.222843
Avery	100.000000	35.322797
Bailey	100.000000	19.046721
Beverly	100.000000	1.213195
Billie	100.000000	23.801506
Dana	100.000000	21.765366
Dee	76.923077	31.270896
Edie	100.000000	1.747844
Hollie	100.000000	7.608124
Holly	100.000000	0.772741
Ivey	100.000000	38.790821
Jodie	100.000000	11.670106
Kelly	100.000000	14.757141
Leslie	90.804598	29.935140
Lindsay	100.000000	3.829148
Lindsey	100.000000	4.732257
Lonie	100.000000	38.973799
Lynn	85.714286	22.320630
Madison	100.000000	2.156303
Marion	62.171053	27.595220
Merida	100.000000	0.000000
Morgan	100.000000	16.627275
Nova	100.000000	7.898499
Orla	100.000000	29.629630
Ossie	64.285714	36.361646
Pat	100.000000	39.964689
Payton	100.000000	33.960323
Presley	100.000000	20.217981
Reese	100.000000	38.903715
Sandy	100.000000	13.159606
Shelby	100.000000	10.215817
Stacy	100.000000	11.978544
Sydney	100.000000	5.343434
Taylor	100.000000	26.055568
Tracy	100.000000	19.573779

Apparently it's much more common for a name to transition from male to female than the other way around!