What this notebook is about¶

Goal:

to learn about the basics of Pandas groupby operations using the census information about states

References:

Group By: split-apply-combine — pandas 0.13.1 documentation
Python for Data Analysis > 9. Data Aggregation and Group Operations > GroupBy Mechanics : Safari Books Online

Note the split-apply-combine framework of thinking

Version of pandas¶

In [1]:

# note version of pandas
import pandas
pandas.__version__

Out[1]:

'0.13.0'

In [2]:

import us
import census
import settings

import pandas as pd
import numpy as np
from pandas import DataFrame, Series
from itertools import islice

c = census.Census(settings.CENSUS_KEY)

def states(variables='NAME'):
    geo={'for':'state:*'}
    
    states_fips = set([state.fips for state in us.states.STATES])
    # need to filter out non-states
    for r in c.sf1.get(variables, geo=geo, year=2010):
        if r['state'] in states_fips:
            yield r

In [3]:

# make a dataframe from the total populations of states in the 2010 Census

df = DataFrame(states('NAME,P0010001'))
df.P0010001 = df.P0010001.astype('int')
df.head()

Out[3]:

	NAME	P0010001	state
0	Alabama	4779736	01
1	Alaska	710231	02
2	Arizona	6392017	04
3	Arkansas	2915918	05
4	California	37253956	06

5 rows × 3 columns

In [4]:

# check that that we have the right total population

df.P0010001.sum() == 308745538 

Out[4]:

True

In [5]:

# add a column with the first letter 
# we'll be grouping states based on the first letter of the state NAME

df['first_letter'] = df.NAME.apply(lambda s:s[0])
df.head()

Out[5]:

	NAME	P0010001	state	first_letter
0	Alabama	4779736	01	A
1	Alaska	710231	02	A
2	Arizona	6392017	04	A
3	Arkansas	2915918	05	A
4	California	37253956	06	C

5 rows × 4 columns

Group By: split-apply-combine — pandas 0.13.1 documentation

Possible to do grouping in many ways. "The mapping can be specified many different ways":

A Python function, to be called on each of the axis labels
A list or NumPy array of the same length as the selected axis
A dict or Series, providing a label -> group name mapping
For DataFrame objects, a string indicating a column to be used to group. Of course df.groupby('A') is just syntactic sugar for df.groupby(df['A']), but it makes life simpler
A list of any of the above things

We can also group by columns, axis=1, but such functionality is not demonstrated in this notebook.

In [6]:

# we can explicitly name df.first_letter

grouped = df.groupby(df.first_letter)
grouped

Out[6]:

<pandas.core.groupby.DataFrameGroupBy object at 0x106212350>

In [7]:

# shorthand for df.first_letter
# same thing as df.groupby(df.first_letter)

grouped = df.groupby('first_letter')
grouped

Out[7]:

<pandas.core.groupby.DataFrameGroupBy object at 0x106203f10>

In [8]:

# count the number of states with each first letter -- look at top of the resulting DataFrame

grouped.count().head()

Out[8]:

	NAME	P0010001	state	first_letter
first_letter
A	4	4	4	4
C	3	3	3	3
D	2	2	2	2
F	1	1	1	1
G	1	1	1	1

5 rows × 4 columns

In [9]:

# count the number of states with each first letter -- look at bottom of the resulting DataFrame

grouped.count().tail()

Out[9]:

	NAME	P0010001	state	first_letter
first_letter
S	2	2	2	2
T	2	2	2	2
U	1	1	1	1
V	2	2	2	2
W	4	4	4	4

5 rows × 4 columns

In [10]:

# we didn't have to explicitly create a new column -- we could groupby on a dynamically generated Series
# note the use of str operations on df.NAME: 
# http://pandas.pydata.org/pandas-docs/stable/basics.html#vectorized-string-methods

df.groupby(df.NAME.str.slice(0,1)).sum()

Out[10]:

	P0010001
NAME
A	14797902
C	45857249
D	1499657
F	18801310
G	9687653
H	1360301
I	23928371
K	7192485
L	4533372
M	38782746
N	46280611
O	19118929
P	12702379
R	1052567
S	5439544
T	31491666
U	2763885
V	8626765
W	14828146

19 rows × 1 columns

In [11]:

# we can get groups of indexes

df.groupby('first_letter').groups

Out[11]:

{u'A': [0, 1, 2, 3],
 u'C': [4, 5, 6],
 u'D': [7, 8],
 u'F': [9],
 u'G': [10],
 u'H': [11],
 u'I': [12, 13, 14, 15],
 u'K': [16, 17],
 u'L': [18],
 u'M': [19, 20, 21, 22, 23, 24, 25, 26],
 u'N': [27, 28, 29, 30, 31, 32, 33, 34],
 u'O': [35, 36, 37],
 u'P': [38],
 u'R': [39],
 u'S': [40, 41],
 u'T': [42, 43],
 u'U': [44],
 u'V': [45, 46],
 u'W': [47, 48, 49, 50]}

In [12]:

# loop through all groups --> here just the first one

for name, group in islice(df.groupby('first_letter'),1):
    print(name)
    print type(group) # yes -- a DataFrame
    print group.index
    print(group),

A
<class 'pandas.core.frame.DataFrame'>
Int64Index([0, 1, 2, 3], dtype='int64')
       NAME  P0010001 state first_letter
0   Alabama   4779736    01            A
1    Alaska    710231    02            A
2   Arizona   6392017    04            A
3  Arkansas   2915918    05            A

[4 rows x 4 columns]

In [13]:

# how about accessing group 'C'?
# http://stackoverflow.com/a/14734627/7782
# http://stackoverflow.com/questions/19804282/in-pandas-is-there-something-like-a-groupby-get-group-but-with-an-optional-defa
# http://pandas.pydata.org/pandas-docs/dev/generated/pandas.core.groupby.GroupBy.get_group.html

grouped = df.groupby('first_letter')
grouped.get_group('C')

Out[13]:

	NAME	P0010001	state	first_letter
4	California	37253956	06	C
5	Colorado	5029196	08	C
6	Connecticut	3574097	09	C

3 rows × 4 columns

In [14]:

# total population of states starting with 'C'

grouped.get_group('C').P0010001.sum()

Out[14]:

45857249

Aggregation¶

In [15]:

# generate a Series of total populations by first letter

grouped = df.groupby('first_letter')
s = grouped['P0010001'].sum()
s

Out[15]:

first_letter
A               14797902
C               45857249
D                1499657
F               18801310
G                9687653
H                1360301
I               23928371
K                7192485
L                4533372
M               38782746
N               46280611
O               19118929
P               12702379
R                1052567
S                5439544
T               31491666
U                2763885
V                8626765
W               14828146
Name: P0010001, dtype: int64

In [16]:

# sort the list to get the most populous groups
# http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.sort.html
# note sorting in place

s.sort(ascending=False)
s

Out[16]:

first_letter
N               46280611
C               45857249
M               38782746
T               31491666
I               23928371
O               19118929
F               18801310
W               14828146
A               14797902
P               12702379
G                9687653
V                8626765
K                7192485
S                5439544
L                4533372
U                2763885
D                1499657
H                1360301
R                1052567
Name: P0010001, dtype: int64

In [17]:

s.index

Out[17]:

Index([u'N', u'C', u'M', u'T', u'I', u'O', u'F', u'W', u'A', u'P', u'G', u'V', u'K', u'S', u'L', u'U', u'D', u'H', u'R'], dtype='object')

In [18]:

# first pass at pulling together the letter, the total population, and the list of corresponding states

for k in s.index:
    print k, s[k], list(grouped.get_group(k).NAME)

N 46280611 [u'Nebraska', u'Nevada', u'New Hampshire', u'New Jersey', u'New Mexico', u'New York', u'North Carolina', u'North Dakota']
C 45857249 [u'California', u'Colorado', u'Connecticut']
M 38782746 [u'Maine', u'Maryland', u'Massachusetts', u'Michigan', u'Minnesota', u'Mississippi', u'Missouri', u'Montana']
T 31491666 [u'Tennessee', u'Texas']
I 23928371 [u'Idaho', u'Illinois', u'Indiana', u'Iowa']
O 19118929 [u'Ohio', u'Oklahoma', u'Oregon']
F 18801310 [u'Florida']
W 14828146 [u'Washington', u'West Virginia', u'Wisconsin', u'Wyoming']
A 14797902 [u'Alabama', u'Alaska', u'Arizona', u'Arkansas']
P 12702379 [u'Pennsylvania']
G 9687653 [u'Georgia']
V 8626765 [u'Vermont', u'Virginia']
K 7192485 [u'Kansas', u'Kentucky']
S 5439544 [u'South Carolina', u'South Dakota']
L 4533372 [u'Louisiana']
U 2763885 [u'Utah']
D 1499657 [u'Delaware', u'District of Columbia']
H 1360301 [u'Hawaii']
R 1052567 [u'Rhode Island']

In [18]:

In [19]:

# let's try this again and try to do this in a more idiomatic Pandas fashion
# ideally, generate a DataFrame with a NAME column that is a list of all states with the same first letter

# It turns out that apply can come to the rescue
# here' calculate the list of NAMEs

# http://pandas.pydata.org/pandas-docs/dev/groupby.html#flexible-apply
# http://stackoverflow.com/questions/19530568/can-pandas-groupby-aggregate-into-a-list-rather-than-sum-mean-etc

df.groupby("first_letter").apply(lambda x: list(x['NAME']))

Out[19]:

first_letter
A                            [Alabama, Alaska, Arizona, Arkansas]
C                             [California, Colorado, Connecticut]
D                                [Delaware, District of Columbia]
F                                                       [Florida]
G                                                       [Georgia]
H                                                        [Hawaii]
I                                [Idaho, Illinois, Indiana, Iowa]
K                                              [Kansas, Kentucky]
L                                                     [Louisiana]
M               [Maine, Maryland, Massachusetts, Michigan, Min...
N               [Nebraska, Nevada, New Hampshire, New Jersey, ...
O                                        [Ohio, Oklahoma, Oregon]
P                                                  [Pennsylvania]
R                                                  [Rhode Island]
S                                  [South Carolina, South Dakota]
T                                              [Tennessee, Texas]
U                                                          [Utah]
V                                             [Vermont, Virginia]
W                 [Washington, West Virginia, Wisconsin, Wyoming]
dtype: object

In [20]:

# apply can be used to add up the populations by group

df.groupby("first_letter").apply(lambda x: np.sum(x['P0010001']))

Out[20]:

first_letter
A               14797902
C               45857249
D                1499657
F               18801310
G                9687653
H                1360301
I               23928371
K                7192485
L                4533372
M               38782746
N               46280611
O               19118929
P               12702379
R                1052567
S                5439544
T               31491666
U                2763885
V                8626765
W               14828146
dtype: int64

In [21]:

# make a tuple out of the list of names and the population

df.groupby("first_letter").apply(lambda x:( list(x['NAME']), np.sum(x['P0010001'])))

Out[21]:

first_letter
A                ([Alabama, Alaska, Arizona, Arkansas], 14797902)
C                 ([California, Colorado, Connecticut], 45857249)
D                     ([Delaware, District of Columbia], 1499657)
F                                           ([Florida], 18801310)
G                                            ([Georgia], 9687653)
H                                             ([Hawaii], 1360301)
I                    ([Idaho, Illinois, Indiana, Iowa], 23928371)
K                                   ([Kansas, Kentucky], 7192485)
L                                          ([Louisiana], 4533372)
M               ([Maine, Maryland, Massachusetts, Michigan, Mi...
N               ([Nebraska, Nevada, New Hampshire, New Jersey,...
O                            ([Ohio, Oklahoma, Oregon], 19118929)
P                                      ([Pennsylvania], 12702379)
R                                       ([Rhode Island], 1052567)
S                       ([South Carolina, South Dakota], 5439544)
T                                  ([Tennessee, Texas], 31491666)
U                                               ([Utah], 2763885)
V                                  ([Vermont, Virginia], 8626765)
W               ([Washington, West Virginia, Wisconsin, Wyomin...
dtype: object

In [22]:

# remind ourselves on how to turn a tuple into a Series with a small example

Series(([1,2],2), index=['one','two'])

Out[22]:

one    [1, 2]
two         2
dtype: object

In [23]:

# we're ready to make a new DataFrame

df.groupby("first_letter").apply(lambda x:Series((list(x['NAME']), np.sum(x['P0010001'])), 
                                                 index=['states','total_pop']))

Out[23]:

	states	total_pop
first_letter
A	[Alabama, Alaska, Arizona, Arkansas]	14797902
C	[California, Colorado, Connecticut]	45857249
D	[Delaware, District of Columbia]	1499657
F	[Florida]	18801310
G	[Georgia]	9687653
H	[Hawaii]	1360301
I	[Idaho, Illinois, Indiana, Iowa]	23928371
K	[Kansas, Kentucky]	7192485
L	[Louisiana]	4533372
M	[Maine, Maryland, Massachusetts, Michigan, Min...	38782746
N	[Nebraska, Nevada, New Hampshire, New Jersey, ...	46280611
O	[Ohio, Oklahoma, Oregon]	19118929
P	[Pennsylvania]	12702379
R	[Rhode Island]	1052567
S	[South Carolina, South Dakota]	5439544
T	[Tennessee, Texas]	31491666
U	[Utah]	2763885
V	[Vermont, Virginia]	8626765
W	[Washington, West Virginia, Wisconsin, Wyoming]	14828146

19 rows × 2 columns

In [24]:

df2 = df.groupby("first_letter").apply(lambda x:Series((list(x['NAME']), np.sum(x['P0010001'])), 
                                                 index=['states','total_pop'])).sort_index(by='total_pop',ascending=False)

In [25]:

# make sure you understand the syntax here:
# .ix: http://pandas.pydata.org/pandas-docs/dev/indexing.html#advanced-indexing-with-ix
# ability to grab columns by name to return a new DataFrame

df2.ix['C'][['states','total_pop']]

Out[25]:

states       [California, Colorado, Connecticut]
total_pop                               45857249
Name: C, dtype: object

In [26]:

print "states that start with 'C'", df2.ix['C']['states']

states that start with 'C' [u'California', u'Colorado', u'Connecticut']

In [27]:

print "total population of states that start with 'C'", df2.ix['C']['total_pop']

total population of states that start with 'C' 45857249

In [27]:

What this notebook is about¶

Version of pandas¶

Setup for grabbing state-related census data¶

Aggregation¶