Groupby operations¶

Some imports:

In [ ]:

%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
try:
    import seaborn
except ImportError:
    pass

pd.options.display.max_rows = 10

Some 'theory': the groupby operation (split-apply-combine)¶

The "group by" concept: we want to apply the same function on subsets of your dataframe, based on some key to split the dataframe in subsets

This operation is also referred to as the "split-apply-combine" operation, involving the following steps:

Splitting the data into groups based on some criteria
Applying a function to each group independently
Combining the results into a data structure

Similar to SQL GROUP BY

The example of the image in pandas syntax:

In [ ]:

df = pd.DataFrame({'key':['A','B','C','A','B','C','A','B','C'],
                   'data': [0, 5, 10, 5, 10, 15, 10, 15, 20]})
df

Using the filtering and reductions operations we have seen in the previous notebooks, we could do something like:

df[df['key'] == "A"].sum()
df[df['key'] == "B"].sum()
...

But pandas provides the groupby method to do this:

In [ ]:

df.groupby('key').aggregate(np.sum)  # 'sum'

In [ ]:

df.groupby('key').sum()

And many more methods are available.

And now applying this on some real data¶

We go back to the titanic survival data:

In [ ]:

df = pd.read_csv("data/titanic.csv")

In [ ]:

df.head()

EXERCISE: Using groupby(), calculate the average age for each sex.

In [ ]:

EXERCISE: Calculate the average survival ratio for all passengers.

In [ ]:

EXERCISE: Calculate this survival ratio for all passengers younger that 25 (remember: filtering/boolean indexing).

In [ ]:

EXERCISE: Is there a difference in this survival ratio between the sexes? (tip: write the above calculation of the survival ratio as a function)

In [ ]:

EXERCISE: Make a bar plot of the survival ratio for the different classes ('Pclass' column).

In [ ]:

If you are ready, more groupby exercises can be found in the "Advanded groupby operations" notebook.