Some imports:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
try:
import seaborn
except ImportError:
pass
pd.options.display.max_rows = 10
The "group by" concept: we want to apply the same function on subsets of your dataframe, based on some key to split the dataframe in subsets
This operation is also referred to as the "split-apply-combine" operation, involving the following steps:
Similar to SQL GROUP BY
The example of the image in pandas syntax:
df = pd.DataFrame({'key':['A','B','C','A','B','C','A','B','C'],
'data': [0, 5, 10, 5, 10, 15, 10, 15, 20]})
df
Using the filtering and reductions operations we have seen in the previous notebooks, we could do something like:
df[df['key'] == "A"].sum()
df[df['key'] == "B"].sum()
...
But pandas provides the groupby
method to do this:
df.groupby('key').aggregate(np.sum) # 'sum'
df.groupby('key').sum()
And many more methods are available.
We go back to the titanic survival data:
df = pd.read_csv("data/titanic.csv")
df.head()
If you are ready, more groupby exercises can be found in the "Advanded groupby operations" notebook.