In [3]:

from __future__ import division, print_function

In [4]:

from IPython.display import HTML, Image, display as disp

In [5]:

with open("../css/css.css", "r") as f:
    style = f.read()
HTML(style)

Out[5]:

Whirlwind Introduction to Pandas!¶

About me: Phillip Cloud¶

Past:¶

MA Psychology from CUNY CCNY (analysis of rat spike train data)
Started using pandas/numpy for loading data from proprietary formats
Decided I should contribute

Now:¶

Data infrastructure/research/development @ neuromatters (neurotechnology R & D company)
Core contributer to pandas for a little over 1 year

In [6]:

!./shortlog.sh ../../pandas

  2985	Wes McKinney
  1149	jreback
   765	y-p
   607	Chang She
   323	Phillip Cloud
   315	Adam Klein
   120	Jeffrey Tratner
   109	Vytautas Jancauskas
    89	Joris Van den Bossche
    84	Andy Hayden

Why pandas?¶

pandas fills the need for a Python library that is awesome at data munging. With pandas you don't need to go outside of Python to do data analysis. You can write a web app using django and have data dashboards that are backed by pandas.

What is pandas?¶

Strengths:¶

pandas is the library for tabular you've always wished you had. It supports operations such as

groupby
merge
aggregations (mean, sum, any, all, etc.)
advanced indexing capabilities
automatic data alignment during e.g., arithmetic operations
powerful vectorized text manipulation
missing data handling
comprehensive and extremely flexible IO (e.g., CSV, Excel, HDF5, HTML)

Limitations:¶

pandas was designed to work with "medium" data, i.e., data that fit into memory.
That said, I routinely work with $\approx$ 20-100M row datasets fairly easily with pandas.

My reaction after discovering pandas:¶

In [7]:

disp(Image("../img/m.jpg"))

In [8]:

disp(Image("../img/u.jpg"))

First rule of pandas talks: they aren't complete without a meme.¶

Deps for the talk¶

These notebooks
pandas with all optional deps (Anaconda provides most if not all of these)
sh: pip install sh
jq
requests because urllib et al are less than desirable

Optional¶

mpltools (for nice looking plots by default)

Enough talking, let's see some code.¶

In [9]:

# imports that I'll use throughout the talk
import numpy as np
import pandas as pd
from pandas import DataFrame, Series, Index
from numpy.random import randn, randint, rand, choice
import matplotlib.pyplot as plt

pd.options.display.max_rows = 10

try:
    from mpltools import style
    style.use('ggplot')
except ImportError:
    pass


# because of our bg color
plt.rc('text', color='white')
plt.rc('axes', labelcolor='white')
plt.rc('xtick', color='white')
plt.rc('ytick', color='white')

In [10]:

%matplotlib inline

`pandas.DataFrame`¶

`DataFrame` is the flagship data structure in `pandas`¶

In [11]:

df = pd.DataFrame(randn(10, 2), columns=list('ab'))
df

Out[11]:

	a	b
0	1.525911	-1.536062
1	-0.773639	0.786538
2	0.673406	0.336026
3	-0.778567	0.557999
4	0.932137	-1.204426
5	0.139991	0.406078
6	-0.687879	-0.497632
7	0.271935	-0.287440
8	0.240098	1.125080
9	0.235167	-0.532541

In [ ]:

# if you have boto installed and have set up credentials
# df = pd.read_csv('s3://nypug/tips.csv')

In [12]:

df = pd.read_csv('https://s3.amazonaws.com/nyqpug/tips.csv')
df

Out[12]:

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4
...	...	...	...	...	...	...	...
239	29.03	5.92	Male	No	Sat	Dinner	3
240	27.18	2.00	Female	Yes	Sat	Dinner	2
241	22.67	2.00	Male	Yes	Sat	Dinner	2
242	17.82	1.75	Male	No	Sat	Dinner	2
243	18.78	3.00	Female	No	Thur	Dinner	2

244 rows × 7 columns

In [13]:

df.dtypes

Out[13]:

total_bill    float64
tip           float64
sex            object
smoker         object
day            object
time           object
size            int64
dtype: object

One of `pandas`' most useful and powerful features is its ability to slice and dice data in almost any way you can think of.¶

In [14]:

# column access by name
df['day']  # ⟵ that's a Series object

Out[14]:

0    Sun
1    Sun
2    Sun
...
241     Sat
242     Sat
243    Thur
Name: day, Length: 244, dtype: object

In [15]:

# by attribute
df.time

Out[15]:

0    Dinner
1    Dinner
2    Dinner
...
241    Dinner
242    Dinner
243    Dinner
Name: time, Length: 244, dtype: object

In [16]:

df.time.value_counts()

Out[16]:

Dinner    176
Lunch      68
dtype: int64

In [17]:

# multiple columns
df[['tip', 'sex']]

Out[17]:

	tip	sex
0	1.01	Female
1	1.66	Male
2	3.50	Male
3	3.31	Male
4	3.61	Female
...	...	...
239	5.92	Male
240	2.00	Female
241	2.00	Male
242	1.75	Male
243	3.00	Female

244 rows × 2 columns

In [18]:

t = df.set_index('day')
t.head()

Out[18]:

	total_bill	tip	sex	smoker	time	size
day
Sun	16.99	1.01	Female	No	Dinner	2
Sun	10.34	1.66	Male	No	Dinner	3
Sun	21.01	3.50	Male	No	Dinner	3
Sun	23.68	3.31	Male	No	Dinner	2
Sun	24.59	3.61	Female	No	Dinner	4

In [ ]:

t.loc['Sun']

In [19]:

df.loc[df.day == 'Sun']

Out[19]:

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4
...	...	...	...	...	...	...	...
186	20.90	3.50	Female	Yes	Sun	Dinner	3
187	30.46	2.00	Male	Yes	Sun	Dinner	5
188	18.15	3.50	Female	Yes	Sun	Dinner	3
189	23.10	4.00	Male	Yes	Sun	Dinner	3
190	15.69	1.50	Male	Yes	Sun	Dinner	2

76 rows × 7 columns

In [21]:

df.loc[:, 'smoker']

Out[21]:

0    No
1    No
2    No
...
241    Yes
242     No
243     No
Name: smoker, Length: 244, dtype: object

In [22]:

df.iloc[:, 3]  # same as df.loc[:, 'smoker']

Out[22]:

0    No
1    No
2    No
...
241    Yes
242     No
243     No
Name: smoker, Length: 244, dtype: object

Add and delete columns as you please¶

In [23]:

df['pct_tip'] = df.tip / df.total_bill
df['avg_price'] = df.total_bill / df.size

In [24]:

df.avg_price.hist(bins=20)

Out[24]:

<matplotlib.axes._subplots.AxesSubplot at 0x4ae7a50>

In [25]:

del df['avg_price']
del df['pct_tip']

In [26]:

# multiple columns, multiple rows
df.loc[[0, 2], ['sex', 'tip']]

Out[26]:

	sex	tip
0	Female	1.01
2	Male	3.50

In [27]:

df.loc[:10, ['total_bill', 'tip']]  # note this is inclusive

Out[27]:

	total_bill	tip
0	16.99	1.01
1	10.34	1.66
2	21.01	3.50
3	23.68	3.31
4	24.59	3.61
...	...	...
6	8.77	2.00
7	26.88	3.12
8	15.04	1.96
9	14.78	3.23
10	10.27	1.71

11 rows × 2 columns

In [28]:

# and with iloc
df.iloc[:5]  # exclusive endpoints

Out[28]:

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4

Applying Functions to Pandas Objects¶

Apply a function over the columns of a DataFrame

In [29]:

# np.ptp is peak-to-peak difference, i.e., range
df[['total_bill', 'tip', 'size']].apply(np.ptp)

Out[29]:

total_bill    47.74
tip            9.00
size           5.00
dtype: float64

In [31]:

days = ['Sun', 'Mon', 'Tues', 'Wed', 'Thur', 'Fri', 'Sat']

In [32]:

df.day.map(days.index)

Out[32]:

0    0
1    0
2    0
...
241    6
242    6
243    4
Name: day, Length: 244, dtype: int64

Reductions ¶

For example, the DataFrame.sum() method results in a Series object and by default sums across the rows. If you want to sum across the columns pass axis=1, similar to the numpy convention

In general, reduction operations result in a Series object.

Some example reduction functions:

mean, median, and mode
count
std, var

NOTE: These ignore NaNs

In [33]:

df.sum()  # whoa! + is defined for strings

Out[33]:

total_bill                                              4827.77
tip                                                      731.58
sex           FemaleMaleMaleMaleFemaleMaleMaleMaleMaleMaleMa...
smoker        NoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNo...
day           SunSunSunSunSunSunSunSunSunSunSunSunSunSunSunS...
time          DinnerDinnerDinnerDinnerDinnerDinnerDinnerDinn...
size                                                        627
dtype: object

In [34]:

df.sum(numeric_only=True)

Out[34]:

total_bill    4827.77
tip            731.58
size           627.00
dtype: float64

In [37]:

df.count() / df.shape[0]  # we don't have any nans

Out[37]:

total_bill    1
tip           1
sex           1
smoker        1
day           1
time          1
size          1
dtype: float64

In [35]:

df.var()

Out[35]:

total_bill    79.252939
tip            1.914455
size           0.904591
dtype: float64

In [36]:

df.mean()

Out[36]:

total_bill    19.785943
tip            2.998279
size           2.569672
dtype: float64

`GroupBy`¶

pandas' GroupBy functionality allows you to perform operations on subsets of a DataFrame and then combines the results for you at the end

In [38]:

df.head()

Out[38]:

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4

In [39]:

gb = df.groupby('sex')
gb.mean()

Out[39]:

	total_bill	tip	size
sex
Female	18.056897	2.833448	2.459770
Male	20.744076	3.089618	2.630573

In [40]:

gb = df.groupby(['sex', 'smoker'])
gb.std()

Out[40]:

		total_bill	tip	size
sex	smoker
Female	No	7.286455	1.128425	1.073146
Female	Yes	9.189751	1.219916	0.613917
Male	No	8.726566	1.489559	0.989094
Male	Yes	9.911845	1.500120	0.892530

In [41]:

# can pass multiple reducers to agg
gb.agg(['mean', 'std', 'median'])

Out[41]:

		total_bill			tip			size
		mean	std	median	mean	std	median	mean	std	median
sex	smoker
Female	No	18.105185	7.286455	16.69	2.773519	1.128425	2.68	2.592593	1.073146	2
Female	Yes	17.977879	9.189751	16.27	2.931515	1.219916	2.88	2.242424	0.613917	2
Male	No	19.791237	8.726566	18.24	3.113402	1.489559	2.74	2.711340	0.989094	2
Male	Yes	22.284500	9.911845	20.39	3.051167	1.500120	3.00	2.500000	0.892530	2

`apply`¶

GroupBy.apply() take a callable and calls it on each group

In [42]:

def stdize(x):
    return (x - x.mean()) / x.std()
    
df['tb_std'] = gb.total_bill.apply(stdize)
df

Out[42]:

	total_bill	tip	sex	smoker	day	time	size	tb_std
0	16.99	1.01	Female	No	Sun	Dinner	2	-0.153049
1	10.34	1.66	Male	No	Sun	Dinner	3	-1.083042
2	21.01	3.50	Male	No	Sun	Dinner	3	0.139661
3	23.68	3.31	Male	No	Sun	Dinner	2	0.445623
4	24.59	3.61	Female	No	Sun	Dinner	4	0.889982
...	...	...	...	...	...	...	...	...
239	29.03	5.92	Male	No	Sat	Dinner	3	1.058694
240	27.18	2.00	Female	Yes	Sat	Dinner	2	1.001346
241	22.67	2.00	Male	Yes	Sat	Dinner	2	0.038893
242	17.82	1.75	Male	No	Sat	Dinner	2	-0.225889
243	18.78	3.00	Female	No	Thur	Dinner	2	0.092612

244 rows × 8 columns

In [43]:

ax = df.tb_std.plot(kind='kde', lw=3)
df.tb_std.hist(ax=ax, normed=True)

ax.set_xlabel('$z$-score', fontsize=20)
ax.set_title(r'Total Bill Stdized Across Sex$\times$ Smoker')

ax.axis('tight')
plt.gcf().tight_layout()

`merge`¶

Often times you have data sets of different shapes with a common column that you want to join on

pandas does this a couple of ways, but the main entry point is pandas.merge

In [45]:

# create some frames
n = 2000
n2 = n // 2
visits = DataFrame({'page_visits_per_day': np.random.poisson(10, size=n),
                 'user_id': randint(9, size=n)})
likes = DataFrame({'likes_per_day': np.random.poisson(30, size=n2),
                   'user_id': randint(6, size=n2)})
visits

Out[45]:

	page_visits_per_day	user_id
0	6	4
1	12	1
2	13	5
3	7	1
4	6	7
...	...	...
1995	7	1
1996	8	5
1997	9	3
1998	10	6
1999	8	4

2000 rows × 2 columns

In [46]:

likes

Out[46]:

	likes_per_day	user_id
0	27	0
1	33	4
2	23	1
3	31	5
4	26	0
...	...	...
995	28	5
996	30	3
997	33	4
998	29	0
999	27	5

1000 rows × 2 columns

In [47]:

merg = pd.merge(visits, likes)
merg.sort('user_id')

Out[47]:

	page_visits_per_day	user_id	likes_per_day
152516	15	0	22
174447	11	0	21
174446	11	0	25
174445	11	0	25
174444	11	0	26
...	...	...	...
85623	17	5	24
85624	17	5	35
85625	17	5	34
85645	17	5	25
73182	10	5	34

227153 rows × 3 columns

String Manipulation ¶

In [48]:

s = df.day
s

Out[48]:

0    Sun
1    Sun
2    Sun
...
241     Sat
242     Sat
243    Thur
Name: day, Length: 244, dtype: object

In [49]:

is_weekend = s.str.startswith('S')
is_weekend

Out[49]:

0    True
1    True
2    True
...
241     True
242     True
243    False
Name: day, Length: 244, dtype: bool

In [50]:

correct = s[is_weekend].str.contains(r'^(?:Sat|Sun)$')
correct.all()

Out[50]:

True

In [51]:

s.str.len()

Out[51]:

0    3
1    3
2    3
...
241    3
242    3
243    4
Name: day, Length: 244, dtype: int64

In [52]:

s.str[:2]

Out[52]:

0    Su
1    Su
2    Su
...
241    Sa
242    Sa
243    Th
Name: day, Length: 244, dtype: object

Time Series Functionality ¶

Construction¶

In [53]:

n = 10000
idx = pd.date_range(start='today', periods=n, freq='D')
idx

Out[53]:

<class 'pandas.tseries.index.DatetimeIndex'>
[2014-05-28, ..., 2041-10-12]
Length: 10000, Freq: D, Timezone: None

In [55]:

s = Series(np.random.poisson(10, size=n), index=idx, name='login_count')
s

Out[55]:

2014-05-28     6
2014-05-29    12
2014-05-30    11
...
2041-10-10     6
2041-10-11    14
2041-10-12     8
Freq: D, Name: login_count, Length: 10000

`resample`¶

In [56]:

s.resample('W', how='sum')

Out[56]:

2014-06-01    46
2014-06-08    56
2014-06-15    74
...
2041-09-29    66
2041-10-06    77
2041-10-13    59
Freq: W-SUN, Name: login_count, Length: 1429

In [57]:

# multiple functions
rs = s.resample('W', how=['mean', 'std', 'count'])
rs

Out[57]:

	mean	std	count
2014-06-01	9.200000	2.387467	5
2014-06-08	8.000000	2.768875	7
2014-06-15	10.571429	4.755949	7
2014-06-22	10.857143	3.132016	7
2014-06-29	9.285714	2.927700	7
...	...	...	...
2041-09-15	8.285714	2.984085	7
2041-09-22	10.000000	2.000000	7
2041-09-29	9.428571	3.154739	7
2041-10-06	11.000000	2.828427	7
2041-10-13	9.833333	2.857738	6

1429 rows × 3 columns

Plotting ¶

In [58]:

mu = rs['mean']
fig, ax = plt.subplots(figsize=(12, 6))
ax.step(mu.index[:50], mu.iloc[:50], lw=3, where='post')
fig.tight_layout()
ax.set_xlabel('Time')
ax.set_ylabel('Average Count / Week')
ax.set_title('Count vs. Time')

Out[58]:

<matplotlib.text.Text at 0x58f0550>

Basic IO¶

CSV: `read_csv`/`to_csv`¶

In [59]:

%%writefile tmp.csv
a,b,c
1,d,3.0
4,e,6.28
2,f,4.4

Overwriting tmp.csv

In [60]:

df = pd.read_csv('tmp.csv')
df

Out[60]:

	a	b	c
0	1	d	3.00
1	4	e	6.28
2	2	f	4.40

In [61]:

df.dtypes

Out[61]:

a      int64
b     object
c    float64
dtype: object

In [62]:

df.to_csv('tmp.csv')

In [63]:

%cat tmp.csv # saves the index created by default

,a,b,c
0,1,d,3.0
1,4,e,6.28
2,2,f,4.4

HDF5: `read_hdf`/`to_hdf`¶

In [64]:

df = DataFrame(dict(a=randn(10), b=randint(10, size=10),
                    c=choice(list('abc'), size=10)))
df

Out[64]:

	a	b	c
0	0.311531	2	a
1	0.466764	3	b
2	-0.134696	9	a
3	-1.227535	4	c
4	-0.572367	8	a
5	0.330043	7	a
6	-0.684184	4	c
7	-1.097351	9	c
8	-0.694465	1	b
9	-1.376861	5	b

In [65]:

df.to_hdf('tmp.h5', 'df', format='table')

In [66]:

pd.read_hdf('tmp.h5', 'df')

Out[66]:

	a	b	c
0	0.311531	2	a
1	0.466764	3	b
2	-0.134696	9	a
3	-1.227535	4	c
4	-0.572367	8	a
5	0.330043	7	a
6	-0.684184	4	c
7	-1.097351	9	c
8	-0.694465	1	b
9	-1.376861	5	b

In [68]:

!ptdump tmp.h5

/ (RootGroup) ''
/df (Group) ''
/df/table (Table(10,)) ''

Whirlwind Introduction to Pandas!¶

About me: Phillip Cloud¶

Past:¶

Now:¶

Why pandas?¶

What is pandas?¶

Strengths:¶

Limitations:¶

My reaction after discovering pandas:¶

First rule of pandas talks: they aren't complete without a meme.¶

Deps for the talk¶

Optional¶

Enough talking, let's see some code.¶

pandas.DataFrame¶

DataFrame is the flagship data structure in pandas¶

One of pandas' most useful and powerful features is its ability to slice and dice data in almost any way you can think of.¶

Add and delete columns as you please¶

Applying Functions to Pandas Objects¶

Reductions¶

GroupBy¶

apply¶

merge¶

String Manipulation¶

Time Series Functionality¶