From Rob Pike's Notes on Programming in C:

Rule 5. Data dominates. If you've chosen the right data structures and

organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming.

Pandas is built on a hierarchy of a few powerful data structures. Each of these structures is composed of, and designed to interoperate with, the simpler structures.

Index (1-Dimensional immutable ordered hash table)
Series (1-Dimensional Labelled Array)
DataFrame (2-Dimensional Labelled Array)
Panel (3-Dimensional Labelled Array)

In [1]:

# Tell IPython to display mapltplotlib plots inline.
%matplotlib inline

# Set default font attributes.
import matplotlib
font = {'family' : 'normal',
        'weight' : 'bold',
        'size'   : 13}
matplotlib.rc('font', **font)

In [2]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
randn = np.random.randn

pd.set_option('display.mpl_style', 'default')
pd.set_option('display.max_rows', 15)

# Make a default figure size for later use.
DEFAULT_FIGSIZE = (12, 6)

`Series`¶

Basics¶

In [3]:

s = pd.Series([3,5,7,2])
s

Out[3]:

0    3
1    5
2    7
3    2
dtype: int64

In [4]:

# An important concept to understand when working with a `Series` is that it's
# actually composed of two pieces: an index array, and a data array.

print "The index is {0}.".format(s.index)
print "The values are {0}.".format(s.values)

The index is Int64Index([0, 1, 2, 3], dtype='int64').
The values are [3 5 7 2].

In [5]:

# You can explicitly pass your own labels to use as an index.  If you don't
# Pandas will construct a default index with integer labels.
pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd'])

Out[5]:

a    1.185723
b   -0.180358
c    0.762084
d    1.277645
dtype: float64

In [6]:

# You can also construct a Series from a dictionary.
# The keys are used as the index, and the values are used as the Series' values
pd.Series(
    {
        'a': 1, 
        'b': 2,
        'c': 3,
    }
)

Out[6]:

a    1
b    2
c    3
dtype: int64

In [7]:

# You get performance (and code clarity!) benefits if your Series'
# labels/values are homogenously-typed, but mixed-type arrays are supported.
pd.Series(
    [1, 2.6, 'a', {'a': 'b'}], 
    index=[1, 'a', 2, 2.5],
)

Out[7]:

1                 1
a               2.6
2                 a
2.5    {u'a': u'b'}
dtype: object

Slicing `Series` with `getitem` (aka `[]`)¶

Pandas objects support a wide range of selection and filtering methods. An important idea to keep in mind is the following:

If you have an N-dimensional object:

Indexing with a scalar returns a value of dimension N-1.
Indexing with a slice filters the object, but maintains the original dimension.

In [8]:

s = pd.Series(range(10), index=list('ABCDEFGHIJ'))
s

Out[8]:

A    0
B    1
C    2
D    3
E    4
F    5
G    6
H    7
I    8
J    9
dtype: int64

In [9]:

# Lookups by key work as you'd expect.
s['E']

Out[9]:

In [10]:

# We can look up multiple values at a time by passing a list of keys.
# The resulting value is a new `Series`.
s[['E', 'I', 'B']]

Out[10]:

E    4
I    8
B    1
dtype: int64

In [11]:

# Because the Index is ordered, we can use Python's slicing syntax.
s['E':]

Out[11]:

E    4
F    5
G    6
H    7
I    8
J    9
dtype: int64

In [12]:

 # Label-based slicing is inclusive of both endpoints.
s[:'I']

Out[12]:

A    0
B    1
C    2
D    3
E    4
F    5
G    6
H    7
I    8
dtype: int64

In [13]:

s['E':'I']

Out[13]:

E    4
F    5
G    6
H    7
I    8
dtype: int64

In [14]:

# Step arguments work just like Python lists.
s['E':'I':2]

Out[14]:

E    4
G    6
I    8
dtype: int64

In [15]:

# If you don't know the label you want, but you do know the position, you can
# use `iloc`.
print "The first entry is: %d" % s.iloc[0]
print "The last entry is: %d" % s.iloc[-1]

The first entry is: 0
The last entry is: 9

In [16]:

# Slicing works with `iloc` as well.

# Note that, unlike with label-based slicing, integer-based slices are
# right-open intervals, i.e. doing s.iloc[X:Y] gives you elements with indices
# in [X, Y).  This is the same as the semantics for list slicing.
s.iloc[5:]

Out[16]:

F    5
G    6
H    7
I    8
J    9
dtype: int64

In [17]:

print s.iloc[:5]

A    0
B    1
C    2
D    3
E    4
dtype: int64

In [18]:

s.iloc[-3:]

Out[18]:

H    7
I    8
J    9
dtype: int64

Numerical Operations¶

In [19]:

# Create two Series objects containing 100 samples each of sine and cosine.
sine = pd.Series(np.sin(np.linspace(0, 3.14 * 2, 100)), name='sine')
cosine = pd.Series(np.cos(np.linspace(0, 3.14 * 2, 100)), name='cosine')

In [20]:

sine

Out[20]:

0    0.000000
1    0.063392
2    0.126529
3    0.189156
4    0.251023
...
94   -0.314905
95   -0.254105
96   -0.192283
97   -0.129688
98   -0.066570
99   -0.003185
Name: sine, Length: 100, dtype: float64

In [21]:

cosine

Out[21]:

0    1.000000
1    0.997989
2    0.991963
3    0.981947
4    0.967981
...
94    0.949123
95    0.967177
96    0.981339
97    0.991555
98    0.997782
99    0.999995
Name: cosine, Length: 100, dtype: float64

In [22]:

# Multiplying two Series objects produces a new Series by multiplying values that have the same keys.
product = cosine * sine
product

Out[22]:

0    0.000000
1    0.063264
2    0.125512
3    0.185742
4    0.242986
...
94   -0.298884
95   -0.245765
96   -0.188695
97   -0.128592
98   -0.066423
99   -0.003185
Length: 100, dtype: float64

In [23]:

# Adding or multiplying a Series by a scalar applies that operation to each value in the Series.
cosine_plus_one = cosine + 1
cosine_plus_one

Out[23]:

0    2.000000
1    1.997989
2    1.991963
3    1.981947
4    1.967981
...
94    1.949123
95    1.967177
96    1.981339
97    1.991555
98    1.997782
99    1.999995
Name: cosine, Length: 100, dtype: float64

In [24]:

# Other binary operators work as you'd expect.  

# Note how much cleaner and clearer this is
# compared to looping over two containers and 
# performing multiple operations on elements 
# from each.
identity = (sine ** 2) + (cosine ** 2)
identity

Out[24]:

0    1
1    1
2    1
3    1
4    1
...
94    1
95    1
96    1
97    1
98    1
99    1
Length: 100, dtype: float64

All of the `pandas` data structures have `plot` methods that provide a user-friendly interface to `matplotlib`.¶

In [25]:

# Plot our sines values.
trigplot = sine.plot(
    ylim=(-1.2, 1.2),
    legend=True,
    figsize=DEFAULT_FIGSIZE,
    linewidth=3,
    label='sine',
)
# Add our other Series' to the same plot.
cosine.plot(ax=trigplot, legend=True, linewidth=3)
product.plot(ax=trigplot, legend=True, linewidth=3, label='product')
identity.plot(ax=trigplot, legend=True, linewidth=3, label='identity')

Out[25]:

<matplotlib.axes.AxesSubplot at 0x1055dec50>

We can map more complicated functions over a `Series` using the `apply` method.¶

In [26]:

def tenths_place(N):
    s = str(N)
    return s[s.find('.') + 1]
product.apply(tenths_place)

Out[26]:

0    0
1    0
2    1
3    1
4    2
...
94    2
95    2
96    1
97    1
98    0
99    0
Length: 100, dtype: object

Handling missing data.¶

In [27]:

# A major problem when working with real world data is handling missing entries.
# Pandas handles missing data by taking 
s1 = pd.Series({'a': 1, 'b': 2, 'c': 3})
# s2 is missing an entry for 'b'
s2 = pd.Series({'a': 4, 'c': 5})
s1 + s2

Out[27]:

a     5
b   NaN
c     8
dtype: float64

Boolean Operations on `Series`¶

In [28]:

s1 = pd.Series(
    {
        'A': 1,
        'B': 2,
        'C': 3,
        'D': 4,
        'E': 3,
        'F': 2,
        'G': 1,
    }
)
# You can create a constant Series by passing a scalar value and an index.
s2 = pd.Series(2, index=s1.index)

In [29]:

greater = s1 > s2
greater

Out[29]:

A    False
B    False
C     True
D     True
E     True
F    False
G    False
dtype: bool

In [30]:

less = s1 < s2
less

Out[30]:

A     True
B    False
C    False
D    False
E    False
F    False
G     True
dtype: bool

In [31]:

equal = s1 == s2
equal

Out[31]:

A    False
B     True
C    False
D    False
E    False
F     True
G    False
dtype: bool

In [32]:

# Comparisons against scalars also work.
s1_equal_to_3 = s1 == 3
s1_equal_to_3

Out[32]:

A    False
B    False
C     True
D    False
E     True
F    False
G    False
dtype: bool

In [33]:

#TODO: Move this down?

pd.DataFrame({
    's1': s1,
    's2': s2,
    's1 > s2': greater,
    's1 == s2': equal,
    's1 < s2': less,
    's1 == 3': s1_equal_to_3,
}, columns=['s1','s2', 's1 > s2', 's1 == s2', 's1 < s2', 's1 == 3'])

Out[33]:

	s1	s2	s1 > s2	s1 == s2	s1 < s2	s1 == 3
A	1	2	False	False	True	False
B	2	2	False	True	False	False
C	3	2	True	False	False	True
D	4	2	True	False	False	False
E	3	2	True	False	False	True
F	2	2	False	True	False	False
G	1	2	False	False	True	False

Boolean-valued Series can be used for slicing. You can think of this as marking particular index values as "keep" (True) or "drop" (False).

In [34]:

# Indexing into a series with a boolean Series masks away the values which were
# false in the passed Series.
s1[s1 > s2]

Out[34]:

C    3
D    4
E    3
dtype: int64

In [35]:

# We can combine these operators to concisely express complex
# computations/filters.
s1[(s1 > 1) & ~(s1 > s2)]

Out[35]:

B    2
F    2
dtype: int64

Working with Time Series Data¶

In [36]:

# Pandas has a special index class, `DatetimeIndex`, for representing
# TimeSeries data.
start = pd.Timestamp('2014-01-01', tz='UTC')
end = pd.Timestamp('2014-01-09', tz='UTC')

# date_range is an easy way to construct a DatetimeIndex
daily_index = pd.date_range(start, end)
daily_index

Out[36]:

<class 'pandas.tseries.index.DatetimeIndex'>
[2014-01-01, ..., 2014-01-09]
Length: 9, Freq: D, Timezone: UTC

In [37]:

# DatetimeIndex has a notion of its Frequency.
from pandas.tseries.offsets import Day, Hour, BDay, Minute

hourly_index = pd.date_range(
    pd.Timestamp('2014-01-01', tz='UTC'),
    pd.Timestamp('2014-01-9', tz='UTC'),
    freq=Hour(),
)
hourly_index

Out[37]:

<class 'pandas.tseries.index.DatetimeIndex'>
[2014-01-01 00:00:00+00:00, ..., 2014-01-09 00:00:00+00:00]
Length: 193, Freq: H, Timezone: UTC

In [38]:

bihourly_index = pd.date_range(
    pd.Timestamp('2014-01-01', tz='UTC'),
    pd.Timestamp('2014-01-09', tz='UTC'),
    freq=Hour(2),
)
bihourly_index

Out[38]:

<class 'pandas.tseries.index.DatetimeIndex'>
[2014-01-01 00:00:00+00:00, ..., 2014-01-09 00:00:00+00:00]
Length: 97, Freq: 2H, Timezone: UTC

In [39]:

weekday_index = pd.date_range(
    pd.Timestamp('2014-01-01', tz='UTC'),
    pd.Timestamp('2014-01-09', tz='UTC'),
    freq=BDay(),
)
print weekday_index
[i for i in weekday_index]

<class 'pandas.tseries.index.DatetimeIndex'>
[2014-01-01, ..., 2014-01-09]
Length: 7, Freq: B, Timezone: UTC

Out[39]:

[Timestamp('2014-01-01 00:00:00+0000', tz='UTC', offset='B'),
 Timestamp('2014-01-02 00:00:00+0000', tz='UTC', offset='B'),
 Timestamp('2014-01-03 00:00:00+0000', tz='UTC', offset='B'),
 Timestamp('2014-01-06 00:00:00+0000', tz='UTC', offset='B'),
 Timestamp('2014-01-07 00:00:00+0000', tz='UTC', offset='B'),
 Timestamp('2014-01-08 00:00:00+0000', tz='UTC', offset='B'),
 Timestamp('2014-01-09 00:00:00+0000', tz='UTC', offset='B')]

If your Series has a DatetimeIndex, then you immediately get access to sophisticated resampling tools.

In [40]:

ts = pd.Series(
    np.arange(30) ** 2,
    pd.date_range(
        start=pd.Timestamp('2014-01-01', tz='UTC'),
        freq='1D',
        periods=30,
    )
)
ts.plot()

Out[40]:

<matplotlib.axes.AxesSubplot at 0x105610b50>

In [41]:

# By default, resampling to a lower frequency takes the mean of the entries
# that were downsampled.
resampled = ts.resample('5D')
resampled

Out[41]:

2014-01-01 00:00:00+00:00      6
2014-01-06 00:00:00+00:00     51
2014-01-11 00:00:00+00:00    146
2014-01-16 00:00:00+00:00    291
2014-01-21 00:00:00+00:00    486
2014-01-26 00:00:00+00:00    731
Freq: 5D, dtype: int64

In [42]:

# We can customize this behavior though.
resampled_first = ts.resample('5D', how='first')
resampled_first

Out[42]:

2014-01-01 00:00:00+00:00      0
2014-01-06 00:00:00+00:00     25
2014-01-11 00:00:00+00:00    100
2014-01-16 00:00:00+00:00    225
2014-01-21 00:00:00+00:00    400
2014-01-26 00:00:00+00:00    625
Freq: 5D, dtype: int64

In [43]:

resampled_last = ts.resample('5D', how='last')
resampled_last

Out[43]:

2014-01-01 00:00:00+00:00     16
2014-01-06 00:00:00+00:00     81
2014-01-11 00:00:00+00:00    196
2014-01-16 00:00:00+00:00    361
2014-01-21 00:00:00+00:00    576
2014-01-26 00:00:00+00:00    841
Freq: 5D, dtype: int64

In [44]:

# We can even define our own custom sampling methods.
def geometric_mean(subseries):
    return np.product(subseries.values) ** (1.0 / len(subseries))
resampled_geometric = ts.resample('5D', how=geometric_mean)
print resampled_geometric

2014-01-01 00:00:00+00:00      0.000000
2014-01-06 00:00:00+00:00     46.970017
2014-01-11 00:00:00+00:00    141.990120
2014-01-16 00:00:00+00:00    286.995117
2014-01-21 00:00:00+00:00    481.997094
2014-01-26 00:00:00+00:00    726.998074
Freq: 5D, dtype: float64

In [45]:

pd.DataFrame(
    {
        "resampled": resampled,
        "resampled_first": resampled_first,
        "resampled_last": resampled_last,
        "resampled_geometric": resampled_geometric,
    }
).plot(linewidth=2, figsize=DEFAULT_FIGSIZE)

Out[45]:

<matplotlib.axes.AxesSubplot at 0x105610c90>

In [46]:

# Upsampling creates missing data, which is represented by numpy.nan.
ts.resample('6H')

Out[46]:

2014-01-01 00:00:00+00:00     0
2014-01-01 06:00:00+00:00   NaN
2014-01-01 12:00:00+00:00   NaN
2014-01-01 18:00:00+00:00   NaN
2014-01-02 00:00:00+00:00     1
...
2014-01-28 18:00:00+00:00    NaN
2014-01-29 00:00:00+00:00    784
2014-01-29 06:00:00+00:00    NaN
2014-01-29 12:00:00+00:00    NaN
2014-01-29 18:00:00+00:00    NaN
2014-01-30 00:00:00+00:00    841
Freq: 6H, Length: 117

We can handle missing data in a variety of ways.

In [47]:

# We can fill empty values with fillna.
zero_filled = ts.resample('6H').fillna(0)
print zero_filled

2014-01-01 00:00:00+00:00    0
2014-01-01 06:00:00+00:00    0
2014-01-01 12:00:00+00:00    0
2014-01-01 18:00:00+00:00    0
2014-01-02 00:00:00+00:00    1
...
2014-01-28 18:00:00+00:00      0
2014-01-29 00:00:00+00:00    784
2014-01-29 06:00:00+00:00      0
2014-01-29 12:00:00+00:00      0
2014-01-29 18:00:00+00:00      0
2014-01-30 00:00:00+00:00    841
Freq: 6H, Length: 117

In [48]:

# We can forward-fill with the last known prior value.
ffilled = ts.resample('6H').ffill()
print ffilled

2014-01-01 00:00:00+00:00    0
2014-01-01 06:00:00+00:00    0
2014-01-01 12:00:00+00:00    0
2014-01-01 18:00:00+00:00    0
2014-01-02 00:00:00+00:00    1
...
2014-01-28 18:00:00+00:00    729
2014-01-29 00:00:00+00:00    784
2014-01-29 06:00:00+00:00    784
2014-01-29 12:00:00+00:00    784
2014-01-29 18:00:00+00:00    784
2014-01-30 00:00:00+00:00    841
Freq: 6H, Length: 117

In [49]:

# We can backfill with earliest known next value.
bfilled = ts.resample('6H').bfill()
print bfilled

2014-01-01 00:00:00+00:00    0
2014-01-01 06:00:00+00:00    1
2014-01-01 12:00:00+00:00    1
2014-01-01 18:00:00+00:00    1
2014-01-02 00:00:00+00:00    1
...
2014-01-28 18:00:00+00:00    784
2014-01-29 00:00:00+00:00    784
2014-01-29 06:00:00+00:00    841
2014-01-29 12:00:00+00:00    841
2014-01-29 18:00:00+00:00    841
2014-01-30 00:00:00+00:00    841
Freq: 6H, Length: 117

In [50]:

# We can interpolate between known values.

# Note: `interpolate` is new as of pandas 0.14.0
# Quantopian is currently on pandas 0.12.0 due to breaking changes in the
# pandas API in 0.13.0.
linear_interpolated = ts.resample('6H').interpolate()
linear_interpolated

Out[50]:

2014-01-01 00:00:00+00:00    0.00
2014-01-01 06:00:00+00:00    0.25
2014-01-01 12:00:00+00:00    0.50
2014-01-01 18:00:00+00:00    0.75
2014-01-02 00:00:00+00:00    1.00
...
2014-01-28 18:00:00+00:00    770.25
2014-01-29 00:00:00+00:00    784.00
2014-01-29 06:00:00+00:00    798.25
2014-01-29 12:00:00+00:00    812.50
2014-01-29 18:00:00+00:00    826.75
2014-01-30 00:00:00+00:00    841.00
Freq: 6H, Length: 117

In [51]:

quadratic_interpolated = ts.resample('6H').interpolate('polynomial', order=2)
quadratic_interpolated
# Note: `interpolate` is new as of pandas 0.14.0
# Quantopian is currently on pandas 0.12.0 due to breaking changes in the
# pandas API in 0.13.0.

Out[51]:

2014-01-01 00:00:00+00:00    0.0000
2014-01-01 06:00:00+00:00    0.0625
2014-01-01 12:00:00+00:00    0.2500
2014-01-01 18:00:00+00:00    0.5625
2014-01-02 00:00:00+00:00    1.0000
...
2014-01-28 18:00:00+00:00    770.0625
2014-01-29 00:00:00+00:00    784.0000
2014-01-29 06:00:00+00:00    798.0625
2014-01-29 12:00:00+00:00    812.2500
2014-01-29 18:00:00+00:00    826.5625
2014-01-30 00:00:00+00:00    841.0000
Freq: 6H, Length: 117

In [52]:

pd.DataFrame(
    {
        "linear_interpolated": linear_interpolated,
        "quadratic_interpolated": quadratic_interpolated,
        "bfilled": bfilled,
        "ffilled": ffilled,
        "zero_filled": zero_filled,
    }
).plot(linewidth=2, figsize=DEFAULT_FIGSIZE)

Out[52]:

<matplotlib.axes.AxesSubplot at 0x1072bf350>

`DataFrame` - 2D `Tables` of Interwoven `Series`¶

In [53]:

# Oftentimes we have more than one axis on which we want to store data.
from pandas.io.data import get_data_yahoo
spy = get_data_yahoo(
    symbols='SPY',
    start=pd.Timestamp('2011-01-01'),
    end=pd.Timestamp('2014-01-01'),
    adjust_price=True,
)
spy

Out[53]:

	Open	High	Low	Close	Volume	Adj_Ratio
Date
2011-01-03	117.923577	118.751861	116.983613	118.24	138725200	0.930657
2011-01-04	118.495717	118.532941	117.434811	118.17	137409700	0.930619
2011-01-05	117.803496	118.864453	117.691816	118.79	133975300	0.930664
2011-01-06	118.839206	118.969502	118.206340	118.56	122519000	0.930685
2011-01-07	118.710864	118.906295	117.398679	118.32	156034600	0.930628
2011-01-10	117.797752	118.337511	117.444117	118.17	122401700	0.930619
2011-01-11	118.599306	118.878495	118.143298	118.59	110287000	0.930629
...	...	...	...	...	...	...
2013-12-20	179.037954	180.326069	178.919052	179.90	197087000	0.990857
2013-12-23	180.780732	180.968994	180.404209	180.86	85598000	0.990851
2013-12-24	180.873560	181.339270	180.863652	181.26	45368800	0.990871
2013-12-26	181.664751	182.279086	181.644934	182.18	63365000	0.990863
2013-12-27	182.417716	182.496984	181.981736	182.17	61814000	0.990862
2013-12-30	182.189543	182.338172	181.902193	182.14	56857000	0.990861
2013-12-31	182.385673	183.000000	182.246954	183.00	86119900	0.990850

754 rows × 6 columns

In [54]:

# Just plotting this DataFrame with the default arguments isn't very useful,
# because the scale of volume is so much greater than all the other columns.
spy.plot(figsize=DEFAULT_FIGSIZE)

Out[54]:

<matplotlib.axes.AxesSubplot at 0x107331910>

In [55]:

# Let's make a more interesting plot.

# Create a figure
fig = plt.figure()

# Add a subplot for price.
price_subplot = fig.add_subplot('311', xlabel='Date', ylabel='Price')
spy['Close'].plot(ax=price_subplot, lw=2)  # lw means "line width"

# Add another subplot for each day's spread.
spread_subplot = fig.add_subplot('312', xlabel='Date', ylabel='Spread')
spread = spy['High'] - spy['Low']
spread.plot(ax=spread_subplot, lw=2, color='r')

# And add a third plot for volume.
volume_subplot = fig.add_subplot('313', xlabel='Date', ylabel='Volume')
spy['Volume'].plot(ax=volume_subplot, lw=2)

# matplotlib.pyplot.gcf is short for "Get Current Figure".  It provides an easy
# way to modify the last drawn plot.
plt.gcf().set_size_inches(*DEFAULT_FIGSIZE)

In [56]:

# Unsurprisingly, spread is strongly correlated with daily volume
spread.corr(spy['Volume'])

Out[56]:

0.787975095601951

Selecting Data with `DataFrames`¶

A DataFrame has two indices, representing row labels and column labels. Since these are Index objects, we can use all the same slicing tools we used when working with Series.

In [57]:

# Default slicing acts on column labels.
# Passing a scalar value drops the dimension by one.
spy['Close'] # Returns a Series

Out[57]:

Date
2011-01-03    118.24
2011-01-04    118.17
2011-01-05    118.79
2011-01-06    118.56
2011-01-07    118.32
...
2013-12-23    180.86
2013-12-24    181.26
2013-12-26    182.18
2013-12-27    182.17
2013-12-30    182.14
2013-12-31    183.00
Name: Close, Length: 754

In [58]:

# Passing a list filters the columns down to the supplied values.
spy[['Close', 'Volume']]

Out[58]:

	Close	Volume
Date
2011-01-03	118.24	138725200
2011-01-04	118.17	137409700
2011-01-05	118.79	133975300
2011-01-06	118.56	122519000
2011-01-07	118.32	156034600
2011-01-10	118.17	122401700
2011-01-11	118.59	110287000
...	...	...
2013-12-20	179.90	197087000
2013-12-23	180.86	85598000
2013-12-24	181.26	45368800
2013-12-26	182.18	63365000
2013-12-27	182.17	61814000
2013-12-30	182.14	56857000
2013-12-31	183.00	86119900

754 rows × 2 columns

In [59]:

# Using .loc with one argument takes a slice of rows based on label.
spy.loc[pd.Timestamp('2013-02-01'):pd.Timestamp('2013-02-28')]

Out[59]:

	Open	High	Low	Close	Volume	Adj_Ratio
Date
2013-02-01	146.316970	147.064823	146.064448	146.89	131173000	0.971238
2013-02-04	145.997571	146.920254	145.133163	145.24	159073600	0.971245
2013-02-05	146.030113	147.127645	145.971836	146.71	113912400	0.971268
2013-02-06	146.188418	146.907122	146.081583	146.81	138762800	0.971223
2013-02-07	146.862813	146.998788	145.551624	146.62	162490000	0.971251
2013-02-08	146.876659	147.527415	146.876659	147.44	103133700	0.971278
2013-02-11	147.380862	147.536265	147.040917	147.41	73775000	0.971272
...	...	...	...	...	...	...
2013-02-20	148.738262	148.786825	146.912299	146.99	160574800	0.971257
2013-02-21	146.614456	147.061214	145.623817	146.09	183257000	0.971214
2013-02-22	146.801290	147.520000	146.160279	147.52	106356600	0.971229
2013-02-25	148.245729	148.469122	144.720000	144.72	245824800	0.971275
2013-02-26	145.418619	145.884829	144.457061	145.71	186596200	0.971270
2013-02-27	145.578109	147.947918	145.451849	147.54	150781900	0.971233
2013-02-28	147.531660	148.473765	147.055752	147.25	126866000	0.971242

19 rows × 6 columns

In [60]:

# Using .loc with two arguments takes a slice of rows based on label, then a
# slice of columns based on name.

# Note the comma between the first slice and the second slice!
spy.loc[pd.Timestamp('2013-02-01'):pd.Timestamp('2013-02-28'), 'Open':'Low']

Out[60]:

	Open	High	Low
Date
2013-02-01	146.316970	147.064823	146.064448
2013-02-04	145.997571	146.920254	145.133163
2013-02-05	146.030113	147.127645	145.971836
2013-02-06	146.188418	146.907122	146.081583
2013-02-07	146.862813	146.998788	145.551624
2013-02-08	146.876659	147.527415	146.876659
2013-02-11	147.380862	147.536265	147.040917
...	...	...	...
2013-02-20	148.738262	148.786825	146.912299
2013-02-21	146.614456	147.061214	145.623817
2013-02-22	146.801290	147.520000	146.160279
2013-02-25	148.245729	148.469122	144.720000
2013-02-26	145.418619	145.884829	144.457061
2013-02-27	145.578109	147.947918	145.451849
2013-02-28	147.531660	148.473765	147.055752

19 rows × 3 columns

In [61]:

# We can use iloc when we want lookups by position.
spy.iloc[-20:-10, [0,2]]

Out[61]:

	Open	Low
Date
2013-12-03	177.327241	176.568422
2013-12-04	176.509114	175.769963
2013-12-05	176.813197	176.182461
2013-12-06	178.053910	177.541439
2013-12-09	178.838985	178.533480
2013-12-10	178.356666	178.021594
2013-12-11	178.199567	175.913188
2013-12-12	176.052613	175.185359
2013-12-13	175.914351	175.194925
2013-12-16	176.353917	176.304642

Boolean Series slicing is very useful with DataFrame.

In [62]:

# Get the days on which SPY closed higher than it opened.
up_days = spy['Close'] > spy['Open']
up_days

Out[62]:

Date
2011-01-03     True
2011-01-04    False
2011-01-05     True
2011-01-06    False
2011-01-07    False
...
2013-12-23     True
2013-12-24     True
2013-12-26     True
2013-12-27    False
2013-12-30    False
2013-12-31     True
Length: 754

In [63]:

spy[up_days]

Out[63]:

	Open	High	Low	Close	Volume	Adj_Ratio
Date
2011-01-03	117.923577	118.751861	116.983613	118.24	138725200	0.930657
2011-01-05	117.803496	118.864453	117.691816	118.79	133975300	0.930664
2011-01-10	117.797752	118.337511	117.444117	118.17	122401700	0.930619
2011-01-12	119.315668	119.790288	118.617698	119.66	107929200	0.930627
2011-01-14	119.297005	120.357919	119.213248	120.33	117677900	0.930626
2011-01-18	120.223573	120.651680	120.083973	120.54	114401300	0.930667
2011-01-20	119.088320	119.497814	118.315865	119.20	175745700	0.930668
...	...	...	...	...	...	...
2013-12-18	176.330239	179.099566	174.753398	179.07	234906000	0.985526
2013-12-19	178.554492	179.066957	178.091303	178.86	136531200	0.985509
2013-12-20	179.037954	180.326069	178.919052	179.90	197087000	0.990857
2013-12-23	180.780732	180.968994	180.404209	180.86	85598000	0.990851
2013-12-24	180.873560	181.339270	180.863652	181.26	45368800	0.990871
2013-12-26	181.664751	182.279086	181.644934	182.18	63365000	0.990863
2013-12-31	182.385673	183.000000	182.246954	183.00	86119900	0.990850

423 rows × 6 columns

In [64]:

# We can use .ix when we want mixed lookups.
spy.ix[-20:-10, 'Open':'High']

Out[64]:

	Open	High
Date
2013-12-03	177.327241	177.770707
2013-12-04	176.509114	177.869150
2013-12-05	176.813197	177.138421
2013-12-06	178.053910	178.487538
2013-12-09	178.838985	179.036085
2013-12-10	178.356666	178.731158
2013-12-11	178.199567	178.229132
2013-12-12	176.052613	176.269427
2013-12-13	175.914351	176.072033
2013-12-16	176.353917	177.201441

Example: Investigating a Momentum Strategy¶

In [65]:

five_day_returns = spy['Close'].pct_change(5)
five_day_returns

Out[65]:

Date
2011-01-03   NaN
2011-01-04   NaN
2011-01-05   NaN
2011-01-06   NaN
2011-01-07   NaN
...
2013-12-23    0.024006
2013-12-24    0.029535
2013-12-26    0.017368
2013-12-27    0.018506
2013-12-30    0.012451
2013-12-31    0.011832
Name: Close, Length: 754

In [66]:

# Checking for equality of floating point numbers is a bad idea because of
# roundoff error.  `numpy.allclose` does an appropriate epsilon test.
test_return = (spy['Close'].iloc[5] - spy['Close'].iloc[0]) / spy['Close'].iloc[0]
np.allclose(five_day_returns.iloc[5], test_return)

Out[66]:

True

In [67]:

thirty_day_forward_returns = (spy['Close'].shift(-30) - spy['Close']) / spy['Close']
test_return = (spy['Close'].iloc[30] - spy['Close'].iloc[0]) / spy['Close'].iloc[0]
np.allclose(thirty_day_forward_returns.iloc[0], test_return)

Out[67]:

True

In [68]:

returns = pd.DataFrame(
    {
        'forward_30Day': thirty_day_forward_returns,
        'backA_2Day': spy['Close'].pct_change(2),
        'backB_5Day': spy['Close'].pct_change(5),
        'backD_50Day': spy['Close'].pct_change(50),
        'backE_100Day': spy['Close'].pct_change(100),
        'backF_200Day': spy['Close'].pct_change(200),
        'backG_300Day': spy['Close'].pct_change(300),
    }
).dropna(how='any')
returns.plot(figsize=DEFAULT_FIGSIZE)

Out[68]:

<matplotlib.axes.AxesSubplot at 0x101508190>

In [69]:

# Pairwise correlation of forward and backward returns.
corr = returns.corr()
corr

Out[69]:

	backA_2Day	backB_5Day	backD_50Day	backE_100Day	backF_200Day	backG_300Day	forward_30Day
backA_2Day	1.000000	0.595697	0.188921	0.071506	0.162341	0.103689	-0.069359
backB_5Day	0.595697	1.000000	0.303945	0.106169	0.209141	0.133395	-0.132127
backD_50Day	0.188921	0.303945	1.000000	0.443526	0.299287	0.324473	-0.364206
backE_100Day	0.071506	0.106169	0.443526	1.000000	0.116012	0.288164	-0.533907
backF_200Day	0.162341	0.209141	0.299287	0.116012	1.000000	0.276626	-0.016570
backG_300Day	0.103689	0.133395	0.324473	0.288164	0.276626	1.000000	0.135996
forward_30Day	-0.069359	-0.132127	-0.364206	-0.533907	-0.016570	0.135996	1.000000

In [70]:

corr.ix['forward_30Day',:-1].plot(kind='bar', position=.5, xlim=(-1, 6))
plt.gcf().set_size_inches(9, 6)

Example: Investigating a Pair Trade for Pepsi and Coca-Cola¶

Here we show how to load data for two securities, graph the data, and compute correlation of returns and volatility for each security over the specified period.

In [71]:

# Load data for Pepsi and Coca-Cola from Yahoo.
symbols = [
    'PEP',
    'KO',
]
cola_data = get_data_yahoo(['PEP', 'KO'], adjust_price=True)
cola_data

Out[71]:

<class 'pandas.core.panel.Panel'>
Dimensions: 6 (items) x 1186 (major_axis) x 2 (minor_axis)
Items axis: Open to Adj_Ratio
Major_axis axis: 2010-01-04 00:00:00 to 2014-09-18 00:00:00
Minor_axis axis: KO to PEP

In [72]:

# Compute the 1-day forward log returns for both securities' close prices.
closes = cola_data['Close']
yesterday_closes = cola_data['Close'].shift(1)
cola_log_returns = (closes / yesterday_closes).apply(np.log)
cola_raw_returns = closes.pct_change(1)

# Look at the data we just calculated by throwing it into a Panel and
# pulling out just the DataFrame or Kola.
pd.Panel({
    'closes'     : closes,
    'prev_closes': yesterday_closes,
    'log_returns': cola_log_returns,
    'raw_returns': cola_raw_returns,
}).loc[:,:,'KO']

Out[72]:

	closes	log_returns	prev_closes	raw_returns
Date
2010-01-04	24.84	NaN	NaN	NaN
2010-01-05	24.54	-0.012151	24.84	-0.012077
2010-01-06	24.53	-0.000408	24.54	-0.000407
2010-01-07	24.47	-0.002449	24.53	-0.002446
2010-01-08	24.02	-0.018561	24.47	-0.018390
2010-01-11	24.50	0.019786	24.02	0.019983
2010-01-12	24.77	0.010960	24.50	0.011020
...	...	...	...	...
2014-09-10	41.86	0.005269	41.64	0.005283
2014-09-11	41.95	0.002148	41.86	0.002150
2014-09-12	41.46	-0.011749	41.95	-0.011681
2014-09-15	41.50	0.000964	41.46	0.000965
2014-09-16	41.64	0.003368	41.50	0.003373
2014-09-17	41.61	-0.000721	41.64	-0.000720
2014-09-18	41.79	0.004317	41.61	0.004326

1186 rows × 4 columns

In [73]:

# Pull the standard returns and the log returns into a single DataFrame using DataFrame.join.
closes.join(cola_log_returns, rsuffix='_lr')\
      .join(cola_raw_returns, rsuffix='_rr')\
      .dropna(how='any')

Out[73]:

	KO	PEP	KO_lr	PEP_lr	KO_rr	PEP_rr
Date
2010-01-05	24.54	53.80	-0.012151	0.011967	-0.012077	0.012039
2010-01-06	24.53	53.26	-0.000408	-0.010088	-0.000407	-0.010037
2010-01-07	24.47	52.93	-0.002449	-0.006215	-0.002446	-0.006196
2010-01-08	24.02	52.75	-0.018561	-0.003407	-0.018390	-0.003401
2010-01-11	24.50	52.69	0.019786	-0.001138	0.019983	-0.001137
2010-01-12	24.77	53.43	0.010960	0.013947	0.011020	0.014044
2010-01-13	24.84	53.86	0.002822	0.008016	0.002826	0.008048
...	...	...	...	...	...	...
2014-09-10	41.86	91.79	0.005269	0.004039	0.005283	0.004047
2014-09-11	41.95	91.65	0.002148	-0.001526	0.002150	-0.001525
2014-09-12	41.46	90.87	-0.011749	-0.008547	-0.011681	-0.008511
2014-09-15	41.50	91.20	0.000964	0.003625	0.000965	0.003632
2014-09-16	41.64	92.57	0.003368	0.014910	0.003373	0.015022
2014-09-17	41.61	92.85	-0.000721	0.003020	-0.000720	0.003025
2014-09-18	41.79	93.37	0.004317	0.005585	0.004326	0.005600

1185 rows × 6 columns

In [74]:

# Create a figure with three 'slots' for subplots.
fig = plt.figure()
# 311 here means "Put the subplot in the 1st slot of a 3 x 1 grid.
# 312 and 313 tell matplotlib to place the subsequent plots in the 2nd and 3rd slot
price_subplot = fig.add_subplot('311', xlabel='Date', ylabel='Price')
return_subplot_pep = fig.add_subplot('312', xlabel='Date', ylabel='PEP Log Returns')
return_subplot_ko = fig.add_subplot('313', xlabel='Date', ylabel='KO Log Returns')

cola_data['Close'].plot(ax=price_subplot, color=['purple', 'red'])
cola_log_returns['PEP'].plot(ax=return_subplot_pep, color='red')
cola_log_returns['KO'].plot(ax=return_subplot_ko, color='purple')

# Set the size of the whole plot array.  gcf stands for `get_current_figure`.
plt.gcf().set_size_inches(14, 10)

In [75]:

# Compute the correlation of our log returns
correlation = (cola_log_returns['PEP']).corr(cola_log_returns['KO'])
correlation

Out[75]:

0.60912363471867426

In [76]:

# Compute column-wise standard deviation of daily returns and divide by
# 1 / sqrt(252) to get annualized volatility.
volatility = cola_log_returns.std() * np.sqrt(252)
volatility

Out[76]:

KO     0.151500
PEP    0.140033
dtype: float64

Series¶

Basics¶

Slicing Series with __getitem__ (aka [])¶

Numerical Operations¶

All of the pandas data structures have plot methods that provide a user-friendly interface to matplotlib.¶

We can map more complicated functions over a Series using the apply method.¶