From Rob Pike's Notes on Programming in C:
Rule 5. Data dominates. If you've chosen the right data structures and
organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming.
Pandas is built on a hierarchy of a few powerful data structures. Each of these structures is composed of, and designed to interoperate with, the simpler structures.
Index
(1-Dimensional immutable ordered hash table)Series
(1-Dimensional Labelled Array)DataFrame
(2-Dimensional Labelled Array)Panel
(3-Dimensional Labelled Array)# Tell IPython to display mapltplotlib plots inline.
%matplotlib inline
# Set default font attributes.
import matplotlib
font = {'family' : 'normal',
'weight' : 'bold',
'size' : 13}
matplotlib.rc('font', **font)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
randn = np.random.randn
pd.set_option('display.mpl_style', 'default')
pd.set_option('display.max_rows', 15)
# Make a default figure size for later use.
DEFAULT_FIGSIZE = (12, 6)
Series
¶s = pd.Series([3,5,7,2])
s
0 3 1 5 2 7 3 2 dtype: int64
# An important concept to understand when working with a `Series` is that it's
# actually composed of two pieces: an index array, and a data array.
print "The index is {0}.".format(s.index)
print "The values are {0}.".format(s.values)
The index is Int64Index([0, 1, 2, 3], dtype='int64'). The values are [3 5 7 2].
# You can explicitly pass your own labels to use as an index. If you don't
# Pandas will construct a default index with integer labels.
pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd'])
a 1.185723 b -0.180358 c 0.762084 d 1.277645 dtype: float64
# You can also construct a Series from a dictionary.
# The keys are used as the index, and the values are used as the Series' values
pd.Series(
{
'a': 1,
'b': 2,
'c': 3,
}
)
a 1 b 2 c 3 dtype: int64
# You get performance (and code clarity!) benefits if your Series'
# labels/values are homogenously-typed, but mixed-type arrays are supported.
pd.Series(
[1, 2.6, 'a', {'a': 'b'}],
index=[1, 'a', 2, 2.5],
)
1 1 a 2.6 2 a 2.5 {u'a': u'b'} dtype: object
Series
with __getitem__
(aka []
)¶Pandas objects support a wide range of selection and filtering methods. An important idea to keep in mind is the following:
If you have an N-dimensional object:
s = pd.Series(range(10), index=list('ABCDEFGHIJ'))
s
A 0 B 1 C 2 D 3 E 4 F 5 G 6 H 7 I 8 J 9 dtype: int64
# Lookups by key work as you'd expect.
s['E']
4
# We can look up multiple values at a time by passing a list of keys.
# The resulting value is a new `Series`.
s[['E', 'I', 'B']]
E 4 I 8 B 1 dtype: int64
# Because the Index is ordered, we can use Python's slicing syntax.
s['E':]
E 4 F 5 G 6 H 7 I 8 J 9 dtype: int64
# Label-based slicing is inclusive of both endpoints.
s[:'I']
A 0 B 1 C 2 D 3 E 4 F 5 G 6 H 7 I 8 dtype: int64
s['E':'I']
E 4 F 5 G 6 H 7 I 8 dtype: int64
# Step arguments work just like Python lists.
s['E':'I':2]
E 4 G 6 I 8 dtype: int64
# If you don't know the label you want, but you do know the position, you can
# use `iloc`.
print "The first entry is: %d" % s.iloc[0]
print "The last entry is: %d" % s.iloc[-1]
The first entry is: 0 The last entry is: 9
# Slicing works with `iloc` as well.
# Note that, unlike with label-based slicing, integer-based slices are
# right-open intervals, i.e. doing s.iloc[X:Y] gives you elements with indices
# in [X, Y). This is the same as the semantics for list slicing.
s.iloc[5:]
F 5 G 6 H 7 I 8 J 9 dtype: int64
print s.iloc[:5]
A 0 B 1 C 2 D 3 E 4 dtype: int64
s.iloc[-3:]
H 7 I 8 J 9 dtype: int64
# Create two Series objects containing 100 samples each of sine and cosine.
sine = pd.Series(np.sin(np.linspace(0, 3.14 * 2, 100)), name='sine')
cosine = pd.Series(np.cos(np.linspace(0, 3.14 * 2, 100)), name='cosine')
sine
0 0.000000 1 0.063392 2 0.126529 3 0.189156 4 0.251023 ... 94 -0.314905 95 -0.254105 96 -0.192283 97 -0.129688 98 -0.066570 99 -0.003185 Name: sine, Length: 100, dtype: float64
cosine
0 1.000000 1 0.997989 2 0.991963 3 0.981947 4 0.967981 ... 94 0.949123 95 0.967177 96 0.981339 97 0.991555 98 0.997782 99 0.999995 Name: cosine, Length: 100, dtype: float64
# Multiplying two Series objects produces a new Series by multiplying values that have the same keys.
product = cosine * sine
product
0 0.000000 1 0.063264 2 0.125512 3 0.185742 4 0.242986 ... 94 -0.298884 95 -0.245765 96 -0.188695 97 -0.128592 98 -0.066423 99 -0.003185 Length: 100, dtype: float64
# Adding or multiplying a Series by a scalar applies that operation to each value in the Series.
cosine_plus_one = cosine + 1
cosine_plus_one
0 2.000000 1 1.997989 2 1.991963 3 1.981947 4 1.967981 ... 94 1.949123 95 1.967177 96 1.981339 97 1.991555 98 1.997782 99 1.999995 Name: cosine, Length: 100, dtype: float64
# Other binary operators work as you'd expect.
# Note how much cleaner and clearer this is
# compared to looping over two containers and
# performing multiple operations on elements
# from each.
identity = (sine ** 2) + (cosine ** 2)
identity
0 1 1 1 2 1 3 1 4 1 ... 94 1 95 1 96 1 97 1 98 1 99 1 Length: 100, dtype: float64
pandas
data structures have plot
methods that provide a user-friendly interface to matplotlib
.¶# Plot our sines values.
trigplot = sine.plot(
ylim=(-1.2, 1.2),
legend=True,
figsize=DEFAULT_FIGSIZE,
linewidth=3,
label='sine',
)
# Add our other Series' to the same plot.
cosine.plot(ax=trigplot, legend=True, linewidth=3)
product.plot(ax=trigplot, legend=True, linewidth=3, label='product')
identity.plot(ax=trigplot, legend=True, linewidth=3, label='identity')
<matplotlib.axes.AxesSubplot at 0x1055dec50>
Series
using the apply
method.¶def tenths_place(N):
s = str(N)
return s[s.find('.') + 1]
product.apply(tenths_place)
0 0 1 0 2 1 3 1 4 2 ... 94 2 95 2 96 1 97 1 98 0 99 0 Length: 100, dtype: object
# A major problem when working with real world data is handling missing entries.
# Pandas handles missing data by taking
s1 = pd.Series({'a': 1, 'b': 2, 'c': 3})
# s2 is missing an entry for 'b'
s2 = pd.Series({'a': 4, 'c': 5})
s1 + s2
a 5 b NaN c 8 dtype: float64
Series
¶s1 = pd.Series(
{
'A': 1,
'B': 2,
'C': 3,
'D': 4,
'E': 3,
'F': 2,
'G': 1,
}
)
# You can create a constant Series by passing a scalar value and an index.
s2 = pd.Series(2, index=s1.index)
greater = s1 > s2
greater
A False B False C True D True E True F False G False dtype: bool
less = s1 < s2
less
A True B False C False D False E False F False G True dtype: bool
equal = s1 == s2
equal
A False B True C False D False E False F True G False dtype: bool
# Comparisons against scalars also work.
s1_equal_to_3 = s1 == 3
s1_equal_to_3
A False B False C True D False E True F False G False dtype: bool
#TODO: Move this down?
pd.DataFrame({
's1': s1,
's2': s2,
's1 > s2': greater,
's1 == s2': equal,
's1 < s2': less,
's1 == 3': s1_equal_to_3,
}, columns=['s1','s2', 's1 > s2', 's1 == s2', 's1 < s2', 's1 == 3'])
s1 | s2 | s1 > s2 | s1 == s2 | s1 < s2 | s1 == 3 | |
---|---|---|---|---|---|---|
A | 1 | 2 | False | False | True | False |
B | 2 | 2 | False | True | False | False |
C | 3 | 2 | True | False | False | True |
D | 4 | 2 | True | False | False | False |
E | 3 | 2 | True | False | False | True |
F | 2 | 2 | False | True | False | False |
G | 1 | 2 | False | False | True | False |
Boolean-valued Series
can be used for slicing. You can think of this as
marking particular index values as "keep" (True
) or "drop" (False
).
# Indexing into a series with a boolean Series masks away the values which were
# false in the passed Series.
s1[s1 > s2]
C 3 D 4 E 3 dtype: int64
# We can combine these operators to concisely express complex
# computations/filters.
s1[(s1 > 1) & ~(s1 > s2)]
B 2 F 2 dtype: int64
# Pandas has a special index class, `DatetimeIndex`, for representing
# TimeSeries data.
start = pd.Timestamp('2014-01-01', tz='UTC')
end = pd.Timestamp('2014-01-09', tz='UTC')
# date_range is an easy way to construct a DatetimeIndex
daily_index = pd.date_range(start, end)
daily_index
<class 'pandas.tseries.index.DatetimeIndex'> [2014-01-01, ..., 2014-01-09] Length: 9, Freq: D, Timezone: UTC
# DatetimeIndex has a notion of its Frequency.
from pandas.tseries.offsets import Day, Hour, BDay, Minute
hourly_index = pd.date_range(
pd.Timestamp('2014-01-01', tz='UTC'),
pd.Timestamp('2014-01-9', tz='UTC'),
freq=Hour(),
)
hourly_index
<class 'pandas.tseries.index.DatetimeIndex'> [2014-01-01 00:00:00+00:00, ..., 2014-01-09 00:00:00+00:00] Length: 193, Freq: H, Timezone: UTC
bihourly_index = pd.date_range(
pd.Timestamp('2014-01-01', tz='UTC'),
pd.Timestamp('2014-01-09', tz='UTC'),
freq=Hour(2),
)
bihourly_index
<class 'pandas.tseries.index.DatetimeIndex'> [2014-01-01 00:00:00+00:00, ..., 2014-01-09 00:00:00+00:00] Length: 97, Freq: 2H, Timezone: UTC
weekday_index = pd.date_range(
pd.Timestamp('2014-01-01', tz='UTC'),
pd.Timestamp('2014-01-09', tz='UTC'),
freq=BDay(),
)
print weekday_index
[i for i in weekday_index]
<class 'pandas.tseries.index.DatetimeIndex'> [2014-01-01, ..., 2014-01-09] Length: 7, Freq: B, Timezone: UTC
[Timestamp('2014-01-01 00:00:00+0000', tz='UTC', offset='B'), Timestamp('2014-01-02 00:00:00+0000', tz='UTC', offset='B'), Timestamp('2014-01-03 00:00:00+0000', tz='UTC', offset='B'), Timestamp('2014-01-06 00:00:00+0000', tz='UTC', offset='B'), Timestamp('2014-01-07 00:00:00+0000', tz='UTC', offset='B'), Timestamp('2014-01-08 00:00:00+0000', tz='UTC', offset='B'), Timestamp('2014-01-09 00:00:00+0000', tz='UTC', offset='B')]
If your Series
has a DatetimeIndex
, then you immediately get access to
sophisticated resampling tools.
ts = pd.Series(
np.arange(30) ** 2,
pd.date_range(
start=pd.Timestamp('2014-01-01', tz='UTC'),
freq='1D',
periods=30,
)
)
ts.plot()
<matplotlib.axes.AxesSubplot at 0x105610b50>
# By default, resampling to a lower frequency takes the mean of the entries
# that were downsampled.
resampled = ts.resample('5D')
resampled
2014-01-01 00:00:00+00:00 6 2014-01-06 00:00:00+00:00 51 2014-01-11 00:00:00+00:00 146 2014-01-16 00:00:00+00:00 291 2014-01-21 00:00:00+00:00 486 2014-01-26 00:00:00+00:00 731 Freq: 5D, dtype: int64
# We can customize this behavior though.
resampled_first = ts.resample('5D', how='first')
resampled_first
2014-01-01 00:00:00+00:00 0 2014-01-06 00:00:00+00:00 25 2014-01-11 00:00:00+00:00 100 2014-01-16 00:00:00+00:00 225 2014-01-21 00:00:00+00:00 400 2014-01-26 00:00:00+00:00 625 Freq: 5D, dtype: int64
resampled_last = ts.resample('5D', how='last')
resampled_last
2014-01-01 00:00:00+00:00 16 2014-01-06 00:00:00+00:00 81 2014-01-11 00:00:00+00:00 196 2014-01-16 00:00:00+00:00 361 2014-01-21 00:00:00+00:00 576 2014-01-26 00:00:00+00:00 841 Freq: 5D, dtype: int64
# We can even define our own custom sampling methods.
def geometric_mean(subseries):
return np.product(subseries.values) ** (1.0 / len(subseries))
resampled_geometric = ts.resample('5D', how=geometric_mean)
print resampled_geometric
2014-01-01 00:00:00+00:00 0.000000 2014-01-06 00:00:00+00:00 46.970017 2014-01-11 00:00:00+00:00 141.990120 2014-01-16 00:00:00+00:00 286.995117 2014-01-21 00:00:00+00:00 481.997094 2014-01-26 00:00:00+00:00 726.998074 Freq: 5D, dtype: float64
pd.DataFrame(
{
"resampled": resampled,
"resampled_first": resampled_first,
"resampled_last": resampled_last,
"resampled_geometric": resampled_geometric,
}
).plot(linewidth=2, figsize=DEFAULT_FIGSIZE)
<matplotlib.axes.AxesSubplot at 0x105610c90>
# Upsampling creates missing data, which is represented by numpy.nan.
ts.resample('6H')
2014-01-01 00:00:00+00:00 0 2014-01-01 06:00:00+00:00 NaN 2014-01-01 12:00:00+00:00 NaN 2014-01-01 18:00:00+00:00 NaN 2014-01-02 00:00:00+00:00 1 ... 2014-01-28 18:00:00+00:00 NaN 2014-01-29 00:00:00+00:00 784 2014-01-29 06:00:00+00:00 NaN 2014-01-29 12:00:00+00:00 NaN 2014-01-29 18:00:00+00:00 NaN 2014-01-30 00:00:00+00:00 841 Freq: 6H, Length: 117
We can handle missing data in a variety of ways.
# We can fill empty values with fillna.
zero_filled = ts.resample('6H').fillna(0)
print zero_filled
2014-01-01 00:00:00+00:00 0 2014-01-01 06:00:00+00:00 0 2014-01-01 12:00:00+00:00 0 2014-01-01 18:00:00+00:00 0 2014-01-02 00:00:00+00:00 1 ... 2014-01-28 18:00:00+00:00 0 2014-01-29 00:00:00+00:00 784 2014-01-29 06:00:00+00:00 0 2014-01-29 12:00:00+00:00 0 2014-01-29 18:00:00+00:00 0 2014-01-30 00:00:00+00:00 841 Freq: 6H, Length: 117
# We can forward-fill with the last known prior value.
ffilled = ts.resample('6H').ffill()
print ffilled
2014-01-01 00:00:00+00:00 0 2014-01-01 06:00:00+00:00 0 2014-01-01 12:00:00+00:00 0 2014-01-01 18:00:00+00:00 0 2014-01-02 00:00:00+00:00 1 ... 2014-01-28 18:00:00+00:00 729 2014-01-29 00:00:00+00:00 784 2014-01-29 06:00:00+00:00 784 2014-01-29 12:00:00+00:00 784 2014-01-29 18:00:00+00:00 784 2014-01-30 00:00:00+00:00 841 Freq: 6H, Length: 117
# We can backfill with earliest known next value.
bfilled = ts.resample('6H').bfill()
print bfilled
2014-01-01 00:00:00+00:00 0 2014-01-01 06:00:00+00:00 1 2014-01-01 12:00:00+00:00 1 2014-01-01 18:00:00+00:00 1 2014-01-02 00:00:00+00:00 1 ... 2014-01-28 18:00:00+00:00 784 2014-01-29 00:00:00+00:00 784 2014-01-29 06:00:00+00:00 841 2014-01-29 12:00:00+00:00 841 2014-01-29 18:00:00+00:00 841 2014-01-30 00:00:00+00:00 841 Freq: 6H, Length: 117
# We can interpolate between known values.
# Note: `interpolate` is new as of pandas 0.14.0
# Quantopian is currently on pandas 0.12.0 due to breaking changes in the
# pandas API in 0.13.0.
linear_interpolated = ts.resample('6H').interpolate()
linear_interpolated
2014-01-01 00:00:00+00:00 0.00 2014-01-01 06:00:00+00:00 0.25 2014-01-01 12:00:00+00:00 0.50 2014-01-01 18:00:00+00:00 0.75 2014-01-02 00:00:00+00:00 1.00 ... 2014-01-28 18:00:00+00:00 770.25 2014-01-29 00:00:00+00:00 784.00 2014-01-29 06:00:00+00:00 798.25 2014-01-29 12:00:00+00:00 812.50 2014-01-29 18:00:00+00:00 826.75 2014-01-30 00:00:00+00:00 841.00 Freq: 6H, Length: 117
quadratic_interpolated = ts.resample('6H').interpolate('polynomial', order=2)
quadratic_interpolated
# Note: `interpolate` is new as of pandas 0.14.0
# Quantopian is currently on pandas 0.12.0 due to breaking changes in the
# pandas API in 0.13.0.
2014-01-01 00:00:00+00:00 0.0000 2014-01-01 06:00:00+00:00 0.0625 2014-01-01 12:00:00+00:00 0.2500 2014-01-01 18:00:00+00:00 0.5625 2014-01-02 00:00:00+00:00 1.0000 ... 2014-01-28 18:00:00+00:00 770.0625 2014-01-29 00:00:00+00:00 784.0000 2014-01-29 06:00:00+00:00 798.0625 2014-01-29 12:00:00+00:00 812.2500 2014-01-29 18:00:00+00:00 826.5625 2014-01-30 00:00:00+00:00 841.0000 Freq: 6H, Length: 117
pd.DataFrame(
{
"linear_interpolated": linear_interpolated,
"quadratic_interpolated": quadratic_interpolated,
"bfilled": bfilled,
"ffilled": ffilled,
"zero_filled": zero_filled,
}
).plot(linewidth=2, figsize=DEFAULT_FIGSIZE)
<matplotlib.axes.AxesSubplot at 0x1072bf350>
DataFrame
- 2D Tables
of Interwoven Series
¶# Oftentimes we have more than one axis on which we want to store data.
from pandas.io.data import get_data_yahoo
spy = get_data_yahoo(
symbols='SPY',
start=pd.Timestamp('2011-01-01'),
end=pd.Timestamp('2014-01-01'),
adjust_price=True,
)
spy
Open | High | Low | Close | Volume | Adj_Ratio | |
---|---|---|---|---|---|---|
Date | ||||||
2011-01-03 | 117.923577 | 118.751861 | 116.983613 | 118.24 | 138725200 | 0.930657 |
2011-01-04 | 118.495717 | 118.532941 | 117.434811 | 118.17 | 137409700 | 0.930619 |
2011-01-05 | 117.803496 | 118.864453 | 117.691816 | 118.79 | 133975300 | 0.930664 |
2011-01-06 | 118.839206 | 118.969502 | 118.206340 | 118.56 | 122519000 | 0.930685 |
2011-01-07 | 118.710864 | 118.906295 | 117.398679 | 118.32 | 156034600 | 0.930628 |
2011-01-10 | 117.797752 | 118.337511 | 117.444117 | 118.17 | 122401700 | 0.930619 |
2011-01-11 | 118.599306 | 118.878495 | 118.143298 | 118.59 | 110287000 | 0.930629 |
... | ... | ... | ... | ... | ... | ... |
2013-12-20 | 179.037954 | 180.326069 | 178.919052 | 179.90 | 197087000 | 0.990857 |
2013-12-23 | 180.780732 | 180.968994 | 180.404209 | 180.86 | 85598000 | 0.990851 |
2013-12-24 | 180.873560 | 181.339270 | 180.863652 | 181.26 | 45368800 | 0.990871 |
2013-12-26 | 181.664751 | 182.279086 | 181.644934 | 182.18 | 63365000 | 0.990863 |
2013-12-27 | 182.417716 | 182.496984 | 181.981736 | 182.17 | 61814000 | 0.990862 |
2013-12-30 | 182.189543 | 182.338172 | 181.902193 | 182.14 | 56857000 | 0.990861 |
2013-12-31 | 182.385673 | 183.000000 | 182.246954 | 183.00 | 86119900 | 0.990850 |
754 rows × 6 columns
# Just plotting this DataFrame with the default arguments isn't very useful,
# because the scale of volume is so much greater than all the other columns.
spy.plot(figsize=DEFAULT_FIGSIZE)
<matplotlib.axes.AxesSubplot at 0x107331910>
# Let's make a more interesting plot.
# Create a figure
fig = plt.figure()
# Add a subplot for price.
price_subplot = fig.add_subplot('311', xlabel='Date', ylabel='Price')
spy['Close'].plot(ax=price_subplot, lw=2) # lw means "line width"
# Add another subplot for each day's spread.
spread_subplot = fig.add_subplot('312', xlabel='Date', ylabel='Spread')
spread = spy['High'] - spy['Low']
spread.plot(ax=spread_subplot, lw=2, color='r')
# And add a third plot for volume.
volume_subplot = fig.add_subplot('313', xlabel='Date', ylabel='Volume')
spy['Volume'].plot(ax=volume_subplot, lw=2)
# matplotlib.pyplot.gcf is short for "Get Current Figure". It provides an easy
# way to modify the last drawn plot.
plt.gcf().set_size_inches(*DEFAULT_FIGSIZE)
# Unsurprisingly, spread is strongly correlated with daily volume
spread.corr(spy['Volume'])
0.787975095601951
DataFrames
¶A DataFrame
has two indices, representing row labels and column labels.
Since these are Index
objects, we can use all the same slicing tools we used
when working with Series
.
# Default slicing acts on column labels.
# Passing a scalar value drops the dimension by one.
spy['Close'] # Returns a Series
Date 2011-01-03 118.24 2011-01-04 118.17 2011-01-05 118.79 2011-01-06 118.56 2011-01-07 118.32 ... 2013-12-23 180.86 2013-12-24 181.26 2013-12-26 182.18 2013-12-27 182.17 2013-12-30 182.14 2013-12-31 183.00 Name: Close, Length: 754
# Passing a list filters the columns down to the supplied values.
spy[['Close', 'Volume']]
Close | Volume | |
---|---|---|
Date | ||
2011-01-03 | 118.24 | 138725200 |
2011-01-04 | 118.17 | 137409700 |
2011-01-05 | 118.79 | 133975300 |
2011-01-06 | 118.56 | 122519000 |
2011-01-07 | 118.32 | 156034600 |
2011-01-10 | 118.17 | 122401700 |
2011-01-11 | 118.59 | 110287000 |
... | ... | ... |
2013-12-20 | 179.90 | 197087000 |
2013-12-23 | 180.86 | 85598000 |
2013-12-24 | 181.26 | 45368800 |
2013-12-26 | 182.18 | 63365000 |
2013-12-27 | 182.17 | 61814000 |
2013-12-30 | 182.14 | 56857000 |
2013-12-31 | 183.00 | 86119900 |
754 rows × 2 columns
# Using .loc with one argument takes a slice of rows based on label.
spy.loc[pd.Timestamp('2013-02-01'):pd.Timestamp('2013-02-28')]
Open | High | Low | Close | Volume | Adj_Ratio | |
---|---|---|---|---|---|---|
Date | ||||||
2013-02-01 | 146.316970 | 147.064823 | 146.064448 | 146.89 | 131173000 | 0.971238 |
2013-02-04 | 145.997571 | 146.920254 | 145.133163 | 145.24 | 159073600 | 0.971245 |
2013-02-05 | 146.030113 | 147.127645 | 145.971836 | 146.71 | 113912400 | 0.971268 |
2013-02-06 | 146.188418 | 146.907122 | 146.081583 | 146.81 | 138762800 | 0.971223 |
2013-02-07 | 146.862813 | 146.998788 | 145.551624 | 146.62 | 162490000 | 0.971251 |
2013-02-08 | 146.876659 | 147.527415 | 146.876659 | 147.44 | 103133700 | 0.971278 |
2013-02-11 | 147.380862 | 147.536265 | 147.040917 | 147.41 | 73775000 | 0.971272 |
... | ... | ... | ... | ... | ... | ... |
2013-02-20 | 148.738262 | 148.786825 | 146.912299 | 146.99 | 160574800 | 0.971257 |
2013-02-21 | 146.614456 | 147.061214 | 145.623817 | 146.09 | 183257000 | 0.971214 |
2013-02-22 | 146.801290 | 147.520000 | 146.160279 | 147.52 | 106356600 | 0.971229 |
2013-02-25 | 148.245729 | 148.469122 | 144.720000 | 144.72 | 245824800 | 0.971275 |
2013-02-26 | 145.418619 | 145.884829 | 144.457061 | 145.71 | 186596200 | 0.971270 |
2013-02-27 | 145.578109 | 147.947918 | 145.451849 | 147.54 | 150781900 | 0.971233 |
2013-02-28 | 147.531660 | 148.473765 | 147.055752 | 147.25 | 126866000 | 0.971242 |
19 rows × 6 columns
# Using .loc with two arguments takes a slice of rows based on label, then a
# slice of columns based on name.
# Note the comma between the first slice and the second slice!
spy.loc[pd.Timestamp('2013-02-01'):pd.Timestamp('2013-02-28'), 'Open':'Low']
Open | High | Low | |
---|---|---|---|
Date | |||
2013-02-01 | 146.316970 | 147.064823 | 146.064448 |
2013-02-04 | 145.997571 | 146.920254 | 145.133163 |
2013-02-05 | 146.030113 | 147.127645 | 145.971836 |
2013-02-06 | 146.188418 | 146.907122 | 146.081583 |
2013-02-07 | 146.862813 | 146.998788 | 145.551624 |
2013-02-08 | 146.876659 | 147.527415 | 146.876659 |
2013-02-11 | 147.380862 | 147.536265 | 147.040917 |
... | ... | ... | ... |
2013-02-20 | 148.738262 | 148.786825 | 146.912299 |
2013-02-21 | 146.614456 | 147.061214 | 145.623817 |
2013-02-22 | 146.801290 | 147.520000 | 146.160279 |
2013-02-25 | 148.245729 | 148.469122 | 144.720000 |
2013-02-26 | 145.418619 | 145.884829 | 144.457061 |
2013-02-27 | 145.578109 | 147.947918 | 145.451849 |
2013-02-28 | 147.531660 | 148.473765 | 147.055752 |
19 rows × 3 columns
# We can use iloc when we want lookups by position.
spy.iloc[-20:-10, [0,2]]
Open | Low | |
---|---|---|
Date | ||
2013-12-03 | 177.327241 | 176.568422 |
2013-12-04 | 176.509114 | 175.769963 |
2013-12-05 | 176.813197 | 176.182461 |
2013-12-06 | 178.053910 | 177.541439 |
2013-12-09 | 178.838985 | 178.533480 |
2013-12-10 | 178.356666 | 178.021594 |
2013-12-11 | 178.199567 | 175.913188 |
2013-12-12 | 176.052613 | 175.185359 |
2013-12-13 | 175.914351 | 175.194925 |
2013-12-16 | 176.353917 | 176.304642 |
Boolean Series
slicing is very useful with DataFrame.
# Get the days on which SPY closed higher than it opened.
up_days = spy['Close'] > spy['Open']
up_days
Date 2011-01-03 True 2011-01-04 False 2011-01-05 True 2011-01-06 False 2011-01-07 False ... 2013-12-23 True 2013-12-24 True 2013-12-26 True 2013-12-27 False 2013-12-30 False 2013-12-31 True Length: 754
spy[up_days]
Open | High | Low | Close | Volume | Adj_Ratio | |
---|---|---|---|---|---|---|
Date | ||||||
2011-01-03 | 117.923577 | 118.751861 | 116.983613 | 118.24 | 138725200 | 0.930657 |
2011-01-05 | 117.803496 | 118.864453 | 117.691816 | 118.79 | 133975300 | 0.930664 |
2011-01-10 | 117.797752 | 118.337511 | 117.444117 | 118.17 | 122401700 | 0.930619 |
2011-01-12 | 119.315668 | 119.790288 | 118.617698 | 119.66 | 107929200 | 0.930627 |
2011-01-14 | 119.297005 | 120.357919 | 119.213248 | 120.33 | 117677900 | 0.930626 |
2011-01-18 | 120.223573 | 120.651680 | 120.083973 | 120.54 | 114401300 | 0.930667 |
2011-01-20 | 119.088320 | 119.497814 | 118.315865 | 119.20 | 175745700 | 0.930668 |
... | ... | ... | ... | ... | ... | ... |
2013-12-18 | 176.330239 | 179.099566 | 174.753398 | 179.07 | 234906000 | 0.985526 |
2013-12-19 | 178.554492 | 179.066957 | 178.091303 | 178.86 | 136531200 | 0.985509 |
2013-12-20 | 179.037954 | 180.326069 | 178.919052 | 179.90 | 197087000 | 0.990857 |
2013-12-23 | 180.780732 | 180.968994 | 180.404209 | 180.86 | 85598000 | 0.990851 |
2013-12-24 | 180.873560 | 181.339270 | 180.863652 | 181.26 | 45368800 | 0.990871 |
2013-12-26 | 181.664751 | 182.279086 | 181.644934 | 182.18 | 63365000 | 0.990863 |
2013-12-31 | 182.385673 | 183.000000 | 182.246954 | 183.00 | 86119900 | 0.990850 |
423 rows × 6 columns
# We can use .ix when we want mixed lookups.
spy.ix[-20:-10, 'Open':'High']
Open | High | |
---|---|---|
Date | ||
2013-12-03 | 177.327241 | 177.770707 |
2013-12-04 | 176.509114 | 177.869150 |
2013-12-05 | 176.813197 | 177.138421 |
2013-12-06 | 178.053910 | 178.487538 |
2013-12-09 | 178.838985 | 179.036085 |
2013-12-10 | 178.356666 | 178.731158 |
2013-12-11 | 178.199567 | 178.229132 |
2013-12-12 | 176.052613 | 176.269427 |
2013-12-13 | 175.914351 | 176.072033 |
2013-12-16 | 176.353917 | 177.201441 |
five_day_returns = spy['Close'].pct_change(5)
five_day_returns
Date 2011-01-03 NaN 2011-01-04 NaN 2011-01-05 NaN 2011-01-06 NaN 2011-01-07 NaN ... 2013-12-23 0.024006 2013-12-24 0.029535 2013-12-26 0.017368 2013-12-27 0.018506 2013-12-30 0.012451 2013-12-31 0.011832 Name: Close, Length: 754
# Checking for equality of floating point numbers is a bad idea because of
# roundoff error. `numpy.allclose` does an appropriate epsilon test.
test_return = (spy['Close'].iloc[5] - spy['Close'].iloc[0]) / spy['Close'].iloc[0]
np.allclose(five_day_returns.iloc[5], test_return)
True
thirty_day_forward_returns = (spy['Close'].shift(-30) - spy['Close']) / spy['Close']
test_return = (spy['Close'].iloc[30] - spy['Close'].iloc[0]) / spy['Close'].iloc[0]
np.allclose(thirty_day_forward_returns.iloc[0], test_return)
True
returns = pd.DataFrame(
{
'forward_30Day': thirty_day_forward_returns,
'backA_2Day': spy['Close'].pct_change(2),
'backB_5Day': spy['Close'].pct_change(5),
'backD_50Day': spy['Close'].pct_change(50),
'backE_100Day': spy['Close'].pct_change(100),
'backF_200Day': spy['Close'].pct_change(200),
'backG_300Day': spy['Close'].pct_change(300),
}
).dropna(how='any')
returns.plot(figsize=DEFAULT_FIGSIZE)
<matplotlib.axes.AxesSubplot at 0x101508190>
# Pairwise correlation of forward and backward returns.
corr = returns.corr()
corr
backA_2Day | backB_5Day | backD_50Day | backE_100Day | backF_200Day | backG_300Day | forward_30Day | |
---|---|---|---|---|---|---|---|
backA_2Day | 1.000000 | 0.595697 | 0.188921 | 0.071506 | 0.162341 | 0.103689 | -0.069359 |
backB_5Day | 0.595697 | 1.000000 | 0.303945 | 0.106169 | 0.209141 | 0.133395 | -0.132127 |
backD_50Day | 0.188921 | 0.303945 | 1.000000 | 0.443526 | 0.299287 | 0.324473 | -0.364206 |
backE_100Day | 0.071506 | 0.106169 | 0.443526 | 1.000000 | 0.116012 | 0.288164 | -0.533907 |
backF_200Day | 0.162341 | 0.209141 | 0.299287 | 0.116012 | 1.000000 | 0.276626 | -0.016570 |
backG_300Day | 0.103689 | 0.133395 | 0.324473 | 0.288164 | 0.276626 | 1.000000 | 0.135996 |
forward_30Day | -0.069359 | -0.132127 | -0.364206 | -0.533907 | -0.016570 | 0.135996 | 1.000000 |
corr.ix['forward_30Day',:-1].plot(kind='bar', position=.5, xlim=(-1, 6))
plt.gcf().set_size_inches(9, 6)
Here we show how to load data for two securities, graph the data, and compute correlation of returns and volatility for each security over the specified period.
# Load data for Pepsi and Coca-Cola from Yahoo.
symbols = [
'PEP',
'KO',
]
cola_data = get_data_yahoo(['PEP', 'KO'], adjust_price=True)
cola_data
<class 'pandas.core.panel.Panel'> Dimensions: 6 (items) x 1186 (major_axis) x 2 (minor_axis) Items axis: Open to Adj_Ratio Major_axis axis: 2010-01-04 00:00:00 to 2014-09-18 00:00:00 Minor_axis axis: KO to PEP
# Compute the 1-day forward log returns for both securities' close prices.
closes = cola_data['Close']
yesterday_closes = cola_data['Close'].shift(1)
cola_log_returns = (closes / yesterday_closes).apply(np.log)
cola_raw_returns = closes.pct_change(1)
# Look at the data we just calculated by throwing it into a Panel and
# pulling out just the DataFrame or Kola.
pd.Panel({
'closes' : closes,
'prev_closes': yesterday_closes,
'log_returns': cola_log_returns,
'raw_returns': cola_raw_returns,
}).loc[:,:,'KO']
closes | log_returns | prev_closes | raw_returns | |
---|---|---|---|---|
Date | ||||
2010-01-04 | 24.84 | NaN | NaN | NaN |
2010-01-05 | 24.54 | -0.012151 | 24.84 | -0.012077 |
2010-01-06 | 24.53 | -0.000408 | 24.54 | -0.000407 |
2010-01-07 | 24.47 | -0.002449 | 24.53 | -0.002446 |
2010-01-08 | 24.02 | -0.018561 | 24.47 | -0.018390 |
2010-01-11 | 24.50 | 0.019786 | 24.02 | 0.019983 |
2010-01-12 | 24.77 | 0.010960 | 24.50 | 0.011020 |
... | ... | ... | ... | ... |
2014-09-10 | 41.86 | 0.005269 | 41.64 | 0.005283 |
2014-09-11 | 41.95 | 0.002148 | 41.86 | 0.002150 |
2014-09-12 | 41.46 | -0.011749 | 41.95 | -0.011681 |
2014-09-15 | 41.50 | 0.000964 | 41.46 | 0.000965 |
2014-09-16 | 41.64 | 0.003368 | 41.50 | 0.003373 |
2014-09-17 | 41.61 | -0.000721 | 41.64 | -0.000720 |
2014-09-18 | 41.79 | 0.004317 | 41.61 | 0.004326 |
1186 rows × 4 columns
# Pull the standard returns and the log returns into a single DataFrame using DataFrame.join.
closes.join(cola_log_returns, rsuffix='_lr')\
.join(cola_raw_returns, rsuffix='_rr')\
.dropna(how='any')
KO | PEP | KO_lr | PEP_lr | KO_rr | PEP_rr | |
---|---|---|---|---|---|---|
Date | ||||||
2010-01-05 | 24.54 | 53.80 | -0.012151 | 0.011967 | -0.012077 | 0.012039 |
2010-01-06 | 24.53 | 53.26 | -0.000408 | -0.010088 | -0.000407 | -0.010037 |
2010-01-07 | 24.47 | 52.93 | -0.002449 | -0.006215 | -0.002446 | -0.006196 |
2010-01-08 | 24.02 | 52.75 | -0.018561 | -0.003407 | -0.018390 | -0.003401 |
2010-01-11 | 24.50 | 52.69 | 0.019786 | -0.001138 | 0.019983 | -0.001137 |
2010-01-12 | 24.77 | 53.43 | 0.010960 | 0.013947 | 0.011020 | 0.014044 |
2010-01-13 | 24.84 | 53.86 | 0.002822 | 0.008016 | 0.002826 | 0.008048 |
... | ... | ... | ... | ... | ... | ... |
2014-09-10 | 41.86 | 91.79 | 0.005269 | 0.004039 | 0.005283 | 0.004047 |
2014-09-11 | 41.95 | 91.65 | 0.002148 | -0.001526 | 0.002150 | -0.001525 |
2014-09-12 | 41.46 | 90.87 | -0.011749 | -0.008547 | -0.011681 | -0.008511 |
2014-09-15 | 41.50 | 91.20 | 0.000964 | 0.003625 | 0.000965 | 0.003632 |
2014-09-16 | 41.64 | 92.57 | 0.003368 | 0.014910 | 0.003373 | 0.015022 |
2014-09-17 | 41.61 | 92.85 | -0.000721 | 0.003020 | -0.000720 | 0.003025 |
2014-09-18 | 41.79 | 93.37 | 0.004317 | 0.005585 | 0.004326 | 0.005600 |
1185 rows × 6 columns
# Create a figure with three 'slots' for subplots.
fig = plt.figure()
# 311 here means "Put the subplot in the 1st slot of a 3 x 1 grid.
# 312 and 313 tell matplotlib to place the subsequent plots in the 2nd and 3rd slot
price_subplot = fig.add_subplot('311', xlabel='Date', ylabel='Price')
return_subplot_pep = fig.add_subplot('312', xlabel='Date', ylabel='PEP Log Returns')
return_subplot_ko = fig.add_subplot('313', xlabel='Date', ylabel='KO Log Returns')
cola_data['Close'].plot(ax=price_subplot, color=['purple', 'red'])
cola_log_returns['PEP'].plot(ax=return_subplot_pep, color='red')
cola_log_returns['KO'].plot(ax=return_subplot_ko, color='purple')
# Set the size of the whole plot array. gcf stands for `get_current_figure`.
plt.gcf().set_size_inches(14, 10)
# Compute the correlation of our log returns
correlation = (cola_log_returns['PEP']).corr(cola_log_returns['KO'])
correlation
0.60912363471867426
# Compute column-wise standard deviation of daily returns and divide by
# 1 / sqrt(252) to get annualized volatility.
volatility = cola_log_returns.std() * np.sqrt(252)
volatility
KO 0.151500 PEP 0.140033 dtype: float64