In [ ]:

import pandas as pd

from pandas import Series, DataFrame, Panel

Pandas¶

http://pandas.pydata.org/

https://github.com/pydata/pandas

Python for Data Analysis ( http://shop.oreilly.com/product/0636920023784.do )

Important topics:

Creating and accessing data structures
- Series, DataFrame, Panel
Creating labels (Index, MultiIndex)
Indexing/slicing data structures
Data IO
Numerical computations
Data alignment
Missing data handling
Grouping

Pandas has data structures for 1-D, 2-D, and 3-D data¶

Series (1-D vector data)
DataFrame (2-D tabular data)
Panel (3-D data cube)

Series¶

In [ ]:

values = [1, 2, 3]
labels = ['Cashews', 'Almonds', 'Peanuts']

In [ ]:

s = Series(values, labels)
s

In [ ]:

s.index

In [ ]:

s.values

In [ ]:

type(s.values)

Accessing Series elements

In [ ]:

s['Peanuts']

In [ ]:

s['Almonds']

In [ ]:

s['Cashews']

Series from dicts

In [ ]:

d = dict(zip(labels, values))
Series(d)

How come the order changed?

DataFrame¶

In [ ]:

df = DataFrame([[1, 2, 3], 
                [4, 5, 6],
                [7, 8, 9]], 
               index=['a', 'b', 'c'], 
               columns=labels)
df

In [ ]:

df.index

In [ ]:

df.columns

In [ ]:

df.values

Accessing DataFrame components

In [ ]:

df['Cashews']

In [ ]:

type(df['Cashews'])

Columns can be accessed as attributes

In [ ]:

df.Cashews

In [ ]:

df.ix['a', :]

Panel¶

In [ ]:

p = Panel({'x' : df, 'y' : df**2})
p

The axes are slightly different

In [ ]:

p.index

In [ ]:

p.items

In [ ]:

p.major_axis

In [ ]:

p.minor_axis

Accessing Panel components

In [ ]:

p['x']

In [ ]:

type(p['x'])

In [ ]:

p.ix['x', :, :]

In [ ]:

p.ix[:, :, 'Cashews'] #items, major, minor

There are many ways to access the data in pandas objects¶

In [ ]:

positional

In [ ]:

s[0]

In [ ]:

s[-1]

In [ ]:

s[:2]

In [ ]:

s[:-1]

list of labels

In [ ]:

s[['Cashews', 'Peanuts', 'Almonds']]

In [ ]:

s[0] = 5

Accessing DataFrame via the indexer (.ix)¶

In [ ]:

df

Single row using position

In [ ]:

df.ix[0]

Using label

In [ ]:

df.ix['a', :]

Single column

In [ ]:

df.ix[:, 'Cashews']

Single element access

In [ ]:

df.ix[0, 1]

lists, tuples, slices, arrays, oh my!

In [ ]:

df.ix[[0, 1], :-1]

boolean indexing

In [ ]:

df.ix[df.Almonds > 4]

boolean indexing with a DataFrame

In [ ]:

df[df > 3] # new in 0.9.1

Mutation can also happen via the indexer as well

In [ ]:

df.ix[:, 0] = 1

In [ ]:

df.ix['b', :] = Series([6, 1, 5], df.columns)
df

Practical 1:¶

Create a 4-by-2 DataFrame where

one column is ['A', 'B', 'A', 'B']
the other is 4 random numbers

Get all entries where the string column is 'A'

Get the entry at position (2, 1)

Get all entries from the numerical column where the string column is not 'A'

Real data comes from files and databases¶

In [ ]:

path = 'https://dl.dropbox.com/u/22164876/data.csv'
df = pd.read_csv(path)

df

In [ ]:

df.shape

In [ ]:

df.index

In [ ]:

df.head()

The default index isn't very useful

In [ ]:

df = pd.read_csv(path, index_col=['Date', 'Time'])
df

In [ ]:

df.index[:5]

In [ ]:

df.index[0]

We want pandas to interpret dates and times automatically

In [ ]:

ticks = pd.read_csv(path, parse_dates={'ts': ['Date', 'Time']}, index_col='ts')
ticks

In [ ]:

ticks.index

In [ ]:

ticks.index[0]

In [ ]:

isinstance(ticks.index[0], datetime.datetime)

Why did we make a subclass of datetime?

In [ ]:

ticks.index[0].nanosecond

Let's do some simple operations

In [ ]:

sq_price = ticks.Price**2
np.sqrt(sq_price.mean())

In [ ]:

sq_vol = ticks.Volume**2
np.sqrt(sq_vol.mean())

In [ ]:

ticks['Price'].std()

In [ ]:

mean = ticks.Price.mean()
std = ticks.Price.std(ddof=0)
uncentered = np.sqrt(mean**2 + std**2)
uncentered

In [ ]:

std = ticks.Price.std()

In [ ]:

ticks['Price'].std()

In [ ]:

ticks['Price'].std()

read_csv and to_csv are good friends

In [ ]:

ticks.to_csv('tmp.csv')

In [ ]:

pd.read_csv('tmp.csv', index_col='ts')

Databases¶

In [ ]:

import pandas.io.sql as sql
import sqlite3

In [ ]:

con = sqlite3.connect(':memory:')

In [ ]:

sql.write_frame(ticks, 'ticks', con)

In [ ]:

sql.read_frame('select * from ticks', con)

HDF¶

In [ ]:

store = pd.HDFStore('ticks.h5')
store['ticks'] = ticks

In [ ]:

store['ticks']

In [ ]:

store.close()

Basic Computations¶

In [ ]:

df = ticks.ix[:1000, ['Price', 'Volume']]
df

In [ ]:

df.index[0]

Summary statistics about this DataFrame

In [ ]:

df.describe()

Each of the summary stats can be computed separately

In [ ]:

df.count()

In [ ]:

df.mean()

In [ ]:

df.std()

In [ ]:

df.min()

In [ ]:

df.max()

In [ ]:

df.quantile(0.50)

In [ ]:

df.median()

Compute across a different axes¶

In [ ]:

df.mean(axis=1)

In [ ]:

df.ix[:10, :].std(axis=1)

Arbitrary functions¶

apply for columnwise operations
applymap for elementwise operations

In [ ]:

df.sum()

In [ ]:

df.apply(lambda x: x.sum())

In [ ]:

df.applymap(lambda x: x.sum()).head() # no effect

In [ ]:

df.head()

In [ ]:

df.head().applymap(lambda x: x**2)

One of the three R's of education¶

In [ ]:

df.ix[:5]

Scalar operations are done element-wise

In [ ]:

df.ix[:5] * 10

In [ ]:

df.ix[:5] + 100

DataFrame with DataFrame (or Series with Series) is element-by-element

In [ ]:

df.ix[:5] - df.ix[:5]

Broadcasting¶

In [ ]:

means = df.mean()
means

In [ ]:

df.ix[:5] - means

DataFrame with Series means each Series element is applied to each DataFrame column

In [ ]:

result = df - means

result.mean()

The term broadcasting describes how arrays with different shapes are manipulated together in computations. Start with the last dimension, two dimensions are compatible for broadcasting if either they are equal or one of them has 1 element

Basic stats computations¶

In [ ]:

df.cov()

In [ ]:

df.dot(df.T) # oops, wrong orientation

In [ ]:

df.T.dot(df)

In [ ]:

df.corr()

In [ ]:

df.ix[:5].abs()

In [ ]:

df.kurt()

In [ ]:

ticks.ix[:, :3].head()

Practical 2:¶

Reimplement DataFrame.cov

In [ ]:

df.cov()

In [ ]:

demeaned = df - df.mean()
numer = demeaned.T.dot(demeaned)
denom = demeaned.count()
numer / denom

Sample cov!

In [ ]:

numer / (denom - 1)

Now package it up as a function and test against DataFrame.cov

In [ ]:

def cov(df):
    demeaned = df - df.mean()
    numer = demeaned.T.dot(demeaned)
    denom = demeaned.count()
    return numer / (denom - 1)

Extra credit: what about NA handling?¶

In [ ]:

df2 = df.copy()
df2.ix[:10, 0] = np.nan
df2.ix[-10:, 1] = np.nan

In [ ]:

cov(df2)

oops, we want to limit the data here

In [ ]:

cov(df2.dropna())

In [ ]:

df2.cov()

Take home: how to implement this NA handling generically for arbitrary sized DataFrame?

Data alignment¶

In [ ]:

df.ix[:5]

In [ ]:

df.ix[6:11]

In [ ]:

df.ix[:5] + df.ix[6:11]

What is df.ix[:5] + df.ix[6:11]?¶

Creating lagged/lead data with shift¶

In [ ]:

df.head()

In [ ]:

df.head().shift()

In [ ]:

df.Price.pct_change()

reindex¶

In [ ]:

every_other = df.ix[[0, 1, 2, 5, 7]]
every_other

In [ ]:

missing = every_other.reindex(df.ix[:20].index)
missing

In [ ]:

df = ticks.ix[:, ['Price', 'Volume']]
df

In [ ]:

df = ticks.reindex(columns=['Price', 'Volume'])
df

Missing Data¶

In [ ]:

missing

Pandas NA handling is intrinsic¶

In [ ]:

missing.mean()

In [ ]:

missing.values

In [ ]:

missing.values.mean(axis=0)

We can fill NAs¶

In [ ]:

missing.Price

In [ ]:

missing.Price.fillna(method='ffill')

In [ ]:

missing.Price.fillna(method='bfill')

In [ ]:

missing.Price.fillna(method='ffill', limit=3)

Does the filling method matter? How? Which is appropriate for this data? for your data?

We can fill using constant values

In [ ]:

missing.Price.fillna(missing.Price.mean())

We can interpolate

In [ ]:

missing.Price.interpolate()

In [ ]:

missing.apply(Series.interpolate)

We can ignore NAs

In [ ]:

missing.dropna()

Groupby¶

Let's read in the data again without combining columns

In [293]:

ticks = pd.read_csv(path, parse_dates=['Date'])
ticks

Out[293]:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14632 entries, 0 to 14631
Data columns:
Date                                            14632  non-null values
Time                                            14632  non-null values
Price                                           14632  non-null values
Volume                                          14632  non-null values
Exchange Code                                   14632  non-null values
Sales Condition                                 14632  non-null values
Correction Indicator                            14632  non-null values
Sequence Number                                 14632  non-null values
Trade Stop Indicator                            14632  non-null values
Source of Trade                                 14632  non-null values
MDS 127 / TRF (Trade Reporting Facility) (*)    2421  non-null values
Exclude Record Flag                             28  non-null values
Filtered Price                                  0  non-null values
dtypes: float64(2), int64(3), object(8)

In [295]:

df = ticks.ix[:, ['Price', 'Volume']]
df

Out[295]:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14632 entries, 0 to 14631
Data columns:
Price     14632  non-null values
Volume    14632  non-null values
dtypes: float64(1), int64(1)

In [296]:

df['Returns'] = df.Price.pct_change()

In [297]:

grouped = df.groupby(ticks.Date)
grouped

Out[297]:

<pandas.core.groupby.DataFrameGroupBy at 0x10a760b10>

In [298]:

grouped.Returns.mean()

Out[298]:

Date
2011-11-01   -4.226326e-07
2011-11-02    3.672861e-09
2011-11-03   -1.658575e-06
Name: Returns

In [299]:

grouped.Volume.sum()

Out[299]:

Date
2011-11-01    2391125
2011-11-02    1114754
2011-11-03     783055
Name: Volume

In [300]:

grouped.Price.std()

Out[300]:

Date
2011-11-01    0.188347
2011-11-02    0.140807
2011-11-03    0.068991
Name: Price

Custom aggregations¶

Compounded returns

In [301]:

grouped.Returns.agg(lambda x: (1 + x).prod() - 1)

Out[301]:

Date
2011-11-01   -3.156385e-03
2011-11-02   -8.326673e-15
2011-11-03   -3.933986e-03
Name: Returns

Daily volume weighted average price

In [302]:

grouped.agg(lambda x: (x.Price * x.Volume).sum() / x.Volume.sum())

Out[302]:

	Price	Volume	Returns
Date
2011-11-01	104.250489	104.250489	104.250489
2011-11-02	104.139615	104.139615	104.139615
2011-11-03	103.870989	103.870989	103.870989

Daily percent change in price

In [309]:

grouped.agg(lambda x: x.irow(-1) / x.irow(0) - 1)

Out[309]:

	Price	Volume	Returns
Date
2011-11-01	-0.003156	2.000000	NaN
2011-11-02	0.002790	0.048333	-1
2011-11-03	0.000385	19.570000	-1

We could also have used first and last

In [310]:

grouped.last() / grouped.first() - 1

Out[310]:

	Price	Volume	Returns
Date
2011-11-01	-0.003156	2.000000	-1
2011-11-02	0.002790	0.048333	-1
2011-11-03	0.000385	19.570000	-1

Difference here is that first/last gets the first/last non-na element

Non-aggregating Manipulations¶

In [ ]:

rs = grouped.transform(lambda x: (x - x.mean()) / x.std())
rs

In [ ]:

rs.min()

In [ ]:

rs.max()

In [ ]:

rs.mean()

In [ ]:

rs.std()

In [ ]:

ticks

In [ ]:

df = ticks.ix[:, ['Date', 'Time', 'Price', 'Volume']]
df

In [ ]:

df = df.set_index(['Date', 'Time'])
df

In [ ]:

rets = df.Price / df.Price.shift(5) - 1
rets

In [ ]:

grouped = df.groupby(level=0)
rs = grouped.transform(lambda x: (x / x.shift(5) - 1).cumsum())
returns = rs.Price
returns

Practical 3¶

Implement function that:

takes in a DataFrame with intraday data and a mapping for dates
compute the raw 2nd moment for each day

In [313]:

def raw_var(df, dates):
    return df.groupby(dates).agg(lambda x: (x**2).sum() / (x.count() - 1))

In [314]:

raw_var(df, ticks.Date)

Out[314]:

	Price	Volume	Returns
Date
2011-11-01	10861.093515	12393295.177649	2.817714e-08
2011-11-02	10844.555943	646318.922801	7.344315e-09
2011-11-03	10796.072772	6452776.031726	1.623339e-08

Are student solutions robust to NAs?

In [ ]: