import pandas as pd
from pandas import Series, DataFrame, Panel
https://github.com/pydata/pandas
Python for Data Analysis ( http://shop.oreilly.com/product/0636920023784.do )
Important topics:
values = [1, 2, 3]
labels = ['Cashews', 'Almonds', 'Peanuts']
s = Series(values, labels)
s
s.index
s.values
type(s.values)
Accessing Series elements
s['Peanuts']
s['Almonds']
s['Cashews']
Series from dicts
d = dict(zip(labels, values))
Series(d)
How come the order changed?
df = DataFrame([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]],
index=['a', 'b', 'c'],
columns=labels)
df
df.index
df.columns
df.values
Accessing DataFrame components
df['Cashews']
type(df['Cashews'])
Columns can be accessed as attributes
df.Cashews
df.ix['a', :]
p = Panel({'x' : df, 'y' : df**2})
p
The axes are slightly different
p.index
p.items
p.major_axis
p.minor_axis
Accessing Panel components
p['x']
type(p['x'])
p.ix['x', :, :]
p.ix[:, :, 'Cashews'] #items, major, minor
s
positional
s[0]
s[-1]
s[:2]
s[:-1]
list of labels
s[['Cashews', 'Peanuts', 'Almonds']]
s[0] = 5
df
Single row using position
df.ix[0]
Using label
df.ix['a', :]
Single column
df.ix[:, 'Cashews']
Single element access
df.ix[0, 1]
lists, tuples, slices, arrays, oh my!
df.ix[[0, 1], :-1]
boolean indexing
df.ix[df.Almonds > 4]
boolean indexing with a DataFrame
df[df > 3] # new in 0.9.1
Mutation can also happen via the indexer as well
df.ix[:, 0] = 1
df.ix['b', :] = Series([6, 1, 5], df.columns)
df
Create a 4-by-2 DataFrame where
Get all entries where the string column is 'A'
Get the entry at position (2, 1)
Get all entries from the numerical column where the string column is not 'A'
path = 'https://dl.dropbox.com/u/22164876/data.csv'
df = pd.read_csv(path)
df
df.shape
df.index
df.head()
The default index isn't very useful
df = pd.read_csv(path, index_col=['Date', 'Time'])
df
df.index[:5]
df.index[0]
We want pandas to interpret dates and times automatically
ticks = pd.read_csv(path, parse_dates={'ts': ['Date', 'Time']}, index_col='ts')
ticks
ticks.index
ticks.index[0]
isinstance(ticks.index[0], datetime.datetime)
Why did we make a subclass of datetime?
ticks.index[0].nanosecond
Let's do some simple operations
sq_price = ticks.Price**2
np.sqrt(sq_price.mean())
sq_vol = ticks.Volume**2
np.sqrt(sq_vol.mean())
ticks['Price'].std()
mean = ticks.Price.mean()
std = ticks.Price.std(ddof=0)
uncentered = np.sqrt(mean**2 + std**2)
uncentered
std = ticks.Price.std()
ticks['Price'].std()
ticks['Price'].std()
read_csv and to_csv are good friends
ticks.to_csv('tmp.csv')
pd.read_csv('tmp.csv', index_col='ts')
import pandas.io.sql as sql
import sqlite3
con = sqlite3.connect(':memory:')
sql.write_frame(ticks, 'ticks', con)
sql.read_frame('select * from ticks', con)
store = pd.HDFStore('ticks.h5')
store['ticks'] = ticks
store['ticks']
store.close()
df = ticks.ix[:1000, ['Price', 'Volume']]
df
df.index[0]
Summary statistics about this DataFrame
df.describe()
Each of the summary stats can be computed separately
df.count()
df.mean()
df.std()
df.min()
df.max()
df.quantile(0.50)
df.median()
df.mean(axis=1)
df.ix[:10, :].std(axis=1)
apply
for columnwise operationsapplymap
for elementwise operationsdf.sum()
df.apply(lambda x: x.sum())
df.applymap(lambda x: x.sum()).head() # no effect
df.head()
df.head().applymap(lambda x: x**2)
df.ix[:5]
Scalar operations are done element-wise
df.ix[:5] * 10
df.ix[:5] + 100
DataFrame with DataFrame (or Series with Series) is element-by-element
df.ix[:5] - df.ix[:5]
means = df.mean()
means
df.ix[:5] - means
DataFrame with Series means each Series element is applied to each DataFrame column
result = df - means
result.mean()
The term broadcasting describes how arrays with different shapes are manipulated together in computations. Start with the last dimension, two dimensions are compatible for broadcasting if either they are equal or one of them has 1 element
df.cov()
df.dot(df.T) # oops, wrong orientation
df.T.dot(df)
df.corr()
df.ix[:5].abs()
df.kurt()
ticks.ix[:, :3].head()
Reimplement DataFrame.cov
df.cov()
demeaned = df - df.mean()
numer = demeaned.T.dot(demeaned)
denom = demeaned.count()
numer / denom
Sample cov!
numer / (denom - 1)
Now package it up as a function and test against DataFrame.cov
def cov(df):
demeaned = df - df.mean()
numer = demeaned.T.dot(demeaned)
denom = demeaned.count()
return numer / (denom - 1)
df2 = df.copy()
df2.ix[:10, 0] = np.nan
df2.ix[-10:, 1] = np.nan
cov(df2)
oops, we want to limit the data here
cov(df2.dropna())
df2.cov()
Take home: how to implement this NA handling generically for arbitrary sized DataFrame?
df.ix[:5]
df.ix[6:11]
df.ix[:5] + df.ix[6:11]
df.head()
df.head().shift()
df.Price.pct_change()
every_other = df.ix[[0, 1, 2, 5, 7]]
every_other
missing = every_other.reindex(df.ix[:20].index)
missing
df = ticks.ix[:, ['Price', 'Volume']]
df
df = ticks.reindex(columns=['Price', 'Volume'])
df
missing
missing.mean()
missing.values
missing.values.mean(axis=0)
missing.Price
missing.Price.fillna(method='ffill')
missing.Price.fillna(method='bfill')
missing.Price.fillna(method='ffill', limit=3)
Does the filling method matter? How? Which is appropriate for this data? for your data?
We can fill using constant values
missing.Price.fillna(missing.Price.mean())
We can interpolate
missing.Price.interpolate()
missing.apply(Series.interpolate)
We can ignore NAs
missing.dropna()
Let's read in the data again without combining columns
ticks = pd.read_csv(path, parse_dates=['Date'])
ticks
<class 'pandas.core.frame.DataFrame'> Int64Index: 14632 entries, 0 to 14631 Data columns: Date 14632 non-null values Time 14632 non-null values Price 14632 non-null values Volume 14632 non-null values Exchange Code 14632 non-null values Sales Condition 14632 non-null values Correction Indicator 14632 non-null values Sequence Number 14632 non-null values Trade Stop Indicator 14632 non-null values Source of Trade 14632 non-null values MDS 127 / TRF (Trade Reporting Facility) (*) 2421 non-null values Exclude Record Flag 28 non-null values Filtered Price 0 non-null values dtypes: float64(2), int64(3), object(8)
df = ticks.ix[:, ['Price', 'Volume']]
df
<class 'pandas.core.frame.DataFrame'> Int64Index: 14632 entries, 0 to 14631 Data columns: Price 14632 non-null values Volume 14632 non-null values dtypes: float64(1), int64(1)
df['Returns'] = df.Price.pct_change()
grouped = df.groupby(ticks.Date)
grouped
<pandas.core.groupby.DataFrameGroupBy at 0x10a760b10>
grouped.Returns.mean()
Date 2011-11-01 -4.226326e-07 2011-11-02 3.672861e-09 2011-11-03 -1.658575e-06 Name: Returns
grouped.Volume.sum()
Date 2011-11-01 2391125 2011-11-02 1114754 2011-11-03 783055 Name: Volume
grouped.Price.std()
Date 2011-11-01 0.188347 2011-11-02 0.140807 2011-11-03 0.068991 Name: Price
Compounded returns
grouped.Returns.agg(lambda x: (1 + x).prod() - 1)
Date 2011-11-01 -3.156385e-03 2011-11-02 -8.326673e-15 2011-11-03 -3.933986e-03 Name: Returns
Daily volume weighted average price
grouped.agg(lambda x: (x.Price * x.Volume).sum() / x.Volume.sum())
Price | Volume | Returns | |
---|---|---|---|
Date | |||
2011-11-01 | 104.250489 | 104.250489 | 104.250489 |
2011-11-02 | 104.139615 | 104.139615 | 104.139615 |
2011-11-03 | 103.870989 | 103.870989 | 103.870989 |
Daily percent change in price
grouped.agg(lambda x: x.irow(-1) / x.irow(0) - 1)
Price | Volume | Returns | |
---|---|---|---|
Date | |||
2011-11-01 | -0.003156 | 2.000000 | NaN |
2011-11-02 | 0.002790 | 0.048333 | -1 |
2011-11-03 | 0.000385 | 19.570000 | -1 |
We could also have used first and last
grouped.last() / grouped.first() - 1
Price | Volume | Returns | |
---|---|---|---|
Date | |||
2011-11-01 | -0.003156 | 2.000000 | -1 |
2011-11-02 | 0.002790 | 0.048333 | -1 |
2011-11-03 | 0.000385 | 19.570000 | -1 |
Difference here is that first/last gets the first/last non-na element
rs = grouped.transform(lambda x: (x - x.mean()) / x.std())
rs
rs.min()
rs.max()
rs.mean()
rs.std()
ticks
df = ticks.ix[:, ['Date', 'Time', 'Price', 'Volume']]
df
df = df.set_index(['Date', 'Time'])
df
rets = df.Price / df.Price.shift(5) - 1
rets
grouped = df.groupby(level=0)
rs = grouped.transform(lambda x: (x / x.shift(5) - 1).cumsum())
returns = rs.Price
returns
Implement function that:
def raw_var(df, dates):
return df.groupby(dates).agg(lambda x: (x**2).sum() / (x.count() - 1))
raw_var(df, ticks.Date)
Price | Volume | Returns | |
---|---|---|---|
Date | |||
2011-11-01 | 10861.093515 | 12393295.177649 | 2.817714e-08 |
2011-11-02 | 10844.555943 | 646318.922801 | 7.344315e-09 |
2011-11-03 | 10796.072772 | 6452776.031726 | 1.623339e-08 |
Are student solutions robust to NAs?