Exploring dates and times in Python with a business analytics focus¶

For many of us in the business analytics world, working with dates and times is bread and butter stuff. We can work all kinds of magic with Excel's "serial" datetimes and worksheet functions like DATEVALUE(), WEEKDAY(), MONTH(), DATE() and many more. We can wield VBA functions like DateDiff with the best of them. We all know that, according to MS, time started on 1/1/1900 (or 1904 if you're on a Mac) and we just live with the fact that we can't do reliable datetime work before those magic epochs. I even wrote and released a free and open source MS Access add-in called Hillmaker many years ago that did statistical analysis of time interval data by day of week and time of day. That tool still finds lots of use in the healthcare world for analyzing occupancy of things like nursing units, emergency departments, recovery room, and many other places with entry and exit timestamp data.

So, when I started working with R and Python, one of the first things I did was see how they handled dates and times. I learned about the UTC epoch and the POSIX time standard. In R, I quickly learned to love the lubridate package. And then I started to learn about dates and times in Python and started to know what programmers mean when they say "there be dragons in there". So, while I'm far from an expert, I did put together this little intro to Python dates and times aimed at business analtytics types who are new to Python. It is by no means comprehensive, but does cover the basics of the base Python datetime and timedelta classes along with Numpy datetime64 and timedelta64 classes. Oh yeah, it also covers the pandas Timestamp and timedelta classes (confused yet?). I also show one particular dragon that bit me - conversion between Numpy datetimes and Pandas datetimes - even though Pandas uses Numpy under the hood. Anyhoo, here we go.

You can find the .ipynb file in my hselab-tutorials github repo. Clone or download a zip.

Python's `datetime` and `timedelta` classes¶

In [1]:

import pandas as pd
import numpy as np
from pandas import Series, DataFrame
from datetime import datetime
from datetime import timedelta

In [114]:

import matplotlib.pyplot as plt
%matplotlib inline

Python has a module called datetime as part of its base system.

https://docs.python.org/2/library/datetime.html

The datetime module supplies classes for manipulating dates and times in both simple and complex ways. While date and time arithmetic is supported, the focus of the implementation is on efficient attribute extraction for output formatting and manipulation.

Datetimes can be naive or aware

naive objects don't know about timezones and their interpretation is application (coder) dependent
aware objects do know about timezones and are unambiguous

In [115]:

dates = [datetime(2012, 5, 1, 8, 30), datetime(2012, 5, 1, 9, 25), datetime(2012, 5, 1, 15, 30)]

In [116]:

dates

Out[116]:

[datetime.datetime(2012, 5, 1, 8, 30),
 datetime.datetime(2012, 5, 1, 9, 25),
 datetime.datetime(2012, 5, 1, 15, 30)]

In [117]:

type(dates)

Out[117]:

list

In [118]:

type(dates[0])

Out[118]:

datetime.datetime

In [119]:

intime_dt = dates[0]
outtime_dt = dates[1]
type(intime_dt)

Out[119]:

datetime.datetime

In [120]:

# The commented out lines represent attributes that ARE available in pandas Timestamp objects
#   but not in Python datetime objects
print ('Datetime: {}'.format(intime_dt))
print ('Date: {}'.format(intime_dt.date()))
print ('Month: {}'.format(intime_dt.month))
print ('Day: {}'.format(intime_dt.day))
#print ('DayOfWeek: {}'.format(intime_dt.dayofweek))
print ('Weekday: {}'.format(intime_dt.weekday()))
#print ('DayOfYear: {}'.format(intime_dt.dayofyear))
#print ('WeekOfYear: {}'.format(intime_dt.weekofyear))
#print ('Quarter: {}'.format(intime_dt.quarter))
print ('Hour: {}'.format(intime_dt.hour))
print ('Minute: {}'.format(intime_dt.minute))
print ('Second: {}'.format(intime_dt.second))
print ('Microsecond: {}'.format(intime_dt.microsecond))

Datetime: 2012-05-01 08:30:00
Date: 2012-05-01
Month: 5
Day: 1
Weekday: 1
Hour: 8
Minute: 30
Second: 0
Microsecond: 0

In [121]:

los_td = outtime_dt - intime_dt
print (los_td)
type(los_td)

0:55:00

Out[121]:

datetime.timedelta

In [122]:

atts = [att for att in dir(los_td) if '__' not in att]
print(atts)

['days', 'max', 'microseconds', 'min', 'resolution', 'seconds', 'total_seconds']

In [123]:

print ('Timedelta: {}'.format(los_td))
print ('Seconds: {}'.format(los_td.seconds))
print ('Total Seconds: {}'.format(los_td.total_seconds()))
print ('Microseconds: {}'.format(los_td.microseconds))
print ('Resolution: {}'.format(los_td.resolution))
print ('Min: {}'.format(los_td.min))
print ('Max: {}'.format(los_td.max))

Timedelta: 0:55:00
Seconds: 3300
Total Seconds: 3300.0
Microseconds: 0
Resolution: 0:00:00.000001
Min: -999999999 days, 0:00:00
Max: 999999999 days, 23:59:59.999999

So, you can use the total_seconds() method to get a duration in seconds from a timedelta object. If duration is < 1 second, it returns fractional seconds. To convert to other time units, just do the simple math.

In [124]:

timedelta(0,0,10)

Out[124]:

datetime.timedelta(0, 0, 10)

In [125]:

timedelta(0,0,17).total_seconds()

Out[125]:

1.7e-05

In [126]:

atts = [att for att in dir(intime_dt) if '__' not in att]

In [127]:

print (atts)

['astimezone', 'combine', 'ctime', 'date', 'day', 'dst', 'fromordinal', 'fromtimestamp', 'hour', 'isocalendar', 'isoformat', 'isoweekday', 'max', 'microsecond', 'min', 'minute', 'month', 'now', 'replace', 'resolution', 'second', 'strftime', 'strptime', 'time', 'timestamp', 'timetuple', 'timetz', 'today', 'toordinal', 'tzinfo', 'tzname', 'utcfromtimestamp', 'utcnow', 'utcoffset', 'utctimetuple', 'weekday', 'year']

Numpy's newish `datetime64` and `timedelta64` classes¶

The Numpy library (v 1.7.0) introduced a new datatype called datetime64. Numpy uses this name because there is already a datetime data type in Python datetime library. Like R, dates and times in Python are stored based on the notion of POSIX time - the number of seconds elapsed since Jan 1, 1970 00:00:00 GMT - called the UTC epoch. This doesn't mean that dates and times before that cannot be represented - we just use negative numbers. Representing and working with dates and times for computing is a pretty interesting topic, fraught with complexities like time zones, daylight savings time, and leap seconds. A few good places to start are:

Numpy docs - http://docs.scipy.org/doc/numpy-dev/reference/arrays.datetime.html

Pandas docs - http://pandas.pydata.org/pandas-docs/dev/timeseries.html

See http://cr.yp.to/proto/utctai.html for gory details on UTC and ATI and leap seconds and the problems with the POSIX standard.

http://stackoverflow.com/questions/13703720/converting-between-datetime-timestamp-and-datetime64

Here's the actual numpy proposal for datetime64 and timedelta64, authored by Travis Oliphant.

https://github.com/numpy/numpy/blob/master/doc/neps/datetime-proposal.rst

Let's create a Numpy datetime64 object.

In [128]:

np.datetime64('2012-05-01 08:30:00')

Out[128]:

numpy.datetime64('2012-05-01T08:30:00-0400')

In [129]:

dates64 = [np.datetime64('2012-05-01 08:30:00'), np.datetime64('2012-05-01 09:25:00'), np.datetime64('2012-05-01 15:30:00')]

In [130]:

dates64

Out[130]:

[numpy.datetime64('2012-05-01T08:30:00-0400'),
 numpy.datetime64('2012-05-01T09:25:00-0400'),
 numpy.datetime64('2012-05-01T15:30:00-0400')]

As you can see from below, UTC is four hours later than EDT. In other words, the EDT timezone offset is UTC-4.

In [131]:

dt_utcnow = datetime.utcnow()
dt_utcnow

Out[131]:

datetime.datetime(2014, 6, 25, 23, 36, 51, 924417)

In [132]:

# Can create datetime64's from base Python datetimes
dt64_utcnow = np.datetime64(dt_utcnow)
dt64_utcnow

Out[132]:

numpy.datetime64('2014-06-25T19:36:51.924417-0400')

In [133]:

curdates64 = [np.datetime64('2014-06-01 08:30:00'), np.datetime64('2014-06-01 09:25:00'), np.datetime64('2014-06-01 15:30:00')]

In [134]:

curdates64

Out[134]:

[numpy.datetime64('2014-06-01T08:30:00-0400'),
 numpy.datetime64('2014-06-01T09:25:00-0400'),
 numpy.datetime64('2014-06-01T15:30:00-0400')]

In [135]:

intime_dt64 = dates64[0]
outtime_dt64 = dates64[1]
print(type(intime_dt64))
print(intime_dt64)
intime_dt64

<class 'numpy.datetime64'>
2012-05-01T08:30:00-0400

Out[135]:

numpy.datetime64('2012-05-01T08:30:00-0400')

In [136]:

# The commented out lines represent attributes that ARE available in pandas Timestamp objects
#   but not in numpy datetime64 objects
print ('Datetime: {}'.format(str(intime_dt64)))
#print ('Date: {}'.format(intime_dt64.date()))
#print ('Month: {}'.format(intime_dt64.month))
#print ('Day: {}'.format(intime_dt.day))
#print ('DayOfWeek: {}'.format(intime_dt.dayofweek))
#print ('Weekday: {}'.format(intime_dt.weekday()))
#print ('DayOfYear: {}'.format(intime_dt.dayofyear))
#print ('WeekOfYear: {}'.format(intime_dt.weekofyear))
#print ('Quarter: {}'.format(intime_dt.quarter))
#print ('Hour: {}'.format(intime_dt.hour))
#print ('Minute: {}'.format(intime_dt.minute))
#print ('Second: {}'.format(intime_dt.second))
#print ('Microsecond: {}'.format(intime_dt.microsecond))

Datetime: 2012-05-01T08:30:00-0400

You can subtract two datetime64 objects and get a timedelta64 object in return.

In [137]:

los_dt64 = outtime_dt64 - intime_dt64
los_dt64

Out[137]:

numpy.timedelta64(3300,'s')

In [138]:

# Coerce it to other units
np.timedelta64(los_dt64,'m')

Out[138]:

numpy.timedelta64(55,'m')

Unfortunately, timedelta64 objects have almost no useful attributes in terms of date time math. OTH, see the datetime.timedelta object (and Pandas) for a number of useful attributes.

Pandas and datetimes¶

Pandas prides itself on its handling of time series data (speed, flexibility, power) and is built on time of numpy datetime64 and timedelta64 data types (instead of Python's base datetime and timedelta types. Pandas also wanted to be very time-zone aware and time-zone capable. So, Pandas has its own data type for timestamps called, well, Timetamp. It's based on numpy's datetime64 type but works like Python's datetime (I think). The following StackOverlfow post (the "welcome to hell" post) is quite popular and relevant.

http://stackoverflow.com/questions/13703720/converting-between-datetime-timestamp-and-datetime64

Pandas time series indexes can be timestamp or time period based.

In [139]:

ts = Series(np.random.randint(1,100,3), index=dates)

In [140]:

ts

Out[140]:

2012-05-01 08:30:00    86
2012-05-01 09:25:00    82
2012-05-01 15:30:00    34
dtype: int32

In [141]:

ts.index

Out[141]:

<class 'pandas.tseries.index.DatetimeIndex'>
[2012-05-01 08:30:00, ..., 2012-05-01 15:30:00]
Length: 3, Freq: None, Timezone: None

If we have demand data by month, a period based index might make more sense.

In [142]:

from pandas import Period, PeriodIndex

In [143]:

periods = PeriodIndex([Period('2012-01'), Period('2012-02'),Period('2012-03')])

In [144]:

ts2 = Series(np.random.randint(1,100,3), index=periods)

In [145]:

ts2

Out[145]:

2012-01     3
2012-02    59
2012-03    34
Freq: M, dtype: int32

In [146]:

ts2.index

Out[146]:

<class 'pandas.tseries.period.PeriodIndex'>
freq: M
[2012-01, ..., 2012-03]
length: 3

From the pandas docs:

Starting with 0.8, pandas allows you to capture both representations and convert between them. Under the hood, pandas represents timestamps using instances of Timestamp and sequences of timestamps using instances of DatetimeIndex. For regular time spans, pandas uses Period objects for scalar values and PeriodIndex for sequences of spans.

In [147]:

from pandas import Timestamp

In [148]:

Timestamp('20120501 08:30:00')

Out[148]:

Timestamp('2012-05-01 08:30:00', tz=None)

In [149]:

timestamps = [Timestamp('20120501 08:30:00'), Timestamp('20120501 09:25:00'), Timestamp('20120501 15:30:00')]

In [150]:

intime_ts = timestamps[0]
outtime_ts = timestamps[1]
type(intime_ts)

Out[150]:

pandas.tslib.Timestamp

In [151]:

los_ptd = outtime_ts - intime_ts
print (los_ptd)
type(los_ptd)

0:55:00

Out[151]:

datetime.timedelta

In [152]:

# The pandas Timestamp data type has a number of useful attributes (dayofweek, dayofyear, weekofyear)
#   that the base Python datetime type does not.
print ('Datetime: {}'.format(intime_ts))
print ('Date: {}'.format(intime_ts.date()))
print ('Month: {}'.format(intime_ts.month))
print ('Day: {}'.format(intime_ts.day))
print ('DayOfWeek: {}'.format(intime_ts.dayofweek))
print ('Weekday: {}'.format(intime_ts.weekday()))
print ('DayOfYear: {}'.format(intime_ts.dayofyear))
print ('WeekOfYear: {}'.format(intime_ts.weekofyear))
print ('Quarter: {}'.format(intime_ts.quarter))
print ('Hour: {}'.format(intime_ts.hour))
print ('Minute: {}'.format(intime_ts.minute))
print ('Second: {}'.format(intime_ts.second))
print ('Microsecond: {}'.format(intime_ts.microsecond))

Datetime: 2012-05-01 08:30:00
Date: 2012-05-01
Month: 5
Day: 1
DayOfWeek: 1
Weekday: 1
DayOfYear: 122
WeekOfYear: 18
Quarter: 2
Hour: 8
Minute: 30
Second: 0
Microsecond: 0

In [153]:

print ('Timedelta: {}'.format(los_ptd))
print ('Days: {}'.format(los_ptd.days))
print ('Seconds: {}'.format(los_ptd.seconds))
print ('Total Seconds: {}'.format(los_ptd.total_seconds()))
print ('Microseconds: {}'.format(los_ptd.microseconds))
print ('Resolution: {}'.format(los_ptd.resolution))
print ('Min: {}'.format(los_ptd.min))
print ('Max: {}'.format(los_ptd.max))

Timedelta: 0:55:00
Days: 0
Seconds: 3300
Total Seconds: 3300.0
Microseconds: 0
Resolution: 0:00:00.000001
Min: -999999999 days, 0:00:00
Max: 999999999 days, 23:59:59.999999

Conversion from numpy.datetime64 to pd.Timestamp¶

A big (potential) problem is trying to convert from numpy.datetime64 to pd.Timestamp. Seemingly strange things happened when I tried this using the to_list method of numpy.datetime64 objects. After much reading of the "Welcome to Hell" SO post and some experimenting and a more careful reading of the numpy docs ("ISO 8601 specifies to use the local time zone if none is explicitly given:"), I think we have an answer.

So, by default, numpy assumes a naively specified datetime is in local time.

In [154]:

dt64 = np.datetime64('2014-06-01 08:30:00')
dt64

Out[154]:

numpy.datetime64('2014-06-01T08:30:00-0400')

If we use it's .tolist() method or pass it in to pandas Timestamp() constructor, we get UTC!

In [155]:

dt64.tolist()

Out[155]:

datetime.datetime(2014, 6, 1, 12, 30)

In [156]:

ts = pd.Timestamp(dt64)
ts

Out[156]:

Timestamp('2014-06-01 12:30:00', tz=None)

If you were doing a bunch of bulk data transformation work, you might easily miss this subtlety and end up with your date time stats all staggered off by a few hours. Some more fun conversions are shown below.

In [157]:

dt = ts.to_datetime()
dt

Out[157]:

datetime.datetime(2014, 6, 1, 12, 30)

... and if we try a "roundtrip" from numpy to Pandas to datetime and back, we actually end up where we started.

In [158]:

np.datetime64(dt)

Out[158]:

numpy.datetime64('2014-06-01T08:30:00.000000-0400')

Appears this is quite a topic of discussion in the numpy community based on this discussion thread.

Pandas and time series¶

Generally, a time series in pandas is a Series object with a DateTimeIndex. Pandas has convenience functions for creating time series indexes. The default frequency for date_range is a calendar day while the default for bdate_range is a business day

In [159]:

index = pd.date_range('2000-1-1', periods=1000, freq='M')

In [160]:

index

Out[160]:

<class 'pandas.tseries.index.DatetimeIndex'>
[2000-01-31, ..., 2083-04-30]
Length: 1000, Freq: M, Timezone: None

In [161]:

start = datetime(2014, 1, 1)
end = datetime(2014, 6, 12)
rng = pd.date_range(start, end)
rng

Out[161]:

<class 'pandas.tseries.index.DatetimeIndex'>
[2014-01-01, ..., 2014-06-12]
Length: 163, Freq: D, Timezone: None

In [162]:

pd.date_range(start, end, freq='W')

Out[162]:

<class 'pandas.tseries.index.DatetimeIndex'>
[2014-01-05, ..., 2014-06-08]
Length: 23, Freq: W-SUN, Timezone: None

In [163]:

data = np.random.randint(1, 100, size=163)

In [164]:

df = DataFrame(data, index=rng)

In [165]:

df.head()

Out[165]:

	0
2014-01-01	18
2014-01-02	9
2014-01-03	69
2014-01-04	44
2014-01-05	75

5 rows × 1 columns

In [166]:

# Date index slicing is easy but, BE CAREFUL, the end point is included (unlike usual Python slicing behavior).
df['2014-01-02':'2014-01-12']

Out[166]:

	0
2014-01-02	9
2014-01-03	69
2014-01-04	44
2014-01-05	75
2014-01-06	96
2014-01-07	87
2014-01-08	2
2014-01-09	69
2014-01-10	76
2014-01-11	90
2014-01-12	31

11 rows × 1 columns

In [167]:

df[datetime(2014,1,2):datetime(2014,1,12)]

Out[167]:

	0
2014-01-02	9
2014-01-03	69
2014-01-04	44
2014-01-05	75
2014-01-06	96
2014-01-07	87
2014-01-08	2
2014-01-09	69
2014-01-10	76
2014-01-11	90
2014-01-12	31

11 rows × 1 columns

Pandas `DateOffset` objects¶

Usually base frequencies are specified with a string alias such as 'W' or 'M'. However,

In [168]:

rng2 = pd.date_range(start, end, freq='M')

In [169]:

len(rng2)

Out[169]:

In [170]:

data[:len(rng2)]

Out[170]:

array([18,  9, 69, 44, 75])

In [171]:

df2 = DataFrame(data[:len(rng2)], index=rng2)

In [172]:

df2

Out[172]:

	0
2014-01-31	18
2014-02-28	9
2014-03-31	69
2014-04-30	44
2014-05-31	75

5 rows × 1 columns

In [173]:

from pandas.tseries.offsets import Hour, Minute
#We rarely use these but they are available

In [174]:

one_hour = Hour()
four_hours = Hour(4)

In [175]:

one_hour + four_hours

Out[175]:

<5 * Hours>

In [176]:

one_hour + Minute(35)

Out[176]:

<95 * Minutes>

Week of month dates¶

In [177]:

rng = pd.date_range('9/1/2012','1/1/2014',freq='M')

In [178]:

rng

Out[178]:

<class 'pandas.tseries.index.DatetimeIndex'>
[2012-09-30, ..., 2013-12-31]
Length: 16, Freq: M, Timezone: None

In [179]:

list(rng)

Out[179]:

[Timestamp('2012-09-30 00:00:00', tz=None),
 Timestamp('2012-10-31 00:00:00', tz=None),
 Timestamp('2012-11-30 00:00:00', tz=None),
 Timestamp('2012-12-31 00:00:00', tz=None),
 Timestamp('2013-01-31 00:00:00', tz=None),
 Timestamp('2013-02-28 00:00:00', tz=None),
 Timestamp('2013-03-31 00:00:00', tz=None),
 Timestamp('2013-04-30 00:00:00', tz=None),
 Timestamp('2013-05-31 00:00:00', tz=None),
 Timestamp('2013-06-30 00:00:00', tz=None),
 Timestamp('2013-07-31 00:00:00', tz=None),
 Timestamp('2013-08-31 00:00:00', tz=None),
 Timestamp('2013-09-30 00:00:00', tz=None),
 Timestamp('2013-10-31 00:00:00', tz=None),
 Timestamp('2013-11-30 00:00:00', tz=None),
 Timestamp('2013-12-31 00:00:00', tz=None)]

In [180]:

rng = pd.date_range('9/1/2012','1/1/2014',freq='WOM-2TUE')

In [181]:

list(rng)

Out[181]:

[Timestamp('2012-09-11 00:00:00', tz=None),
 Timestamp('2012-10-09 00:00:00', tz=None),
 Timestamp('2012-11-13 00:00:00', tz=None),
 Timestamp('2012-12-11 00:00:00', tz=None),
 Timestamp('2013-01-08 00:00:00', tz=None),
 Timestamp('2013-02-12 00:00:00', tz=None),
 Timestamp('2013-03-12 00:00:00', tz=None),
 Timestamp('2013-04-09 00:00:00', tz=None),
 Timestamp('2013-05-14 00:00:00', tz=None),
 Timestamp('2013-06-11 00:00:00', tz=None),
 Timestamp('2013-07-09 00:00:00', tz=None),
 Timestamp('2013-08-13 00:00:00', tz=None),
 Timestamp('2013-09-10 00:00:00', tz=None),
 Timestamp('2013-10-08 00:00:00', tz=None),
 Timestamp('2013-11-12 00:00:00', tz=None),
 Timestamp('2013-12-10 00:00:00', tz=None)]

Shifting (Leading and Lagging) Data¶

Move data back and forth through time. Series and DateFrame objects have a shift method for doing naive shifts forward and backward and leaving the index unchanged.

In [182]:

ts = Series(np.random.randint(1,10,4),index = pd.date_range('1/2/2013',periods=4,freq='M'))
ts

Out[182]:

2013-01-31    3
2013-02-28    9
2013-03-31    1
2013-04-30    1
Freq: M, dtype: int32

In [183]:

ts.shift(periods = 2)

Out[183]:

2013-01-31   NaN
2013-02-28   NaN
2013-03-31     3
2013-04-30     9
Freq: M, dtype: float64

In [184]:

# Common use is pct change
ts/ts.shift(1) - 1.0

Out[184]:

2013-01-31         NaN
2013-02-28    2.000000
2013-03-31   -0.888889
2013-04-30    0.000000
Freq: M, dtype: float64

In [185]:

# Since naive shifts leave index unchanged, some data is lost. To shift both data and index, pass in the frequency
ts.shift(1)

Out[185]:

2013-01-31   NaN
2013-02-28     3
2013-03-31     9
2013-04-30     1
Freq: M, dtype: float64

In [186]:

ts.shift(1, freq='M')

Out[186]:

2013-02-28    3
2013-03-31    9
2013-04-30    1
2013-05-31    1
Freq: M, dtype: int32

Periods and period arithmetic¶

Periods represent time spans like months or quarters or years.

In [187]:

p = pd.Period('6/1/2014',freq='M')
p

Out[187]:

Period('2014-06', 'M')

In [188]:

p+6

Out[188]:

Period('2014-12', 'M')

In [189]:

# Ranges of periods is done much like ranges of dates
rng = pd.period_range('1/1/2010','6/1/2014',freq='Q')

In [190]:

rng

Out[190]:

<class 'pandas.tseries.period.PeriodIndex'>
freq: Q-DEC
[2010Q1, ..., 2014Q2]
length: 18

In [191]:

list(rng)

Out[191]:

[Period('2010Q1', 'Q-DEC'),
 Period('2010Q2', 'Q-DEC'),
 Period('2010Q3', 'Q-DEC'),
 Period('2010Q4', 'Q-DEC'),
 Period('2011Q1', 'Q-DEC'),
 Period('2011Q2', 'Q-DEC'),
 Period('2011Q3', 'Q-DEC'),
 Period('2011Q4', 'Q-DEC'),
 Period('2012Q1', 'Q-DEC'),
 Period('2012Q2', 'Q-DEC'),
 Period('2012Q3', 'Q-DEC'),
 Period('2012Q4', 'Q-DEC'),
 Period('2013Q1', 'Q-DEC'),
 Period('2013Q2', 'Q-DEC'),
 Period('2013Q3', 'Q-DEC'),
 Period('2013Q4', 'Q-DEC'),
 Period('2014Q1', 'Q-DEC'),
 Period('2014Q2', 'Q-DEC')]

In [192]:

p.asfreq('M','start')

Out[192]:

Period('2014-06', 'M')

In [193]:

p.asfreq('M','end')

Out[193]:

Period('2014-06', 'M')

Resampling and frequency conversion¶

Resampling - convert time series from one frequency to another.

Downsampling - higher frequency to lower frequency (e.g. daily to monthly)
Upsampling - lower frequency to higher frequency (e.g. annual to monthly)

Downsampling¶

Take a daily time series and convert to monthly.

In [194]:

rng = pd.date_range('9/1/2012',periods=100,freq='D')

In [195]:

ts = Series(np.random.randint(1,25,len(rng)),index=rng)

In [196]:

ts

Out[196]:

2012-09-01    13
2012-09-02    18
2012-09-03     7
2012-09-04    21
2012-09-05    21
2012-09-06    17
2012-09-07    20
2012-09-08    24
2012-09-09    20
2012-09-10     4
2012-09-11    23
2012-09-12    18
2012-09-13    24
2012-09-14    24
2012-09-15     2
...
2012-11-25     3
2012-11-26     6
2012-11-27     2
2012-11-28     2
2012-11-29    10
2012-11-30    18
2012-12-01    19
2012-12-02    24
2012-12-03    15
2012-12-04     3
2012-12-05    13
2012-12-06    14
2012-12-07    13
2012-12-08     2
2012-12-09    21
Freq: D, Length: 100

In [197]:

ts.resample('M',how='mean',kind='period')

Out[197]:

2012-09    12.666667
2012-10    12.806452
2012-11    10.633333
2012-12    13.777778
Freq: M, dtype: float64

In [198]:

ts.resample('M',how='sum',kind='period')

Out[198]:

2012-09    380
2012-10    397
2012-11    319
2012-12    124
Freq: M, dtype: int32

In [199]:

ts.resample('M',how='sum')

Out[199]:

2012-09-30    380
2012-10-31    397
2012-11-30    319
2012-12-31    124
Freq: M, dtype: int32

Whenver you are downsampling, you need to think about two bin boundary issues:

is the interval closed on the left or the right?
is the bin label based on the left or right bin boundary?

With resampling from days to months, the question of left or right closure isn't really an issue, but consider changing from 15 minute time bins to 30 minute time bins.

In [200]:

rng15 = pd.date_range('9/1/2012',periods=96,freq='15T')

In [201]:

ts15 = Series(np.random.randint(1,25,len(rng15)),index=rng15)

In [202]:

ts15

Out[202]:

2012-09-01 00:00:00    16
2012-09-01 00:15:00     7
2012-09-01 00:30:00     4
2012-09-01 00:45:00     7
2012-09-01 01:00:00    18
2012-09-01 01:15:00     1
2012-09-01 01:30:00     4
2012-09-01 01:45:00     7
2012-09-01 02:00:00     7
2012-09-01 02:15:00    22
2012-09-01 02:30:00    24
2012-09-01 02:45:00     6
2012-09-01 03:00:00    22
2012-09-01 03:15:00     5
2012-09-01 03:30:00     1
...
2012-09-01 20:15:00    11
2012-09-01 20:30:00     2
2012-09-01 20:45:00     7
2012-09-01 21:00:00    20
2012-09-01 21:15:00     1
2012-09-01 21:30:00    13
2012-09-01 21:45:00     9
2012-09-01 22:00:00     5
2012-09-01 22:15:00    21
2012-09-01 22:30:00    17
2012-09-01 22:45:00    20
2012-09-01 23:00:00    21
2012-09-01 23:15:00    18
2012-09-01 23:30:00    19
2012-09-01 23:45:00    21
Freq: 15T, Length: 96

Now, if we resample to 30 minute bins and take the mean across each set of two 15 bins, we need to specify if the times on the half hour such as 22:30 in the original series should go in the bin from (22:15,22:30] or [22:30,22:45). The first is referred to as closed on the right and the second as closed on the left. The default is close on right. You can also control whether the left or right edge of the bin is used as the label (independent of which side "closes" the interval).

In [203]:

ts15.resample('30min',how='mean',closed='right',label='left')

Out[203]:

2012-08-31 23:30:00    16.0
2012-09-01 00:00:00     5.5
2012-09-01 00:30:00    12.5
2012-09-01 01:00:00     2.5
2012-09-01 01:30:00     7.0
2012-09-01 02:00:00    23.0
2012-09-01 02:30:00    14.0
2012-09-01 03:00:00     3.0
2012-09-01 03:30:00    10.5
2012-09-01 04:00:00    11.5
2012-09-01 04:30:00    20.0
2012-09-01 05:00:00    19.5
2012-09-01 05:30:00    15.0
2012-09-01 06:00:00    17.0
2012-09-01 06:30:00    20.0
2012-09-01 07:00:00     9.0
2012-09-01 07:30:00    13.0
2012-09-01 08:00:00    18.5
2012-09-01 08:30:00    12.0
2012-09-01 09:00:00     7.0
2012-09-01 09:30:00    20.5
2012-09-01 10:00:00    15.0
2012-09-01 10:30:00     8.5
2012-09-01 11:00:00    16.0
2012-09-01 11:30:00    11.5
2012-09-01 12:00:00    12.0
2012-09-01 12:30:00    17.0
2012-09-01 13:00:00    12.0
2012-09-01 13:30:00     4.5
2012-09-01 14:00:00    16.0
2012-09-01 14:30:00    11.5
2012-09-01 15:00:00     7.5
2012-09-01 15:30:00    12.0
2012-09-01 16:00:00     7.5
2012-09-01 16:30:00     5.0
2012-09-01 17:00:00    20.0
2012-09-01 17:30:00    11.0
2012-09-01 18:00:00    13.0
2012-09-01 18:30:00    14.0
2012-09-01 19:00:00    15.5
2012-09-01 19:30:00    14.5
2012-09-01 20:00:00     6.5
2012-09-01 20:30:00    13.5
2012-09-01 21:00:00     7.0
2012-09-01 21:30:00     7.0
2012-09-01 22:00:00    19.0
2012-09-01 22:30:00    20.5
2012-09-01 23:00:00    18.5
2012-09-01 23:30:00    21.0
Freq: 30T, dtype: float64

Notice how in the example above, the first record is labelled with '2012-08-31' since it's the left boundary and we picked 'right' for closing intervals and 'left' for labelling. We might want to close and label on the left.

In [204]:

ts30 = ts15.resample('30min',how='mean',closed='left',label='left')
ts30

Out[204]:

2012-09-01 00:00:00    11.5
2012-09-01 00:30:00     5.5
2012-09-01 01:00:00     9.5
2012-09-01 01:30:00     5.5
2012-09-01 02:00:00    14.5
2012-09-01 02:30:00    15.0
2012-09-01 03:00:00    13.5
2012-09-01 03:30:00     2.0
2012-09-01 04:00:00    13.0
2012-09-01 04:30:00    17.5
2012-09-01 05:00:00    22.0
2012-09-01 05:30:00    17.5
2012-09-01 06:00:00    13.5
2012-09-01 06:30:00    16.5
2012-09-01 07:00:00    20.0
2012-09-01 07:30:00     4.0
2012-09-01 08:00:00    20.0
2012-09-01 08:30:00    20.0
2012-09-01 09:00:00     3.0
2012-09-01 09:30:00    13.0
2012-09-01 10:00:00    18.0
2012-09-01 10:30:00    14.5
2012-09-01 11:00:00    11.5
2012-09-01 11:30:00    14.0
2012-09-01 12:00:00    12.0
2012-09-01 12:30:00    14.0
2012-09-01 13:00:00    11.0
2012-09-01 13:30:00    12.5
2012-09-01 14:00:00     9.5
2012-09-01 14:30:00    16.5
2012-09-01 15:00:00     7.0
2012-09-01 15:30:00     3.5
2012-09-01 16:00:00    12.5
2012-09-01 16:30:00     8.5
2012-09-01 17:00:00    14.5
2012-09-01 17:30:00    18.5
2012-09-01 18:00:00    13.0
2012-09-01 18:30:00     9.5
2012-09-01 19:00:00    13.5
2012-09-01 19:30:00    12.0
2012-09-01 20:00:00    15.5
2012-09-01 20:30:00     4.5
2012-09-01 21:00:00    10.5
2012-09-01 21:30:00    11.0
2012-09-01 22:00:00    13.0
2012-09-01 22:30:00    18.5
2012-09-01 23:00:00    19.5
2012-09-01 23:30:00    20.0
Freq: 30T, dtype: float64

A common financial aggregation is to compute the first (open), last (close), max (high), min (low) values for each bucket. If you use 'ohlc' for the how argument, you get a DataFrame with these aggregations computed automatically.

In [205]:

ts60 = ts15.resample('60min',how='ohlc',closed='left',label='left')
ts60

Out[205]:

	open	high	low	close
2012-09-01 00:00:00	16	16	4	7
2012-09-01 01:00:00	18	18	1	7
2012-09-01 02:00:00	7	24	6	6
2012-09-01 03:00:00	22	22	1	3
2012-09-01 04:00:00	18	20	8	20
2012-09-01 05:00:00	20	24	15	20
2012-09-01 06:00:00	10	17	10	16
2012-09-01 07:00:00	24	24	2	6
2012-09-01 08:00:00	20	23	17	23
2012-09-01 09:00:00	1	17	1	17
2012-09-01 10:00:00	24	24	11	11
2012-09-01 11:00:00	6	17	6	13
2012-09-01 12:00:00	10	18	10	18
2012-09-01 13:00:00	16	18	6	7
2012-09-01 14:00:00	2	18	2	18
2012-09-01 15:00:00	5	9	1	1
2012-09-01 16:00:00	23	23	2	4
2012-09-01 17:00:00	6	23	6	20
2012-09-01 18:00:00	2	24	2	17
2012-09-01 19:00:00	11	16	9	9
2012-09-01 20:00:00	20	20	2	7
2012-09-01 21:00:00	20	20	1	9
2012-09-01 22:00:00	5	21	5	20
2012-09-01 23:00:00	21	21	18	21

24 rows × 4 columns

Upsampling¶

Upsampling brings an entirely different set of challenges. If we increase the frequency of the samples, how do we fill in these new time series values? For example, consider the reverse of the last example. Then check out the fill_method options.

In [206]:

ts30

Out[206]:

2012-09-01 00:00:00    11.5
2012-09-01 00:30:00     5.5
2012-09-01 01:00:00     9.5
2012-09-01 01:30:00     5.5
2012-09-01 02:00:00    14.5
2012-09-01 02:30:00    15.0
2012-09-01 03:00:00    13.5
2012-09-01 03:30:00     2.0
2012-09-01 04:00:00    13.0
2012-09-01 04:30:00    17.5
2012-09-01 05:00:00    22.0
2012-09-01 05:30:00    17.5
2012-09-01 06:00:00    13.5
2012-09-01 06:30:00    16.5
2012-09-01 07:00:00    20.0
2012-09-01 07:30:00     4.0
2012-09-01 08:00:00    20.0
2012-09-01 08:30:00    20.0
2012-09-01 09:00:00     3.0
2012-09-01 09:30:00    13.0
2012-09-01 10:00:00    18.0
2012-09-01 10:30:00    14.5
2012-09-01 11:00:00    11.5
2012-09-01 11:30:00    14.0
2012-09-01 12:00:00    12.0
2012-09-01 12:30:00    14.0
2012-09-01 13:00:00    11.0
2012-09-01 13:30:00    12.5
2012-09-01 14:00:00     9.5
2012-09-01 14:30:00    16.5
2012-09-01 15:00:00     7.0
2012-09-01 15:30:00     3.5
2012-09-01 16:00:00    12.5
2012-09-01 16:30:00     8.5
2012-09-01 17:00:00    14.5
2012-09-01 17:30:00    18.5
2012-09-01 18:00:00    13.0
2012-09-01 18:30:00     9.5
2012-09-01 19:00:00    13.5
2012-09-01 19:30:00    12.0
2012-09-01 20:00:00    15.5
2012-09-01 20:30:00     4.5
2012-09-01 21:00:00    10.5
2012-09-01 21:30:00    11.0
2012-09-01 22:00:00    13.0
2012-09-01 22:30:00    18.5
2012-09-01 23:00:00    19.5
2012-09-01 23:30:00    20.0
Freq: 30T, dtype: float64

In [207]:

ts30.resample('15min')

Out[207]:

2012-09-01 00:00:00    11.5
2012-09-01 00:15:00     NaN
2012-09-01 00:30:00     5.5
2012-09-01 00:45:00     NaN
2012-09-01 01:00:00     9.5
2012-09-01 01:15:00     NaN
2012-09-01 01:30:00     5.5
2012-09-01 01:45:00     NaN
2012-09-01 02:00:00    14.5
2012-09-01 02:15:00     NaN
2012-09-01 02:30:00    15.0
2012-09-01 02:45:00     NaN
2012-09-01 03:00:00    13.5
2012-09-01 03:15:00     NaN
2012-09-01 03:30:00     2.0
...
2012-09-01 20:00:00    15.5
2012-09-01 20:15:00     NaN
2012-09-01 20:30:00     4.5
2012-09-01 20:45:00     NaN
2012-09-01 21:00:00    10.5
2012-09-01 21:15:00     NaN
2012-09-01 21:30:00    11.0
2012-09-01 21:45:00     NaN
2012-09-01 22:00:00    13.0
2012-09-01 22:15:00     NaN
2012-09-01 22:30:00    18.5
2012-09-01 22:45:00     NaN
2012-09-01 23:00:00    19.5
2012-09-01 23:15:00     NaN
2012-09-01 23:30:00    20.0
Freq: 15T, Length: 95

Time Series Plotting¶

Pandas tries to make it a little easier to create nice looking time series plots using matplotlib than if you just used matplotlib yourself. Let's read in some stock price data from Yahoo! Finance that has already been saved to a csv file.

In [208]:

close_px_all = pd.read_csv('stock_px.csv', parse_dates=True, index_col=0)

In [209]:

close_px_all.head()

Out[209]:

	AAPL	MSFT	XOM	SPX
2003-01-02	7.40	21.11	29.22	909.03
2003-01-03	7.45	21.14	29.24	908.59
2003-01-06	7.45	21.52	29.96	929.01
2003-01-07	7.43	21.93	28.95	922.93
2003-01-08	7.28	21.31	28.83	909.93

5 rows × 4 columns

In [210]:

close_px = close_px_all[['AAPL','MSFT','XOM']]

In [211]:

close_px[1:10]

Out[211]:

	AAPL	MSFT	XOM
2003-01-03	7.45	21.14	29.24
2003-01-06	7.45	21.52	29.96
2003-01-07	7.43	21.93	28.95
2003-01-08	7.28	21.31	28.83
2003-01-09	7.34	21.93	29.44
2003-01-10	7.36	21.97	29.03
2003-01-13	7.32	22.16	28.91
2003-01-14	7.30	22.39	29.17
2003-01-15	7.22	22.11	28.77

9 rows × 3 columns

In [212]:

close_px['AAPL'].plot()

Out[212]:

<matplotlib.axes.AxesSubplot at 0x8097990>

In [213]:

close_px.plot()

Out[213]:

<matplotlib.axes.AxesSubplot at 0x72ad3d0>

In [214]:

# And here's what it looks like if we just plot one year. Notice the auto x-axis formatting.
close_px.ix['2009'].plot()

Out[214]:

<matplotlib.axes.AxesSubplot at 0x814ad70>

In [215]:

close_px.index

Out[215]:

<class 'pandas.tseries.index.DatetimeIndex'>
[2003-01-02, ..., 2011-10-14]
Length: 2214, Freq: None, Timezone: None

In [216]:

list(close_px.index)[1:25]

Out[216]:

[Timestamp('2003-01-03 00:00:00', tz=None),
 Timestamp('2003-01-06 00:00:00', tz=None),
 Timestamp('2003-01-07 00:00:00', tz=None),
 Timestamp('2003-01-08 00:00:00', tz=None),
 Timestamp('2003-01-09 00:00:00', tz=None),
 Timestamp('2003-01-10 00:00:00', tz=None),
 Timestamp('2003-01-13 00:00:00', tz=None),
 Timestamp('2003-01-14 00:00:00', tz=None),
 Timestamp('2003-01-15 00:00:00', tz=None),
 Timestamp('2003-01-16 00:00:00', tz=None),
 Timestamp('2003-01-17 00:00:00', tz=None),
 Timestamp('2003-01-21 00:00:00', tz=None),
 Timestamp('2003-01-22 00:00:00', tz=None),
 Timestamp('2003-01-23 00:00:00', tz=None),
 Timestamp('2003-01-24 00:00:00', tz=None),
 Timestamp('2003-01-27 00:00:00', tz=None),
 Timestamp('2003-01-28 00:00:00', tz=None),
 Timestamp('2003-01-29 00:00:00', tz=None),
 Timestamp('2003-01-30 00:00:00', tz=None),
 Timestamp('2003-01-31 00:00:00', tz=None),
 Timestamp('2003-02-03 00:00:00', tz=None),
 Timestamp('2003-02-04 00:00:00', tz=None),
 Timestamp('2003-02-05 00:00:00', tz=None),
 Timestamp('2003-02-06 00:00:00', tz=None)]

In [217]:

close_px['AAPL'].ix['01-2011':'03-2011'].plot()

Out[217]:

<matplotlib.axes.AxesSubplot at 0x81a9c90>

Pandas does a particulary nice job with quarterly data.

In [218]:

appl_q = close_px['AAPL'].resample('Q-DEC')

In [219]:

appl_q

Out[219]:

2003-03-31      7.289180
2003-06-30      8.299365
2003-09-30     10.499844
2003-12-31     10.893906
2004-03-31     12.031613
2004-06-30     14.335645
2004-09-30     16.609219
2004-12-31     27.792813
2005-03-31     39.535574
2005-06-30     37.876719
2005-09-30     45.719219
2005-12-31     63.500476
2006-03-31     70.159194
2006-06-30     64.030000
2006-09-30     66.210635
2006-12-31     82.832381
2007-03-31     88.520164
2007-06-30    108.308254
2007-09-30    135.619683
2007-12-31    178.586875
2008-03-31    139.138689
2008-06-30    173.566719
2008-09-30    160.141250
2008-12-31     94.936250
2009-03-31     93.574098
2009-06-30    129.413333
2009-09-30    164.225156
2009-12-31    197.276406
2010-03-31    210.855082
2010-06-30    254.738889
2010-09-30    259.905156
2010-12-31    311.494688
2011-03-31    345.683226
2011-06-30    337.612381
2011-09-30    380.510312
2011-12-31    389.424000
Freq: Q-DEC, Name: AAPL, dtype: float64

In [220]:

appl_q.ix['2009':].plot()

Out[220]:

<matplotlib.axes.AxesSubplot at 0x82417d0>

Moving window functions¶

Pandas makes it easy to do all kinds of moving window operations like moving average, cumulative sums, and even things like exponential moving averages ("exponential smoothing").

In [221]:

close_px.AAPL.plot()
pd.rolling_mean(close_px.AAPL,250).plot()

Out[221]:

<matplotlib.axes.AxesSubplot at 0x82ac050>

In [222]:

pd.rolling_mean(close_px.AAPL,250)

Out[222]:

2003-01-02   NaN
2003-01-03   NaN
2003-01-06   NaN
2003-01-07   NaN
2003-01-08   NaN
2003-01-09   NaN
2003-01-10   NaN
2003-01-13   NaN
2003-01-14   NaN
2003-01-15   NaN
2003-01-16   NaN
2003-01-17   NaN
2003-01-21   NaN
2003-01-22   NaN
2003-01-23   NaN
...
2011-09-26    342.82340
2011-09-27    343.28544
2011-09-28    343.74340
2011-09-29    344.19112
2011-09-30    344.56064
2011-10-03    344.90228
2011-10-04    345.23540
2011-10-05    345.57212
2011-10-06    345.90016
2011-10-07    346.18520
2011-10-10    346.53988
2011-10-11    346.93180
2011-10-12    347.28160
2011-10-13    347.64332
2011-10-14    348.09336
Length: 2214

This example shows one problem with a moving average - by default you usually can't compute it until you have data for the number of periods corresponding to the window size (250 in this case). Pandas lets you specify an alternate minumum number of data points to get around this problem.

In [223]:

pd.rolling_mean(close_px.AAPL,250,min_periods=10)

Out[223]:

2003-01-02         NaN
2003-01-03         NaN
2003-01-06         NaN
2003-01-07         NaN
2003-01-08         NaN
2003-01-09         NaN
2003-01-10         NaN
2003-01-13         NaN
2003-01-14         NaN
2003-01-15    7.355000
2003-01-16    7.350909
2003-01-17    7.325833
2003-01-21    7.301538
2003-01-22    7.275714
2003-01-23    7.263333
...
2011-09-26    342.82340
2011-09-27    343.28544
2011-09-28    343.74340
2011-09-29    344.19112
2011-09-30    344.56064
2011-10-03    344.90228
2011-10-04    345.23540
2011-10-05    345.57212
2011-10-06    345.90016
2011-10-07    346.18520
2011-10-10    346.53988
2011-10-11    346.93180
2011-10-12    347.28160
2011-10-13    347.64332
2011-10-14    348.09336
Length: 2214

In [224]:

close_px.AAPL.plot()
pd.rolling_mean(close_px.AAPL,250,min_periods=10).plot()

Out[224]:

<matplotlib.axes.AxesSubplot at 0x84f6690>

In [225]:

# Things like rolling standard deviation are also possible
appl_std250 = pd.rolling_std(close_px.AAPL,250,min_periods=10)

In [226]:

appl_std250.plot()

Out[226]:

<matplotlib.axes.AxesSubplot at 0x851deb0>

Learn more about pandas timeseries¶

Another intro tutorial is http://earthpy.org/pandas-basics.html

Wes Mckinney's 2012 PyCon talk on pandas and timeseries - http://pyvideo.org/video/1198/time-series-data-analysis-with-pandas