For many of us in the business analytics world, working with dates and times is bread and butter stuff. We can work all kinds of magic with Excel's "serial" datetimes and worksheet functions like DATEVALUE(), WEEKDAY(), MONTH(), DATE() and many more. We can wield VBA functions like DateDiff with the best of them. We all know that, according to MS, time started on 1/1/1900 (or 1904 if you're on a Mac) and we just live with the fact that we can't do reliable datetime work before those magic epochs. I even wrote and released a free and open source MS Access add-in called Hillmaker many years ago that did statistical analysis of time interval data by day of week and time of day. That tool still finds lots of use in the healthcare world for analyzing occupancy of things like nursing units, emergency departments, recovery room, and many other places with entry and exit timestamp data.
So, when I started working with R and Python, one of the first things I did was see how they handled dates and times. I learned about the UTC epoch and the POSIX time standard. In R, I quickly learned to love the lubridate
package. And then I started to learn about dates and times in Python and started to know what programmers mean when they say "there be dragons in there". So, while I'm far from an expert, I did put together this little intro to Python dates and times aimed at business analtytics types who are new to Python. It is by no means comprehensive, but does cover the basics of the base Python datetime
and timedelta
classes along with Numpy datetime64
and timedelta64
classes. Oh yeah, it also covers the pandas Timestamp
and timedelta
classes (confused yet?). I also show one particular dragon that bit me - conversion between Numpy datetimes and Pandas datetimes - even though Pandas uses Numpy under the hood. Anyhoo, here we go.
You can find the .ipynb
file in my hselab-tutorials github repo. Clone or download a zip.
datetime
and timedelta
classes¶import pandas as pd
import numpy as np
from pandas import Series, DataFrame
from datetime import datetime
from datetime import timedelta
import matplotlib.pyplot as plt
%matplotlib inline
Python has a module called datetime
as part of its base system.
https://docs.python.org/2/library/datetime.html
The datetime module supplies classes for manipulating dates and times in both simple and complex ways. While date and time arithmetic is supported, the focus of the implementation is on efficient attribute extraction for output formatting and manipulation.
Datetimes can be naive or aware
dates = [datetime(2012, 5, 1, 8, 30), datetime(2012, 5, 1, 9, 25), datetime(2012, 5, 1, 15, 30)]
dates
[datetime.datetime(2012, 5, 1, 8, 30), datetime.datetime(2012, 5, 1, 9, 25), datetime.datetime(2012, 5, 1, 15, 30)]
type(dates)
list
type(dates[0])
datetime.datetime
intime_dt = dates[0]
outtime_dt = dates[1]
type(intime_dt)
datetime.datetime
# The commented out lines represent attributes that ARE available in pandas Timestamp objects
# but not in Python datetime objects
print ('Datetime: {}'.format(intime_dt))
print ('Date: {}'.format(intime_dt.date()))
print ('Month: {}'.format(intime_dt.month))
print ('Day: {}'.format(intime_dt.day))
#print ('DayOfWeek: {}'.format(intime_dt.dayofweek))
print ('Weekday: {}'.format(intime_dt.weekday()))
#print ('DayOfYear: {}'.format(intime_dt.dayofyear))
#print ('WeekOfYear: {}'.format(intime_dt.weekofyear))
#print ('Quarter: {}'.format(intime_dt.quarter))
print ('Hour: {}'.format(intime_dt.hour))
print ('Minute: {}'.format(intime_dt.minute))
print ('Second: {}'.format(intime_dt.second))
print ('Microsecond: {}'.format(intime_dt.microsecond))
Datetime: 2012-05-01 08:30:00 Date: 2012-05-01 Month: 5 Day: 1 Weekday: 1 Hour: 8 Minute: 30 Second: 0 Microsecond: 0
los_td = outtime_dt - intime_dt
print (los_td)
type(los_td)
0:55:00
datetime.timedelta
atts = [att for att in dir(los_td) if '__' not in att]
print(atts)
['days', 'max', 'microseconds', 'min', 'resolution', 'seconds', 'total_seconds']
print ('Timedelta: {}'.format(los_td))
print ('Seconds: {}'.format(los_td.seconds))
print ('Total Seconds: {}'.format(los_td.total_seconds()))
print ('Microseconds: {}'.format(los_td.microseconds))
print ('Resolution: {}'.format(los_td.resolution))
print ('Min: {}'.format(los_td.min))
print ('Max: {}'.format(los_td.max))
Timedelta: 0:55:00 Seconds: 3300 Total Seconds: 3300.0 Microseconds: 0 Resolution: 0:00:00.000001 Min: -999999999 days, 0:00:00 Max: 999999999 days, 23:59:59.999999
So, you can use the total_seconds()
method to get a duration in seconds from a timedelta
object. If duration is < 1 second, it returns fractional seconds. To convert to other time units, just do the simple math.
timedelta(0,0,10)
datetime.timedelta(0, 0, 10)
timedelta(0,0,17).total_seconds()
1.7e-05
atts = [att for att in dir(intime_dt) if '__' not in att]
print (atts)
['astimezone', 'combine', 'ctime', 'date', 'day', 'dst', 'fromordinal', 'fromtimestamp', 'hour', 'isocalendar', 'isoformat', 'isoweekday', 'max', 'microsecond', 'min', 'minute', 'month', 'now', 'replace', 'resolution', 'second', 'strftime', 'strptime', 'time', 'timestamp', 'timetuple', 'timetz', 'today', 'toordinal', 'tzinfo', 'tzname', 'utcfromtimestamp', 'utcnow', 'utcoffset', 'utctimetuple', 'weekday', 'year']
datetime64
and timedelta64
classes¶The Numpy library (v 1.7.0) introduced a new datatype called datetime64. Numpy uses this name because there is already a datetime data type in Python datetime library. Like R, dates and times in Python are stored based on the notion of POSIX time - the number of seconds elapsed since Jan 1, 1970 00:00:00 GMT - called the UTC epoch. This doesn't mean that dates and times before that cannot be represented - we just use negative numbers. Representing and working with dates and times for computing is a pretty interesting topic, fraught with complexities like time zones, daylight savings time, and leap seconds. A few good places to start are:
Numpy docs - http://docs.scipy.org/doc/numpy-dev/reference/arrays.datetime.html
Pandas docs - http://pandas.pydata.org/pandas-docs/dev/timeseries.html
See http://cr.yp.to/proto/utctai.html for gory details on UTC and ATI and leap seconds and the problems with the POSIX standard.
http://stackoverflow.com/questions/13703720/converting-between-datetime-timestamp-and-datetime64
Here's the actual numpy proposal for datetime64 and timedelta64, authored by Travis Oliphant.
https://github.com/numpy/numpy/blob/master/doc/neps/datetime-proposal.rst
Let's create a Numpy datetime64
object.
np.datetime64('2012-05-01 08:30:00')
numpy.datetime64('2012-05-01T08:30:00-0400')
dates64 = [np.datetime64('2012-05-01 08:30:00'), np.datetime64('2012-05-01 09:25:00'), np.datetime64('2012-05-01 15:30:00')]
dates64
[numpy.datetime64('2012-05-01T08:30:00-0400'), numpy.datetime64('2012-05-01T09:25:00-0400'), numpy.datetime64('2012-05-01T15:30:00-0400')]
As you can see from below, UTC is four hours later than EDT. In other words, the EDT timezone offset is UTC-4.
dt_utcnow = datetime.utcnow()
dt_utcnow
datetime.datetime(2014, 6, 25, 23, 36, 51, 924417)
# Can create datetime64's from base Python datetimes
dt64_utcnow = np.datetime64(dt_utcnow)
dt64_utcnow
numpy.datetime64('2014-06-25T19:36:51.924417-0400')
curdates64 = [np.datetime64('2014-06-01 08:30:00'), np.datetime64('2014-06-01 09:25:00'), np.datetime64('2014-06-01 15:30:00')]
curdates64
[numpy.datetime64('2014-06-01T08:30:00-0400'), numpy.datetime64('2014-06-01T09:25:00-0400'), numpy.datetime64('2014-06-01T15:30:00-0400')]
intime_dt64 = dates64[0]
outtime_dt64 = dates64[1]
print(type(intime_dt64))
print(intime_dt64)
intime_dt64
<class 'numpy.datetime64'> 2012-05-01T08:30:00-0400
numpy.datetime64('2012-05-01T08:30:00-0400')
# The commented out lines represent attributes that ARE available in pandas Timestamp objects
# but not in numpy datetime64 objects
print ('Datetime: {}'.format(str(intime_dt64)))
#print ('Date: {}'.format(intime_dt64.date()))
#print ('Month: {}'.format(intime_dt64.month))
#print ('Day: {}'.format(intime_dt.day))
#print ('DayOfWeek: {}'.format(intime_dt.dayofweek))
#print ('Weekday: {}'.format(intime_dt.weekday()))
#print ('DayOfYear: {}'.format(intime_dt.dayofyear))
#print ('WeekOfYear: {}'.format(intime_dt.weekofyear))
#print ('Quarter: {}'.format(intime_dt.quarter))
#print ('Hour: {}'.format(intime_dt.hour))
#print ('Minute: {}'.format(intime_dt.minute))
#print ('Second: {}'.format(intime_dt.second))
#print ('Microsecond: {}'.format(intime_dt.microsecond))
Datetime: 2012-05-01T08:30:00-0400
You can subtract two datetime64
objects and get a timedelta64
object in return.
los_dt64 = outtime_dt64 - intime_dt64
los_dt64
numpy.timedelta64(3300,'s')
# Coerce it to other units
np.timedelta64(los_dt64,'m')
numpy.timedelta64(55,'m')
Unfortunately, timedelta64
objects have almost no useful attributes in terms of date time math. OTH, see the datetime.timedelta
object (and Pandas) for a number of useful attributes.
Pandas prides itself on its handling of time series data (speed, flexibility, power) and is built on time of numpy datetime64
and timedelta64
data types (instead of Python's base datetime
and timedelta
types. Pandas also wanted to be very time-zone aware and time-zone capable. So, Pandas has its own data type for timestamps called, well, Timetamp
. It's based on numpy's datetime64
type but works like Python's datetime
(I think). The following StackOverlfow post (the "welcome to hell" post) is quite popular and relevant.
http://stackoverflow.com/questions/13703720/converting-between-datetime-timestamp-and-datetime64
Pandas time series indexes can be timestamp or time period based.
ts = Series(np.random.randint(1,100,3), index=dates)
ts
2012-05-01 08:30:00 86 2012-05-01 09:25:00 82 2012-05-01 15:30:00 34 dtype: int32
ts.index
<class 'pandas.tseries.index.DatetimeIndex'> [2012-05-01 08:30:00, ..., 2012-05-01 15:30:00] Length: 3, Freq: None, Timezone: None
If we have demand data by month, a period based index might make more sense.
from pandas import Period, PeriodIndex
periods = PeriodIndex([Period('2012-01'), Period('2012-02'),Period('2012-03')])
ts2 = Series(np.random.randint(1,100,3), index=periods)
ts2
2012-01 3 2012-02 59 2012-03 34 Freq: M, dtype: int32
ts2.index
<class 'pandas.tseries.period.PeriodIndex'> freq: M [2012-01, ..., 2012-03] length: 3
From the pandas docs:
Starting with 0.8, pandas allows you to capture both representations and convert between them. Under the hood, pandas represents timestamps using instances of Timestamp and sequences of timestamps using instances of DatetimeIndex. For regular time spans, pandas uses Period objects for scalar values and PeriodIndex for sequences of spans.
from pandas import Timestamp
Timestamp('20120501 08:30:00')
Timestamp('2012-05-01 08:30:00', tz=None)
timestamps = [Timestamp('20120501 08:30:00'), Timestamp('20120501 09:25:00'), Timestamp('20120501 15:30:00')]
intime_ts = timestamps[0]
outtime_ts = timestamps[1]
type(intime_ts)
pandas.tslib.Timestamp
los_ptd = outtime_ts - intime_ts
print (los_ptd)
type(los_ptd)
0:55:00
datetime.timedelta
# The pandas Timestamp data type has a number of useful attributes (dayofweek, dayofyear, weekofyear)
# that the base Python datetime type does not.
print ('Datetime: {}'.format(intime_ts))
print ('Date: {}'.format(intime_ts.date()))
print ('Month: {}'.format(intime_ts.month))
print ('Day: {}'.format(intime_ts.day))
print ('DayOfWeek: {}'.format(intime_ts.dayofweek))
print ('Weekday: {}'.format(intime_ts.weekday()))
print ('DayOfYear: {}'.format(intime_ts.dayofyear))
print ('WeekOfYear: {}'.format(intime_ts.weekofyear))
print ('Quarter: {}'.format(intime_ts.quarter))
print ('Hour: {}'.format(intime_ts.hour))
print ('Minute: {}'.format(intime_ts.minute))
print ('Second: {}'.format(intime_ts.second))
print ('Microsecond: {}'.format(intime_ts.microsecond))
Datetime: 2012-05-01 08:30:00 Date: 2012-05-01 Month: 5 Day: 1 DayOfWeek: 1 Weekday: 1 DayOfYear: 122 WeekOfYear: 18 Quarter: 2 Hour: 8 Minute: 30 Second: 0 Microsecond: 0
print ('Timedelta: {}'.format(los_ptd))
print ('Days: {}'.format(los_ptd.days))
print ('Seconds: {}'.format(los_ptd.seconds))
print ('Total Seconds: {}'.format(los_ptd.total_seconds()))
print ('Microseconds: {}'.format(los_ptd.microseconds))
print ('Resolution: {}'.format(los_ptd.resolution))
print ('Min: {}'.format(los_ptd.min))
print ('Max: {}'.format(los_ptd.max))
Timedelta: 0:55:00 Days: 0 Seconds: 3300 Total Seconds: 3300.0 Microseconds: 0 Resolution: 0:00:00.000001 Min: -999999999 days, 0:00:00 Max: 999999999 days, 23:59:59.999999
A big (potential) problem is trying to convert from numpy.datetime64 to pd.Timestamp. Seemingly strange things happened when I tried this using the to_list
method of numpy.datetime64 objects. After much reading of the "Welcome to Hell" SO post and some experimenting and a more careful reading of the numpy docs ("ISO 8601 specifies to use the local time zone if none is explicitly given:"), I think we have an answer.
So, by default, numpy assumes a naively specified datetime is in local time.
dt64 = np.datetime64('2014-06-01 08:30:00')
dt64
numpy.datetime64('2014-06-01T08:30:00-0400')
If we use it's .tolist() method or pass it in to pandas Timestamp() constructor, we get UTC!
dt64.tolist()
datetime.datetime(2014, 6, 1, 12, 30)
ts = pd.Timestamp(dt64)
ts
Timestamp('2014-06-01 12:30:00', tz=None)
If you were doing a bunch of bulk data transformation work, you might easily miss this subtlety and end up with your date time stats all staggered off by a few hours. Some more fun conversions are shown below.
dt = ts.to_datetime()
dt
datetime.datetime(2014, 6, 1, 12, 30)
... and if we try a "roundtrip" from numpy to Pandas to datetime and back, we actually end up where we started.
np.datetime64(dt)
numpy.datetime64('2014-06-01T08:30:00.000000-0400')
Appears this is quite a topic of discussion in the numpy community based on this discussion thread.
Generally, a time series in pandas is a Series
object with a DateTimeIndex
. Pandas has convenience functions for creating time series indexes. The default frequency for date_range is a calendar day while the default for bdate_range is a business day
index = pd.date_range('2000-1-1', periods=1000, freq='M')
index
<class 'pandas.tseries.index.DatetimeIndex'> [2000-01-31, ..., 2083-04-30] Length: 1000, Freq: M, Timezone: None
start = datetime(2014, 1, 1)
end = datetime(2014, 6, 12)
rng = pd.date_range(start, end)
rng
<class 'pandas.tseries.index.DatetimeIndex'> [2014-01-01, ..., 2014-06-12] Length: 163, Freq: D, Timezone: None
pd.date_range(start, end, freq='W')
<class 'pandas.tseries.index.DatetimeIndex'> [2014-01-05, ..., 2014-06-08] Length: 23, Freq: W-SUN, Timezone: None
data = np.random.randint(1, 100, size=163)
df = DataFrame(data, index=rng)
df.head()
0 | |
---|---|
2014-01-01 | 18 |
2014-01-02 | 9 |
2014-01-03 | 69 |
2014-01-04 | 44 |
2014-01-05 | 75 |
5 rows × 1 columns
# Date index slicing is easy but, BE CAREFUL, the end point is included (unlike usual Python slicing behavior).
df['2014-01-02':'2014-01-12']
0 | |
---|---|
2014-01-02 | 9 |
2014-01-03 | 69 |
2014-01-04 | 44 |
2014-01-05 | 75 |
2014-01-06 | 96 |
2014-01-07 | 87 |
2014-01-08 | 2 |
2014-01-09 | 69 |
2014-01-10 | 76 |
2014-01-11 | 90 |
2014-01-12 | 31 |
11 rows × 1 columns
df[datetime(2014,1,2):datetime(2014,1,12)]
0 | |
---|---|
2014-01-02 | 9 |
2014-01-03 | 69 |
2014-01-04 | 44 |
2014-01-05 | 75 |
2014-01-06 | 96 |
2014-01-07 | 87 |
2014-01-08 | 2 |
2014-01-09 | 69 |
2014-01-10 | 76 |
2014-01-11 | 90 |
2014-01-12 | 31 |
11 rows × 1 columns
DateOffset
objects¶Usually base frequencies are specified with a string alias such as 'W' or 'M'. However,
rng2 = pd.date_range(start, end, freq='M')
len(rng2)
5
data[:len(rng2)]
array([18, 9, 69, 44, 75])
df2 = DataFrame(data[:len(rng2)], index=rng2)
df2
0 | |
---|---|
2014-01-31 | 18 |
2014-02-28 | 9 |
2014-03-31 | 69 |
2014-04-30 | 44 |
2014-05-31 | 75 |
5 rows × 1 columns
from pandas.tseries.offsets import Hour, Minute
#We rarely use these but they are available
one_hour = Hour()
four_hours = Hour(4)
one_hour + four_hours
<5 * Hours>
one_hour + Minute(35)
<95 * Minutes>
rng = pd.date_range('9/1/2012','1/1/2014',freq='M')
rng
<class 'pandas.tseries.index.DatetimeIndex'> [2012-09-30, ..., 2013-12-31] Length: 16, Freq: M, Timezone: None
list(rng)
[Timestamp('2012-09-30 00:00:00', tz=None), Timestamp('2012-10-31 00:00:00', tz=None), Timestamp('2012-11-30 00:00:00', tz=None), Timestamp('2012-12-31 00:00:00', tz=None), Timestamp('2013-01-31 00:00:00', tz=None), Timestamp('2013-02-28 00:00:00', tz=None), Timestamp('2013-03-31 00:00:00', tz=None), Timestamp('2013-04-30 00:00:00', tz=None), Timestamp('2013-05-31 00:00:00', tz=None), Timestamp('2013-06-30 00:00:00', tz=None), Timestamp('2013-07-31 00:00:00', tz=None), Timestamp('2013-08-31 00:00:00', tz=None), Timestamp('2013-09-30 00:00:00', tz=None), Timestamp('2013-10-31 00:00:00', tz=None), Timestamp('2013-11-30 00:00:00', tz=None), Timestamp('2013-12-31 00:00:00', tz=None)]
rng = pd.date_range('9/1/2012','1/1/2014',freq='WOM-2TUE')
list(rng)
[Timestamp('2012-09-11 00:00:00', tz=None), Timestamp('2012-10-09 00:00:00', tz=None), Timestamp('2012-11-13 00:00:00', tz=None), Timestamp('2012-12-11 00:00:00', tz=None), Timestamp('2013-01-08 00:00:00', tz=None), Timestamp('2013-02-12 00:00:00', tz=None), Timestamp('2013-03-12 00:00:00', tz=None), Timestamp('2013-04-09 00:00:00', tz=None), Timestamp('2013-05-14 00:00:00', tz=None), Timestamp('2013-06-11 00:00:00', tz=None), Timestamp('2013-07-09 00:00:00', tz=None), Timestamp('2013-08-13 00:00:00', tz=None), Timestamp('2013-09-10 00:00:00', tz=None), Timestamp('2013-10-08 00:00:00', tz=None), Timestamp('2013-11-12 00:00:00', tz=None), Timestamp('2013-12-10 00:00:00', tz=None)]
Move data back and forth through time. Series
and DateFrame
objects have a shift
method for doing naive shifts forward and backward and leaving the index unchanged.
ts = Series(np.random.randint(1,10,4),index = pd.date_range('1/2/2013',periods=4,freq='M'))
ts
2013-01-31 3 2013-02-28 9 2013-03-31 1 2013-04-30 1 Freq: M, dtype: int32
ts.shift(periods = 2)
2013-01-31 NaN 2013-02-28 NaN 2013-03-31 3 2013-04-30 9 Freq: M, dtype: float64
# Common use is pct change
ts/ts.shift(1) - 1.0
2013-01-31 NaN 2013-02-28 2.000000 2013-03-31 -0.888889 2013-04-30 0.000000 Freq: M, dtype: float64
# Since naive shifts leave index unchanged, some data is lost. To shift both data and index, pass in the frequency
ts.shift(1)
2013-01-31 NaN 2013-02-28 3 2013-03-31 9 2013-04-30 1 Freq: M, dtype: float64
ts.shift(1, freq='M')
2013-02-28 3 2013-03-31 9 2013-04-30 1 2013-05-31 1 Freq: M, dtype: int32
Periods represent time spans like months or quarters or years.
p = pd.Period('6/1/2014',freq='M')
p
Period('2014-06', 'M')
p+6
Period('2014-12', 'M')
# Ranges of periods is done much like ranges of dates
rng = pd.period_range('1/1/2010','6/1/2014',freq='Q')
rng
<class 'pandas.tseries.period.PeriodIndex'> freq: Q-DEC [2010Q1, ..., 2014Q2] length: 18
list(rng)
[Period('2010Q1', 'Q-DEC'), Period('2010Q2', 'Q-DEC'), Period('2010Q3', 'Q-DEC'), Period('2010Q4', 'Q-DEC'), Period('2011Q1', 'Q-DEC'), Period('2011Q2', 'Q-DEC'), Period('2011Q3', 'Q-DEC'), Period('2011Q4', 'Q-DEC'), Period('2012Q1', 'Q-DEC'), Period('2012Q2', 'Q-DEC'), Period('2012Q3', 'Q-DEC'), Period('2012Q4', 'Q-DEC'), Period('2013Q1', 'Q-DEC'), Period('2013Q2', 'Q-DEC'), Period('2013Q3', 'Q-DEC'), Period('2013Q4', 'Q-DEC'), Period('2014Q1', 'Q-DEC'), Period('2014Q2', 'Q-DEC')]
p.asfreq('M','start')
Period('2014-06', 'M')
p.asfreq('M','end')
Period('2014-06', 'M')
Resampling - convert time series from one frequency to another.
Take a daily time series and convert to monthly.
rng = pd.date_range('9/1/2012',periods=100,freq='D')
ts = Series(np.random.randint(1,25,len(rng)),index=rng)
ts
2012-09-01 13 2012-09-02 18 2012-09-03 7 2012-09-04 21 2012-09-05 21 2012-09-06 17 2012-09-07 20 2012-09-08 24 2012-09-09 20 2012-09-10 4 2012-09-11 23 2012-09-12 18 2012-09-13 24 2012-09-14 24 2012-09-15 2 ... 2012-11-25 3 2012-11-26 6 2012-11-27 2 2012-11-28 2 2012-11-29 10 2012-11-30 18 2012-12-01 19 2012-12-02 24 2012-12-03 15 2012-12-04 3 2012-12-05 13 2012-12-06 14 2012-12-07 13 2012-12-08 2 2012-12-09 21 Freq: D, Length: 100
ts.resample('M',how='mean',kind='period')
2012-09 12.666667 2012-10 12.806452 2012-11 10.633333 2012-12 13.777778 Freq: M, dtype: float64
ts.resample('M',how='sum',kind='period')
2012-09 380 2012-10 397 2012-11 319 2012-12 124 Freq: M, dtype: int32
ts.resample('M',how='sum')
2012-09-30 380 2012-10-31 397 2012-11-30 319 2012-12-31 124 Freq: M, dtype: int32
Whenver you are downsampling, you need to think about two bin boundary issues:
With resampling from days to months, the question of left or right closure isn't really an issue, but consider changing from 15 minute time bins to 30 minute time bins.
rng15 = pd.date_range('9/1/2012',periods=96,freq='15T')
ts15 = Series(np.random.randint(1,25,len(rng15)),index=rng15)
ts15
2012-09-01 00:00:00 16 2012-09-01 00:15:00 7 2012-09-01 00:30:00 4 2012-09-01 00:45:00 7 2012-09-01 01:00:00 18 2012-09-01 01:15:00 1 2012-09-01 01:30:00 4 2012-09-01 01:45:00 7 2012-09-01 02:00:00 7 2012-09-01 02:15:00 22 2012-09-01 02:30:00 24 2012-09-01 02:45:00 6 2012-09-01 03:00:00 22 2012-09-01 03:15:00 5 2012-09-01 03:30:00 1 ... 2012-09-01 20:15:00 11 2012-09-01 20:30:00 2 2012-09-01 20:45:00 7 2012-09-01 21:00:00 20 2012-09-01 21:15:00 1 2012-09-01 21:30:00 13 2012-09-01 21:45:00 9 2012-09-01 22:00:00 5 2012-09-01 22:15:00 21 2012-09-01 22:30:00 17 2012-09-01 22:45:00 20 2012-09-01 23:00:00 21 2012-09-01 23:15:00 18 2012-09-01 23:30:00 19 2012-09-01 23:45:00 21 Freq: 15T, Length: 96
Now, if we resample to 30 minute bins and take the mean across each set of two 15 bins, we need to specify if the times on the half hour such as 22:30 in the original series should go in the bin from (22:15,22:30] or [22:30,22:45). The first is referred to as closed on the right and the second as closed on the left. The default is close on right. You can also control whether the left or right edge of the bin is used as the label (independent of which side "closes" the interval).
ts15.resample('30min',how='mean',closed='right',label='left')
2012-08-31 23:30:00 16.0 2012-09-01 00:00:00 5.5 2012-09-01 00:30:00 12.5 2012-09-01 01:00:00 2.5 2012-09-01 01:30:00 7.0 2012-09-01 02:00:00 23.0 2012-09-01 02:30:00 14.0 2012-09-01 03:00:00 3.0 2012-09-01 03:30:00 10.5 2012-09-01 04:00:00 11.5 2012-09-01 04:30:00 20.0 2012-09-01 05:00:00 19.5 2012-09-01 05:30:00 15.0 2012-09-01 06:00:00 17.0 2012-09-01 06:30:00 20.0 2012-09-01 07:00:00 9.0 2012-09-01 07:30:00 13.0 2012-09-01 08:00:00 18.5 2012-09-01 08:30:00 12.0 2012-09-01 09:00:00 7.0 2012-09-01 09:30:00 20.5 2012-09-01 10:00:00 15.0 2012-09-01 10:30:00 8.5 2012-09-01 11:00:00 16.0 2012-09-01 11:30:00 11.5 2012-09-01 12:00:00 12.0 2012-09-01 12:30:00 17.0 2012-09-01 13:00:00 12.0 2012-09-01 13:30:00 4.5 2012-09-01 14:00:00 16.0 2012-09-01 14:30:00 11.5 2012-09-01 15:00:00 7.5 2012-09-01 15:30:00 12.0 2012-09-01 16:00:00 7.5 2012-09-01 16:30:00 5.0 2012-09-01 17:00:00 20.0 2012-09-01 17:30:00 11.0 2012-09-01 18:00:00 13.0 2012-09-01 18:30:00 14.0 2012-09-01 19:00:00 15.5 2012-09-01 19:30:00 14.5 2012-09-01 20:00:00 6.5 2012-09-01 20:30:00 13.5 2012-09-01 21:00:00 7.0 2012-09-01 21:30:00 7.0 2012-09-01 22:00:00 19.0 2012-09-01 22:30:00 20.5 2012-09-01 23:00:00 18.5 2012-09-01 23:30:00 21.0 Freq: 30T, dtype: float64
Notice how in the example above, the first record is labelled with '2012-08-31' since it's the left boundary and we picked 'right' for closing intervals and 'left' for labelling. We might want to close and label on the left.
ts30 = ts15.resample('30min',how='mean',closed='left',label='left')
ts30
2012-09-01 00:00:00 11.5 2012-09-01 00:30:00 5.5 2012-09-01 01:00:00 9.5 2012-09-01 01:30:00 5.5 2012-09-01 02:00:00 14.5 2012-09-01 02:30:00 15.0 2012-09-01 03:00:00 13.5 2012-09-01 03:30:00 2.0 2012-09-01 04:00:00 13.0 2012-09-01 04:30:00 17.5 2012-09-01 05:00:00 22.0 2012-09-01 05:30:00 17.5 2012-09-01 06:00:00 13.5 2012-09-01 06:30:00 16.5 2012-09-01 07:00:00 20.0 2012-09-01 07:30:00 4.0 2012-09-01 08:00:00 20.0 2012-09-01 08:30:00 20.0 2012-09-01 09:00:00 3.0 2012-09-01 09:30:00 13.0 2012-09-01 10:00:00 18.0 2012-09-01 10:30:00 14.5 2012-09-01 11:00:00 11.5 2012-09-01 11:30:00 14.0 2012-09-01 12:00:00 12.0 2012-09-01 12:30:00 14.0 2012-09-01 13:00:00 11.0 2012-09-01 13:30:00 12.5 2012-09-01 14:00:00 9.5 2012-09-01 14:30:00 16.5 2012-09-01 15:00:00 7.0 2012-09-01 15:30:00 3.5 2012-09-01 16:00:00 12.5 2012-09-01 16:30:00 8.5 2012-09-01 17:00:00 14.5 2012-09-01 17:30:00 18.5 2012-09-01 18:00:00 13.0 2012-09-01 18:30:00 9.5 2012-09-01 19:00:00 13.5 2012-09-01 19:30:00 12.0 2012-09-01 20:00:00 15.5 2012-09-01 20:30:00 4.5 2012-09-01 21:00:00 10.5 2012-09-01 21:30:00 11.0 2012-09-01 22:00:00 13.0 2012-09-01 22:30:00 18.5 2012-09-01 23:00:00 19.5 2012-09-01 23:30:00 20.0 Freq: 30T, dtype: float64
A common financial aggregation is to compute the first (open), last (close), max (high), min (low) values for each bucket. If you use 'ohlc' for the how
argument, you get a DataFrame
with these aggregations computed automatically.
ts60 = ts15.resample('60min',how='ohlc',closed='left',label='left')
ts60
open | high | low | close | |
---|---|---|---|---|
2012-09-01 00:00:00 | 16 | 16 | 4 | 7 |
2012-09-01 01:00:00 | 18 | 18 | 1 | 7 |
2012-09-01 02:00:00 | 7 | 24 | 6 | 6 |
2012-09-01 03:00:00 | 22 | 22 | 1 | 3 |
2012-09-01 04:00:00 | 18 | 20 | 8 | 20 |
2012-09-01 05:00:00 | 20 | 24 | 15 | 20 |
2012-09-01 06:00:00 | 10 | 17 | 10 | 16 |
2012-09-01 07:00:00 | 24 | 24 | 2 | 6 |
2012-09-01 08:00:00 | 20 | 23 | 17 | 23 |
2012-09-01 09:00:00 | 1 | 17 | 1 | 17 |
2012-09-01 10:00:00 | 24 | 24 | 11 | 11 |
2012-09-01 11:00:00 | 6 | 17 | 6 | 13 |
2012-09-01 12:00:00 | 10 | 18 | 10 | 18 |
2012-09-01 13:00:00 | 16 | 18 | 6 | 7 |
2012-09-01 14:00:00 | 2 | 18 | 2 | 18 |
2012-09-01 15:00:00 | 5 | 9 | 1 | 1 |
2012-09-01 16:00:00 | 23 | 23 | 2 | 4 |
2012-09-01 17:00:00 | 6 | 23 | 6 | 20 |
2012-09-01 18:00:00 | 2 | 24 | 2 | 17 |
2012-09-01 19:00:00 | 11 | 16 | 9 | 9 |
2012-09-01 20:00:00 | 20 | 20 | 2 | 7 |
2012-09-01 21:00:00 | 20 | 20 | 1 | 9 |
2012-09-01 22:00:00 | 5 | 21 | 5 | 20 |
2012-09-01 23:00:00 | 21 | 21 | 18 | 21 |
24 rows × 4 columns
Upsampling brings an entirely different set of challenges. If we increase the frequency of the samples, how do we fill in these new time series values? For example, consider the reverse of the last example. Then check out the fill_method
options.
ts30
2012-09-01 00:00:00 11.5 2012-09-01 00:30:00 5.5 2012-09-01 01:00:00 9.5 2012-09-01 01:30:00 5.5 2012-09-01 02:00:00 14.5 2012-09-01 02:30:00 15.0 2012-09-01 03:00:00 13.5 2012-09-01 03:30:00 2.0 2012-09-01 04:00:00 13.0 2012-09-01 04:30:00 17.5 2012-09-01 05:00:00 22.0 2012-09-01 05:30:00 17.5 2012-09-01 06:00:00 13.5 2012-09-01 06:30:00 16.5 2012-09-01 07:00:00 20.0 2012-09-01 07:30:00 4.0 2012-09-01 08:00:00 20.0 2012-09-01 08:30:00 20.0 2012-09-01 09:00:00 3.0 2012-09-01 09:30:00 13.0 2012-09-01 10:00:00 18.0 2012-09-01 10:30:00 14.5 2012-09-01 11:00:00 11.5 2012-09-01 11:30:00 14.0 2012-09-01 12:00:00 12.0 2012-09-01 12:30:00 14.0 2012-09-01 13:00:00 11.0 2012-09-01 13:30:00 12.5 2012-09-01 14:00:00 9.5 2012-09-01 14:30:00 16.5 2012-09-01 15:00:00 7.0 2012-09-01 15:30:00 3.5 2012-09-01 16:00:00 12.5 2012-09-01 16:30:00 8.5 2012-09-01 17:00:00 14.5 2012-09-01 17:30:00 18.5 2012-09-01 18:00:00 13.0 2012-09-01 18:30:00 9.5 2012-09-01 19:00:00 13.5 2012-09-01 19:30:00 12.0 2012-09-01 20:00:00 15.5 2012-09-01 20:30:00 4.5 2012-09-01 21:00:00 10.5 2012-09-01 21:30:00 11.0 2012-09-01 22:00:00 13.0 2012-09-01 22:30:00 18.5 2012-09-01 23:00:00 19.5 2012-09-01 23:30:00 20.0 Freq: 30T, dtype: float64
ts30.resample('15min')
2012-09-01 00:00:00 11.5 2012-09-01 00:15:00 NaN 2012-09-01 00:30:00 5.5 2012-09-01 00:45:00 NaN 2012-09-01 01:00:00 9.5 2012-09-01 01:15:00 NaN 2012-09-01 01:30:00 5.5 2012-09-01 01:45:00 NaN 2012-09-01 02:00:00 14.5 2012-09-01 02:15:00 NaN 2012-09-01 02:30:00 15.0 2012-09-01 02:45:00 NaN 2012-09-01 03:00:00 13.5 2012-09-01 03:15:00 NaN 2012-09-01 03:30:00 2.0 ... 2012-09-01 20:00:00 15.5 2012-09-01 20:15:00 NaN 2012-09-01 20:30:00 4.5 2012-09-01 20:45:00 NaN 2012-09-01 21:00:00 10.5 2012-09-01 21:15:00 NaN 2012-09-01 21:30:00 11.0 2012-09-01 21:45:00 NaN 2012-09-01 22:00:00 13.0 2012-09-01 22:15:00 NaN 2012-09-01 22:30:00 18.5 2012-09-01 22:45:00 NaN 2012-09-01 23:00:00 19.5 2012-09-01 23:15:00 NaN 2012-09-01 23:30:00 20.0 Freq: 15T, Length: 95
Pandas tries to make it a little easier to create nice looking time series plots using matplotlib
than if you just used matplotlib
yourself. Let's read in some stock price data from Yahoo! Finance that has already been saved to a csv file.
close_px_all = pd.read_csv('stock_px.csv', parse_dates=True, index_col=0)
close_px_all.head()
AAPL | MSFT | XOM | SPX | |
---|---|---|---|---|
2003-01-02 | 7.40 | 21.11 | 29.22 | 909.03 |
2003-01-03 | 7.45 | 21.14 | 29.24 | 908.59 |
2003-01-06 | 7.45 | 21.52 | 29.96 | 929.01 |
2003-01-07 | 7.43 | 21.93 | 28.95 | 922.93 |
2003-01-08 | 7.28 | 21.31 | 28.83 | 909.93 |
5 rows × 4 columns
close_px = close_px_all[['AAPL','MSFT','XOM']]
close_px[1:10]
AAPL | MSFT | XOM | |
---|---|---|---|
2003-01-03 | 7.45 | 21.14 | 29.24 |
2003-01-06 | 7.45 | 21.52 | 29.96 |
2003-01-07 | 7.43 | 21.93 | 28.95 |
2003-01-08 | 7.28 | 21.31 | 28.83 |
2003-01-09 | 7.34 | 21.93 | 29.44 |
2003-01-10 | 7.36 | 21.97 | 29.03 |
2003-01-13 | 7.32 | 22.16 | 28.91 |
2003-01-14 | 7.30 | 22.39 | 29.17 |
2003-01-15 | 7.22 | 22.11 | 28.77 |
9 rows × 3 columns
close_px['AAPL'].plot()
<matplotlib.axes.AxesSubplot at 0x8097990>
close_px.plot()
<matplotlib.axes.AxesSubplot at 0x72ad3d0>
# And here's what it looks like if we just plot one year. Notice the auto x-axis formatting.
close_px.ix['2009'].plot()
<matplotlib.axes.AxesSubplot at 0x814ad70>
close_px.index
<class 'pandas.tseries.index.DatetimeIndex'> [2003-01-02, ..., 2011-10-14] Length: 2214, Freq: None, Timezone: None
list(close_px.index)[1:25]
[Timestamp('2003-01-03 00:00:00', tz=None), Timestamp('2003-01-06 00:00:00', tz=None), Timestamp('2003-01-07 00:00:00', tz=None), Timestamp('2003-01-08 00:00:00', tz=None), Timestamp('2003-01-09 00:00:00', tz=None), Timestamp('2003-01-10 00:00:00', tz=None), Timestamp('2003-01-13 00:00:00', tz=None), Timestamp('2003-01-14 00:00:00', tz=None), Timestamp('2003-01-15 00:00:00', tz=None), Timestamp('2003-01-16 00:00:00', tz=None), Timestamp('2003-01-17 00:00:00', tz=None), Timestamp('2003-01-21 00:00:00', tz=None), Timestamp('2003-01-22 00:00:00', tz=None), Timestamp('2003-01-23 00:00:00', tz=None), Timestamp('2003-01-24 00:00:00', tz=None), Timestamp('2003-01-27 00:00:00', tz=None), Timestamp('2003-01-28 00:00:00', tz=None), Timestamp('2003-01-29 00:00:00', tz=None), Timestamp('2003-01-30 00:00:00', tz=None), Timestamp('2003-01-31 00:00:00', tz=None), Timestamp('2003-02-03 00:00:00', tz=None), Timestamp('2003-02-04 00:00:00', tz=None), Timestamp('2003-02-05 00:00:00', tz=None), Timestamp('2003-02-06 00:00:00', tz=None)]
close_px['AAPL'].ix['01-2011':'03-2011'].plot()
<matplotlib.axes.AxesSubplot at 0x81a9c90>
Pandas does a particulary nice job with quarterly data.
appl_q = close_px['AAPL'].resample('Q-DEC')
appl_q
2003-03-31 7.289180 2003-06-30 8.299365 2003-09-30 10.499844 2003-12-31 10.893906 2004-03-31 12.031613 2004-06-30 14.335645 2004-09-30 16.609219 2004-12-31 27.792813 2005-03-31 39.535574 2005-06-30 37.876719 2005-09-30 45.719219 2005-12-31 63.500476 2006-03-31 70.159194 2006-06-30 64.030000 2006-09-30 66.210635 2006-12-31 82.832381 2007-03-31 88.520164 2007-06-30 108.308254 2007-09-30 135.619683 2007-12-31 178.586875 2008-03-31 139.138689 2008-06-30 173.566719 2008-09-30 160.141250 2008-12-31 94.936250 2009-03-31 93.574098 2009-06-30 129.413333 2009-09-30 164.225156 2009-12-31 197.276406 2010-03-31 210.855082 2010-06-30 254.738889 2010-09-30 259.905156 2010-12-31 311.494688 2011-03-31 345.683226 2011-06-30 337.612381 2011-09-30 380.510312 2011-12-31 389.424000 Freq: Q-DEC, Name: AAPL, dtype: float64
appl_q.ix['2009':].plot()
<matplotlib.axes.AxesSubplot at 0x82417d0>
Pandas makes it easy to do all kinds of moving window operations like moving average, cumulative sums, and even things like exponential moving averages ("exponential smoothing").
close_px.AAPL.plot()
pd.rolling_mean(close_px.AAPL,250).plot()
<matplotlib.axes.AxesSubplot at 0x82ac050>
pd.rolling_mean(close_px.AAPL,250)
2003-01-02 NaN 2003-01-03 NaN 2003-01-06 NaN 2003-01-07 NaN 2003-01-08 NaN 2003-01-09 NaN 2003-01-10 NaN 2003-01-13 NaN 2003-01-14 NaN 2003-01-15 NaN 2003-01-16 NaN 2003-01-17 NaN 2003-01-21 NaN 2003-01-22 NaN 2003-01-23 NaN ... 2011-09-26 342.82340 2011-09-27 343.28544 2011-09-28 343.74340 2011-09-29 344.19112 2011-09-30 344.56064 2011-10-03 344.90228 2011-10-04 345.23540 2011-10-05 345.57212 2011-10-06 345.90016 2011-10-07 346.18520 2011-10-10 346.53988 2011-10-11 346.93180 2011-10-12 347.28160 2011-10-13 347.64332 2011-10-14 348.09336 Length: 2214
This example shows one problem with a moving average - by default you usually can't compute it until you have data for the number of periods corresponding to the window size (250 in this case). Pandas lets you specify an alternate minumum number of data points to get around this problem.
pd.rolling_mean(close_px.AAPL,250,min_periods=10)
2003-01-02 NaN 2003-01-03 NaN 2003-01-06 NaN 2003-01-07 NaN 2003-01-08 NaN 2003-01-09 NaN 2003-01-10 NaN 2003-01-13 NaN 2003-01-14 NaN 2003-01-15 7.355000 2003-01-16 7.350909 2003-01-17 7.325833 2003-01-21 7.301538 2003-01-22 7.275714 2003-01-23 7.263333 ... 2011-09-26 342.82340 2011-09-27 343.28544 2011-09-28 343.74340 2011-09-29 344.19112 2011-09-30 344.56064 2011-10-03 344.90228 2011-10-04 345.23540 2011-10-05 345.57212 2011-10-06 345.90016 2011-10-07 346.18520 2011-10-10 346.53988 2011-10-11 346.93180 2011-10-12 347.28160 2011-10-13 347.64332 2011-10-14 348.09336 Length: 2214
close_px.AAPL.plot()
pd.rolling_mean(close_px.AAPL,250,min_periods=10).plot()
<matplotlib.axes.AxesSubplot at 0x84f6690>
# Things like rolling standard deviation are also possible
appl_std250 = pd.rolling_std(close_px.AAPL,250,min_periods=10)
appl_std250.plot()
<matplotlib.axes.AxesSubplot at 0x851deb0>
Another intro tutorial is http://earthpy.org/pandas-basics.html
Wes Mckinney's 2012 PyCon talk on pandas and timeseries - http://pyvideo.org/video/1198/time-series-data-analysis-with-pandas