import addutils.toc ; addutils.toc.js(ipy_notebook=True)
For further documentation see Time Series / Date functionality on the pandas documentation.
import numpy as np
import pandas as pd
from numpy import NaN
from IPython.display import (display, HTML)
from pandas.tseries.offsets import *
from addutils import side_by_side2
from addutils import css_notebook
css_notebook()
to_datetime
convertsa list of date-like objects to Time Stamps. If the list is homegeneous infer_datetime_format=True
can give a great speed-up. Otherwise with format
is possible to define the format of the strings a-priori.
pd.to_datetime(['10-11-2000 17:52:13.7731+02:00', '3-5-2014 11:57:18.7754+01:00'], dayfirst=True, utc=True)
DatetimeIndex(['2000-11-10 15:52:13.773100+00:00', '2014-05-03 10:57:18.775400+00:00'], dtype='datetime64[ns, UTC]', freq=None)
It’s also possible to convert integer or float epoch times. The float value is interpreted as a Unix timestamp and the default unit of measure is nanoseconds bu a different unit (D,s,ms,us,ns) can be specified:
print (pd.to_datetime([0]))
print (pd.to_datetime([1]))
print (pd.to_datetime([1], unit='D'))
DatetimeIndex(['1970-01-01'], dtype='datetime64[ns]', freq=None) DatetimeIndex(['1970-01-01 00:00:00.000000001'], dtype='datetime64[ns]', freq=None) DatetimeIndex(['1970-01-02'], dtype='datetime64[ns]', freq=None)
Index of Timestamps: use date_range
and bdate_range
to create regular frequency timestamp indexes. These functions return DatetimeIndex
objects that are array of Timestamps:
rng = pd.date_range('1/1/2012 15:53:25.335',
periods=6,
freq='2d3h35min42s115ms',
tz='Europe/Rome')
for date in rng:
print (date)
2012-01-01 15:53:25.335000+01:00 2012-01-03 19:29:07.450000+01:00 2012-01-05 23:04:49.565000+01:00 2012-01-08 02:40:31.680000+01:00 2012-01-10 06:16:13.795000+01:00 2012-01-12 09:51:55.910000+01:00
rng[0]
Timestamp('2012-01-01 15:53:25.335000+0100', tz='Europe/Rome', freq='185742115L')
This can be used as an index for Series, Dataframes and Panels:
data = np.random.randint(0,99,(len(rng),3))
d1 = pd.DataFrame(data, index=rng, columns=list('ABC'))
display(d1)
A | B | C | |
---|---|---|---|
2012-01-01 15:53:25.335000+01:00 | 25 | 4 | 32 |
2012-01-03 19:29:07.450000+01:00 | 20 | 15 | 13 |
2012-01-05 23:04:49.565000+01:00 | 68 | 65 | 13 |
2012-01-08 02:40:31.680000+01:00 | 51 | 53 | 4 |
2012-01-10 06:16:13.795000+01:00 | 2 | 41 | 53 |
2012-01-12 09:51:55.910000+01:00 | 26 | 5 | 6 |
Pandas provides a timezone conversion. Here we produce new data by converting the previous timeseries form the Rome timezone to the Eastern US timezone. When data with different timezones are combined toghether (as in d3 = d1+d2
) the results are given in UTC time which can be in turn converted in any timezone. In this example the UTC timezone is shown with the notation +00:00
d2 = d1.tz_convert('US/Eastern')
HTML(side_by_side2(d1, d2))
A | B | C | |
---|---|---|---|
2012-01-01 15:53:25.335000+01:00 | 25 | 4 | 32 |
2012-01-03 19:29:07.450000+01:00 | 20 | 15 | 13 |
2012-01-05 23:04:49.565000+01:00 | 68 | 65 | 13 |
2012-01-08 02:40:31.680000+01:00 | 51 | 53 | 4 |
2012-01-10 06:16:13.795000+01:00 | 2 | 41 | 53 |
2012-01-12 09:51:55.910000+01:00 | 26 | 5 | 6 |
A | B | C | |
---|---|---|---|
2012-01-01 09:53:25.335000-05:00 | 25 | 4 | 32 |
2012-01-03 13:29:07.450000-05:00 | 20 | 15 | 13 |
2012-01-05 17:04:49.565000-05:00 | 68 | 65 | 13 |
2012-01-07 20:40:31.680000-05:00 | 51 | 53 | 4 |
2012-01-10 00:16:13.795000-05:00 | 2 | 41 | 53 |
2012-01-12 03:51:55.910000-05:00 | 26 | 5 | 6 |
d3 = d1+d2
d3
A | B | C | |
---|---|---|---|
2012-01-01 14:53:25.335000+00:00 | 50 | 8 | 64 |
2012-01-03 18:29:07.450000+00:00 | 40 | 30 | 26 |
2012-01-05 22:04:49.565000+00:00 | 136 | 130 | 26 |
2012-01-08 01:40:31.680000+00:00 | 102 | 106 | 8 |
2012-01-10 05:16:13.795000+00:00 | 4 | 82 | 106 |
2012-01-12 08:51:55.910000+00:00 | 52 | 10 | 12 |
In the previous example we used a time frequency string (with Offset Aliases) freq='2d3h35min42s115ms
with date_range
to create a DatetimeIndex
. These frequency strings are being translated into an instance of pandas DateOffset
, which represents a regular frequency increment. Specific offset logic like “month”, “business day”, or “one hour” is represented in its various subclasses.
The key features of a DateOffset
object are:
from pandas.tseries.offsets import *
print (pd.datetime(2014, 1, 1) + Week() + Hour(13) + Minute()*15 + Milli()*350)
print (pd.datetime(2014, 1, 1) + MonthEnd())
print (pd.datetime(2014, 1, 1) + MonthBegin())
print (pd.datetime(2014, 1, 1) + BQuarterBegin(2))
2014-01-08 13:15:00.350000 2014-01-31 00:00:00 2014-02-01 00:00:00 2014-06-02 00:00:00
Offset Aliases can be used to define to define time series frequencies:
print (pd.date_range('2014/1/1', periods=2, freq=Minute(4)))
print (pd.date_range('2014/1/1', periods=2, freq='4T'))
print (pd.date_range('2014/1/1', periods=2, freq='4min'))
DatetimeIndex(['2014-01-01 00:00:00', '2014-01-01 00:04:00'], dtype='datetime64[ns]', freq='4T') DatetimeIndex(['2014-01-01 00:00:00', '2014-01-01 00:04:00'], dtype='datetime64[ns]', freq='4T') DatetimeIndex(['2014-01-01 00:00:00', '2014-01-01 00:04:00'], dtype='datetime64[ns]', freq='4T')
start = pd.datetime(2014, 1, 1, 12, 30)
end = pd.datetime(2014, 1, 12)
idx2 = pd.date_range(start, end, freq='1B1H10T')
for i in idx2:
print (i)
2014-01-01 12:30:00 2014-01-02 13:40:00 2014-01-03 14:50:00 2014-01-06 16:00:00 2014-01-07 17:10:00 2014-01-08 18:20:00 2014-01-09 19:30:00
As the name can suggest Datetime Index objects can be used to index a DataFrames. Try uncommenting some of the following lines to see alternative ways for selecting dates:
data = np.random.randint(0,99,(len(idx2),3))
d2 = pd.DataFrame(data, index=idx2, columns=list('ABC'))
d2
A | B | C | |
---|---|---|---|
2014-01-01 12:30:00 | 65 | 8 | 4 |
2014-01-02 13:40:00 | 95 | 15 | 90 |
2014-01-03 14:50:00 | 71 | 92 | 2 |
2014-01-06 16:00:00 | 92 | 45 | 74 |
2014-01-07 17:10:00 | 48 | 65 | 59 |
2014-01-08 18:20:00 | 45 | 33 | 91 |
2014-01-09 19:30:00 | 94 | 28 | 46 |
# Remember that in Pandas endpoints are included
d2.loc['2014/01/02':'2014/01/08', ['A', 'C']]
A | C | |
---|---|---|
2014-01-02 13:40:00 | 95 | 90 |
2014-01-03 14:50:00 | 71 | 2 |
2014-01-06 16:00:00 | 92 | 74 |
2014-01-07 17:10:00 | 48 | 59 |
2014-01-08 18:20:00 | 45 | 91 |
d2.loc['2014/01/06 12:00':'2014/01/08 12:00', ['A', 'C']]
A | C | |
---|---|---|
2014-01-06 16:00:00 | 92 | 74 |
2014-01-07 17:10:00 | 48 | 59 |
It is possible to mangle with indexes using offset objects. BDay means business day (be aware that xmas is considered a business day).
We can convert all TimeSeries to specified frequency using DateOffset objects. Optionally we can provide a fill method to handle missing values.
# TODO: Fix
HTML(side_by_side2(d2,
d2.asfreq(Day(), method='ffill'),
d2.asfreq(Day(), method=None)))
A | B | C | |
---|---|---|---|
2014-01-01 12:30:00 | 65 | 8 | 4 |
2014-01-02 13:40:00 | 95 | 15 | 90 |
2014-01-03 14:50:00 | 71 | 92 | 2 |
2014-01-06 16:00:00 | 92 | 45 | 74 |
2014-01-07 17:10:00 | 48 | 65 | 59 |
2014-01-08 18:20:00 | 45 | 33 | 91 |
2014-01-09 19:30:00 | 94 | 28 | 46 |
A | B | C | |
---|---|---|---|
2014-01-01 12:30:00 | 65 | 8 | 4 |
2014-01-02 12:30:00 | 65 | 8 | 4 |
2014-01-03 12:30:00 | 95 | 15 | 90 |
2014-01-04 12:30:00 | 71 | 92 | 2 |
2014-01-05 12:30:00 | 71 | 92 | 2 |
2014-01-06 12:30:00 | 71 | 92 | 2 |
2014-01-07 12:30:00 | 92 | 45 | 74 |
2014-01-08 12:30:00 | 48 | 65 | 59 |
2014-01-09 12:30:00 | 45 | 33 | 91 |
A | B | C | |
---|---|---|---|
2014-01-01 12:30:00 | 65.0 | 8.0 | 4.0 |
2014-01-02 12:30:00 | NaN | NaN | NaN |
2014-01-03 12:30:00 | NaN | NaN | NaN |
2014-01-04 12:30:00 | NaN | NaN | NaN |
2014-01-05 12:30:00 | NaN | NaN | NaN |
2014-01-06 12:30:00 | NaN | NaN | NaN |
2014-01-07 12:30:00 | NaN | NaN | NaN |
2014-01-08 12:30:00 | NaN | NaN | NaN |
2014-01-09 12:30:00 | NaN | NaN | NaN |
We are going to see some ways to let pandas fill NaN
values on a dataframe.
The first method is called forward filling and consists on using the first element above that isn't NaN
.
d3 = d2.copy()
d3.iloc[2:5, 0] = np.nan
d3.iloc[0:4, 1] = np.nan
d3.iloc[4:, 2] = np.nan
cols = ['A', 'B', 'C']
HTML(side_by_side2(d3, d3[cols].fillna(method='ffill')))
A | B | C | |
---|---|---|---|
2014-01-01 12:30:00 | 65.0 | NaN | 4.0 |
2014-01-02 13:40:00 | 95.0 | NaN | 90.0 |
2014-01-03 14:50:00 | NaN | NaN | 2.0 |
2014-01-06 16:00:00 | NaN | NaN | 74.0 |
2014-01-07 17:10:00 | NaN | 65.0 | NaN |
2014-01-08 18:20:00 | 45.0 | 33.0 | NaN |
2014-01-09 19:30:00 | 94.0 | 28.0 | NaN |
A | B | C | |
---|---|---|---|
2014-01-01 12:30:00 | 65.0 | NaN | 4.0 |
2014-01-02 13:40:00 | 95.0 | NaN | 90.0 |
2014-01-03 14:50:00 | 95.0 | NaN | 2.0 |
2014-01-06 16:00:00 | 95.0 | NaN | 74.0 |
2014-01-07 17:10:00 | 95.0 | 65.0 | 74.0 |
2014-01-08 18:20:00 | 45.0 | 33.0 | 74.0 |
2014-01-09 19:30:00 | 94.0 | 28.0 | 74.0 |
Notice that on column 'B' the NaN
values are at the beginning. So the method hasn't been able to fill those holes. The backward filling methods is complementary to the one above. It fills gaps using the first non NaN
value below the cell.
HTML(side_by_side2(d3, d3[cols].fillna(method='bfill')))
A | B | C | |
---|---|---|---|
2014-01-01 12:30:00 | 65.0 | NaN | 4.0 |
2014-01-02 13:40:00 | 95.0 | NaN | 90.0 |
2014-01-03 14:50:00 | NaN | NaN | 2.0 |
2014-01-06 16:00:00 | NaN | NaN | 74.0 |
2014-01-07 17:10:00 | NaN | 65.0 | NaN |
2014-01-08 18:20:00 | 45.0 | 33.0 | NaN |
2014-01-09 19:30:00 | 94.0 | 28.0 | NaN |
A | B | C | |
---|---|---|---|
2014-01-01 12:30:00 | 65.0 | 65.0 | 4.0 |
2014-01-02 13:40:00 | 95.0 | 65.0 | 90.0 |
2014-01-03 14:50:00 | 45.0 | 65.0 | 2.0 |
2014-01-06 16:00:00 | 45.0 | 65.0 | 74.0 |
2014-01-07 17:10:00 | 45.0 | 65.0 | NaN |
2014-01-08 18:20:00 | 45.0 | 33.0 | NaN |
2014-01-09 19:30:00 | 94.0 | 28.0 | NaN |
Visit www.add-for.com for more tutorials and updates.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.