Pandas: Working with time series data
© 2016, Joris Van den Bossche and Stijn Van Hoey (mailto:jorisvandenbossche@gmail.com, mailto:stijnvanhoey@gmail.com). Licensed under CC BY 4.0 Creative Commons
%matplotlib notebook
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.options.display.max_rows = 8
datetime
module¶Standard Python contains the datetime
module to handle with date and time data:
import datetime
dt = datetime.datetime(year=2016, month=12, day=19, hour=13, minute=30)
dt
print(dt) # .day,...
print(dt.strftime("%d %B %Y"))
Timestamp
object¶Pandas has its own date and time objects, which are compatible with the standard datetime
objects, but provide some more functionality to work with.
The Timestamp
object can also be constructed from a string:
ts = pd.Timestamp('2016-12-19')
ts
Like with datetime.datetime
objects, there are several useful attributes available on the Timestamp
. For example, we can get the month:
ts.month
ts + pd.Timedelta('5 days')
Unfortunately, when working with real world data, you encounter many different datetime
formats. Most of the time when you have to deal with them, they come in text format, e.g. from a CSV
file. To work with those data in Pandas, we first have to parse the strings to actual Timestamp
objects.
from string formatted dates to Timestamp objects: to_datetime
function
pd.to_datetime("2016-12-09")
pd.to_datetime("09/12/2016")
pd.to_datetime("09/12/2016", dayfirst=True)
pd.to_datetime("09/12/2016", format="%d/%m/%Y")
A detailed overview of how to specify the format
string, see the table in the python documentation: https://docs.python.org/3.5/library/datetime.html#strftime-and-strptime-behavior
Timestamp
data in a Series or DataFrame column¶s = pd.Series(['2016-12-09 10:00:00', '2016-12-09, 11:00:00', '2016-12-09 12:00:00'])
The to_datetime
function can also be used to convert a full series of strings:
ts = pd.to_datetime(s)
ts
Notice the data type of this series: the datetime64[ns]
dtype. This indicates that we have a series of actual datetime values.
The same attributes as on single Timestamp
s are also available on a Series with datetime data, using the .dt
accessor:
ts.dt.hour
ts.dt.weekday
To quickly construct some regular time series data, the pd.date_range
function comes in handy:
pd.Series(pd.date_range(start="2016-01-01", periods=10, freq='3H'))
Timestamp
in the index¶For the following demonstration of the time series functionality, we use a sample of discharge data of the Maarkebeek (Flanders) with 3 hour averaged values, derived from the Waterinfo website.
data = pd.read_csv("data/flowdata.csv")
data.head()
We already know how to parse a date column with Pandas:
data['Time'] = pd.to_datetime(data['Time'])
With set_index('datetime')
, we set the column with datetime values as the index, which can be done by both Series
and DataFrame
.
data = data.set_index("Time")
data
The steps above are provided as built-in functionality of read_csv
:
data = pd.read_csv("data/flowdata.csv", index_col=0, parse_dates=True)
pd.read_csv
provides a lot of built-in functionality to support this kind of transactions when reading in a file! Check the help of the read_csv function...
When we ensure the DataFrame has a DatetimeIndex
, time-series related functionality becomes available:
data.index
Similar to a Series with datetime data, there are some attributes of the timestamp values available:
data.index.day
data.index.dayofyear
data.index.year
The plot
method will also adapt it's labels (when you zoom in, you can see the different levels of detail of the datetime labels):
data.plot()
We have to much data to sensibly plot on one figure. Let's see how we can easily select part of the data or aggregate the data to other time resolutions in the next sections.
We can use label based indexing on a timeseries as expected:
data[pd.Timestamp("2012-01-01 09:00"):pd.Timestamp("2012-01-01 19:00")]
But, for convenience, indexing a time series also works with strings:
data["2012-01-01 09:00":"2012-01-01 19:00"]
A nice feature is "partial string" indexing, where we can do implicit slicing by providing a partial datetime string.
E.g. all data of 2013:
data['2013']
Normally you would expect this to access a column named '2013', but as for a DatetimeIndex, pandas also tries to interprete it as a datetime slice.
Or all data of January up to March 2012:
data['2012-01':'2012-03']
EXERCISE:
# %load snippets/05 - Time series data36.py
EXERCISE:
# %load snippets/05 - Time series data37.py
EXERCISE:
# %load snippets/05 - Time series data38.py
data = data.drop("months", axis=1)
EXERCISE:
# %load snippets/05 - Time series data40.py
# %load snippets/05 - Time series data41.py
resample
¶A very powerfull method is resample
: converting the frequency of the time series (e.g. from hourly to daily data).
The time series has a frequency of 1 hour. I want to change this to daily:
data.resample('D').mean().head()
NOTE:
data.resample('D').mean()
was expressed as data.resample('D', how='mean')
.Other mathematical methods can also be specified:
data.resample('D').max().head()
The string to specify the new time frequency: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases <br><br>
These strings can also be combined with numbers, eg `'10D'`...
data.resample('A').mean().plot() # 10D
EXERCISE:
# %load snippets/05 - Time series data45.py
EXERCISE:
**Note** <br>You can create a new figure with `fig, ax = plt.subplots()` and add each of the plots to the created `ax` object (see documentation of pandas plot function)
# %load snippets/05 - Time series data46.py
# %load snippets/05 - Time series data47.py
EXERCISE:
# %load snippets/05 - Time series data48.py
# %load snippets/05 - Time series data49.py
# %load snippets/05 - Time series data50.py