Working with time series data¶

Some imports:

In [ ]:

%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
try:
    import seaborn
except:
    pass

pd.options.display.max_rows = 8

Case study: air quality data of European monitoring stations (AirBase)¶

AirBase (The European Air quality dataBase): hourly measurements of all air quality monitoring stations from Europe.

In [ ]:

from IPython.display import HTML
HTML('<iframe src=http://www.eea.europa.eu/data-and-maps/data/airbase-the-european-air-quality-database-8#tab-data-by-country width=900 height=350></iframe>')

I downloaded and preprocessed some of the data (python-airbase): data/airbase_data.csv. This file includes the hourly concentrations of NO2 for 4 different measurement stations:

FR04037 (PARIS 13eme): urban background site at Square de Choisy
FR04012 (Paris, Place Victor Basch): urban traffic site at Rue d'Alesia
BETR802: urban traffic site in Antwerp, Belgium
BETN029: rural background site in Houtem, Belgium

See http://www.eea.europa.eu/themes/air/interactive/no2

Importing the data¶

Import the csv file:

In [ ]:

!head -5 data/airbase_data.csv

As you can see, the missing values are indicated by -9999. This can be recognized by read_csv by passing the na_values keyword:

In [ ]:

data = pd.read_csv('data/airbase_data.csv', index_col=0, parse_dates=True, na_values=[-9999])

Exploring the data - recap of some useful methods¶

Some useful methods:

head and tail

In [ ]:

data.head(3)

In [ ]:

data.tail()

info()

In [ ]:

data.info()

Getting some basic summary statistics about the data with describe:

In [ ]:

data.describe()

Quickly visualizing the data

In [ ]:

data.plot(kind='box', ylim=[0,250])

In [ ]:

data['BETR801'].plot(kind='hist', bins=50)

In [ ]:

data.plot(figsize=(12,6))

This does not say too much ..

We can select part of the data (eg the latest 500 data points):

In [ ]:

data[-500:].plot(figsize=(12,6))

Or we can use some more advanced time series features -> next section!

Working with time series data¶

When we ensure the DataFrame has a DatetimeIndex, time-series related functionality becomes available:

In [ ]:

data.index

Indexing a time series works with strings:

In [ ]:

data["2010-01-01 09:00":"2010-01-01 12:00"]

A nice feature is "partial string" indexing, where we can do implicit slicing by providing a partial datetime string.

E.g. all data of 2012:

In [ ]:

data['2012']

Normally you would expect this to access a column named '2012', but as for a DatetimeIndex, pandas also tries to interprete it as a datetime slice.

Or all data of January up to March 2012:

In [ ]:

data['2012-01':'2012-03']

Time and date components can be accessed from the index:

In [ ]:

data.index.hour

In [ ]:

data.index.year

EXERCISE: select all data starting from 1999

In [ ]:

data = data['1999':]

EXERCISE: select all data in January for all different years

In [ ]:

data[data.index.month == 1]

EXERCISE: select all data in January, February and March for all different years

In [ ]:

data['months'] = data.index.month
data[data['months'].isin([1, 2, 3])]

EXERCISE: select all 'daytime' data (between 8h and 20h) for all days

In [ ]:

data[(data.index.hour >= 8) & (data.index.hour < 20)]

In [ ]:

data.between_time('08:00', '20:00')

In [ ]:

data = pd.read_csv('data/airbase_data.csv', index_col=0, parse_dates=True, na_values=[-9999])
data = data['1999':]

The power of pandas: `resample`¶

A very powerfull method is resample: converting the frequency of the time series (e.g. from hourly to daily data).

The time series has a frequency of 1 hour. I want to change this to daily:

In [ ]:

data.resample('D').mean().head()

NOTE: with older versions of pandas, data.resample('D').mean() was expressed as data.resample('D', how='mean').

Similar to groupby, other methods can also be specified:

In [ ]:

data.resample('D').max().head()

The string to specify the new time frequency: http://pandas.pydata.org/pandas-docs/dev/timeseries.html#offset-aliases
These strings can also be combined with numbers, eg '10D'.

Further exploring the data:

In [ ]:

data.resample('M').mean().plot() # 'A'

In [ ]:

# data['2012'].resample('D').mean().plot()

QUESTION: plot the monthly mean and median concentration of the 'FR04037' station for the years 2009-2012

In [ ]:

# %load snippets/05 - Time series data29.py

In [ ]:

# %load snippets/05 - Time series data30.py

QUESTION: plot the monthly mininum and maximum daily concentration of the 'BETR801' station

In [ ]:

# %load snippets/05 - Time series data31.py

In [ ]:

# %load snippets/05 - Time series data32.py

QUESTION: make a bar plot of the mean of the stations in year of 2012

In [ ]:

# %load snippets/05 - Time series data33.py

QUESTION: The evolution of the yearly averages with, and the overall mean of all stations (indicate the overall mean with a thicker black line)?

In [ ]:

# %load snippets/05 - Time series data34.py

Combination with groupby¶

resample can actually be seen as a specific kind of groupby. E.g. taking annual means with data.resample('A', 'mean') is equivalent to data.groupby(data.index.year).mean() (only the result of resample still has a DatetimeIndex).

In [ ]:

data.groupby(data.index.year).mean().plot()

But, groupby is more flexible and can also do resamples that do not result in a new continuous time series, e.g. by grouping by the hour of the day to get the diurnal cycle.

QUESTION: how does the *typical monthly profile* look like for the different stations?

1. add a column to the dataframe that indicates the month (integer value of 1 to 12):

In [ ]:

# %load snippets/05 - Time series data36.py

2. Now, we can calculate the mean of each month over the different years:

In [ ]:

# %load snippets/05 - Time series data37.py

3. plot the typical monthly profile of the different stations:

In [ ]:

# %load snippets/05 - Time series data38.py

QUESTION: plot the weekly 95% percentiles of the concentration in 'BETR801' and 'BETN029' for 2011

In [ ]:

df2011 = data['2011'].dropna()

In [ ]:

# %load snippets/05 - Time series data40.py

In [ ]:

# %load snippets/05 - Time series data41.py

In [ ]:

data = data.drop('month', axis=1)

QUESTION: The typical diurnal profile for the different stations?

In [ ]:

# %load snippets/05 - Time series data43.py

QUESTION: What are the number of exceedances of hourly values above the European limit 200 µg/m3 for each year/station?

In [ ]:

# %load snippets/05 - Time series data44.py

In [ ]:

# %load snippets/05 - Time series data45.py

In [ ]:

# %load snippets/05 - Time series data46.py

In [ ]:

# %load snippets/05 - Time series data47.py

QUESTION: And are there exceedances of the yearly limit value of 40 µg/m3 since 200 ?

In [ ]:

# %load snippets/05 - Time series data48.py

In [ ]:

# %load snippets/05 - Time series data49.py

In [ ]:

# %load snippets/05 - Time series data50.py

QUESTION: The maximum daily 8 hour mean should be below 100 µg/m³. What are the number of exceedances of this limit for each year/station?

Tip: have a look at the rolling method to perform moving window operations.

Note: this is not an actual limit for NO2, but a nice exercise to introduce the rolling method. Other pollutans, such as 03 have actually such kind of limit values.

In [ ]:

# %load snippets/05 - Time series data52.py

QUESTION: Calculate the correlation between the different stations

In [ ]:

# %load snippets/05 - Time series data53.py

In [ ]:

# %load snippets/05 - Time series data54.py

Working with time series data¶

Case study: air quality data of European monitoring stations (AirBase)¶

Importing the data¶

Exploring the data - recap of some useful methods¶

Working with time series data¶

The power of pandas: resample¶

Combination with groupby¶

The power of pandas: `resample`¶