Some imports:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
try:
import seaborn
except:
pass
pd.options.display.max_rows = 8
AirBase (The European Air quality dataBase): hourly measurements of all air quality monitoring stations from Europe.
from IPython.display import HTML
HTML('<iframe src=http://www.eea.europa.eu/data-and-maps/data/airbase-the-european-air-quality-database-8#tab-data-by-country width=900 height=350></iframe>')
I downloaded and preprocessed some of the data (python-airbase): data/airbase_data.csv
. This file includes the hourly concentrations of NO2 for 4 different measurement stations:
Import the csv file:
!head -5 data/airbase_data.csv
As you can see, the missing values are indicated by -9999
. This can be recognized by read_csv
by passing the na_values
keyword:
data = pd.read_csv('data/airbase_data.csv', index_col=0, parse_dates=True, na_values=[-9999])
Some useful methods:
head
and tail
data.head(3)
data.tail()
info()
data.info()
Getting some basic summary statistics about the data with describe
:
data.describe()
Quickly visualizing the data
data.plot(kind='box', ylim=[0,250])
data['BETR801'].plot(kind='hist', bins=50)
data.plot(figsize=(12,6))
This does not say too much ..
We can select part of the data (eg the latest 500 data points):
data[-500:].plot(figsize=(12,6))
Or we can use some more advanced time series features -> next section!
When we ensure the DataFrame has a DatetimeIndex
, time-series related functionality becomes available:
data.index
Indexing a time series works with strings:
data["2010-01-01 09:00":"2010-01-01 12:00"]
A nice feature is "partial string" indexing, where we can do implicit slicing by providing a partial datetime string.
E.g. all data of 2012:
data['2012']
Normally you would expect this to access a column named '2012', but as for a DatetimeIndex, pandas also tries to interprete it as a datetime slice.
Or all data of January up to March 2012:
data['2012-01':'2012-03']
Time and date components can be accessed from the index:
data.index.hour
data.index.year
data = data['1999':]
data[data.index.month == 1]
data['months'] = data.index.month
data[data['months'].isin([1, 2, 3])]
data[(data.index.hour >= 8) & (data.index.hour < 20)]
data.between_time('08:00', '20:00')
data = pd.read_csv('data/airbase_data.csv', index_col=0, parse_dates=True, na_values=[-9999])
data = data['1999':]
resample
¶A very powerfull method is resample
: converting the frequency of the time series (e.g. from hourly to daily data).
The time series has a frequency of 1 hour. I want to change this to daily:
data.resample('D').mean().head()
data.resample('D').mean()
was expressed as data.resample('D', how='mean')
.
Similar to groupby
, other methods can also be specified:
data.resample('D').max().head()
The string to specify the new time frequency: http://pandas.pydata.org/pandas-docs/dev/timeseries.html#offset-aliases
These strings can also be combined with numbers, eg '10D'
.
Further exploring the data:
data.resample('M').mean().plot() # 'A'
# data['2012'].resample('D').mean().plot()
# %load snippets/05 - Time series data29.py
# %load snippets/05 - Time series data30.py
# %load snippets/05 - Time series data31.py
# %load snippets/05 - Time series data32.py
# %load snippets/05 - Time series data33.py
# %load snippets/05 - Time series data34.py
resample
can actually be seen as a specific kind of groupby
. E.g. taking annual means with data.resample('A', 'mean')
is equivalent to data.groupby(data.index.year).mean()
(only the result of resample
still has a DatetimeIndex).
data.groupby(data.index.year).mean().plot()
But, groupby
is more flexible and can also do resamples that do not result in a new continuous time series, e.g. by grouping by the hour of the day to get the diurnal cycle.
1. add a column to the dataframe that indicates the month (integer value of 1 to 12):
# %load snippets/05 - Time series data36.py
2. Now, we can calculate the mean of each month over the different years:
# %load snippets/05 - Time series data37.py
3. plot the typical monthly profile of the different stations:
# %load snippets/05 - Time series data38.py
df2011 = data['2011'].dropna()
# %load snippets/05 - Time series data40.py
# %load snippets/05 - Time series data41.py
data = data.drop('month', axis=1)
# %load snippets/05 - Time series data43.py
# %load snippets/05 - Time series data44.py
# %load snippets/05 - Time series data45.py
# %load snippets/05 - Time series data46.py
# %load snippets/05 - Time series data47.py
# %load snippets/05 - Time series data48.py
# %load snippets/05 - Time series data49.py
# %load snippets/05 - Time series data50.py
Tip: have a look at the rolling
method to perform moving window operations.
Note: this is not an actual limit for NO2, but a nice exercise to introduce the rolling
method. Other pollutans, such as 03 have actually such kind of limit values.
# %load snippets/05 - Time series data52.py
# %load snippets/05 - Time series data53.py
# %load snippets/05 - Time series data54.py