Time Series Practicals for GEOGG121¶

1. Data¶

Today, we will be using the in situ CO2 measurements from Mauna Loa Station (http://www.esrl.noaa.gov/gmd/obop/mlo/programs/esrl/co2/co2.html). The data can be downloaded from : ftp://aftp.cmdl.noaa.gov/data/trace_gases/co2/in-situ/surface/mlo/co2_mlo_surface-insitu_1_ccgg_month.txt. We can start from examine the data file (read the header information, etc.), if you haven't used this data set before. Then, we can download this file to your perferred data directry.¶

In [ ]:

!wget ftp://aftp.cmdl.noaa.gov/data/trace_gases/co2/in-situ/surface/mlo/co2_mlo_surface-insitu_1_ccgg_month.txt
!mv co2_mlo_surface-insitu_1_ccgg_month.txt mlo_co2.txt 

How many columns are in the file? Let's use a column-friendly function from numpy, the loadtxt function, to read this file into an array.¶

In [2]:

import numpy as np
co2 = np.loadtxt("mlo_co2.txt", comments="#", usecols=(3,), unpack=False)[4:] #skiprows=4, 
yearmonth = np.loadtxt("mlo_co2.txt", comments="#", usecols=(1,2), dtype=int, unpack=True)[:,4:]  #skiprows=4, 

In [3]:

print co2.shape, yearmonth.shape

(464,) (2, 464)

Now we can plot the CO2 data. Before we do that, we can convert the year and month coloumns into datetime format, and then lable the x-axis with the datetime.¶

In [4]:

import datetime as DT
#assume all days are the first day of month
days = np.ones_like(co2, dtype=int)
#year,doy = datetime.datetime(data[4:,0], data[4:,1], days).strftime('%Y %j').split()
#months = data[4:,1]

dates = []

for i in np.arange(len(days)):
    dates.append(DT.datetime(yearmonth[0][i], yearmonth[1][i], days[i]))
    
print "Mauna Loa CO2 serie starts from %s, and ends at %s"%(dates[0], dates[-1])

Mauna Loa CO2 serie starts from 1974-05-01 00:00:00, and ends at 2012-12-01 00:00:00

In [5]:

plt.figure(figsize=(20,5)) #make figure wider
plt.plot(dates,co2)

Out[5]:

[<matplotlib.lines.Line2D at 0x53b7730>]

Firstly, we can draw a straight line to fit our co2 data, by using scipy.stats (http://docs.scipy.org/doc/scipy/reference/stats.html)¶

In [6]:

import scipy.stats as stats

index = np.arange(len(co2))

co2_lin = stats.linregress(index, co2)
co2_predict = co2_lin[0] * index + co2_lin[1]

#figure(figsize=(20,5))
plot(dates, co2)
plot(dates, co2_predict)
title('Mauna Loa CO2 Concentation')
xlabel('Year')
ylabel('CO2 (ppm)')
figtext(0.15, 0.8, 'r = '+str(co2_lin[2]), size='x-large')
figtext(0.15, 0.75, 'intercept = '+str(co2_lin[1]), size='large')
figtext(0.15, 0.7, 'slope = '+str(co2_lin[0]), size='large')

Out[6]:

<matplotlib.text.Text at 0x5a41250>

It looks like the trend is not perfectly linear. This is more obvious when we plot the residuals:¶

In [7]:

#calculate residuals
co2_lin_redis = co2 - co2_predict
figure(figsize=(20,5))
plot(dates, co2_lin_redis)

Out[7]:

[<matplotlib.lines.Line2D at 0x5a157f0>]

2. New libraries¶

A new library we will use today is the statsmodels (https://pypi.python.org/pypi/statsmodels). Statsmodels provides a series of statistical and time sereis tools, including OLS, AR, MA, ARMA, ARIMA, FFT, etc. Some examplary uses can be found here: http://jarrodmillman.com/scipy2011/pdfs/statsmodels.pdf ¶

In [61]:

import statsmodels.api as sm
tsa = sm.tsa
import statsmodels.formula.api as smf
index_const = sm.add_constant(index)
co2_model = smf.GLM(co2, index_const).fit()
print co2_model.summary()
co2_pr = co2_model.predict()
#plot(dates, co2)
#plot(dates, co2_pr)

                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                      y   No. Observations:                  464
Model:                            GLM   Df Residuals:                      462
Model Family:                Gaussian   Df Model:                            1
Link Function:               identity   Scale:                   6.35942699131
Method:                          IRLS   Log-Likelihood:                -1086.6
Date:                Thu, 12 Dec 2013   Deviance:                       2938.1
Time:                        21:25:06   Pearson chi2:                 2.94e+03
No. Iterations:                     3                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const        327.6770      0.234   1401.740      0.000       327.219   328.135
x1             0.1384      0.001    158.347      0.000         0.137     0.140
==============================================================================

2.1 AR & MA¶

Now, let's explore this series with classical time series analysis methods.¶

2.2 ARMA¶

In [9]:

import statsmodels.api as sm
tsa = sm.tsa
help(tsa.ARMA)

In [10]:

#Testing ARMA(1,1) for Original Series

arma_mod = tsa.ARMA(co2, order=(1,1))
arma_res = arma_mod.fit(trend='c', disp=-1)
arma_pred = arma_res.predict()

print arma_res.params
print "AIC = %f, BIC = %f"%(arma_res.aic,arma_res.bic)

plt.figure(figsize=(20,5))
plot(dates,arma_pred)
plot(dates,co2)

[ 362.86827751    0.99825572    0.69159428]
AIC = 1240.736430, BIC = 1257.295968

Out[10]:

[<matplotlib.lines.Line2D at 0x543e290>]

take a look of ARMA mannual¶

2.3 ARIMA¶

Now, let's define a function to test and plot different ARIMA scenarios:¶

In [18]:

from statsmodels.tsa.arima_model import ARIMA

def Plot_ARIMA(p,d,q):
    arima = ARIMA(co2, [p, d, q], exog=None, dates=dates, freq='M', missing='none')
    arima_results = arima.fit(trend='c', disp=False)
    co2_ARIMA = arima_results.predict(exog=None, dynamic=False)
    
    if d>0:
        co2_d = co2[d:]-co2[:len(co2)-d]
    else:
        co2_d = co2

    plt.figure(figsize=(20,5))
    plt.plot(dates[d:],co2_d)
    plt.plot(dates[d:],co2_ARIMA)
    return arima_results.aic

In [19]:

aic_001 = Plot_ARIMA(0,0,1)   #MA(1), OR ARMA(0,1)

This is essentially a MA(1) model, we can also call the MA(1) or ARMA(0,1) fucntions in Scipy to achieve the same model. Next, we can test the AR(1) model, ARMA(1,1) model, and so on.¶

In [20]:

aic_100 = Plot_ARIMA(1,0,0)   #AR(1), or ARMA(1,0)

In [21]:

aic_101 = Plot_ARIMA(1,0,1)   # ARMA(1,1)

In [22]:

aic_001, aic_100, aic_101

Out[22]:

(3415.9859841534144, 1540.2555026390642, 1240.7364300383967)

What do these AIC values mean? Which mode is better?¶

Try Plot_ARIMA(0,1,0), i.e. Random Walk, and ARIMA(1,0,1). What happend and why? Test a few more cobinations of p,d,q.¶

2.4 ACF & PACF¶

Now, let's test the co2 series for autocorelation and partial autocorelation.¶

In [23]:

acf, ci, Q, pvalue = tsa.acf(co2, nlags=24, alpha=0.05, qstat=True, unbiased=True)
subplot(211)
title('ACF')
plot(acf,'b^')

pacf, cip = tsa.pacf(co2, nlags=24, alpha=0.05)
subplot(212)
title('PACF')
plot(pacf,'b^')
vlines(np.arange(len(pacf)),[-0.4],pacf,'b')
#tight_layout()

Out[23]:

<matplotlib.collections.LineCollection at 0x7e3ae50>

A sharp cut off in PACF and a graduate decay in ACF is observed here. So, this further proves that we should use the AR() model more than MA() model. However, keep in mind, we haven't done any preprocessing yet (i.e. differencing, transformation).¶

3. Differencing & Transformation¶

The previous corelograms indicated strong autocorrelation. Now try to difference our time series by lag 1 (by using numpy.roll function) -- Let's see what happens.¶

In [24]:

# diff by 1 month
# Differencing by one month forces us to drop the first (or last) 1 value. 
df1_co2 = (co2 - np.roll(co2,1))[1:]
plt.figure(figsize=(20,5))
plt.plot(dates[1:],df1_co2)
#df1_log_co2 = (log_co2 - np.roll(log_co2,1))[1:]
#plt.plot(df1_log_co2)

Out[24]:

[<matplotlib.lines.Line2D at 0x7ce5990>]

In [25]:

acf, ci, Q, pvalue = tsa.acf(df1_co2, nlags=24, alpha=0.05, qstat=True, unbiased=True)
subplot(211)
plot(acf,'b^')

pacf, cip = tsa.pacf(df1_co2, nlags=24, alpha=0.05)
subplot(212)
plot(pacf,'b^')
vlines(np.arange(len(pacf)),[-0.6],pacf,'b')

Out[25]:

<matplotlib.collections.LineCollection at 0x80ceff0>

Now, it seems that the trend component had been removed through differencing. What is the middle spike in our ACF corelogram?¶

Now, let's try to take into account the seasonality. Can you understand the following code?¶

In [29]:

# Differencing by the 12 months forces us to drop the first 12 values. 
df12_co2 = (co2 - np.roll(co2,12))[12:]
df12_dates = dates[12:]

p = 1
d = 0
q = 1

arima = ARIMA(df12_co2, [p, d, q], exog=None, dates=df12_dates, freq='M', missing='none')
df12_arima_results = arima.fit(trend='c', disp=False)
predicted_df12_arima = df12_arima_results.predict(exog=None, dynamic=False)

predicted_df12_arima_co2 = np.roll(co2,12)[12+d:] + np.roll(df12_co2,d)[d:] + predicted_df12_arima

plt.figure(figsize=(20,5))
plt.plot(df12_dates,df12_co2) 
plt.plot(df12_dates,predicted_df12_arima) # predicted 

Out[29]:

[<matplotlib.lines.Line2D at 0x5a8f130>]

In [30]:

df12_arima_results.summary(alpha=0.05)

Out[30]:

ARMA Model Results
Dep. Variable:	y	No. Observations:	452
Model:	ARMA(1, 1)	Log Likelihood	-213.130
Method:	css-mle	S.D. of innovations	0.387
Date:	Thu, 12 Dec 2013	AIC	434.260
Time:	21:04:20	BIC	450.715
Sample:	05-01-1975	HQIC	440.745
	- 12-01-2012

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
const	1.6828	0.118	14.273	0.000	1.452 1.914
ar.L1.y	0.8990	0.027	33.706	0.000	0.847 0.951
ma.L1.y	-0.3345	0.061	-5.481	0.000	-0.454 -0.215

1.1123 +0.0000j 1.1123 0.0000 2.9900 +0.0000j 2.9900 0.0000

Roots
AR.1
	Real	Imaginary	Modulus	Frequency
MA.1

How is this AIC value compare to the ones we produced ealier? Normally, we want to choose a model that produces the smallest AIC value. Modify the above code, and test the AICs with different models.¶

Exercise 3.1 Log Transformation¶

Now, let's take a look of a different date series. Try to plot and data, and see whether we need to do any transformation on it.¶

In [31]:

# International airline passengers: monthly totals in thousands. 
# from Jan 1949 to Dec 1960
air = [112,118,132,129,121,135,148,148,136,119,104,118,\
       115,126,141,135,125,149,170,170,158,133,114,140,\
    145,150,178,163,172,178,199,199,184,162,146,166,\
    171,180,193,181,183,218,230,242,209,191,172,194,\
    196,196,236,235,229,243,264,272,237,211,180,201,\
    204,188,235,227,234,264,302,293,259,229,203,229,\
    242,233,267,269,270,315,364,347,312,274,237,278,\
    284,277,317,313,318,374,413,405,355,306,271,306,\
    315,301,356,348,355,422,465,467,404,347,305,336,\
    340,318,362,348,363,435,491,505,404,359,310,337,\
    360,342,406,396,420,472,548,559,463,407,362,405,\
    417,391,419,461,472,535,622,606,508,461,390,432]

In [32]:

plot(air)

Out[32]:

[<matplotlib.lines.Line2D at 0x8b05cb0>]

Is this data series sationary? The variance seems to increase with the mean. How can we stationarize this series? By transforming?¶

In [34]:

#log transformation
log_air = np.log(air)
subplot(211)
plot(log_air)
     
#1st order differencing 
df1_log_air = log_air[1:] -log_air[:-1]
subplot(212)
plot(df1_log_air)

Out[34]:

[<matplotlib.lines.Line2D at 0x8cb5790>]

How does the series look now? How does the ACF or PACF look like?¶

In [57]:

def plot_acf_pacf(data):
    acf, ci, Q, pvalue = tsa.acf(data, nlags=24, alpha=0.05, qstat=True, unbiased=True)
    subplot(211)
    plot(acf,'b^')
    
    pacf, cip = tsa.pacf(data, nlags=24, alpha=0.05)
    subplot(212)
    plot(pacf,'b^')
    vlines(np.arange(len(pacf)),[-0.6],pacf,'b')

plot_acf_pacf(df1_log_air)

In [45]:

# Differencing by the 12 months forces us to drop the first 12 values. 
df12_air = (df1_log_air - np.roll(df1_log_air,12))[12:]

p = 1
d = 0
q = 1

arima = ARIMA(df12_air, [p, d, q], exog=None, missing='none')
df12_arima_results = arima.fit(trend='c', disp=False)
predicted_df12_arima = df12_arima_results.predict(exog=None, dynamic=False)

predicted_df12_arima_air = np.roll(df12_air,d)[d:] + predicted_df12_arima

plt.figure(figsize=(20,5))
plt.plot(df12_air) 
plt.plot(predicted_df12_arima_air) # predicted 

df12_arima_results.summary()

Out[45]:

ARMA Model Results
Dep. Variable:	y	No. Observations:	131
Model:	ARMA(1, 1)	Log Likelihood	227.133
Method:	css-mle	S.D. of innovations	0.043
Date:	Thu, 12 Dec 2013	AIC	-446.266
Time:	21:17:33	BIC	-434.765
Sample:	0	HQIC	-441.593

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
const	0.0003	0.002	0.124	0.901	-0.004 0.004
ar.L1.y	0.1448	0.245	0.590	0.556	-0.336 0.626
ma.L1.y	-0.5190	0.218	-2.382	0.019	-0.946 -0.092

6.9063 +0.0000j 6.9063 0.0000 1.9268 +0.0000j 1.9268 0.0000

Roots
AR.1
	Real	Imaginary	Modulus	Frequency
MA.1

4. Decomposition (Excercise)¶

From here, let's try to mannualy decomposite a time series, the Australian Beer Production data. Copy the data from: https://github.com/qwu-hab/geogg121/blob/master/Beer.txt ¶

In [52]:

import numpy as np
import pylab as plt

beer = np.genfromtxt("beer.txt",delimiter="\n")
year = np.arange(1,beer.shape[-1]+1)

#plt.plot(year,beer)

Plot the data. Do you need to tranform the series, or should we difference the series?¶

In [60]:

# some examplary code to calculate the 12-month moving average
months = np.arange(12) 
beer_12mon_avg = np.zeros_like(beer)[:-len(months)]
for i in months:
    beer_12mon_avg += beer[i:(i-len(months))]
    #print beer_ts[2:-10]
beer_12mon_avg /= len(months)
#plot(beer_12mon_avg)

Is this an additive or multiplicative model? Which one of the following two detrending method would work better?¶

In [55]:

beer_detrend_additive = beer[:-len(months)] - beer_12mon_avg
plot(beer_detrend_additive)

In [56]:

beer_detrend_multi = beer[:-len(months)] / beer_12mon_avg
plot(beer_detrend_multi)

Next, examine the residuals (irregular terms). Is there anything else we can do to make the residuals more random?¶

"To visualise how well the decompositions have performed we can compute the ACF of the two random components. If the random components are purely random processes then the ACF will show no significant autocorrelations "¶

5. Wavelet (Extra taste)¶

Example wavelet code from: http://www.phy.uct.ac.za/courses/python/examples/Wavelets.py ¶

In [2]:

run Wavelets.py

C:\Python27\lib\site-packages\matplotlib\image.py:349: UserWarning: Images are not supported on non-linear axes.
  warnings.warn("Images are not supported on non-linear axes.")

Time Series Practicals for GEOGG121¶

1. Data¶

How many columns are in the file? Let's use a column-friendly function from numpy, the loadtxt function, to read this file into an array.¶

Now we can plot the CO2 data. Before we do that, we can convert the year and month coloumns into datetime format, and then lable the x-axis with the datetime.¶

Firstly, we can draw a straight line to fit our co2 data, by using scipy.stats (http://docs.scipy.org/doc/scipy/reference/stats.html)¶

It looks like the trend is not perfectly linear. This is more obvious when we plot the residuals:¶

2. New libraries¶

2.1 AR & MA¶

Now, let's explore this series with classical time series analysis methods.¶

2.2 ARMA¶

take a look of ARMA mannual¶

2.3 ARIMA¶

Now, let's define a function to test and plot different ARIMA scenarios:¶

This is essentially a MA(1) model, we can also call the MA(1) or ARMA(0,1) fucntions in Scipy to achieve the same model. Next, we can test the AR(1) model, ARMA(1,1) model, and so on.¶

What do these AIC values mean? Which mode is better?¶

Try Plot_ARIMA(0,1,0), i.e. Random Walk, and ARIMA(1,0,1). What happend and why? Test a few more cobinations of p,d,q.¶

2.4 ACF & PACF¶

Now, let's test the co2 series for autocorelation and partial autocorelation.¶

A sharp cut off in PACF and a graduate decay in ACF is observed here. So, this further proves that we should use the AR() model more than MA() model. However, keep in mind, we haven't done any preprocessing yet (i.e. differencing, transformation).¶

3. Differencing & Transformation¶

The previous corelograms indicated strong autocorrelation. Now try to difference our time series by lag 1 (by using numpy.roll function) -- Let's see what happens.¶

Now, it seems that the trend component had been removed through differencing. What is the middle spike in our ACF corelogram?¶

Now, let's try to take into account the seasonality. Can you understand the following code?¶

How is this AIC value compare to the ones we produced ealier? Normally, we want to choose a model that produces the smallest AIC value. Modify the above code, and test the AICs with different models.¶

Exercise 3.1 Log Transformation¶

Now, let's take a look of a different date series. Try to plot and data, and see whether we need to do any transformation on it.¶

Is this data series sationary? The variance seems to increase with the mean. How can we stationarize this series? By transforming?¶

How does the series look now? How does the ACF or PACF look like?¶

4. Decomposition (Excercise)¶

From here, let's try to mannualy decomposite a time series, the Australian Beer Production data. Copy the data from: https://github.com/qwu-hab/geogg121/blob/master/Beer.txt ¶

Plot the data. Do you need to tranform the series, or should we difference the series?¶

Is this an additive or multiplicative model? Which one of the following two detrending method would work better?¶

Next, examine the residuals (irregular terms). Is there anything else we can do to make the residuals more random?¶

"To visualise how well the decompositions have performed we can compute the ACF of the two random components. If the random components are purely random processes then the ACF will show no significant autocorrelations "¶

5. Wavelet (Extra taste)¶

Example wavelet code from: http://www.phy.uct.ac.za/courses/python/examples/Wavelets.py ¶

Can you figure out how to read this scalogram? If so, then, try to modify the code, and fit our time series data to the Wavelet Scalogram.¶

Hope we had fun playing with the data! Enjoy your Xmas!¶

Time Series Practicals for GEOGG121¶

1. Data¶

How many columns are in the file? Let's use a column-friendly function from numpy, the loadtxt function, to read this file into an array.¶

Now we can plot the CO2 data. Before we do that, we can convert the year and month coloumns into datetime format, and then lable the x-axis with the datetime.¶

Firstly, we can draw a straight line to fit our co2 data, by using scipy.stats (http://docs.scipy.org/doc/scipy/reference/stats.html)¶

It looks like the trend is not perfectly linear. This is more obvious when we plot the residuals:¶

2. New libraries¶

2.1 AR & MA¶

Now, let's explore this series with classical time series analysis methods.¶

2.2 ARMA¶

take a look of ARMA mannual¶

2.3 ARIMA¶

Now, let's define a function to test and plot different ARIMA scenarios:¶

This is essentially a MA(1) model, we can also call the MA(1) or ARMA(0,1) fucntions in Scipy to achieve the same model. Next, we can test the AR(1) model, ARMA(1,1) model, and so on.¶

What do these AIC values mean? Which mode is better?¶

Try Plot_ARIMA(0,1,0), i.e. Random Walk, and ARIMA(1,0,1). What happend and why? Test a few more cobinations of p,d,q.¶

2.4 ACF & PACF¶

Now, let's test the co2 series for autocorelation and partial autocorelation.¶

A sharp cut off in PACF and a graduate decay in ACF is observed here. So, this further proves that we should use the AR() model more than MA() model. However, keep in mind, we haven't done any preprocessing yet (i.e. differencing, transformation).¶

3. Differencing & Transformation¶

The previous corelograms indicated strong autocorrelation. Now try to difference our time series by lag 1 (by using numpy.roll function) -- Let's see what happens.¶

Now, it seems that the trend component had been removed through differencing. What is the middle spike in our ACF corelogram?¶

Now, let's try to take into account the seasonality. Can you understand the following code?¶

How is this AIC value compare to the ones we produced ealier? Normally, we want to choose a model that produces the smallest AIC value. Modify the above code, and test the AICs with different models.¶

Exercise 3.1 Log Transformation¶

Now, let's take a look of a different date series. Try to plot and data, and see whether we need to do any transformation on it.¶

Is this data series sationary? The variance seems to increase with the mean. How can we stationarize this series? By transforming?¶

How does the series look now? How does the ACF or PACF look like?¶

4. Decomposition (Excercise)¶

From here, let's try to mannualy decomposite a time series, the Australian Beer Production data. Copy the data from: https://github.com/qwu-hab/geogg121/blob/master/Beer.txt¶

Plot the data. Do you need to tranform the series, or should we difference the series?¶

Is this an additive or multiplicative model? Which one of the following two detrending method would work better?¶

Next, examine the residuals (irregular terms). Is there anything else we can do to make the residuals more random?¶

"To visualise how well the decompositions have performed we can compute the ACF of the two random components. If the random components are purely random processes then the ACF will show no significant autocorrelations "¶

5. Wavelet (Extra taste)¶

Example wavelet code from: http://www.phy.uct.ac.za/courses/python/examples/Wavelets.py¶

Can you figure out how to read this scalogram? If so, then, try to modify the code, and fit our time series data to the Wavelet Scalogram.¶

Hope we had fun playing with the data! Enjoy your Xmas!¶

From here, let's try to mannualy decomposite a time series, the Australian Beer Production data. Copy the data from: https://github.com/qwu-hab/geogg121/blob/master/Beer.txt ¶

Example wavelet code from: http://www.phy.uct.ac.za/courses/python/examples/Wavelets.py ¶