#!/usr/bin/env python
# coding: utf-8

# # Relationship between WTI Crude Oil and Henry Hub Natural Gas  

# #### Author:

#     Rajan Subramanian    
#     Commodity Quant Research Intern
#     Energy/Metals Group

# #### Introduction

# In[ ]:


# This project examines the time series relationship between the Henry Hub Natural Gas price and the West Texas Intermediate (WTI) Crude Oil price.  This relationship has been approached in the past using correlation and deterministic trends.  Correlation is based on asset returns.  Any analysis of correlation typically begins by detrending the price data.  De-trending removes long term trend.  Hence, it is impossible to base any decision on common trends in prices.  
# 
# In the past, many studies have established that crude oil and gas prices are cointegrated.  Yet recent studies have noted that the two price series have 'decoupled'.  The prices have appeared to move independently of each other.  Natural gas prices are more volatile (approximately twice as more) than Crude Oil Prices. According to recent literature, the co-integrating relationship doesn't seem stable over time.  While rom 1989-2005, the two prices seemed to be trending upwards and co-integrated, after 2005, they two prices seems to have decoupled.  
# 
# Economic theory suggests that the two commodities are related.  They tend to be substitutes in production and consumption.  Past changes in crude oil drove changes in natural gas prices, but the converse did not occur.  This is because crude oil price is determined in the world market while natural gas market is regionally segmented.  Hence, domestic natural gas market is much smaller than global crude oil market and hence less likely to influence the global price of oil.  
# 
# The  following are the economic factors linking Natural Gas and Crude Oil Market:
# 
# 1) Increase in crude oil price motivates consumers to substitute nat gas for petroleum products in consumption
# which increases nat gas demand and hence prices.  
# 2) Increase in crude oil price resulting from increase in crude oil demand may increase nat gas produced as a co-product of oil, which would decrease nat gas prices.  
# 3) Increase in crude oil prices resulting from increase in crude oil demand may lead to increased costs of nat gas production and development, putting upward pressure on nat gas prices
# 4) Increase in crude oil prices resulting from an increase in crude oil demand may lead to more drilling and development of nat gas projects, which would tend to increase production and decrease nat gas prices.  
# 
# 
# 

# #### Purpose

# The purpose of this project is to better characterize the relationship between crude oil and natural gas.  Specifically, we will do the following:
# 
# 1) Examine the time series properties between natural gas and crude oil.  
# 2) Estimate a VAR model to identify the fundamental tie (the cointegrating equation).  This is the relationship that is generally established once two prices move away from each other
# 3) Determine the error correction mechanism.  Whenever nat gas price has been pulled away from fundamental tie, the price will predictably, drift back to fundamental tie.  This step estimates the rate at which the drift occurs
# 4) 

# #### Data Munging and Gathering

# The data for daily Crude Oil and Natural Gas prices are obtained from the Bloomberg Terminal.  Lets load the necessary libraries

# In[1]:


import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import statsmodels.graphics.tsaplots as tsaplots
import statsmodels.api as sm
from pylab import rcParams
#some useful R import functions
import rpy2.robjects as robj
import pandas.rpy.common as com
from rpy2.robjects.packages import importr
import scipy.stats as sp
stats=importr('stats')
TSA = importr('TSA')
urca = importr('urca')
VARS = importr('vars')
get_ipython().run_line_magic('pylab', 'inline')
import stat2models as st2
get_ipython().run_line_magic('load_ext', 'rpy2.ipython')


# Now, we will import the necessary data which is stored in a excel file.  Before we plot, lets munge the data to make the analysis more easier

# In[2]:


#read data from excel file
natgas = pd.read_excel('commodity_prices.xlsx',sheetname='nat_gas',index_col=0)
wti = pd.read_excel('commodity_prices.xlsx',sheetname='wti',index_col=0)


# Plot of WTI Crude Oil and Natural Gas Raw Prices  

# In[3]:


rcParams['figure.figsize'] = 12,4
axe = wti.plot(color='black')
we = natgas.plot(secondary_y=True,style='orange',ax=axe)
we.legend(loc='best')
axe.set_ylabel('Price (Dollars Per Barrel)')
axe.right_ax.set_ylabel('Price (Dollars per MMBtuu')
axe.set_title('Daily WTI/Nat Gas Prices')
axe.legend(loc='upper left')


# #### Analysis

# From the figure at the top, we can see that Nat Gas is more volatile than WTI.  It seems that from just looking at the graphs, the correlation between WTI/Nat gas is high and positive.  Since the data period includes daily prices, there might be too much noise in the daily data.  Lets resample monthly.  Monthly is used because they contain enough information to capture short-run movements over time.  Hence, we will choose the first price of a particular month for the commodity.  Also, log transformations is done for both the WTI and Nat Gas.  Log Transformation is done because 1)Generally in Geometric Brownian Motion, it is:
# 
# $log(S_t / S_{t-1})$  ~  $ N(\mu t,\sigma^2 t) $ 
# 
# that is, it is the log returns that are normally distributed.  Also, 2) log returns are easier to work with since h period log return is sum of h-consequtive one period returns.  This leads to square-root of time rule.  3) Log Transformations can remove scale effects and reduce possible effects of Heteroskedasticity.  
# 
# Finally, as Crude Prices are generally higher than Nat Gas price by atleast a factor of 2, we will scale down the log price of WTI by subtracting the mean of log price of Nat Gas.  This allows direct comparison of the two prices.  That is:
# 
# $log(WTI_t) = log(WTI_t) - mean(log(Nat Gas_t))$

# In[ ]:


# In[4]:


#resampling to monthly frequencies
#wti = wti.resample('W-MON',how='mean')
#natgas = natgas.resample('W-MON',how='mean')


#subtracting mean of log(Natgasprice) from log(wti)
Gt_wti = np.log(wti['wti_price'])-np.mean(np.log(natgas['ng1_price']))
xt_wti = Gt_wti.diff().dropna() #stores the log(returns of wti)
Gt_natgas = np.log(natgas['ng1_price'])
xt_natgas = Gt_natgas.diff().dropna() #stores the log(returns of nat gas)


# In[5]:


#plotting the data
rcParams['figure.figsize'] = 12,4
plt.plot(Gt_wti.index,Gt_wti.values,color='black',label='WTI Crude Oil Log(Price)')
plt.plot(Gt_natgas.index,Gt_natgas.values,color='orange',label='HenryHub Natural Gas Price (log Price)')
plt.legend(loc='best')
plt.ylabel('log price (Dollars)')
plt.xlabel('Date')
plt.title('Daily Nat Gas Log Prices & Mean-Adjusted log(WTI) 01/29 /1998-5/20/2015')


# ###### Analysis

# From visual inspection, we can see that nat gas prices are more volatile than crude oil price.  
# It seems like the two commodity prices were moving together since 1998 till mid 2008.  After 2008, there is a decoupling effect seen.  They both seem to be positively correlated and it seems the price of Henry Hub lags the price of WTI Crude Oil.  The two prices were trending upwards since early 2000s, however, there is a decoupling effect seen around 2001.  Although WTI was more stable around this period, nat gas futher declined.  There seems to be a steady increase in crude oil price since 2003, however, nat gas prices were more variable around this same period.  Also, since 2011, it seems WTI stabalized around $4.5 whereas nat gas prices were more variable during the same period.  There has been a steady decline in prices for both the commodities since mid 2014.  
# 
# WTI has been trending upwards since 1998 and mean doesn't appear to be constant.  This is an indication of a stochastic trend (a time series has a stochastic trend if it is currently non stationary and first difference makes it stationary).  Hence, WTI is not stationary.  
# 
# Natural Gas doesn't seem to wander into any particular direction.  However, it seems that since the early 1999, there was a general upward trend, but after 2005, the trend has been going downwards.  
# 
# 

# #### Part 1: EDA-Time Series Properties of WTI Crude Oil/Henry Hug Natural Gas

# In[14]:


from scipy.stats import norm
(mu,sigma) = norm.fit(xt_wti.values)
n,bins,patches = plt.hist(xt_wti.values,60,normed=1,color='darkkhaki',alpha=0.75,histtype='bar')
import matplotlib.mlab as mlab
y = mlab.normpdf(bins,mu,sigma)
l = plt.plot(bins,y,'blue',linewidth=2,label='Normal Density')
sns.kdeplot(xt_wti.values,color='r',label='Empirical Density')
plt.ylabel('Density')
plt.title('WTI Daily Returns')


# In[13]:


(mu,sigma) = norm.fit(xt_natgas.values)
n,bins,patches = plt.hist(xt_natgas.values,60,normed=1,color='darkkhaki',alpha=0.75,histtype='bar')
y = mlab.normpdf(bins,mu,sigma)
l = plt.plot(bins,y,'blue',linewidth=2,label='Normal Density')
sns.kdeplot(xt_natgas.values,color='r',label='Empirical Density')
plt.ylabel('Density')
plt.title('NatGas Daily Returns')


# In[16]:


pd.rolling_corr(xt_natgas,xt_wti,window=255).plot(title='WTI/NatGas Return Rolling Yearly Correlation',color='firebrick')


# In[12]:


skew = pd.rolling_skew(xt_wti,window=255).dropna()
n,bins,patches = plt.hist(skew.values,bins=30,normed=1,color='cadetblue')
plt.title('Histogram of Yearly WTI Return Skews')


# In[15]:


skew.plot(title='Rolling Yearly Skewness of WTI Returns',color='firebrick')


# In[22]:


skew = pd.rolling_skew(xt_natgas,window=255).dropna()
n,bins,patches = plt.hist(skew.values,bins=30,normed=1,color='cadetblue')
plt.title('Histogram of Yearly NatGas Return Skews')


# In[23]:


skew.plot(title='Rolling Yearly Skewness of NatGas Returns',color='firebrick')


# #### a)ACF and PACF Plots of the two Commodities

# Now, plot of the original series and the corresponding ACF, the difference series and the corresponding PACF for both commodities is shown next.   

# ###    A) Crude Oil 

# In[24]:


rcParams['figure.figsize'] = 13,8
fig,axes = plt.subplots(2,2)
axes[0,0].plot(Gt_wti.index,Gt_wti.values,color='black')
axes[0,0].set_title('log(wti price)')
tsaplots.plot_acf(Gt_wti.values,ax=axes[0,1],lags=24);
axes[1,0].plot(xt_wti.index,xt_wti.values,color='black')
axes[1,0].set_title('log(wti returns)')
tsaplots.plot_pacf(xt_wti.values,ax=axes[1,1],lags=24);
axes[0,1].set_title('ACF log(wti price)')
axes[1,1].set_title('PACF log(wti return)')


# ##### Analysis

# In[ ]:


# The top left figure shows the time series plot of Log(WTI) sampled at monthly intervals.  From visual inspection, it looks like the series is trending upwards.  There seemed to be a drastic drop at the beginning of 2008-2009.  The price series then reverted back to trending upwards after that.  However, since 2014, the prices were dropping again.  The ACF of the Log(wti) series is plotted on the top right graph.  The ACF graph shows that it is high and positive, close to 1 and slowly decaying.  This indicates that the root parameter of an AR process is close to 1. Hence, the series is clearly non-stationary.  
# 
# The plot of the first difference and the corresponding PACF is shown at the bottom.  From the graph of the first difference, we can see that the series looks stationary and seems to wander around a fixed level.  On the basis of sample pacf, we can see that the series drops down to 0 very quicly.  This is a property of a stationary series.  Similar property to white noise !. Looks like an AR(1) model would fit the data very well.  
# 
# Hence, the first level of the series is non stationary and differencing renders it stationary.  Hence, the original model is an I(1) series.  Below contains similar plots for nat_gas prices:

# ### B)Henry Hub Natural Gas

# In[25]:


rcParams['figure.figsize'] = 13,8
fig,axes = plt.subplots(2,2)
axes[0,0].plot(Gt_natgas.index,Gt_natgas.values,color='orange')
axes[0,0].set_title('log(NatGas price)')
tsaplots.plot_acf(Gt_natgas.values,ax=axes[0,1],lags=24);
axes[1,0].plot(xt_natgas.index,xt_natgas.values,color='orange')
axes[1,0].set_title('log(NatGas returns)')
tsaplots.plot_pacf(xt_natgas.values,ax=axes[1,1],lags=24);
axes[0,1].set_title('ACF log(NatGas price)')
axes[1,1].set_title('PACF log(NatGas return)')


# #### Analysis

# The top left figure shows the time series plot of Log(Natural Gas) sampled at monthly intervals.  From visual inspection, we can see theres seasonality in this data as there are many sequences of up and down patterns.  Also, its hard to say if there was any particular trend.   From 1998-2005, we can see an upward trend.  However, after 2008, it seems the trend has been generally downwards.    There seemed to be a drastic drop at the beginning of 2008-2009.  Since 2014, the prices were dropping again.  The ACF of the Log(Nat Gas) series is plotted on the top right graph.  The ACF graph shows that it is high and positive and decaying slowly.  This indicates there is some long memory in price series.     
# 
# The plot of the first difference and the corresponding PACF is shown at the bottom.  From the graph of the first difference, we can see that the series looks stationary and seems to wander around a fixed level.  On the basis of sample pacf, we can see that the AR(9) model might fit this data to render it stationary.  
# 
# Lets run some statistical testing to see if the model is stationary.  

# ### a)1 sample t-test on mean of Log(WTI Returns)

# In[26]:


#run the one sample t-test to check if sample mean is 0
print("Log(WTI) Returns")
st2.print_ttest(xt_wti.values)


# ### b) 1 sample t-test on mean-Log(Natural Gas Returns)

# In[27]:


print('\n')
print("Log(NatGas) Returns")
#run the one sample t-test for nat gas
st2.print_ttest(xt_natgas.values)


# ### Observation

# Sample mean of the log(returns) is 0 since the p-value is not significant.  We accept the null that the sample mean of log returns is 0. Hence we accept the null.  We will specify a model with no constant:

# In[ ]:


# In[28]:


r_df_wti = com.convert_to_r_dataframe(pd.DataFrame(xt_wti))
r_df_ng = com.convert_to_r_dataframe(pd.DataFrame(xt_natgas))
get_ipython().run_line_magic('Rpush', 'r_df_wti')
get_ipython().run_line_magic('Rpush', 'r_df_ng')

get_ipython().run_line_magic('R', 'y_wti_ret = ts(r_df_wti);')
get_ipython().run_line_magic('R', 'y_ng_ret = ts(r_df_ng);')
get_ipython().run_line_magic('R', "cat('wti order select', ar(y_wti_ret,dmean=FALSE)$order)")
get_ipython().run_line_magic('R', "cat('ng1 order select', ar(y_ng_ret,demean=FALSE)$order)")


# We can see that the automatic selection picks an AR(6) for the log (returns) of wti and AR(1) for logNatGas.  This confirms with the initial observation of the pacf graph that the order of the model is of AR(1) & AR(9) respectively. Lets check the residuals of the model:

# In[29]:


rcParams['figure.figsize'] = 11,8
fig,axes = plt.subplots(2,1)
model_wti1 = sm.tsa.ARMA(xt_wti.values,order=(9,0)).fit(trend='nc') #based on PACF criteria.  AIC leads to corr at lag 13
model_natgas = sm.tsa.ARMA(xt_natgas.values,order=(10,0)).fit(trend='nc') #based on PACF only
tsaplots.plot_acf(model_wti1.resid,lags=24,ax=axes[0]);
tsaplots.plot_acf(model_natgas.resid,lags=24,ax=axes[1]);
axes[0].set_title('ACF log(WTI returns) residuals')
axes[1].set_title('ACF log(NatGas Returns) residuals ')


# ##### Analysis

# We can see that the two models for wti and natgas fits pretty well.  The ACFs of the residuals don't show any significant lags at 5% significance.  Although in the wti, there seems to be a minor significance around lag 13.  But this is to be expected since in a sample size of 100, there is a 5% chance that our residuals will be correlated.  

# In[ ]:


# In[30]:


#ljung box test on the residuals
print "ljung box test on wti residuals \n"
st2.boxTest(model_wti1.resid)
print '\n'
print "ljung box test on nat gas residuals \n"
st2.boxTest(model_natgas.resid)


# #### ADF Test & Determining the trend and intercept for WTI and NatGas

# We perform two forms of ADF test due to viual observation of the data.  Each form differs based on deterministic component of the trend:
# 
# $\delta x_t = \beta_0 + \phi_1 x_{t-1} + \sum\limits_{i=0}^{p}(\gamma_i \delta x_{t-i}) + \epsilon_t$
# 
# $\delta x_t = \beta_0 + \beta_1 t + \phi_1 x_{t-1}  + \sum\limits_{i=0}^{p}(\gamma_i \delta x_{t-i}) + \epsilon_t$

# Following Elder/Kennedy's methodology to determine if there is a trend or intercept in the data:
# 
# #### Growth of the series is not known:
# Assume our regression is of the form:
# 
# $\delta x_t = \beta_0 + \beta_1 t + \phi_1 x_{t-1}  + \sum\limits_{i=0}^{p}(\gamma_i \delta x_{t-i}) + a_t$
# 
# Ignore the fact that $\beta_1$ is 0.  We will conduct the ADF to test for stationarity.
# 
# If the data is not stationary, $\beta_1$ = 0 would follow.  
# 
# Since there is unit root, we test presence of intercept by regressing $\delta x_t$ against the remaining variables on the rhs of above equation

# ##### ADF test

# In[31]:


st2.print_adfuller(Gt_wti.values,title='Log(WTI)-based on PACF',maxlag=9,regression='ct') # Based on  PACF observation
print('\n')
st2.print_adfuller(Gt_wti.values,title='Log(WTI)-based on t-stat',regression='ct',autolag='t-stat') # Based on t-stat
print('\n')
st2.print_adfuller(Gt_wti.values,title='Log(WTI)-based on AIC',regression='ct',autolag='AIC') # Based on t-stat
print('\n')
st2.print_adfuller(xt_wti.values,title='Log(WTI) Returns',maxlag=9,regression='nc') #checking if returns are stationary
print('\n')

st2.print_adfuller(Gt_natgas.values,title='Log(NatGas) based on PACF',maxlag=10,regression='ct') #Based on PACF observation
print('\n')
st2.print_adfuller(Gt_natgas.values,title='Log(NatGas) based on t-stat',regression='ct',autolag='t-stat')
print('\n')
st2.print_adfuller(Gt_natgas.values,title='Log(NatGas) based on AIC',regression='ct',autolag='AIC')
print('\n')
st2.print_adfuller(xt_natgas.values,title='Log(WTI) Returns',maxlag=10,regression='nc')


# In[ ]:


# ##### Analysis

# Hence we see that presence of unit root is not rejected.  i.e, we accept that the series is not stationary.  Our testing of stationarity is done. Hence $\beta_1$ = 0.  Next step, we test presence of intercept by regressing change in series against intercept:
# 
# $\delta x_t = \beta_0 + \sum\limits_{i=0}^{p}(\gamma_i \delta x_{t-i}) + a_t$
# 
# Notice...this is just a traditional AR model on the differences...Hence, we fit an AR(6) model for wti and AR(1) for nat_gas to determine the t statistic on the coefficient.  We can run this regression because the log(returns) are stationary

# In[32]:


m1 = st2.ar_ols(xt_wti.values,lags=9)
m2 = st2.ar_ols(xt_natgas.values,lags=10)
print 'wti_const = ', m1.params[0]
print 'wti_p value = ', m1.pvalues[0]
print '\n'
print 'natgas_const = ', m2.params[0]
print 'natgas_p value = ', m2.pvalues[0]


# Hence we accept that the drift is 0 and both of them have no trend.  Hence, both series are I(1) with no drift or trending terms

# In[33]:


#finally, using phillips perrons test to confirm that the series is stationary:
from arch.unitroot import PhillipsPerron

print "wti prices"
print(PhillipsPerron(Gt_wti,trend='nc',lags=9))
print '\n'

print 'nat gas prices'
print(PhillipsPerron(Gt_natgas,trend='nc',lags=10))


# The Phillips Perron test also confirm that the original series is I(1) for both the series

# ## Part II: Co-Integration Analysis

# two time Series $X_t$ and $Y_t$ are co-integrated if each is I(1) and there exists a $\lambda$ such that:
# 
# $Z = X_t -\lambda Y_t $ is stationary.
# 
# The process $Z_t$ is called disequilibrium because it captures deviations from long term equilibrium.  
# 
# Generally, price data is detrended even before any analysis even begins.  Hence, long term trend is generally removed.  Hence, it impossible to base any decision on common stochastic trends.  In co-integration, we have to first test if theres a common stochastic trends in variables.  
# 
# 
# If there is a common stochsic trend in set of prices, they must have long term equilibrium relationship.  Second goal of cointegration analysis is to capture this equilibrium in a dynamic correlation analysis.  Hence, we have a two set procedure:
# 
# a) long term equilibrium relationship between prices is established.  Statistical tests for cointegration is applied, and if cointegration is present, we identifuy stationary linear combination of prices which best describes long term equilibrum relationship between them.  
# 
# b) long term equilibrium is used in error correction model (ECM) of returns.  ECMs explain how short term deviations from equilibrium are corrected.  
# 
# 

# ##### a) Establishing Long Term Relationship (Engle-Granger Methodology)

# We can see from the very first picture on the top, that the two series seems co-integrated.  Following is a scatter plot of the two integrated Series and a regression model regressing Log(nat gas price) against log(crude oil price) :
# 
# $log(natgas_t) = \beta_0 + \beta_1 log(wti_t) +  \epsilon_t$
# 
# The above regression model determines whether the two series are co-integrated because once we run the regression and fit the model, then we subtract the log of WTI from log of nat gas, that is:
# 
# $log(natgas_t) - \beta_1 log(wti_t)  - \beta_0 =  \epsilon_t$
# 
# Hence, $ \beta_1 $   determines the parameter needed to form a stationarity between the two series.  Hence, from the rhs of the equation, an observation of the residuals would indicate whether the series is stationary or not.  
# 
# Engle Granger propose applying unit root testing on the residuals from the static regression above.  Unit root test regression on the residuals is without the deterministic terms (constant + trend).  According to Phillips and Ouliaris, ADF and Phillips Perrons Unit tests as applied to the residuals do not follow the usual Dickey Fuller Distribution.  But, due to spurious regression, distribution of ADF and PP test has asymptotic distirbutions that are a function of Weirner Process that depend on the deterministic terms and number of variables used as the dependent variable.  Hence, we use the Phillips Ouliaris test on the residuals to determine if the two series are co-integrating or not.  Also, due to Hansen (1992), PO distributions of the ADF and PP test applied to residuals depend on the trend behavior as follows:
# 
# 1) if two series are both I(1) without drift, then use regression with just a constant with dimension parameter n-1
# 
# 2) if independent variable is I(1) with drift while the dependent one may or maynot be I(1) with drift, use regression with constant plus trend in PO test with dimension parameter n-2.  If n-2 = 0, then use DF distribution adjusted for constant and trend
# 
# 3) Dependent is I(1) with drift and indepdent is I(1) without drift.  Use constnat plus trend in regression.  Use PO test with dimension n-1
# 
# Since we already know that both our series are I(1) without the drift, we will use the fist criteria and just specify the constant:

# In[34]:


prices = ['log(natgas)','log(wti)'] #empirically, wti causes nat gas
Gt_data = pd.DataFrame(np.column_stack([Gt_natgas.values,Gt_wti.values]),index=Gt_natgas.index,columns=prices)
xt_data = Gt_data.diff().dropna()
returns = ['log(natgas)return','log(wti)return']
xt_data.columns = returns
r_df = com.convert_to_r_dataframe(Gt_data)
get_ipython().run_line_magic('Rpush', 'r_df')
get_ipython().run_line_magic('R', 'y = ts(r_df)')
get_ipython().run_line_magic('R', "print(summary(ca.po(y,lag='long',type='Pz',demean='constant')))")


# The two series is not co-integrated according to the Phillips Ouliaris test.  The PO test gives a test statistic of 24.5729 which is less than the 10% critical value of 44.8877.  Hence, we accept that the two series are not co-integrated, at least in this data period.  

# Next we plot the residuals from our static regression and also show the acf and pacf plots:

# In[35]:


rcParams['figure.figsize'] = 14,3
fig,axes = plt.subplots(1,3)
n = len(Gt_natgas)
t = np.arange(0,n) 
X = np.column_stack((np.ones(n),Gt_wti.values))
res = sm.OLS(Gt_natgas.values,X).fit() # includes a constant: y(t) = a + bx(t) + e_t
axes[0].plot(res.resid,color='firebrick')
axes[0].set_title('residual plot')
tsaplots.plot_acf(res.resid,lags=24,ax=axes[1]);
axes[1].set_title('ACF residuals')
tsaplots.plot_pacf(res.resid,lags=24,ax=axes[2]);
axes[2].set_title('PACF residuals')


# #### Analysis

# The above graph plots the residual plot from the co-integrating regression (with a constant) and includes the ACF and PACF graph respectively.  We can see a slowly declining ACF for the residual and a sudden decline in PACF.  From the plots, it doesn't look like the residual series is stationary. 

# In[36]:


from statsmodels.stats.stattools import durbin_watson
print "Rsquared: ", res.rsquared
print "DW statistic: ", durbin_watson(res.resid)


# ### Analysis

# We notice a small DW statistic and low Rsquared.  ACF plot of residuals shows a slowly declining correlation. 
# Phillips Ouliaris test shows that the data is not cointegrated.  Hence, the parameters obtained for the static regression model is spurious or non sense.  

# 
# 

# 
# This is in essence, the Engle-Granger Methodology.  The methodology says that: 
# 
# 1)Test inidividual variables $X_{t}$ and $Y_{t}$ for unit roots
# 
# 2)Run the static cointegrating regression:
# 
# $Y_{t} = \beta_0 + \beta_1 X_{t} + \epsilon_t $
# 
# 3) Test for no-cointegration by testing for unit root in residuals, $\epsilon_t $.  Unit root regression for residuals is without the deterministic terms (constant or constant and trend)
# 
# 4) If co-integration is not rejected, check for a dynamic ECM model.  
# 
# We apply the OLS regression of one integrated variable on another integrated variable and apply Engle Granger test on the residuals.  However, Phillips and Ouliaris show that ADF root tests applied to estimated cointegrating residual do not have the usual Dickey Fuller distribution under null of no cointegration.  These distributions depend on the Phillips-Ouliaris (PO) distributions.  We can also confirm using the Phillips-Ouliaris co-integration (which is an implementation of Engle-Granger Methodology and an improvement over EG methodology for co-integration) test.  Finally, before we proceed to these testing...we will cover Johansen Procedure which is another form of co-integration testing...

# #### Johansen Procedure

# The basic steps are the following:
# 
# 1) Specify and Estimate a VAR(p) model for $Y_t$
# 
# 2) Construct likelihood ratio tests for rank of coefficient matrix to determine number of co-integrating vector
# 
# 3) Impose normalization and identify restrictions on co-integrating vectors
# 
# 4) Given cointegrating vector, estimate the cointegrated VECM using Maximum likelihood 
# 
# Again, there are special cases that depends on whether the data has a restricted constant or trending.  Refer to Eric Zivot's chapter 12 -Cointegration

# In[37]:


get_ipython().run_line_magic('R', "print(summary(ca.jo(y,type='eigen',ecdet='const')))")
get_ipython().run_line_magic('R', "print(summary(ca.jo(y,type='trace',ecdet='const')))")


# 
# 
# #### Analysis

# There is no co-integration between 1998-2015.   You cannot apply ADF test on the residuals.  Testing stationarity of residuals follows the phillipps-Ouliaris distribution.  According to the po test above, it has the benefit that it doesn't matter what the dependent variable in the regression equation is.  We can clearly see that, including a constant in our regression stills leads to no co-integration.  We can conclude, that for period 1998-2015, Crude WTI/NatGas is not co-integrated.  
# 
# We can see that, it seems structural breaks in the data might affect the outcome of the test.  It looks like from 2010, there was a structural break in the data where the two prices have diverged.  Perhaps more time needs to pass to see whether the relationship eventually co-integrates.  
# 
# Also, some of these tests that perform ADF tests on the residuals are too sensitive to the tests at hand.  For example, recall that an AR(1) is stationary if and only if $ \beta_1  $ is $<= |1|$: 
# 
# $x_t$ = $\beta_0 + \beta_1 x_{t-1} + \epsilon_t$.  Subtracting we find..
# 
# $\delta x_t $ = $\beta_0 + x_{t-1}\pi + \epsilon_t $
# 
# where $\pi$ = ($\beta_1 - 1$)
# 
# Hence, testing for unit root is analogous to testing for $\pi$ = 0.  If the number is very close to 0, i.e 0.000056 etc..there is nothing to distinguish it from 0 and not 0.  Bottom line is if roots are close to unity, we might have tests that fail and others that pass

# #### TimeSpan

# According to literature, Crude Oil WTI and Natgas were co-integrated in the past.  Looking at the figure, we can see that a possible relationship might have existed from 1998-2009.  After 2009, it looks like the spread between the two commodities have increased.  Lets run our tests on these dates:
# 
# 1998-2006
# 
# 1998-01/2009
# 
# 2009-current

# In[38]:


def filter_data(wti,natgas,start,end):
    #return the wti and natgas log prices corresponding to the start and end dates
    x = np.log(wti.ix[start:end,'wti_price']) - np.mean(np.log(natgas.ix[start:end,'ng1_price']))
    y = np.log(natgas.ix[start:end,'ng1_price'])
    
    return x,y


# # Eye Test

# In[39]:


plt.plot(Gt_data.values)
plt.plot((3045,3045),(0,4),'r-')
plt.plot((2620,2620),(0,4),color='orange')
plt.plot((2760,2760),(0,4),'b-')
print(Gt_data.index[3045])
print(Gt_data.index[2620])
print(Gt_data.index[2760])


# Red line marks the end of co-integrating relationship (done through trial and error)
# Blue line indicates the beginning of decoupling effect
# Orange line marks the beginning of financial crisis

# In[40]:


crude,ng1 = filter_data(wti,natgas,start='1998',end='03-04-2010') #filter the time period.  From 06-29-2009: no co-integration
fig,ax = plt.subplots(1,2)
rcParams['figure.figsize'] = 12,5
tsaplots.plot_pacf(crude.diff().dropna(),lags=24,ax=ax[0]); #plot the pacf for the two commodites
tsaplots.plot_pacf(ng1.diff().dropna(),lags=24,ax=ax[1]);


# In[ ]:


# In[41]:


from arch.unitroot import ADF
#based on Enders/kennedy method to determine if data has a drift + trend.
print(ADF(crude.values,trend='ct',lags=9)) #based on pacf
print '\n'
print(ADF(ng1.values,trend='ct',lags=5))


# In[42]:


#we accept the data is integrated.  Hence, beta_1 = 0.  Now we regress difference with just constant:
m1 = st2.ar_ols(crude.diff().dropna(),9)
m2 = st2.ar_ols(ng1.diff().dropna(),5)
print 'wti_const = ', m1.params[0]
print 'wti_p value = ', m1.pvalues[0]
print '\n'
print 'natgas_const = ', m2.params[0]
print 'natgas_p value = ', m2.pvalues[0]


# In[43]:


#accept that there is no intercept.  Next we do the ljung box test to check validity of the model:
st2.boxTest(m1.resid)


# In[44]:


st2.boxTest(m2.resid)


# Hence, no trend and no constant in our model.  We accept that the data is integrated during this time period as well. Ljung Box test on the residuals don't show any correlation for both the shorter and longer lags.  We accept there is no correlation in the residuals.   We confirm using Phillips Perrons Unit test:

# In[45]:


print(PhillipsPerron(crude.values,lags=9,trend='nc'))
print '\n'
print(PhillipsPerron(ng1.values,lags=5,trend='nc'))


# In[46]:


#Next we perform Phillips Ouliaris test and Johansen's test:

df_pre = pd.DataFrame(np.column_stack([ng1.values,crude.values]),index=ng1.index,columns=prices)
r_df = com.convert_to_r_dataframe(df_pre)
get_ipython().run_line_magic('Rpush', 'r_df')
get_ipython().run_line_magic('R', 'y.pre = ts(r_df)')
get_ipython().run_line_magic('R', "print(summary(ca.po(y.pre,lag='long',type='Pz',demean='constant'))) #phillps ouliaris test")


# Phillips Ouliaris test gives a test statistic of 49.1058.  The 10% and 5 % critical values are 47.5877 and 55.2202 respectively.  Hence, the PO test rejects the null of no cointegration.  Hence, we can be 90% confident, that the two series were co-integrated in this time period.  However, the test fails at the 5% and 1% significance.  

# In[47]:


#Johansen test:
get_ipython().run_line_magic('R', "print(summary(ca.jo(y.pre,type='eigen',ecdet='const')))")
get_ipython().run_line_magic('R', "print(summary(ca.jo(y.pre,type='trace',ecdet='const')))")


# #### Analysis

# Johansen's test gives a value of 9.41 and hence fails to reject no cointegration.  
# 
# Finally, Johansen's trace fails to accept that the data is cointegrated.  
# 
# Quite Possible that during this time, there is a possibility of cointegration or if it existed, it was weak.   

# ## 04/2010-05/20/2015

# In[48]:


crude,ng1 = filter_data(wti,natgas,start='04-2010',end='05-20-2015') #filter the time period
fig,ax = plt.subplots(1,2)
rcParams['figure.figsize'] = 12,4
tsaplots.plot_pacf(crude.diff().dropna(),lags=24,ax=ax[0]); #plot the pacf for the two commodites
tsaplots.plot_pacf(ng1.diff().dropna(),lags=24,ax=ax[1]);


# In[49]:


print(ADF(crude.values,trend='ct',lags=1)) #based on pacf
print '\n'
print(ADF(ng1.values,trend='ct',lags=1))


# In[50]:


#Hence we accept that the two time series are integrated.  Next we specify the model with a constant:
m1 = st2.ar_ols(crude.diff().dropna(),1)
m2 = st2.ar_ols(ng1.diff().dropna(),1)
print 'wti_const = ', m1.params[0]
print 'wti_p value = ', m1.pvalues[0]
print '\n'
print 'natgas_const = ', m2.params[0]
print 'natgas_p value = ', m2.pvalues[0]


# In[51]:


#again, no constant, and no trend in our data. Next we perform PhillipsPerron unit test:
print(PhillipsPerron(crude.values,trend='nc'))
print '\n'
print(PhillipsPerron(ng1.values,trend='nc'))


# In[52]:


st2.boxTest(m1.resid)


# In[53]:


st2.boxTest(m2.resid)


# In[54]:


#perform the phillips ouliaris test and johanset test
df_post = pd.DataFrame(np.column_stack([ng1.values,crude.values]),index=ng1.index,columns=prices)
r_df = com.convert_to_r_dataframe(df_post)
get_ipython().run_line_magic('Rpush', 'r_df')
get_ipython().run_line_magic('R', 'y.post = ts(r_df)')
get_ipython().run_line_magic('R', "print(summary(ca.po(y.post,lag='long',type='Pz',demean='constant'))) #phillips perron unit test")


# #### Observation
# 

# PO test has failed to reject the null of no cointegration.  i.e the data is not co-integrated during this time period.  Next we perform the Johansen test:

# In[55]:


#Johansen test:
get_ipython().run_line_magic('R', "print(summary(ca.jo(y.post,type='eigen',ecdet='const')))")
get_ipython().run_line_magic('R', "print(summary(ca.jo(y.post,type='trace',ecdet='const')))")


# #### Observation

# After 03-2010, Johansen test also fails to reject no cointegration.  All the tests conclude that there is no co-integrating relationship anymore.  A doucoupling effect has taken over since 06-29-2009.   

# ## Vector Autoregression(VAR)

# VAR models help to characterize the joint behavior of collection of variables.  VAR models are generally used for structural inference and policy analysis.   In structural analysis, certain assumptions about the causal structure of the data under investigation are imposed, and the resulting causal impacts of unexpected shocks or
# innovations to specified variables on the variables in the model are summarized. These causal impacts are usually summarized with impulse response functions and forecast error variance decompositions.  
# 
# Since we have two time series, we will use a Var(p) model of the form:
# 
# $X_t = \beta_{x,0} + \beta_{x,x_1}X_{t-1} + \beta_{x,x_2}X_{t-2} + ... + \beta_{x,x_p}X_{t-p} + \beta_{x,y_1}Y_{t-1} + \beta_{x,y_2}Y_{t-2} + ... + \beta_{x,y_p}Y_{t-p} + v_t ^{x}$
# 
# $Y_t = \beta_{y,0} + \beta_{y,y_1}Y_{t-1} + \beta_{y,y_2}Y_{t-2} + ... + \beta_{y,y_p}Y_{t-p} + \beta_{y,x_1}X_{t-1} + \beta_{y,x_2}X_{t-2} + ... + \beta_{y,x_p}X_{t-p} + v_t ^{y}$
# 
# where $\beta_{x,y_p}$ is coefficient of Y at equation for X at lag p
# 
# and $\beta_{y,x_p}$ is coefficient of X at equation for Y at lag p
# 
# 

# #### Granger Causality

# If lagged variables of one return $X_t$ helps predict current and future returns of another variable $Y_{t}$ better than $Y_{t}$ alone, this is called $X_{t}$ Granger Causes $Y_{t}$,  This is the lead-lag relationship between the two returns.  
# 
# In order to test for Granger Casuality from one variable to another, we test joint significance of all variables containing lagged Y in the first equation.  
# 
# Y Granger Causes X if and only if $H_0$ : $ \beta_{x,y_1} = \beta_{x,y_2} = ... = \beta_{x,y_p}  = 0$ is rejected

# #### Toda and Yamamato procedure to determine Granger Casuality

# 1)Test each of the time-series to determine their order of integration.
# 
# 2)Let the maximum order of integration for the group of time-series be m.
# 
# 3)Set up a VAR model in the levels (not the differences) of the data, regardless of the orders of integration of the various time-series.
# 
# 4)Determine the appropriate maximum lag length for the variables in the VAR, say p, using the usual methods.
# 
# 5)Make sure that the VAR is well-specified.
# 
# 6)If two or more of the time-series have the same order of integration, at Step 1, then test to see if they are cointegrated.
# 
# 7)No matter what you conclude about cointegration at Step 6, this is not going to affect what follows. It just provides a possible cross-check on the validity of your results at the very end of the analysis.
# 
# 8)Take the preferred VAR model and add in m additional lags of each of the variables into each of the equations.
# 
# 9)Test for Granger non-causality as follows. For expository purposes, suppose that the VAR has two equations, one for X and one for Y. Test the hypothesis that the coefficients of (only) the first p lagged values of X are zero in the Y equation, using a standard Wald test. Then do the same thing for the coefficients of the lagged values of Y in the X equation.
# 
# 10)Make sure that you don't include the coefficients for the 'extra' m lags when you perform the Wald tests.
# 
# 11)The Wald test statistics will be asymptotically chi-square distributed with p d.o.f., under the null.
# 
# 12)Rejection of the null implies a rejection of Granger non-causality.
# Finally, look back at what you concluded in Step 6 about cointegration
# 
# 

# #### Analysis

# We know that each of the time series is I(1) as we did that earlier utilizing ADF tests.  Hence, maximum order of integration is 1.  Also, we note that there is a structural break in 2009.  Casuality tests are very sensitive to structural breaks in data.  Hence, we will omit up to this time period and consider the period 02/2009- 2015
# and a pre structural break period from 1998-01/2009:

# #### 04/2010 - Current

# In[60]:


#plot of the data

fig,axes=plt.subplots(1,2)
df_post['log(wti)'].plot(title='log(WTI)',ax=axes[0],color='black')
df_post['log(natgas)'].plot(title='log(NatGas)',ax=axes[1],color='orange')


# #### Analysis

# The spread between the two series has definitely increased when compared to the past.  I chose 04-2010 as the starting point to test Granger Causality because that is when the relationship between the two series seem to decouple.  

# In[61]:


get_ipython().run_line_magic('R', "print(VARselect(y.post,lag=20,type='const')$selection) #step 4) Determining maximum lag length")


# In[62]:


#Either lag 1 or 2
#step 3) specifying a VAR model on the levels
#Step 5) Making sure the VAR model is well specified:
get_ipython().run_line_magic('R', "v_1 <- VAR(y.post,p=1,type='const')")
get_ipython().run_line_magic('R', "v_2 <- VAR(y.post,p=2,type='const') #lag 2")

get_ipython().run_line_magic('R', "print(serial.test(v_1,lags.bg=16,type='BG'))")
get_ipython().run_line_magic('R', "print(serial.test(v_2,lags.bg=16,type='BG'))")


# #### Analysis

# We can see that both the lag 1 and lag 2 show no serial correlation at the residuals.    This is confirmed by the Breusch-Godfrey Lagrange Multiplier Test with a p value of 0.3577 and 0.6197.  I will choose a lag 2 model based on AIC criteria. Hence we choose lag-2:

# In[65]:


get_ipython().run_line_magic('R', 'print(coef(v_2))')


# In[66]:


#stability analysis
print(robj.r('1/roots(v_2)[[2]]')) # > 1
print(robj.r('1/roots(v_2)[[1]]')) # > 1
#alternative stability analysis, 
#Computes an empirical fluctuation process 

get_ipython().run_line_magic('R', 'stability_v2 <- stability(v_2)')
get_ipython().run_line_magic('R', 'plot(stability_v2)')


# In[70]:


get_ipython().run_line_magic('R', "plot(stability(v_2,type='ME'))")


# In[97]:


strucchange = importr('strucchange')
get_ipython().run_line_magic('R', 'wt <- breakpoints(y.post[,2] ~ 1)')
get_ipython().run_line_magic('R', 'print(wt)')
get_ipython().run_line_magic('R', 'print(summary(wt))')


# In[98]:


robj.r('''ocus2 <- efp(y.post[,2]~1,type='OLS-CUSUM')''')
robj.r('''ocus1 <- efp(y.post[,1]~1,type='OLS-CUSUM')''')
get_ipython().run_line_magic('R', 'plot(ocus2)')


# In[99]:


get_ipython().run_line_magic('R', 'plot(ocus1)')


# In[ ]:


# In[105]:


rcParams['figure.figsize'] = 9,4
get_ipython().run_line_magic('R', 'fm0 <- lm(y.post[,2] ~ 1)')
get_ipython().run_line_magic('R', 'fm1 <- lm(y.post[,2]~breakfactor(wt,breaks=4))')
fitted_fm0 = get_ipython().run_line_magic('R', 'fitted(fm0)')
fitted_fm1 = get_ipython().run_line_magic('R', 'fitted(fm1)')
df_post['log(wti)'].plot(color='black')
plt.plot(df_post.index,fitted_fm0,color='cadetblue')
plt.plot(df_post.index,fitted_fm1,color='orange')
fir = pd.to_datetime('2011-02-22')
sec = pd.to_datetime('2012-05-15')
third = pd.to_datetime('2013-06-13')
fourth = pd.to_datetime('2014-08-13')
plt.plot((fir,fir),(2.4,3.5),'b--')
plt.plot((sec,sec),(2.4,3.5),'b--')
plt.plot((third,third),(2.4,3.5),'b--')
plt.plot((fourth,fourth),(2.4,3.5),'b--')
plt.title('wti post')


# In[ ]:


# In[106]:


get_ipython().run_line_magic('R', 'wt <- breakpoints(y.pre[,2] ~ 1)')
get_ipython().run_line_magic('R', 'print(wt)')
get_ipython().run_line_magic('R', 'print(summary(wt))')


# In[107]:


df_pre.index[[456,1220,1676,2361]]


# In[111]:


get_ipython().run_line_magic('R', 'fm0 <- lm(y.pre[,2] ~ 1)')
get_ipython().run_line_magic('R', 'fm1 <- lm(y.pre[,2]~breakfactor(wt,breaks=4))')
fitted_fm0 = get_ipython().run_line_magic('R', 'fitted(fm0)')
fitted_fm1 = get_ipython().run_line_magic('R', 'fitted(fm1)')
df_pre['log(wti)'].plot(color='black')
plt.plot(df_pre.index,fitted_fm0,color='green')
plt.plot(df_pre.index,fitted_fm1,color='red')
fir = pd.to_datetime('1999-10-27')
sec = pd.to_datetime('2002-11-20')
third = pd.to_datetime('2004-09-21')
fourth = pd.to_datetime('2007-06-15')
plt.plot((fir,fir),(1.0,3.5),'b--')
plt.plot((sec,sec),(1.0,3.5),'b--')
plt.plot((third,third),(1.0,3.5),'b--')
plt.plot((fourth,fourth),(1.0,3.5),'b--')
plt.plot(df_pre.index,df_pre['log(natgas)'].values,color='orange')


# In[112]:


get_ipython().run_line_magic('R', 'wt_entire <- breakpoints(y[,2]~1)')
get_ipython().run_line_magic('R', 'print(wt_entire)')


# In[113]:


get_ipython().run_line_magic('R', 'print(summary(wt_entire))')


# In[114]:


Gt_data.index[[654,1634,2360,3214]]


# In[115]:


get_ipython().run_line_magic('R', 'fm0 <- lm(y[,2] ~ 1)')
get_ipython().run_line_magic('R', 'fm1 <- lm(y[,2]~breakfactor(wt_entire,breaks=4))')
fitted_fm0 = get_ipython().run_line_magic('R', 'fitted(fm0)')
fitted_fm1 = get_ipython().run_line_magic('R', 'fitted(fm1)')
Gt_data['log(wti)'].plot(color='black')
plt.plot(Gt_data.index,fitted_fm0,color='green')
plt.plot(Gt_data.index,fitted_fm1,color='red')
fir = pd.to_datetime('2000-08-15')
sec = pd.to_datetime('2004-07-22')
third = pd.to_datetime('2007-06-14')
fourth = pd.to_datetime('2010-11-02')
plt.plot((fir,fir),(1.0,3.5),'b--')
plt.plot((sec,sec),(1.0,3.5),'b--')
plt.plot((third,third),(1.0,3.5),'b--')
plt.plot((fourth,fourth),(1.0,3.5),'b--')
plt.plot(Gt_data.index,Gt_data['log(natgas)'].values,color='orange')


# In[ ]:


# In[ ]:


# In[ ]:


# ##### Analysis

# (Need to explain what OLS-CUSUM is and stability)
# 
# We can see that the model looks stable as verified by the plots and the stability calculation. 

# In[116]:


#Steps 6-7 have already been done before. 
#Step 8: Adding in m=1 additional lag:
get_ipython().run_line_magic('R', "v_3 <- VAR(y.post,p=3,type='const') #lag 3")
get_ipython().run_line_magic('R', 'print(v_3$varresult)')


# In[ ]:


# In[117]:


#Steps 9-12:
# Wald-test for the first 2 lags
# The test can be directly done with the VAR model, however using the correct 
# variables is a little more tricky
aod = importr('aod')
#Var-Model, lag=3 (additional lag,not tested)
#Wald-test (H0: Log(WTI) does not Granger-cause Log(NatGas))
get_ipython().run_line_magic('R', 'print(wald.test(b=coef(v_3$varresult[[1]]), Sigma=vcov(v_3$varresult[[1]]), Terms=c(2,4))) #test is for three lags')
# Failure to reject  (X2=0.55; p=0.76)


# In[69]:


#Wald-Test (H0: Log(NatGas) does not Granger Cause Log(WTI))
get_ipython().run_line_magic('R', 'print(wald.test(b=coef(v_3$varresult[[2]]), Sigma=vcov(v_3$varresult[[2]]), Terms=c(1,3))) #test is for two lags')
#Failure to reject (X2=1.5,p=0.47)


# ##### Analysis/Conclusion for 2010-Current

# We can see that Log(WTI) does not Granger cause Log(NatGas), hence we fail to reject the Null Hypothesis that WTI does not Granger Cause NatGas.  Wald Test gives a p-value of 0.4 and a Xsquared value of 3.0
# 
# log(NatGas) does not Granger cause Log(WTI) since the Wald Test gives a value of 3.4 with a p value of 0.34
# 
# Since there is no Granger Casuality, the two series are not co-integrated in this time span.  This is confirmed in our earlier analysis for co-integration test that there wasn't a co-integrating behavior in this time span.  

# In[ ]:


# ##### 1998 - 03/04/2010

# We already know that the two series are co-integrated in this time span.  Hence, if two time series are co-integrated, there must be atleast one Granger Casual flow into our system.  Hence, our Tado and Yamamato procedure provides a cross check to confirm a granger Casuality in time frame:

# In[ ]:


# In[119]:


get_ipython().run_line_magic('R', "print(VARselect(y.pre,lag=20,type='both')$selection) #step 4) Determining maximum lag length")


# In[120]:


#Either lag 1,2 or 3
#step 3) specifying a VAR model on the levels
#Step 5) Making sure the VAR model is well specified:
get_ipython().run_line_magic('R', "v_1 <- VAR(y.pre,p=1,type='const') #lag 1")
get_ipython().run_line_magic('R', "v_2 <- VAR(y.pre,p=2,type='const') #lag 2")
get_ipython().run_line_magic('R', "v_3 <- VAR(y.pre,p=3,type='const') #lag 3")
get_ipython().run_line_magic('R', "print(serial.test(v_1,lags.bg=16,type='BG'))")
get_ipython().run_line_magic('R', "print(serial.test(v_2,lags.bg=16,type='BG'))")
get_ipython().run_line_magic('R', "print(serial.test(v_3,lags.bg=16,type='BG'))")


# ##### Analysis

# The Breusch-Godfrey LM test says that there is correlation for the lag-1 model with a p-value of 0.01209.  Hence we reject the null of no correlation for the lag 1 model.  Lag 2 and Lag 3 are the only models that leads to acceptance of the null of no serial correlation.  Based on AIC, I will choose a lag 3 model:

# In[121]:


#stability analysis
get_ipython().run_line_magic('R', 'print(1/roots(v_3)[[1]]) # > 1')

#alternative stability analysis, 
#Computes an empirical fluctuation process 
get_ipython().run_line_magic('R', 'plot(stability(v_3))')


# #### Analysis

# In[ ]:


# In[122]:


#Steps 6-7 have already been done before. 
#Step 8: Adding in m=1 additional lag:
get_ipython().run_line_magic('R', "v_4 <- VAR(y.pre,p=4,type='const') #lag 4 but test is upto lag 3")
get_ipython().run_line_magic('R', 'print(v_4$varresult)')


# In[123]:


#Steps 9-12:
# Wald-test for the first 3 lags
# The test can be directly done with the VAR model, however using the correct 
# variables is a little more tricky

#Var-Model, lag=4 (additional lag,not tested)
#Wald-test (H0: Log(WTI) does not Granger-cause Log(NatGas))
get_ipython().run_line_magic('R', 'print(wald.test(b=coef(v_4$varresult[[1]]), Sigma=vcov(v_4$varresult[[1]]), Terms=c(2,4,6))) #test for four lags')
# reject  (X2=8.6; p=0.035)


# In[124]:


#Wald-Test (H0: Log(NatGas) does not Granger Cause Log(WTI))
get_ipython().run_line_magic('R', 'print(wald.test(b=coef(v_4$varresult[[2]]), Sigma=vcov(v_4$varresult[[2]]), Terms=c(1,3,5)))')
#Failure to reject (X2=3.0,p=0.39)


# ##### Analysis

# We can confirm that there is Granger Casual flows into the system.  Specifically, WTI Granger Causes NatGas.  The Wald test statistic has a p value of 0.035 which rejects the null that all coefficients in front of the WTI regressors are 0.  However, NatGas does not Granger cause WTI.  We can confirm that the p value is indeed high at 0.39.  Hence, we accept that log(NatGas) does not Granger Cause log(WTI).  This provides a cross check because earlier we found that two series are co-integrated into this time frame, hence there must be atleast one granger Casual flows into the system.  
# 
# Hence, log(WTI) granger causes NatGas

# ##### Entire Data Generating Process

# Finally, we will test for casuality for the entire data generating process:

# In[ ]:


# In[125]:


get_ipython().run_line_magic('R', "print(VARselect(y,lag=20,type='both')$selection) #step 4) Determining maximum lag length")


# In[126]:


#Doesn't make any sense.  Too many structural breaks in the data to take the entire data generating sequence into account.  
#No way 10 weeks ago prices can determine todays price.  


# In[127]:


#Either lag 1 or 2
#step 3) specifying a VAR model on the levels
#Step 5) Making sure the VAR model is well specified:
get_ipython().run_line_magic('R', "v_1 <- VAR(y,p=1,type='const') #lag 1")
get_ipython().run_line_magic('R', "v_2 <- VAR(y,p=2,type='const') #lag 2")
get_ipython().run_line_magic('R', "v_3 <- VAR(y,p=3,type='const') #lag 3")


get_ipython().run_line_magic('R', "print(serial.test(v_1,lags.bg=16,type='BG'))")
get_ipython().run_line_magic('R', "print(serial.test(v_2,type='BG',lags.bg=16))")
get_ipython().run_line_magic('R', "print(serial.test(v_3,type='BG',lags.bg=16))")


# ####Analysis

# Lag 1 model is inadequate as the Greusch Godfrey LM test rejects the null of no correlation.  It seems a lag 2 model might work or even a lag 3 since the test accepts that there is no correlation at those lags.  I will choose a lag 3 as it has the lowest AIC:

# In[128]:


#stability analysis
get_ipython().run_line_magic('R', 'print(1/roots(v_3)[[1]]) # > 1')
get_ipython().run_line_magic('R', 'print(1/roots(v_3)[[2]]) # > 1')

#alternative stability analysis, 
#Computes an empirical fluctuation process 
get_ipython().run_line_magic('R', 'plot(stability(v_3))')


# In[129]:


#Steps 6-7 have already been done before. 
#Step 8: Adding in m=1 additional lag:
get_ipython().run_line_magic('R', "v_4 <- VAR(y,p=4,type='const') #lag 4 but test is upto lag 3")
get_ipython().run_line_magic('R', 'print(v_4$varresult)')


# In[130]:


#Steps 9-12:
# Wald-test for the first 3 lags 
# The test can be directly done with the VAR model, however using the correct 
# variables is a little more tricky

#Var-Model, lag=4 (additional lag,not tested)
#Wald-test (H0: Log(WTI) does not Granger-cause Log(NatGas))
get_ipython().run_line_magic('R', 'print(wald.test(b=coef(v_4$varresult[[1]]),Sigma=vcov(v_4$varresult[[1]]),Terms=c(2,4,6)))')


# In[131]:


#Wald-Test (H0: Log(NatGas) does not Granger Cause Log(WTI))
get_ipython().run_line_magic('R', 'print(wald.test(b=coef(v_4$varresult[[2]]), Sigma=vcov(v_4$varresult[[2]]), Terms=c(1,3,5)))')
#Failure to reject (X2=3.5,p=0.32)


# ##### Analysis

# We will conclude that there is no Granger Casuality during this time span.  Wald test fails to reject that WTI does not Granger cause log(NatGas) with a p-value of 0.059.  Log(NatGas) does not Granger Cause log(WTI) as wald test fails to reject no causality with a p-value of 0.32.  

# ### Impulse Response Function and Forecasting

# After testing for Granger Causality, we know that the two series are not co-integrated.  Hence, we fit a VAR model to the differences for modeling IRF/forecating purposes:

# In[132]:


# for post 2010 to current:
from statsmodels.tsa import vector_ar
model = vector_ar.var_model.VAR(df_post.diff().dropna()).fit(2)
impulse = model.irf(10)
impulse.plot(orth=True)


# #### Analysis

# The top row shows the responses of change in log(natgas) to structural shocks while bottom shows responses to change in log(wti) to structural shocks.  For the first shock $v_t^y$, a unit shock in natgas causes natgas returns to increase initially, but then it drops quickly down to 0 within a day.  Similarly, it causes wti returns to initially increase by 0.22% before iet quickly goes down to 0 within 3 days.  For the second shock $v_t^x$, it initialy causes no response to change in log(Natgas) returns.  But then starts to increase over a period of 2 days.  This is followed by a peak response on the 2nd day and then it quickly declines to 0 by the third day.  Similarly, a bump in change in log(wti) causes log(wti) to increase initially but then it quickly drops down to 0 within a day.  

# In[133]:


model.fevd(20).plot()


# #### Analysis

# In[ ]:


# In[ ]:


# In[ ]:


# ##  Hidden Markov Models and Regime Switching

# In[135]:


depmixS4 = importr('depmixS4')


# In[136]:


r_df = com.convert_to_r_dataframe(xt_data)
get_ipython().run_line_magic('Rpush', 'r_df')
get_ipython().run_line_magic('R', 'set.seed(1)')
robj.r('mod_1 <- depmix(log.wti.return~1,family=gaussian(),nstates=3,data=r_df)')
robj.r('fm2.wti <-fit(mod_1,verbose=FALSE)')
get_ipython().run_line_magic('R', 'print(summary(fm2.wti))')


# In[137]:


robj.r('mod_2 <- depmix(log.natgas.return~1,family=gaussian(),nstates=2,data=r_df)')
robj.r('fm2.ng1 <-fit(mod_2,verbose=FALSE)')
get_ipython().run_line_magic('R', 'print(summary(fm2.ng1))')


# In[138]:


robj.r('probs_wti <- posterior(fm2.wti)') #compute the probability of being in each state')
robj.r('probs_ng1 <- posterior(fm2.ng1)')
get_ipython().run_line_magic('R', 'print(head(probs_wti))')


# In[139]:


get_ipython().run_line_magic('R', 'print(head(probs_ng1))')


# In[140]:


regimes_wti = get_ipython().run_line_magic('R', 'probs_wti[,1]')
p_bear_wti = get_ipython().run_line_magic('R', 'probs_wti[,4]')
p_neutral_wti = get_ipython().run_line_magic('R', 'probs_wti[,3]')
p_bull_wti = get_ipython().run_line_magic('R', 'probs_wti[,2]')


# In[141]:


regimes_ng1 = get_ipython().run_line_magic('R', 'probs_ng1[,1]')
p_bull_ng1 = get_ipython().run_line_magic('R', 'probs_ng1[,2]')
p_bear_ng1 = get_ipython().run_line_magic('R', 'probs_ng1[,3]')


# In[149]:


fig,axes = plt.subplots(3,1)
rcParams['figure.figsize'] = 12,7
axes[0].plot(xt_data.index,p_bear_wti,color='firebrick')
axes[0].set_title('Bear Market probabilities')
axes[1].plot(xt_data.index,p_neutral_wti,color='steelblue')
axes[1].set_title('Neutral Market probabilities')
axes[2].plot(xt_data.index,p_bull_wti,color='seagreen')
axes[2].set_title('Bull Market probabilities')
fig.subplots_adjust(hspace=0.5)


# In[167]:


fig,axes = plt.subplots(3,1)
rcParams['figure.figsize'] = 12,8
axes[0].plot(Gt_data.index,Gt_natgas.values,color='orange')
axes[1].plot(xt_data.index,p_bear_ng1,color='firebrick')
axes[1].set_title('Bear Market probabilities')

axes[2].plot(xt_data.index,p_bull_ng1,color='seagreen')
axes[2].set_title('Bull Market probabilities')
fig.subplots_adjust(hspace=0.5)


# In[168]:


#treat both as a multavriate response:
get_ipython().run_line_magic('R', 'set.seed(1)')
robj.r('mod_mv <- depmix(list(log.wti.return~1,log.natgas.return~1),family=list(gaussian(),gaussian()),nstates=3,data=r_df)')
robj.r('fm2.mv <-fit(mod_mv,verbose=FALSE)')
get_ipython().run_line_magic('R', 'print(summary(fm2.mv))')


# In[169]:


get_ipython().run_line_magic('R', 'print(fm2.mv)')


# In[170]:


robj.r('probs_mv <- posterior(fm2.mv)');


# In[171]:


get_ipython().run_line_magic('R', 'print(head(probs_mv))')


# In[172]:


regimes_mv = get_ipython().run_line_magic('R', 'probs_mv[,1]')
p_bear_mv = get_ipython().run_line_magic('R', 'probs_mv[,4]')
p_neutral_mv = get_ipython().run_line_magic('R', 'probs_mv[,3]')
p_bull_mv = get_ipython().run_line_magic('R', 'probs_mv[,2]')


# In[182]:


fig,axes = plt.subplots(4,1)
rcParams['figure.figsize'] = 12,15
axes[0].plot(Gt_data.index,Gt_wti.values,color='black')
axes[0].plot(Gt_data.index,Gt_natgas.values,color='orange')

axes[1].plot(xt_data.index,p_bear_mv,color='firebrick')
axes[1].set_title('Bear Market probabilities')

axes[2].plot(xt_data.index,p_neutral_mv,color='steelblue')
axes[2].set_title('Neutral Market probabilities')

axes[3].plot(xt_data.index,p_bull_mv,color='seagreen')
axes[3].set_title('Bull Market probabilities')
fig.subplots_adjust(hspace=0.5)


# In[ ]:


#takes too long to execute.  roughly 15-20 minutes
robj.r('wt <- breakpoints(log.natgas.return ~ log.wti.return,data=r_df,h=1)')
get_ipython().run_line_magic('R', 'print(wt)')
get_ipython().run_line_magic('R', 'print(summary(wt))')


# In[126]:


xt_data.index[[758,1416,2076,2729]]


# In[129]:


rcParams['figure.figsize'] = 12,3
#%R fm0 <- lm(y[,2] ~ 1)
get_ipython().run_line_magic('R', 'fm1 <- lm(r_df[,2]~breakfactor(wt,breaks=3))')
#fitted_fm0 = %R fitted(fm0)
fitted_fm1 = get_ipython().run_line_magic('R', 'fitted(fm1)')
Gt_data['log(wti)'].plot(color='maroon')
#plt.plot(Gt_data.index,fitted_fm0)
#plt.plot(Gt_data.index,fitted_fm1)
fir = pd.to_datetime('2001-01-17')
sec = pd.to_datetime('2003-09-05')
third = pd.to_datetime('2006-04-28')
fourth = pd.to_datetime('2008-12-01')
plt.plot((fir,fir),(1.0,3.5),'b--')
plt.plot((sec,sec),(1.0,3.5),'b--')
plt.plot((third,third),(1.0,3.5),'b--')
plt.plot((fourth,fourth),(1.0,3.5),'b--')
plt.plot(Gt_data.index,Gt_data['log(natgas)'].values,color='orange')


# In[ ]: