ARIMA and Seasonal ARIMA¶

Autoregressive Integrated Moving Averages¶

일반적으로 다음 프로세스를 따라 ARIMA 모델을 만듭니다:

시계열 데이터 시각화
시계열 데이터를 stationary 특성화
Correlation 과 AutoCorrelation 챠트 플로팅
ARIMA 모델 생성
만들어진 모델로 예측 수행

순서대로 다뤄봅시다.

Step 1: 데이터 읽기와 포매팅¶

월별 우유 소비 데이터를 사용합니다. 데이터는 Data Market 에서 다운받아 저장해놓은 CSV 파일을 읽습니다.

In [1]:

import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:

df = pd.read_csv('data/monthly-milk-production.csv', index_col='Month', parse_dates=True)

In [3]:

df.head()

Out[3]:

	pounds per cow
Month
1962-01-01	589
1962-02-01	561
1962-03-01	640
1962-04-01	656
1962-05-01	727

Step 2: Visualize the Data¶

pandas 내장 플로팅 함수들로 시계열 데이터를 시각화 해봅니다.

In [4]:

timeseries = df['pounds per cow']

In [5]:

timeseries.rolling(12).mean().plot(label='12 Month Rolling Mean')
timeseries.plot()
plt.legend()

Out[5]:

<matplotlib.legend.Legend at 0x7f97a85de6a0>

In [6]:

timeseries.rolling(12).mean().plot(label='12 Month Rolling Mean')
timeseries.rolling(12).std().plot(label='12 Month Rolling Std')
timeseries.plot()
plt.legend()

Out[6]:

<matplotlib.legend.Legend at 0x7f97a8158860>

Decomposition¶

ETS decomposition 을 통해서 시계열 데이터를 구성하는 요소들을 확인할 수 있습니다.

In [7]:

from statsmodels.tsa.seasonal import seasonal_decompose

In [8]:

decomposition = seasonal_decompose(df['pounds per cow'])

In [9]:

fig = plt.figure()
fig = decomposition.plot()
fig.set_size_inches(15,8)

<Figure size 432x288 with 0 Axes>

Testing for Stationarity¶

시계열 데이터의 Stationarity 특성을 시험하기 위해 Augmented Dickey-Fuller unit root test 를 사용합니다.

Augmented Dicky-Fuller (ADF) test 는 시계열 데이터 샘플에 단위근(unit root)이 존재한다는 귀무가설을 시험하므로써 대립가인 stationarity 또는 trend-stationarity 여부를 확인합니다.

시계열 데이터에 unit root 가 있어 non-stationary 하다는 귀무가설 H0 를 accept 할지, 또는 이를 reject 하면서 시계열 데이터에 unit root 가 없고 stationary 하다는 대립가설을 채택할지를 실험하는 것입니다.

결국 결과값 p-value 로 결정하게 됩니다.

p-value 값이 작을 때 (≤ 0.05) 귀무가설을 reject 하고 데이터는 stationary 함
p-value 값이 클 (> 0.05) 귀무가설을 accept 하고 데이터는 non-stationary 함

가공한 데이터에 ADF 테스트를 해봅시다:

In [10]:

from statsmodels.tsa.stattools import adfuller

In [11]:

result = adfuller(df['pounds per cow'])

In [12]:

result

Out[12]:

(-1.3038115874221246,
 0.627426708603034,
 13,
 154,
 {'1%': -3.473542528196209,
  '5%': -2.880497674144038,
  '10%': -2.576878053634677},
 1115.1730447395112)

In [13]:

# 반복적인 사용을 위해 함수화 합니다.
def adf_check(ts):
    result = adfuller(ts)
    if result[1] <= 0.05:
        print('Stationary {}'.format(result[1]))
    else:
        print('Non-Stationary {}'.format(result[1]))

In [14]:

adf_check(df['pounds per cow'])

Non-Stationary 0.627426708603034

Differencing¶

시계열 데이터에 대한 1차(first) difference 는 다음 스탭으로의 한 차원(시간단위) 만큼의 변화값의 series 입니다. pandas 를 이용하면 아주 differencing 할 수 있습니다. 2차, 3차, 그보다 더 높은 차원의 differencing 을 시도해보고 stationary 한 차원을 찾습니다.

First Difference¶

In [15]:

df['1st diff'] = df['pounds per cow'] - df['pounds per cow'].shift(1)

In [16]:

df.head()

Out[16]:

	pounds per cow	1st diff
Month
1962-01-01	589	NaN
1962-02-01	561	-28.0
1962-03-01	640	79.0
1962-04-01	656	16.0
1962-05-01	727	71.0

In [17]:

adf_check(df['1st diff'].dropna())

Stationary 0.03006800400178688

In [18]:

df['1st diff'].plot()

Out[18]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f97a7c50a58>

Second Differencing¶

In [19]:

df['2nd diff'] = df['1st diff'] - df['1st diff'].shift(1)

In [20]:

adf_check(df['2nd diff'].dropna())

Stationary 1.1126989332083069e-26

In [21]:

df['2nd diff'].plot()

Out[21]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f97a7eb07b8>

Seasonal Difference¶

In [22]:

df['seasonal diff'] = df['pounds per cow'] - df['pounds per cow'].shift(12)

In [23]:

df['seasonal diff'].plot()

Out[23]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f97a7e2ca20>

In [24]:

adf_check(df['seasonal diff'].dropna())

Non-Stationary 0.16079880527711304

결과상 seasonality 만 아니라 trend 도 가지고 있으므로 1차 differencing 한 데이터에 대한 seasonal differencing 을 수행해봅니다.

Seasonal First Difference¶

In [25]:

df['seasonal 1st diff'] = df['1st diff'] - df['1st diff'].shift(12)

In [26]:

df['seasonal 1st diff'].plot()

Out[26]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f97a7db3908>

In [27]:

adf_check(df['seasonal 1st diff'].dropna())

Stationary 1.86542343187882e-05

이와 같은 과정에서 d=1, D=1 의 파라메터를 찾았습니다.

Autocorrelation and Partial Autocorrelation Plots¶

ARIMA 의 p,q 파라메터를 찾기 위해 PACF 와 ACF 플롯을 참조합니다.

플롯에 대한 해석은 간단하지만은 않은데, 여기 와 여기 에 ACF 와 PACF 를 해석하는 방법에 대한 정보가 자세히 나와있으니 참고하십시오.

Autocorrelation Interpretation¶

In [28]:

from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

In [29]:

plot_acf(df['1st diff'].dropna());

In [30]:

plot_acf(df['seasonal 1st diff'].dropna());

Partial Autocorrelation¶

In [31]:

plot_pacf(df['1st diff'].dropna(), method='ywm');

In [32]:

plot_pacf(df['seasonal 1st diff'].dropna(), method='ywm');

Autocorrelation 과 Partial Autocorrelation 에 대한 요약¶

AR 모델 판별은 대체로 PACF 를 통해 확인할 수 있습니다.
- 이론적으로 PACF 는 AR 모델의 차수 이후 "shut off" 를 보입니다. "shut off" 되었다는 것은 partial autocorrelation 이 해당 지점 이후 0 과 같다는 의미입니다. 다른 말로, 0 아닌 partial autocorrelation 이 존재하는 차수가 AR 모델의 차수가 됩니다.
MA 모델에 대한 판별은 PACF 보다 ACF 더 명확히 확인됩니다.
- MA 모델에 있어서 PACF 는 "shut off" 하지 않지만 0 을 향해 점진적으로 줄어듭니다. ACF 에 MA 모델이 좀 더 분명히 나타나는데, 모델과 관련된 lag 에 대해서만 0 아닌 autocorrelation 이 나타납니다.

Seasonal ARIMA 모델 사용하기¶

In [33]:

model = sm.tsa.statespace.SARIMAX(df['pounds per cow'],
                                 order=(0,1,0),
                                 seasonal_order=(1,1,1,12))

/home/lyle/anaconda3/envs/tsa/lib/python3.7/site-packages/statsmodels/tsa/base/tsa_model.py:171: ValueWarning: No frequency information was provided, so inferred frequency MS will be used.
  % freq, ValueWarning)

In [34]:

result = model.fit()

In [35]:

print(result.summary())

                                 Statespace Model Results                                 
==========================================================================================
Dep. Variable:                     pounds per cow   No. Observations:                  168
Model:             SARIMAX(0, 1, 0)x(1, 1, 1, 12)   Log Likelihood                -534.065
Date:                            Thu, 31 Jan 2019   AIC                           1074.131
Time:                                    19:14:21   BIC                           1083.261
Sample:                                01-01-1962   HQIC                          1077.839
                                     - 12-01-1975                                         
Covariance Type:                              opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.S.L12      -0.0449      0.106     -0.422      0.673      -0.253       0.163
ma.S.L12      -0.5860      0.102     -5.762      0.000      -0.785      -0.387
sigma2        55.5118      5.356     10.365      0.000      45.015      66.009
===================================================================================
Ljung-Box (Q):                       33.48   Jarque-Bera (JB):                32.04
Prob(Q):                              0.76   Prob(JB):                         0.00
Heteroskedasticity (H):               0.69   Skew:                             0.77
Prob(H) (two-sided):                  0.18   Kurtosis:                         4.60
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

In [36]:

result.resid.plot()

Out[36]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f97a4cfe438>

In [37]:

result.resid.plot(kind='kde')

Out[37]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f97a4297748>

Prediction of Future Values¶

일단 생성된 모델이 이미 알고 있는 결과 대비 얼마나 좋은 예측 성능을 보여주는지 확인해봅시다:

In [38]:

df['forecast'] = result.predict(start=150, end=168, dynamic=True)

df[['pounds per cow', 'forecast']].plot(figsize=(12,8))

Out[38]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f97a41faa20>