때때로 시계열 데이터의 인덱스에 따라 데이터를 일정한 간격으로 이동시켜야 할 경우가 있습니다. 그렇게 time shifting 된 데이터를 lagged 데이터라고 하는데, 시계열 데이터 분석 알고리즘 대부분은 이러한 lagged 데이터를 필요로 합니다. pandas 에는 이를 지원하는 함수들이 있어 편리하게 사용할 수 있습니다.
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
# csv 를 읽을 때부터 Date 칼럼을 인덱스로 지정
df = pd.read_csv('data/apple_stock.csv', index_col='Date')
df.index
Index(['2009-12-31', '2010-01-04', '2010-01-05', '2010-01-06', '2010-01-07', '2010-01-08', '2010-01-11', '2010-01-12', '2010-01-13', '2010-01-14', ... '2018-12-17', '2018-12-18', '2018-12-19', '2018-12-20', '2018-12-21', '2018-12-24', '2018-12-26', '2018-12-27', '2018-12-28', '2018-12-31'], dtype='object', name='Date', length=2265)
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 2265 entries, 2009-12-31 to 2018-12-31 Data columns (total 6 columns): High 2265 non-null float64 Low 2265 non-null float64 Open 2265 non-null float64 Close 2265 non-null float64 Volume 2265 non-null float64 Adj Close 2265 non-null float64 dtypes: float64(6) memory usage: 123.9+ KB
인덱스가 object 타입이므로 datetime 형으로 변환합니다.
df.index = pd.to_datetime(df.index)
df.index
DatetimeIndex(['2009-12-31', '2010-01-04', '2010-01-05', '2010-01-06', '2010-01-07', '2010-01-08', '2010-01-11', '2010-01-12', '2010-01-13', '2010-01-14', ... '2018-12-17', '2018-12-18', '2018-12-19', '2018-12-20', '2018-12-21', '2018-12-24', '2018-12-26', '2018-12-27', '2018-12-28', '2018-12-31'], dtype='datetime64[ns]', name='Date', length=2265, freq=None)
df.info()
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 2265 entries, 2009-12-31 to 2018-12-31 Data columns (total 6 columns): High 2265 non-null float64 Low 2265 non-null float64 Open 2265 non-null float64 Close 2265 non-null float64 Volume 2265 non-null float64 Adj Close 2265 non-null float64 dtypes: float64(6) memory usage: 123.9 KB
df.head()
High | Low | Open | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|
Date | ||||||
2009-12-31 | 30.478571 | 30.080000 | 30.447144 | 30.104286 | 88102700.0 | 20.159719 |
2010-01-04 | 30.642857 | 30.340000 | 30.490000 | 30.572857 | 123432400.0 | 20.473503 |
2010-01-05 | 30.798571 | 30.464285 | 30.657143 | 30.625713 | 150476200.0 | 20.508902 |
2010-01-06 | 30.747143 | 30.107143 | 30.625713 | 30.138571 | 138040000.0 | 20.182680 |
2010-01-07 | 30.285715 | 29.864286 | 30.250000 | 30.082857 | 119282800.0 | 20.145369 |
df.tail()
High | Low | Open | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|
Date | ||||||
2018-12-24 | 151.550003 | 146.589996 | 148.149994 | 146.830002 | 37169200.0 | 146.830002 |
2018-12-26 | 157.229996 | 146.720001 | 148.300003 | 157.169998 | 58582500.0 | 157.169998 |
2018-12-27 | 156.770004 | 150.070007 | 155.839996 | 156.149994 | 53117100.0 | 156.149994 |
2018-12-28 | 158.520004 | 154.550003 | 157.500000 | 156.229996 | 42291400.0 | 156.229996 |
2018-12-31 | 159.360001 | 156.479996 | 158.529999 | 157.740005 | 35003500.0 | 157.740005 |
df.shift(1).head()
High | Low | Open | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|
Date | ||||||
2009-12-31 | NaN | NaN | NaN | NaN | NaN | NaN |
2010-01-04 | 30.478571 | 30.080000 | 30.447144 | 30.104286 | 88102700.0 | 20.159719 |
2010-01-05 | 30.642857 | 30.340000 | 30.490000 | 30.572857 | 123432400.0 | 20.473503 |
2010-01-06 | 30.798571 | 30.464285 | 30.657143 | 30.625713 | 150476200.0 | 20.508902 |
2010-01-07 | 30.747143 | 30.107143 | 30.625713 | 30.138571 | 138040000.0 | 20.182680 |
# 인덱스 밖으로 밀려난 마지막 데이터를 잃게 됩니다.
df.shift(1).tail()
High | Low | Open | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|
Date | ||||||
2018-12-24 | 158.160004 | 149.630005 | 156.860001 | 150.729996 | 95744600.0 | 150.729996 |
2018-12-26 | 151.550003 | 146.589996 | 148.149994 | 146.830002 | 37169200.0 | 146.830002 |
2018-12-27 | 157.229996 | 146.720001 | 148.300003 | 157.169998 | 58582500.0 | 157.169998 |
2018-12-28 | 156.770004 | 150.070007 | 155.839996 | 156.149994 | 53117100.0 | 156.149994 |
2018-12-31 | 158.520004 | 154.550003 | 157.500000 | 156.229996 | 42291400.0 | 156.229996 |
df.shift(-1).head()
High | Low | Open | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|
Date | ||||||
2009-12-31 | 30.642857 | 30.340000 | 30.490000 | 30.572857 | 123432400.0 | 20.473503 |
2010-01-04 | 30.798571 | 30.464285 | 30.657143 | 30.625713 | 150476200.0 | 20.508902 |
2010-01-05 | 30.747143 | 30.107143 | 30.625713 | 30.138571 | 138040000.0 | 20.182680 |
2010-01-06 | 30.285715 | 29.864286 | 30.250000 | 30.082857 | 119282800.0 | 20.145369 |
2010-01-07 | 30.285715 | 29.865715 | 30.042856 | 30.282858 | 111902700.0 | 20.279305 |
df.shift(-1).tail()
High | Low | Open | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|
Date | ||||||
2018-12-24 | 157.229996 | 146.720001 | 148.300003 | 157.169998 | 58582500.0 | 157.169998 |
2018-12-26 | 156.770004 | 150.070007 | 155.839996 | 156.149994 | 53117100.0 | 156.149994 |
2018-12-27 | 158.520004 | 154.550003 | 157.500000 | 156.229996 | 42291400.0 | 156.229996 |
2018-12-28 | 159.360001 | 156.479996 | 158.529999 | 157.740005 | 35003500.0 | 157.740005 |
2018-12-31 | NaN | NaN | NaN | NaN | NaN | NaN |
# 한 달 앞으로 모든 데이터를 shift
df.tshift(freq='M', periods=12).head()
High | Low | Open | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|
Date | ||||||
2010-12-31 | 30.478571 | 30.080000 | 30.447144 | 30.104286 | 88102700.0 | 20.159719 |
2010-12-31 | 30.642857 | 30.340000 | 30.490000 | 30.572857 | 123432400.0 | 20.473503 |
2010-12-31 | 30.798571 | 30.464285 | 30.657143 | 30.625713 | 150476200.0 | 20.508902 |
2010-12-31 | 30.747143 | 30.107143 | 30.625713 | 30.138571 | 138040000.0 | 20.182680 |
2010-12-31 | 30.285715 | 29.864286 | 30.250000 | 30.082857 | 119282800.0 | 20.145369 |