(watch section)
(read the wiki)
import QSTK.qstkutil.qsdateutil as du
import QSTK.qstkutil.tsutil as tsu
import QSTK.qstkutil.DataAccess as da
import datetime as dt
import matplotlib.pyplot as plt
import pandas as pd
Some notebook magic, please ignore
%matplotlib inline
We'll be using historical adjusted close data. QSTK has a DataAccess class designed to quickly read this data into pandas DataFrame object. We must first select which symbols we're interested in, and for which time periods. Note that the wiki tutorial script uses 2006, but in the video tutorial, we use 2010. The end date differs to so as we have 2 weeks of data. SPX is the S&P 500.
ls_symbols = ["AAPL", "GLD", "GOOG", "$SPX", "XOM"]
dt_start = dt.datetime(2010, 1, 1)
dt_end = dt.datetime(2010, 1, 15)
The time at which the market closes is 16:00.
dt_timeofday = dt.timedelta(hours=16)
The function getNYSEdays(dt_start, dt_end, dt_timeofday)
returns the days of the interval for which the New York stock exchange was open. It adds the time of day given as param to each datetime member of the result list.
ldt_timestamps = du.getNYSEdays(dt_start, dt_end, dt_timeofday)
ldt_timestamps
[Timestamp('2010-01-04 16:00:00', tz=None), Timestamp('2010-01-05 16:00:00', tz=None), Timestamp('2010-01-06 16:00:00', tz=None), Timestamp('2010-01-07 16:00:00', tz=None), Timestamp('2010-01-08 16:00:00', tz=None), Timestamp('2010-01-11 16:00:00', tz=None), Timestamp('2010-01-12 16:00:00', tz=None), Timestamp('2010-01-13 16:00:00', tz=None), Timestamp('2010-01-14 16:00:00', tz=None)]
Note than Jan 4 was the first open day of the interval, and Jan 14, the last.
Create an object that will be ready to read from our Yahoo data source
c_dataobj = da.DataAccess('Yahoo')
c_dataobj.get_data
creates a list of dataframe objects. The dict-zipping converts this list into a dictionary for easier access.
ls_keys = ['open', 'high', 'low', 'close', 'volume', 'actual_close']
ldf_data = c_dataobj.get_data(ldt_timestamps, ls_symbols, ls_keys)
d_data = dict(zip(ls_keys, ldf_data))
Note that in the list of keys, 'close'
refers to adjusted close, and actual_close is the raw close price. Looking at the type of the object created by get_data
ldf_data.__class__, ldf_data[1].__class__
(list, pandas.core.frame.DataFrame)
d_data['close']
AAPL | GLD | GOOG | $SPX | XOM | |
---|---|---|---|---|---|
2010-01-04 16:00:00 | 213.10 | 109.80 | 626.75 | 1132.99 | 64.55 |
2010-01-05 16:00:00 | 213.46 | 109.70 | 623.99 | 1136.52 | 64.80 |
2010-01-06 16:00:00 | 210.07 | 111.51 | 608.26 | 1137.14 | 65.36 |
2010-01-07 16:00:00 | 209.68 | 110.82 | 594.10 | 1141.69 | 65.15 |
2010-01-08 16:00:00 | 211.07 | 111.37 | 602.02 | 1144.98 | 64.89 |
2010-01-11 16:00:00 | 209.21 | 112.85 | 601.11 | 1146.98 | 65.62 |
2010-01-12 16:00:00 | 206.83 | 110.49 | 590.48 | 1136.22 | 65.29 |
2010-01-13 16:00:00 | 209.75 | 111.54 | 587.09 | 1145.68 | 65.03 |
2010-01-14 16:00:00 | 208.53 | 112.03 | 589.85 | 1148.46 | 65.04 |
We pull the close prices we cant to plot out of the pandas dataframe, into a 2D numpy array.
na_price = d_data['close'].values
Now, plotting with matplotlib. Here we first clear the plot (clf
), plot the data, and then modify the plot to add a legend and some labels.
plt.clf()
plt.plot(ldt_timestamps, na_price)
plt.legend(ls_symbols)
plt.ylabel('Adjusted Close')
plt.xlabel('Date')
<matplotlib.text.Text at 0x4a9c2d0>
To save a PDF file containing the plot:
plt.savefig('adjustedclose.pdf', format='pdf')
<matplotlib.figure.Figure at 0x4923310>
To better see the variations, we normalize the data with respect to the first day's price (so we divide all rows by the first row).
na_normalized_price = na_price / na_price[0, :]
Plotting:
plt.clf()
plt.plot(ldt_timestamps, na_normalized_price)
plt.legend(ls_symbols)
plt.ylabel('Normalised Adjusted Close')
plt.xlabel('Date')
<matplotlib.text.Text at 0x4d04110>
(From here, there are no video tutorials, only the wiki)
The QSTK function returnize0
computes the daily return of stocks (remember, ret(t)=price(t)price(t−1)−1). Beware, returnize0
mutates the array. Make a copy first:
na_rets = na_normalized_price.copy()
tsu.returnize0(na_rets)
array([[ 0. , 0. , 0. , 0. , 0. ], [ 0.00168935, -0.00091075, -0.00440367, 0.00311565, 0.00387297], [-0.0158812 , 0.01649954, -0.02520874, 0.00054552, 0.00864198], [-0.00185652, -0.00618779, -0.02327952, 0.00400127, -0.00321297], [ 0.00662915, 0.004963 , 0.01333109, 0.00288169, -0.00399079], [-0.00881224, 0.01328904, -0.00151158, 0.00174676, 0.01124981], [-0.01137613, -0.02091272, -0.01768395, -0.00938116, -0.00502895], [ 0.01411787, 0.00950312, -0.00574109, 0.00832585, -0.00398223], [-0.00581645, 0.00439304, 0.00470115, 0.00242651, 0.00015378]])
plt.clf()
plt.plot(ldt_timestamps[0:50], na_rets[0:50, 3]) # $SPX 50 days
plt.plot(ldt_timestamps[0:50], na_rets[0:50, 4]) # XOM 50 days
plt.axhline(y=0, color='r')
plt.legend(['$SPX', 'XOM'])
plt.ylabel('Daily Returns')
plt.xlabel('Date')
<matplotlib.text.Text at 0x4923a90>
To estimate the correlation between two securities. Let's work with more data first.
dt_start = dt.datetime(2006, 1, 1)
dt_end = dt.datetime(2010, 12, 31)
ldt_timestamps = du.getNYSEdays(dt_start, dt_end, dt_timeofday)
d_data = dict(zip(ls_keys, c_dataobj.get_data(ldt_timestamps, ls_symbols, ls_keys)))
Filling the data for NAN: we apply the filling functions in decreasing order of preference.
for s_key in ls_keys:
d_data[s_key] = d_data[s_key].fillna(method='ffill')
d_data[s_key] = d_data[s_key].fillna(method='bfill')
d_data[s_key] = d_data[s_key].fillna(1.0)
Getting the numpy ndarray of close prices.
na_price = d_data['close'].values
na_normalized_price = na_price / na_price[0, :]
na_rets = na_normalized_price.copy()
tsu.returnize0(na_rets);
XOM vs. S&P 500:
plt.clf()
plt.scatter(na_rets[:, 3], na_rets[:, 4], c='blue')
plt.ylabel('XOM')
plt.xlabel('$SPX')
<matplotlib.text.Text at 0x4fe8850>
Plotting the scatter plot of daily returns between $SPX VS GLD
plt.clf()
plt.scatter(na_rets[:, 3], na_rets[:, 1], c='yellow')
plt.ylabel('GLD')
plt.xlabel('$SPX')
<matplotlib.text.Text at 0x5003f50>
Using the daily returns we can reconstruct cumulative daily returns. Note that in general the cumulative daily return for day t is defined as follows (this is NOT Python code, it is an equation):
daily_cum_ret(t) = daily_cum_ret(t-1) * (1 + daily_ret(t))
I don't provide the code for this, as it is a programming assignment. If you plot the result, it should look exactly like the normalized returns plot above.
import numpy as np
def daily_cum_ret(np_arr):
"Reconstructs the cumulative returns from daily returns."
assert np_arr.size != 0
dcr = np.empty_like(np_arr, dtype=float)
dcr[0] = 1
for ri in xrange(1, np_arr.shape[0]):
dcr[ri] = dcr[ri-1] * (1+np_arr[ri])
return dcr
# For comparison
plt.clf()
plt.figure(figsize=(8,16))
plt.subplot(311)
plt.plot(ldt_timestamps, na_normalized_price)
plt.ylabel('Normalised Adjusted Close')
plt.xlabel('Date')
# Our reconstruction
plt.subplot(312)
plt.plot(ldt_timestamps, daily_cum_ret(na_rets))
plt.ylabel('Reconstructed Normalised Adjusted Close')
plt.xlabel('Date')
# floating poing operations cumulative error
err = na_normalized_price - daily_cum_ret(na_rets)
plt.subplot(313)
plt.plot(ldt_timestamps, err)
plt.ylabel('Reconstruction error')
plt.xlabel('Date')
<matplotlib.text.Text at 0x5824790>
<matplotlib.figure.Figure at 0x564f190>
Suppose we want to estimate the performance of a portfolio composed of 75% GLD and 25% SPY? We can approximate the daily returns in an equation as follows:
portfolio_daily_rets = 0.75 * GLD_daily_rets + 0.25 * SPY_daily_rets
Then, using the equation above for cumulative daily returns, we can plot the performance of the combined portfolio. This also is an assignment in my course, so I don't list the code for it here.
As I am lazy, I approximate SPX≈SPY
p_rets = na_rets[:, 3] * 0.25 + na_rets[:, 1] * 0.75
na_plot_cumul_rets = daily_cum_ret(np.concatenate([na_rets[:, [1, 3]], p_rets[:, np.newaxis]], axis=1))
plt.clf()
plt.plot(ldt_timestamps, na_plot_cumul_rets)
plt.legend(["GLD", "S&P 500", "portf."])
plt.ylabel('Cumulative returns')
plt.xlabel('Date')
<matplotlib.text.Text at 0x5c08350>
Finally, we revisit the scatterplots above that reveal visually how closely correlated (related) the daily movement of two stocks are. It's even better if can quantify this correlation by fitting a line to them using linear regression. Note the red line in the figure on the right; this was computed using one of NumPy's linear regression tools. The value of the slope of the line is reported as "corr" which is technically not correct.
Wikipedia has a nice discussion of correlation.
Again, I'm not going to show the code here, but I will tell you that the code is not very complex, and I used the following functions: polyfit(), polyval(), and sort().
Note that there is no figure on the wiki. So I choose to fit the XOM scatterplot. I also don't use the same functions as the instructor, but my version works too.
spx_rets, xom_rets = na_rets[:, 3], na_rets[:, 4]
coefs = np.polyfit(x=spx_rets, y=xom_rets, deg=1)
fit_f = np.poly1d(coefs)
fit_x = [np.amin(spx_rets), np.amax(spx_rets)]
fit_y = fit_f(fit_x)
plt.clf()
plt.scatter(spx_rets, xom_rets, c='blue')
plt.plot(fit_x, fit_y, c='red')
plt.ylabel('XOM')
plt.xlabel('$SPX')
<matplotlib.text.Text at 0x5e701d0>