There are several ways of making money in the stock market; here we explore a few of them using techniques from data science and finance. We explore alpha generation from the perspectives of:
Thus our initial question is: Which of the above methods will generate alpha, and of course what are the caveats.
The motivation of this project is to apply knowledge from 209 to finance. The motivation is to make money. We not only want to just maximize returns, but also minimize volitility. You will see through the statistics we employ, that we focused on these two ideas. That is our motivation.
In terms of related work. Many of our funcitons are derived from other works (like the butterworth filter). We have sited them in our .py files. In addition we have used what we learned on bayesian tomatoes quite extensively. Most of the other work we have done, comes from our own experience.
We colleted the initial data from Quandl, and have included it as files in the .zip. In addition, we collected twitter data. That process is descibed below.
Grappling with timeseries data is essential to our task of stock price prediction, since a stock traces out a series of datapoints with regard to time. Fortunately, pandas includes an API for time-series manipulations. In the following, we introduce some exploratory statistics and methodology and begin to analyze stock data from a returns perspective.
The pandas time-series API includes:
Let's import some code. From datetime, we need
From pandas, let's get the timeseries tools:
In addition we need:
# Code to allow inline plotting, derived from psets
import warnings
warnings.filterwarnings('ignore')
from helperfile import *
plt.rc('figure', figsize=(10, 6))
%matplotlib inline
import json
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 30)
# set some nicer defaults for matplotlib
from matplotlib import rcParams
#these colors come from colorbrewer2.org. Each is an RGB triplet
dark2_colors = [(0.10588235294117647, 0.6196078431372549, 0.4666666666666667),
(0.8509803921568627, 0.37254901960784315, 0.00784313725490196),
(0.4588235294117647, 0.4392156862745098, 0.7019607843137254),
(0.9058823529411765, 0.1607843137254902, 0.5411764705882353),
(0.4, 0.6509803921568628, 0.11764705882352941),
(0.9019607843137255, 0.6705882352941176, 0.00784313725490196),
(0.6509803921568628, 0.4627450980392157, 0.11372549019607843),
(0.4, 0.4, 0.4)]
rcParams['figure.figsize'] = (10, 6)
rcParams['figure.dpi'] = 150
rcParams['axes.color_cycle'] = dark2_colors
rcParams['lines.linewidth'] = 2
rcParams['axes.grid'] = False
rcParams['axes.facecolor'] = 'white'
rcParams['font.size'] = 14
rcParams['patch.edgecolor'] = 'none'
def remove_border(axes=None, top=False, right=False, left=True, bottom=True):
"""
Minimize chartjunk by stripping out unnecessary plot borders and axis ticks
The top/right/left/bottom keywords toggle whether the corresponding plot border is drawn
"""
ax = axes or plt.gca()
ax.spines['top'].set_visible(top)
ax.spines['right'].set_visible(right)
ax.spines['left'].set_visible(left)
ax.spines['bottom'].set_visible(bottom)
#turn off all ticks
ax.yaxis.set_ticks_position('none')
ax.xaxis.set_ticks_position('none')
#now re-enable visibles
if top:
ax.xaxis.tick_top()
if bottom:
ax.xaxis.tick_bottom()
if left:
ax.yaxis.tick_left()
if right:
ax.yaxis.tick_right()
Let's sample some data from the SPX.csv data file, sourced from Yahoo Finance S&P 500.
with open('SPX.csv', 'r') as ff:
print ff.readline() # headers
print ff.readline() # first row
DATE,OPEN,HIGH,LOW,CLOSE,VOLUME,ADJ 2013-09-13,169.13,169.46,168.74,169.33,72459700,169.33
We can use lists or dicts for parsing dates.
data = pd.read_csv('SPX.csv',parse_dates={'Timestamp': ['DATE']},index_col='Timestamp')
data
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 5195 entries, 2013-09-13 00:00:00 to 1993-01-29 00:00:00 Data columns (total 6 columns): OPEN 5195 non-null values HIGH 5195 non-null values LOW 5195 non-null values CLOSE 5195 non-null values VOLUME 5195 non-null values ADJ 5195 non-null values dtypes: float64(5), int64(1)
Let's take a look at some of these more recent values:
history = data.ix[:, ['OPEN', 'VOLUME']]
history.head()
OPEN | VOLUME | |
---|---|---|
Timestamp | ||
2013-09-13 | 169.13 | 72459700 |
2013-09-12 | 169.34 | 83098200 |
2013-09-11 | 168.64 | 94272700 |
2013-09-10 | 168.64 | 102432800 |
2013-09-09 | 166.45 | 83392900 |
Compute a VWAP (Volume Weighted Avereage Price, a weighted avereage of price with volume as the weights, often used as an anchoring quantity for algorithmic traders) using resample.
volume = history.VOLUME.resample('1w', how='sum') # weekly vwap
value = history.prod(axis=1).resample('1w', how='sum')
vwap = value / volume
print vwap
Timestamp 1993-01-31 43.970000 1993-02-07 44.554778 1993-02-14 44.845201 1993-02-21 44.021129 1993-02-28 43.816993 1993-03-07 44.786620 1993-03-14 45.346889 1993-03-21 45.133627 1993-03-28 44.827772 1993-04-04 45.121409 1993-04-11 44.453826 1993-04-18 44.844166 1993-04-25 44.560004 1993-05-02 43.723734 1993-05-09 44.391409 ... 2013-06-09 163.200245 2013-06-16 163.581478 2013-06-23 162.626114 2013-06-30 159.262948 2013-07-07 161.399900 2013-07-14 165.689620 2013-07-21 168.265996 2013-07-28 169.031879 2013-08-04 169.393906 2013-08-11 169.890286 2013-08-18 167.850913 2013-08-25 165.400218 2013-09-01 164.397174 2013-09-08 165.588047 2013-09-15 168.435810 Freq: W-SUN, Length: 1077, dtype: float64
We can conveniently index financial time series data as follows.
vwap.ix['1993-01-31':'2013-09-15']
Timestamp 1993-01-31 43.970000 1993-02-07 44.554778 1993-02-14 44.845201 1993-02-21 44.021129 1993-02-28 43.816993 1993-03-07 44.786620 1993-03-14 45.346889 1993-03-21 45.133627 1993-03-28 44.827772 1993-04-04 45.121409 1993-04-11 44.453826 1993-04-18 44.844166 1993-04-25 44.560004 1993-05-02 43.723734 1993-05-09 44.391409 ... 2013-06-09 163.200245 2013-06-16 163.581478 2013-06-23 162.626114 2013-06-30 159.262948 2013-07-07 161.399900 2013-07-14 165.689620 2013-07-21 168.265996 2013-07-28 169.031879 2013-08-04 169.393906 2013-08-11 169.890286 2013-08-18 167.850913 2013-08-25 165.400218 2013-09-01 164.397174 2013-09-08 165.588047 2013-09-15 168.435810 Freq: W-SUN, Length: 1077, dtype: float64
selection = vwap.between_time('1993-01-31', '2013-09-15')
vol = pd.Series(volume.between_time('1993-01-31', '2013-09-15'),dtype=float)
#vol.head(20)
# attempt to fill possible missing data
selection.ix['1993-01-31':'2013-09-15'].head(20)
filled = selection.fillna(method='pad', limit=1)
filled.ix['1993-01-31':'2013-09-15'].head(20)
vol = vol.fillna(0.)
#vol.head(20)
# now for visualization, plot the volume weighted avereage price, along with volume, with respect to time.
vol.plot(style='y')
filled.plot(secondary_y=True, style='r')
remove_border()
BB = history.OPEN.resample('1w', how='ohlc')
week_range = BB.high - BB.low
#week_range.describe()
week_return = BB.close / BB.open - 1
#week_return.describe()
#week_return.head()
MM = week_return.between_time('1993-01-31', '2013-09-15')
#lagged = MM.shift(1)
lagged = week_return.tshift(1, 'w').between_time('1993-01-31', '2013-09-15')
# Ordinary least squares
pd.ols(y=MM, x=lagged)
-------------------------Summary of Regression Analysis------------------------- Formula: Y ~ <x> + <intercept> Number of Observations: 1076 Number of Degrees of Freedom: 2 R-squared: 0.0012 Adj R-squared: 0.0003 Rmse: 0.0222 F-stat (1, 1074): 1.2743, p-value: 0.2592 Degrees of Freedom: model 1, resid 1074 -----------------------Summary of Estimated Coefficients------------------------ Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5% -------------------------------------------------------------------------------- x -0.0344 0.0305 -1.13 0.2592 -0.0942 0.0254 intercept 0.0014 0.0007 2.07 0.0383 0.0001 0.0027 ---------------------------------End of Summary---------------------------------
MM = vwap / BB.open - 1
MM = MM.between_time('1993-01-31', '2013-09-15')
lagged = MM.tshift(1, 'w').between_time('1993-01-31', '2013-09-15')
pd.ols(y=MM, x=lagged)
-------------------------Summary of Regression Analysis------------------------- Formula: Y ~ <x> + <intercept> Number of Observations: 1076 Number of Degrees of Freedom: 2 R-squared: 0.0018 Adj R-squared: 0.0009 Rmse: 0.0122 F-stat (1, 1074): 1.9175, p-value: 0.1664 Degrees of Freedom: model 1, resid 1074 -----------------------Summary of Estimated Coefficients------------------------ Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5% -------------------------------------------------------------------------------- x -0.0422 0.0305 -1.38 0.1664 -0.1020 0.0175 intercept 0.0003 0.0004 0.91 0.3610 -0.0004 0.0011 ---------------------------------End of Summary---------------------------------
So the S&P did very well this year. Why can't we just invest into a 3x ETF and make 60 percent a year!
What are leveraged ETFs? First, ETFs (Exchange Traded Funds) and financial instruments that trade similarly to stocks, and attempt to replicate the return of a basket of other assets. In principle, it would require many costly transactions for an investor to replicate the exposure of an ETF, so they are seen as convenient cost saving tools (Example: one SPY purchase gives you the same exposure as buying every stock in the S&P 500). Some ETFs purport to replicate a multiple of the return of the underlying asset; these are known as Leveraged ETFs. Leveraged ETFs have an interesting behavior; many market participants have noted their long-term downward tendency. This is because they have an asymmetric distribution with a large chance of decay. While the chance of decay is large, the expected value of each timestep is the same as for a similar nonleveraged instrument. Let's attempt to demonstrate this through a Monte-Carlo approach.
We want an etf security with normal historical returns. Average daily return should therefore be 0, with a standard deviation of 1%.
# numpy/matplotlib
%pylab inline
history = randn(10000)
plot(randn(10000)) #plot the 10000-day historical price vector.
title('returns')
xlabel('day')
ylabel('return [%]')
remove_border()
Populating the interactive namespace from numpy and matplotlib
WARNING: pylab import has clobbered these variables: ['load', 'info', 'mpl', 'datetime', 'save', 'set_printoptions', 'unique'] `%pylab --no-import-all` prevents importing * from pylab and numpy
h = hist(history,100)
print 'Mean:' , history.mean()
print 'Standard deviation:%.2f' % history.std()
Mean: -0.00286332404091 Standard deviation:1.01
This is very near mean 0, standard deviation 1.
Starting with an investmnt of k dollars, on day n our investment would then have a value of $k*r[t]*r[t+1]*...r[n]$. We can simulate these returns by using the cumulative product function.
price = 100*(history/100+1).cumprod() # normalized price with a starting asset base of $100.
plot(price) # cumulative sum
title('normalized price')
xlabel('day number')
ylabel('price [$]')
remove_border()
Now employ a monte carlo approach, creating many simultaneous price paths.
history = randn(1000,5000)
prices = 100*(history/100+1).cumprod(axis=0) # normalized price along axis
h = plot(prices[:,:300]) # plot first 300
xlabel('day number')
ylabel('price [$]')
remove_border()
Let's make a histogram of the last day of returns. This is the thousandth day, which is simply the end of the prices matrix.
finalPrice = prices[-1,:] # get last row of prices table
h = hist(finalPrice,50) # plot histogram
xlabel('price [$]')
ylabel('frequency [#]')
print 'Mean:', finalPrice.mean()
print 'Std:', finalPrice.std()
Mean: 99.3375039092 Std: 32.1103764376
history3x = history*3. # leveraged returns
prices3x = 100*(history3x/100+1).cumprod(axis=0) # normalized price
finalPrice3x = prices3x[-1,:] # last
h1 = hist(finalPrice3x,500) # histogram 1
h2 = hist(finalPrice,100) # histogram 2
xlim([0,400]) # setting limit of x axis
legend(['3 times','1 times'])
print 'Mean 3x:', finalPrice3x.mean()
Mean 3x: 98.0867193128
The mean is close to our initial asset base, but the distribution is skewed to the right. No longer a normal distribution. Therefore, an equal probability of going up or down in percentage terms does not change long term expected value. The resulting distribution is however skewed involving a price decline more often, but it is compensated by the fat tails on the up side.
Note what is happening here: the fat tail of the leveraged ETF is so enormous that while the peak of the distribution is below 50, the long term mean is still at 100. This suggests that the naive strategy of shorting levereaged ETFs is not likely to be as profitable over the long term as it might at first appear.
So okay, our first approach did not work, but now to our next approach.
warnings.filterwarnings('ignore')
import signal_processing as sp
sp.example()
[2013-12-12 20:04] DEBUG: Transform: Running StatefulTransform [short_mavg] [2013-12-12 20:04] DEBUG: Transform: Running StatefulTransform [long_mavg] [2013-12-12 20:04] DEBUG: Transform: Finished StatefulTransform [long_mavg] [2013-12-12 20:04] DEBUG: Transform: Finished StatefulTransform [short_mavg] [2013-12-12 20:04] INFO: Performance: Simulated 253 trading days out of 253. [2013-12-12 20:04] INFO: Performance: first open: 1990-01-02 14:31:00+00:00 [2013-12-12 20:04] INFO: Performance: last close: 1990-12-31 21:00:00+00:00
IBM
The concept of a DMAC is fairly straightforward. Calculate two moving averages of the price of a security, or in this case exchange rates of a currency. One average would be the short term (ST) (strictly relative to the other moving average) and the other long term (LT). Mathematically speaking, the long term moving average (LTMA) will have a lower variance and will move in the same direction as the short term moving average but at a different rate. The different rates of direction, induces points where the values of the two moving averages may equal and or cross one another. These points are called the crossover points.
In the dual moving average crossover trading strategy, these crossovers are points of decision to buy or sell the desired product. What these crossover points imply depends on the approach the investor has in their strategy. There are two schools of thought: Technical and Value.
The Technical Approach suggests that when the Short Term Moving Average (STMA) moves above the LTMA, that represents a Buy (or Long) signal. (Conversely, when the STMA moves below the LTMA, the Technical Approach indicates a Sell (or Short) signal.) The intuition behind this strategy can be explained in terms of momentum. Basically, the principle of momentum states that a price that is moving up (or down) during period t is likely to continue to move up (or down) in period t+1 unless evidence exists to the contrary. When the STMA moves above the LTMA, this provides a lagged indicator that the price is moving upward relative to the historical price. Buy high, sell higher.
The Value Approach offers the opposite trading signals to the Technical Approach. The Value Approach claims that when the STMA crosses from below to above the LTMA, that the product is now overvalued, and should be sold. Conversely when the currency STMA moves below the LTMA then the product is undervalued it should be bought. The intuition behind the Value Approach can be thought simply as a mean reversion approach. Buy low (value), sell high (overvalued).
Both strategies try to achieve the same goal, but do it in opposing ways to one another. In this paper, we will analyze both the technical and value strategies as applied to the SPY with the periods of 2 and 5.
Let us better understand what we are doing. We are looking for the underlying signal from a stock's price. We are then taking a value based approach and a technical approach using these signals. But what is a signal. Signals are tools for estimating future value of an asset based on its previous behavior, e.g. predict the price of AAPL stock based on its previous price movements for that hour, day or month. But what does it do? Consider below:
# make a new dataframe from the original dataset
prices = pd.DataFrame(data.reindex( index=data.index[ ::-1 ] ).OPEN)
prices.plot()
remove_border()
Notice how choppy the graph appears. It is hard to tell which direction the S&P is moving. Now consider applying a simple moving average (see included python files: filters.py).
prices['SMA50'] = sp.sma(prices.OPEN,50)
prices['SMA200'] = sp.sma(prices.OPEN,200)
prices.plot()
remove_border()
Now what you see is a smoother set of curves. Notice how the 50 period SMA follows the actual prices more closely than the SMA 200 curve. And the 200 SMA follows it more loosely. Why? The 200 curve looks a prices up to 200 days ago, vs the 50 which only looks 50 days back. That is why the SMA 50 reacts more quickly.
Now that we understand what the true values/underlying signals of the market are. But how do we use this? Now we can implement a DMAC algorithm that based on these signals will produce equity curves (how much we will have made). Consider the code below.
equity_curves = pd.DataFrame(sp.dmac(prices.OPEN,prices.OPEN,sp.sma,2, 5, 5, True),columns =['Value SMA'],index=prices.index)
equity_curves.plot(); remove_border()
Well hot dog! Looks like we have a winner. Well we should just go invest in this! It gives us 10x returns! Well wait, let's first put this in prespective
equity_curves['Technical SMA'] = sp.dmac(prices.OPEN,prices.OPEN,sp.sma,2, 5, 5, False)
equity_curves['Baseline'] = 100*prices.OPEN/prices.OPEN[0]
equity_curves.plot(); remove_border()
Okay, this is more reasonalble. It seems that we have beaten the baseline in our Value base approach (described above) and lost nearly all of our money with our Technical appraoch. But generally speaking just eyeballing our results is not the best strategy, there are certain statistics that we should veiw:
We will begin by examining the statistics on our baseline.
stats = sp.performance(equity_curves['Baseline'], time_frame= 251)
pd.DataFrame([stats.values()],columns = stats.keys(),index=['Baseline'])
sharp | std | sortino | car3 | car1 | cret | pos | car5 | aret | max | ave | ytd | 1000 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Baseline | 0.4506 | 0.169088 | 0.868786 | 14.479322 | 17.134151 | 284.648624 | 0.526083 | 5.947035 | 6.967836 | 56.70044 | 0.033569 | 17.134151 | 3846.486241 |
So the baseline model (meaning let's just invest in the S&P 500) does pretty well. It gives us nearly 4x! But what else... The max drawdown was 54%. Pretty bad, during the morgage crisis the market really sank. The anuallized return was 7% a year! The sharp ratio (usually looked to be good at .5) was a bit above .45. But lets look at the statistics of all three combined.
stats = []
for strategy in ['Baseline','Value SMA','Technical SMA']:
stats.append( sp.performance(equity_curves[strategy], time_frame= 251))
pd.DataFrame([stat.values() for stat in stats],columns = stats[0].keys(),index=['Baseline','Value SMA','Technical SMA'])
sharp | std | sortino | car3 | car1 | cret | pos | car5 | aret | max | ave | ytd | 1000 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Baseline | 0.450600 | 0.169088 | 0.868786 | 14.479322 | 17.134151 | 284.648624 | 0.526083 | 5.947035 | 6.967836 | 56.700440 | 0.033569 | 17.134151 | 3846.486241 |
Value SMA | 0.857395 | 0.159222 | 16.518197 | 12.628268 | 4.673404 | 1021.574147 | 0.337055 | 14.389879 | 12.847359 | 24.042366 | 0.051072 | 4.673404 | 11215.741466 |
Technical SMA | -1.141516 | 0.109234 | -1.163120 | -12.432710 | -5.102461 | -94.240220 | 0.307796 | -15.226719 | -13.299761 | 20.924733 | -0.050313 | -5.102461 | 57.597798 |
Now we can make some comparisons. First we notice that the Technical SMA did quite poorly as compared to the other strategies. In almost all regards. It lost money. Next is the Value SMA. It made more money. And in fact it is less risky, there is less volitility. We can see this by looking at the sharp and sortino ratios. But then two questions arise... Did we choose the correct period, and did we choose the correct filter. Let's deal with the second question first.
Let us deal with the first of these questions. We have done research, and the periods of 2 and 5 were recommended, but let us check a braoder range of periods.
from mpl_toolkits.mplot3d import Axes3D
values = []
for short_period in range(1,5):
for long_period in range(short_period + 1, short_period + 5):
values.append([short_period, long_period,sp.dmac(prices.OPEN,prices.OPEN,sp.sma,short_period, long_period, 5, True,end_value=True)])
X,Y,Z = zip(*values)
print max(values,key=lambda x:x[2])
fig = plt.figure()
ax = Axes3D(fig)
ax.plot_trisurf(X,Y,Z)
[2, 5, 1121.5741465850183]
<mpl_toolkits.mplot3d.art3d.Poly3DCollection at 0x1632d8d0>
So yes, our research worked out, in fact the 2 and 5 periods lead to the best overall gain. So the final question we will ask is: Is our filter right? We will focus on four different types of filters (all described and coded in signal_processing.py). The filters are
Let us then do a naive optimization over all of these filters to find their optimal periods, and then veiw some statistics.
for filt,name in [(sp.ema,'EMA'),(sp.gaussian,'Gaussian'),(sp.butterworth,'Butterworth')]:
values = []
for short_period in range(1,5):
for long_period in range(short_period + 1, short_period + 5):
values.append([short_period, long_period,sp.dmac(prices.OPEN,prices.OPEN,filt,short_period, long_period, 5, True,end_value=True)])
optimal = max(values,key=lambda x:x[2])
print "%s Value, Returns: %d, Periods: %d, %d" % (name, optimal[2], optimal[0], optimal[1])
values = []
for short_period in range(1,5):
for long_period in range(short_period + 1, short_period + 5):
values.append([short_period, long_period,sp.dmac(prices.OPEN,prices.OPEN,filt,short_period, long_period, 5, False,end_value=True)])
optimal = max(values,key=lambda x:x[2])
print "%s Technical, Returns: %d, Periods: %d, %d" % (name, optimal[2], optimal[0], optimal[1])
EMA Value, Returns: 499, Periods: 1, 5 EMA Technical, Returns: 59, Periods: 4, 8 Gaussian Value, Returns: 457, Periods: 2, 3 Gaussian Technical, Returns: 42, Periods: 2, 6 Butterworth Value, Returns: 101, Periods: 1, 2 Butterworth Technical, Returns: 98, Periods: 1, 2
So we are done right? Not quite yet. So it may seem like SMA preforms the best out of all of the filters above. But we have not looked at their statistics. Let us veiw the best of the best, and compare SMA, EMA, and the baseline.
equity_curves['Value EMA'] = sp.dmac(prices.OPEN,prices.OPEN,sp.ema,1, 5, 5, True)
equity_curves['Value Gaussian'] = sp.dmac(prices.OPEN,prices.OPEN,sp.gaussian,2, 3, 5, True)
del equity_curves['Technical SMA']
equity_curves.plot()
remove_border()
Before we move on to the stats, let's take a look at what we have here. The most curious thing to note is that the EMA did not do too much better than the baseline. Oddly enough the EMA seemed to do best in times of economic downturn. Well, lets move onto the statistics, and comparative statistics.
stats = []
for strategy in ['Baseline','Value SMA','Value EMA','Value Gaussian']:
stats.append( sp.performance(equity_curves[strategy], time_frame= 251))
pd.DataFrame([stat.values() for stat in stats],columns = stats[0].keys(),index=['Baseline','Value SMA','Value EMA','Value Gaussian'])
sharp | std | sortino | car3 | car1 | cret | pos | car5 | aret | max | ave | ytd | 1000 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Baseline | 0.450600 | 0.169088 | 0.868786 | 14.479322 | 17.134151 | 284.648624 | 0.526083 | 5.947035 | 6.967836 | 56.700440 | 0.033569 | 17.134151 | 3846.486241 |
Value SMA | 0.857395 | 0.159222 | 16.518197 | 12.628268 | 4.673404 | 1021.574147 | 0.337055 | 14.389879 | 12.847359 | 24.042366 | 0.051072 | 4.673404 | 11215.741466 |
Value EMA | 0.509205 | 0.185759 | 1.388655 | 8.647352 | 8.518586 | 399.564057 | 0.327045 | 11.128043 | 8.375112 | 48.362182 | 0.035617 | 8.518586 | 4995.640570 |
Value Gaussian | 0.537531 | 0.168085 | 1.455399 | 4.046805 | 4.218977 | 357.432135 | 0.300866 | 6.198407 | 7.898730 | 30.130000 | 0.033532 | 4.218977 | 4574.321353 |
stats = []
for strategy in ['Value SMA','Value EMA','Value Gaussian']:
stats.append( sp.stats(equity_curves[strategy], equity_curves['Baseline'],time_frame= 251))
pd.DataFrame([stat.values() for stat in stats],columns = stats[0].keys(),index=['Value SMA','Value EMA','Value Gaussian'])
alpha | beta | r | return | correlation | |
---|---|---|---|---|---|
Value SMA | 12.124887 | 0.021216 | 0.000268 | 1021.574147 | 0.016368 |
Value EMA | 6.986131 | 0.019568 | 0.000234 | 399.564057 | 0.015300 |
Value Gaussian | 6.260903 | 0.038330 | 0.000822 | 357.432135 | 0.028669 |
So this tells the story. The Value SMA has lower varience (meaning high sharp and sortino, and low maximum drawdown and standard deviation). And it has higher in terms of return. All of the strategies yeild positive alpha and have very low beta.
One can much more thruoghouly examine different filters and different periods in order to find the optimal. But we should be wary of overfitting. There are other statistical test that we could preform to judge the value of a strategy.
RSI or relative strength index is a measure of a certain stock being overbought or oversold. Thus if the RSI (implemented and explained further in signal_processing.py) breaches upper and lower bounds of 70 and 30, we should assume that the stock will revert to the mean. But what does this look like?
equity_curves = pd.DataFrame(sp.rsi(prices.OPEN,50),columns =['RSI'],index=prices.index)
equity_curves.plot(); remove_border()
Well this does not tell us much. First the principal of RSI is: Sell wehn above 70 and buy when below 30. So lets add some more graphics to the chart.
equity_curves['Upper Bound'] = [70]*len(equity_curves.RSI)
equity_curves['Lower Bound'] = [30]*len(equity_curves.RSI)
# normalize the original s&p
equity_curves['Baseline'] = 100*prices.OPEN/prices.OPEN[-1]
equity_curves.plot(); remove_border(); plt.legend(loc='upper left')
<matplotlib.legend.Legend at 0x1be9fb00>
Ah, now we can see the strength of RSI. Notice that when RSI breaks the bounds of 70/30, the morket will then move in the other direction. RSI tells us if the stock is overbought or oversold. Thus we can see that it works, to an extent. Let's try to find the best implementation of this strategy.
values = []
for period in range(1,101):
values.append([period,sp.equity_curve(sp.rsi(prices.OPEN,period), prices.OPEN, [70]*len(prices.OPEN), [30]*len(prices.OPEN), 5,end_value=True)])
print max(values,key=lambda x:x[1])
periods, returns = zip(*values)
plt.plot(returns)
[13, 114.08440469332842]
[<matplotlib.lines.Line2D at 0x1babc518>]
So as one can see, there is a trend to RSI such that by increasing or decreasing the period can lead to consistantly better results. Notice that as the period increases, we get less volitility, because we get fewer trades. But we have found a generally optimal strategy. Let us view some statistics on this curve vs. the baseline and finish off with our conclusions on RSI.
equity_curves = pd.DataFrame(sp.equity_curve(sp.rsi(prices.OPEN,13), prices.OPEN, [70]*len(prices.OPEN), [30]*len(prices.OPEN), 5),columns =['RSI'],index=prices.index)
# normalize the original s&p
equity_curves['Baseline'] = 100*prices.OPEN/prices.OPEN[1]
equity_curves.plot(); remove_border();# plt.legend(loc='upper left')
So RSI might not be what we are looking for. RSI is a naive approach to handle the stock market from the 1970s and it is not surprising that such a naive strategy might not work. Further implemntations might be to change the holding period, to set stops, or in short develop a more comprehensive RSI tool.
Signal processing can get us far. It can make us money, ensure low volitility, and even beat the stock market. But sometimes it just does not work. Our naive approach to signal processing has given us insight on how traders might look at the market and how to evaluate a strategy, however there exist problems with overfitting and a potentially limitless supply of different signals.
Extensions and future work for signal processing include: run more advanced statistics, to try a wider range of techniques, and develop a more in depth trading system (including stops, limits, commissions, and slippage).
This is the most classice way to go about making money on the market: buying some stocks and putting them in your portfolio.
Returns defined as: $$ $$ $$r_t = \frac{p_t-p_{t-1}}{p_{t-1}} $$
We define the iterative return index:
$$i_t = (1 + r_t) \cdot i_{t-1}, \quad i_0 = 1$$RR = close_px/close_px.shift(1) - 1
daily_index = (1 + RR).cumprod()
daily_index['MSFT'].plot()
remove_border()
# Let's sample and plot monthly returns...
monthly_index = daily_index.asfreq('EOM', method='ffill')
monthly_RR = monthly_index / monthly_index.shift(1) - 1
monthly_RR.plot()
remove_border()
We can plot multiple timeseries in a dataframe using matplotlib:
subRR = RR.ix[:, ['AAPL', 'IBM', 'XOM', 'MSFT']]
(1 + subRR).cumprod().plot()
remove_border()
We can look at the distributions of stock and ETF (Exchange Traded Fund) returns:
RR[-250:].hist(bins=50)
remove_border()
RR.ix[-1].plot(kind='bar')
remove_border()
We can use a sliding window of price to calculate certain functions of timeseries, such as rolling volatility and moving averages.
# We have the following plot for Apple:
px = close_px['AAPL']
close_px['AAPL'].plot()
remove_border()
And the corresponding rolling volatility:
vol = rolling_std(px / px.shift(1) - 1, 250) * np.sqrt(250)
vol.plot()
remove_border()
We calculated the rolling volatility using a sliding window of price; similarly we can calculate a smoothed exponential moving average of the rolling volatility to more easily recognize changes in trend.
evol = ewmstd(RR, span=250) * np.sqrt(250)
vol.plot()
evol.plot()
plt.xlim([datetime.datetime(1991, 1, 1), datetime.datetime(2012, 1, 1)])
(726833.0, 734503.0)
In the variable $RR$, we have the returns of the various $SPX Equities. These returns may be correlated. We can look at the correlation in a grid:
RR.corr()
AA | AAPL | GE | IBM | JNJ | MSFT | PEP | SPX | XOM | |
---|---|---|---|---|---|---|---|---|---|
AA | 1.000000 | 0.216917 | 0.454180 | 0.320836 | 0.239261 | 0.332316 | 0.231136 | 0.599806 | 0.433569 |
AAPL | 0.216917 | 1.000000 | 0.309258 | 0.353119 | 0.153393 | 0.365921 | 0.129998 | 0.439518 | 0.189228 |
GE | 0.454180 | 0.309258 | 1.000000 | 0.413646 | 0.374792 | 0.414954 | 0.332428 | 0.744141 | 0.392965 |
IBM | 0.320836 | 0.353119 | 0.413646 | 1.000000 | 0.247022 | 0.426098 | 0.215850 | 0.575682 | 0.278011 |
JNJ | 0.239261 | 0.153393 | 0.374792 | 0.247022 | 1.000000 | 0.292051 | 0.372419 | 0.506015 | 0.351107 |
MSFT | 0.332316 | 0.365921 | 0.414954 | 0.426098 | 0.292051 | 1.000000 | 0.249435 | 0.618319 | 0.314164 |
PEP | 0.231136 | 0.129998 | 0.332428 | 0.215850 | 0.372419 | 0.249435 | 1.000000 | 0.450798 | 0.306896 |
SPX | 0.599806 | 0.439518 | 0.744141 | 0.575682 | 0.506015 | 0.618319 | 0.450798 | 1.000000 | 0.601648 |
XOM | 0.433569 | 0.189228 | 0.392965 | 0.278011 | 0.351107 | 0.314164 | 0.306896 | 0.601648 | 1.000000 |
Or preferably, we can use a colorful visualization to see at a glance which equities exhibit stronger correlation with each other:
plot_corr(RR.corr())
We can also plot the correlation of the monthly returns.
plot_corr(monthly_RR.corr())
Leave-One-Out Regression (LOOCV):
# Note: the function leave_one_out is included in the helper file.
leave_one_out(RR)
{'ex-AA': GE -0.056442 IBM 0.196518 JNJ -0.138967 MSFT 0.175767 PEP -0.113942 SPX 1.142164 XOM -0.178632 intercept 0.000844 dtype: float64, 'ex-GE': AA -0.079250 IBM 0.194958 JNJ -0.150925 MSFT 0.174841 PEP -0.118740 SPX 1.170754 XOM -0.157575 intercept 0.000841 dtype: float64, 'ex-IBM': AA -0.082702 GE -0.060464 JNJ -0.162179 MSFT 0.187395 PEP -0.127989 SPX 1.445041 XOM -0.185627 intercept 0.000924 dtype: float64, 'ex-JNJ': AA -0.069560 GE -0.054263 IBM 0.199524 MSFT 0.173160 PEP -0.141957 SPX 1.153692 XOM -0.173623 intercept 0.000799 dtype: float64, 'ex-MSFT': AA -0.086019 GE -0.075172 IBM 0.211504 JNJ -0.155034 PEP -0.125363 SPX 1.479190 XOM -0.183643 intercept 0.000965 dtype: float64, 'ex-PEP': AA -0.074750 GE -0.054790 IBM 0.199032 JNJ -0.175586 MSFT 0.174323 SPX 1.175799 XOM -0.168719 intercept 0.000829 dtype: float64, 'ex-SPX': AA 0.026372 GE 0.214939 IBM 0.332856 JNJ -0.028252 MSFT 0.319514 PEP -0.031255 XOM 0.026058 intercept 0.000598 dtype: float64, 'ex-XOM': AA -0.090336 GE -0.035589 IBM 0.205419 JNJ -0.161027 MSFT 0.180873 PEP -0.123881 SPX 1.095649 intercept 0.000791 dtype: float64}
We can organize this output into a dataframe.
DataFrame(leave_one_out(RR))
ex-AA | ex-GE | ex-IBM | ex-JNJ | ex-MSFT | ex-PEP | ex-SPX | ex-XOM | |
---|---|---|---|---|---|---|---|---|
AA | NaN | -0.079250 | -0.082702 | -0.069560 | -0.086019 | -0.074750 | 0.026372 | -0.090336 |
GE | -0.056442 | NaN | -0.060464 | -0.054263 | -0.075172 | -0.054790 | 0.214939 | -0.035589 |
IBM | 0.196518 | 0.194958 | NaN | 0.199524 | 0.211504 | 0.199032 | 0.332856 | 0.205419 |
JNJ | -0.138967 | -0.150925 | -0.162179 | NaN | -0.155034 | -0.175586 | -0.028252 | -0.161027 |
MSFT | 0.175767 | 0.174841 | 0.187395 | 0.173160 | NaN | 0.174323 | 0.319514 | 0.180873 |
PEP | -0.113942 | -0.118740 | -0.127989 | -0.141957 | -0.125363 | NaN | -0.031255 | -0.123881 |
SPX | 1.142164 | 1.170754 | 1.445041 | 1.153692 | 1.479190 | 1.175799 | NaN | 1.095649 |
XOM | -0.178632 | -0.157575 | -0.185627 | -0.173623 | -0.183643 | -0.168719 | 0.026058 | NaN |
intercept | 0.000844 | 0.000841 | 0.000924 | 0.000799 | 0.000965 | 0.000829 | 0.000598 | 0.000791 |
We can conveniently use the function 'groupby' to group the data by year.
grouped = RR.groupby(lambda x: x.year)
for year, group in grouped:
print year
print group[:5]
1990 AA AAPL GE IBM JNJ MSFT PEP SPX XOM 1990-02-01 NaN NaN NaN NaN NaN NaN NaN NaN NaN 1990-02-02 0.012048 0.017812 0.000000 0.005956 0.023419 0 0.008278 0.006478 0.019608 1990-02-05 0.005952 0.022500 0.000000 0.025459 -0.006865 0 -0.006568 0.002810 0.001603 1990-02-06 -0.011834 -0.007335 0.003484 0.013857 -0.004608 0 0.016529 -0.006599 -0.003200 1990-02-07 0.005988 -0.043103 0.010417 0.021071 0.013889 0 0.003252 0.012407 0.016051 1991 AA AAPL GE IBM JNJ MSFT PEP SPX XOM 1991-01-02 0.006250 0.011799 -0.014545 -0.007874 -0.025554 0.000000 -0.008526 -0.011417 -0.019746 1991-01-03 -0.006211 -0.011662 -0.022140 0.003472 -0.010490 0.000000 -0.029484 -0.013907 0.005755 1991-01-04 -0.022917 0.005900 -0.011321 -0.003460 -0.007067 0.012195 -0.005063 -0.002827 0.010014 1991-01-07 -0.006397 0.000000 -0.015267 -0.016865 -0.030249 -0.012048 -0.025445 -0.017321 -0.012748 1991-01-08 -0.015021 0.000000 0.007752 -0.011100 0.009174 -0.024390 0.005222 -0.001712 -0.002869 1992 AA AAPL GE IBM JNJ MSFT PEP SPX XOM 1992-01-02 -0.005445 0.054978 0.000000 0.014328 -0.019958 0.027473 0.000000 0.000408 -0.014874 1992-01-03 0.010949 -0.008451 -0.002646 0.001177 -0.004287 -0.010695 0.015654 0.004985 0.002323 1992-01-06 -0.009025 -0.017045 -0.005305 0.020576 0.009688 0.032432 -0.015413 -0.003291 -0.008111 1992-01-07 0.000000 0.019509 -0.005333 0.025922 0.012793 0.026178 0.000000 -0.001340 -0.005841 1992-01-08 0.001821 0.023388 -0.002681 -0.023582 0.018947 0.035714 0.041436 0.001677 -0.012926 1993 AA AAPL GE IBM JNJ MSFT PEP SPX XOM 1993-01-04 0.019108 -0.025017 0.000000 -0.004980 -0.019883 -0.004762 -0.008889 -0.000757 0.004348 1993-01-05 0.001562 0.017106 0.011494 -0.024024 -0.022673 0.014354 -0.009716 -0.002389 0.009740 1993-01-06 -0.006240 0.042046 -0.004545 -0.018462 -0.030525 0.033019 -0.015094 0.000414 -0.004287 1993-01-07 -0.001570 -0.012105 -0.002283 -0.020899 -0.026448 -0.022831 -0.028352 -0.008722 -0.002153 1993-01-08 -0.028302 0.020422 -0.002288 -0.010672 0.024580 0.000000 0.016562 -0.003900 -0.016181 1994 AA AAPL GE IBM JNJ MSFT PEP SPX XOM 1994-01-03 0.019293 0.021038 -0.007299 0.020636 0.007702 -0.005051 -0.006662 -0.002165 0.009054 1994-01-04 -0.001577 0.054945 -0.003676 0.023589 -0.002548 0.005076 -0.011923 0.003115 0.007976 1994-01-05 0.023697 0.071615 -0.007380 0.008230 -0.002554 0.020202 -0.009804 0.001414 0.005935 1994-01-06 -0.004630 -0.030377 0.009294 -0.017143 -0.020487 0.029703 -0.012186 -0.000920 -0.005900 1994-01-07 0.004651 0.012531 0.005525 0.006645 0.011765 0.004808 0.022359 0.005951 -0.007913 1995 AA AAPL GE IBM JNJ MSFT PEP SPX XOM 1995-01-03 -0.013854 -0.015560 0.000000 0.003251 -0.004111 -0.013333 -0.006552 -0.000348 -0.001996 1995-01-04 0.003831 0.025290 0.000000 0.008425 0.009288 0.006757 -0.014015 0.003485 0.004000 1995-01-05 -0.007634 -0.012333 0.003643 -0.004499 -0.005112 -0.016779 -0.020903 -0.000803 0.001992 1995-01-06 0.029487 0.080125 -0.005445 0.014848 -0.006166 0.017065 0.000000 0.000739 0.000000 1995-01-09 0.009963 -0.019268 -0.009124 0.005089 -0.011375 -0.006711 0.003416 0.000326 -0.007952 1996 AA AAPL GE IBM JNJ MSFT PEP SPX XOM 1996-01-02 0.028340 0.007528 0.021330 -0.005691 -0.014848 0.023202 -0.002611 0.007793 -0.006442 1996-01-03 0.015748 0.000000 0.001229 -0.017690 0.039974 -0.031746 0.002618 0.000950 0.002882 1996-01-04 0.008721 -0.017435 -0.011043 -0.026483 -0.009452 0.004684 -0.002611 -0.005826 0.006466 1996-01-05 -0.006724 0.084918 0.001241 0.020131 0.000000 -0.011655 -0.002094 -0.001603 0.024982 1996-01-08 -0.054159 0.011682 0.007435 0.005333 0.012723 0.000000 0.000000 0.002838 0.011838 1997 AA AAPL GE IBM JNJ MSFT PEP SPX XOM 1997-01-02 0.021470 0.005747 -0.013405 0.011413 0.000000 -0.012315 0.008358 -0.005036 0.003425 1997-01-03 0.044462 0.036190 0.015399 0.038426 0.010388 0.036160 0.008289 0.014952 0.002844 1997-01-06 -0.002322 -0.178309 -0.000892 0.013216 0.007576 -0.002407 -0.016441 -0.000508 0.013613 1997-01-07 -0.005431 -0.020134 0.020536 0.013333 0.002148 0.007238 0.008358 0.007463 0.002798 1997-01-08 -0.001560 0.006849 -0.013998 -0.021453 0.007503 -0.019162 0.004388 -0.006399 -0.007812 1998 AA AAPL GE IBM JNJ MSFT PEP SPX XOM 1998-01-02 0.012546 0.237805 0.008289 0.009545 -0.012215 0.014173 -0.006844 0.004750 0.011111 1998-01-05 0.017493 -0.022167 0.018203 0.007696 -0.000824 -0.005435 0.013783 0.002082 -0.010989 1998-01-06 -0.022923 0.193955 -0.013264 -0.011128 -0.015264 0.005464 -0.036136 -0.010736 -0.036000 1998-01-07 0.002199 -0.075949 0.008182 -0.009488 0.009636 -0.011646 0.019673 -0.002669 0.031812 1998-01-08 -0.042429 0.038813 -0.009275 -0.000668 0.011618 0.007070 0.000000 -0.008247 -0.021448 1999 AA AAPL GE IBM JNJ MSFT PEP SPX XOM 1999-01-04 -0.010981 0.007820 -0.014274 -0.007505 -0.013889 0.016514 -0.004727 -0.000919 -0.006899 1999-01-05 0.020819 0.050436 0.021295 0.036295 0.003841 0.038989 0.001583 0.013582 -0.008775 1999-01-06 0.039429 -0.036011 0.019600 -0.004621 0.011798 0.032662 0.026241 0.022140 0.039838 1999-01-07 -0.012426 0.077586 -0.016769 0.007575 -0.009455 -0.005047 -0.025570 -0.002051 -0.001419 1999-01-08 0.101987 0.000000 -0.005824 -0.013823 0.000636 -0.004058 0.015492 0.004221 -0.006039 2000 AA AAPL GE IBM JNJ MSFT PEP SPX XOM 2000-01-03 -0.024605 0.089105 -0.030891 0.075319 -0.011504 -0.001526 0.046093 -0.009549 -0.027742 2000-01-04 0.004360 -0.084673 -0.039774 -0.033934 -0.036617 -0.033843 -0.025473 -0.038345 -0.019244 2000-01-05 0.057674 0.014832 -0.001763 0.035125 0.010607 0.010621 -0.024373 0.001922 0.054465 2000-01-06 -0.013193 -0.086538 0.013243 -0.017214 0.031487 -0.033542 0.044895 0.000956 0.051652 2000-01-07 -0.002971 0.047579 0.038629 -0.004429 0.042397 0.013188 0.027027 0.027090 -0.002746 2001 AA AAPL GE IBM JNJ MSFT PEP SPX XOM 2001-01-02 -0.037407 0.000000 -0.087551 -0.002150 -0.029224 0.000000 -0.003787 -0.028032 0.025161 2001-01-03 0.013467 0.100806 0.092764 0.115690 -0.031875 0.104985 -0.058287 0.050099 -0.043664 2001-01-04 0.032650 0.041514 0.005251 -0.015210 -0.021688 0.010085 -0.045748 -0.010552 -0.027753 2001-01-05 -0.018382 -0.039859 -0.015670 0.008703 0.013088 0.014188 0.021151 -0.026242 0.004604 2001-01-08 0.014981 0.010989 -0.036851 -0.004618 -0.001318 -0.003627 0.016570 -0.001918 -0.004583 2002 AA AAPL GE IBM JNJ MSFT PEP SPX XOM 2002-01-02 0.003434 0.063927 0.021575 0.004416 -0.006894 0.011909 0.009647 0.005740 0.007600 2002-01-03 0.013005 0.012017 -0.008381 0.017775 -0.000217 0.032650 -0.007292 0.009180 0.001571 2002-01-04 0.032095 0.005089 0.008452 0.015718 -0.009330 -0.004779 -0.008865 0.006213 0.008472 2002-01-07 0.023241 -0.033755 -0.038887 -0.012398 -0.004599 -0.004802 -0.005878 -0.006499 -0.008712 2002-01-08 -0.021753 -0.012227 -0.010464 0.005315 -0.006601 0.011878 -0.005913 -0.003588 0.001255 2003 AA AAPL GE IBM JNJ MSFT PEP SPX XOM 2003-01-02 0.033578 0.033520 0.046729 0.039604 0.030168 0.039389 0.021092 0.033200 0.015288 2003-01-03 0.010660 0.006757 -0.003151 0.013445 0.026788 0.001421 0.006791 -0.000484 0.000684 2003-01-06 0.024611 0.000000 0.025817 0.023770 0.009507 0.017975 -0.010118 0.022474 0.024624 2003-01-07 0.000000 -0.002685 -0.006163 0.028888 -0.022558 0.019052 -0.018171 -0.006545 -0.033712 2003-01-08 -0.103922 -0.020188 -0.015504 -0.021123 -0.005826 -0.028272 0.012435 -0.014086 -0.004145 2004 AA AAPL GE IBM JNJ MSFT PEP SPX XOM 2004-01-02 -0.011967 -0.004677 0.004203 -0.012204 0.000000 0.002765 -0.003352 -0.003094 -0.008929 2004-01-05 0.032609 0.042293 0.015069 0.016391 0.005971 0.025276 0.017076 0.012395 0.023249 2004-01-06 -0.007218 -0.003607 -0.006598 0.000120 -0.004511 0.003586 0.006360 0.001292 -0.006816 2004-01-07 -0.007573 0.022624 0.009548 -0.003008 0.000477 -0.001340 -0.012892 0.002367 -0.007149 2004-01-08 0.011905 0.033628 0.018092 0.002776 0.004529 -0.001342 -0.019974 0.004963 -0.002592 2005 AA AAPL GE IBM JNJ MSFT PEP SPX XOM 2005-01-03 -0.013120 -0.017081 0.002787 -0.008455 -0.008205 0.000845 -0.004984 -0.008119 -0.022717 2005-01-04 -0.018095 0.010111 -0.012161 -0.010687 -0.003078 0.003797 -0.007058 -0.011671 -0.006904 2005-01-05 -0.006017 0.008758 -0.005980 -0.002068 -0.000772 -0.002522 0.001147 -0.003628 -0.005098 2005-01-06 0.004162 0.000930 0.007785 -0.003109 0.002897 -0.000843 0.007100 0.003506 0.012579 2005-01-07 0.010173 0.072491 -0.005969 -0.004390 -0.003659 -0.002952 0.008642 -0.001431 -0.006441 2006 AA AAPL GE IBM JNJ MSFT PEP SPX XOM 2006-01-03 0.010985 0.039783 0.009191 -0.001741 0.025469 0.026428 0.011600 0.016430 0.041046 2006-01-04 0.005620 0.002943 -0.001401 -0.001342 0.015402 0.004568 -0.000583 0.003673 0.001739 2006-01-05 0.008197 -0.007870 -0.002455 0.006717 -0.004171 0.000827 -0.005834 0.000016 -0.005016 2006-01-06 -0.003326 0.025813 0.006681 0.029624 0.004570 -0.002891 0.003717 0.009399 0.019779 2006-01-09 0.011865 -0.003277 -0.002445 -0.014256 0.006255 -0.002071 -0.005067 0.003656 -0.000570 2007 AA AAPL GE IBM JNJ MSFT PEP SPX XOM 2007-01-03 -0.022661 -0.012258 0.020692 0.001230 0.005795 0.000000 0.002734 -0.001199 -0.032981 2007-01-04 -0.007479 0.022196 -0.006018 0.010721 0.012395 -0.001472 0.006908 0.001228 -0.018699 2007-01-05 -0.012057 -0.007121 -0.004780 -0.009061 -0.008967 -0.005898 -0.003069 -0.006085 0.007165 2007-01-08 -0.009535 0.004938 -0.000320 0.015165 -0.001740 0.010011 0.002173 0.002220 -0.008022 2007-01-09 0.001540 0.083070 0.000000 0.011863 -0.003660 0.000734 0.004156 -0.000517 -0.007782 2008 AA AAPL GE IBM JNJ MSFT PEP SPX XOM 2008-01-02 -0.011481 -0.016357 -0.008186 -0.031513 -0.011862 -0.010658 -0.008092 -0.014438 -0.001862 2008-01-03 0.001489 0.000462 0.000952 0.001944 0.000171 0.004001 0.006526 0.000000 0.003382 2008-01-04 -0.036574 -0.076335 -0.020615 -0.035948 -0.001372 -0.027897 -0.003095 -0.024552 -0.018596 2008-01-07 -0.050309 -0.013385 0.003886 -0.010699 0.015625 0.006623 0.023503 0.003223 -0.009356 2008-01-08 -0.063698 -0.035972 -0.021613 -0.024521 0.001183 -0.033521 0.007943 -0.018352 -0.012791 2009 AA AAPL GE IBM JNJ MSFT PEP SPX XOM 2009-01-02 0.076137 0.063269 0.053999 0.038154 0.013774 0.046002 0.021860 0.031608 0.022773 2009-01-05 -0.020708 0.042204 -0.025940 -0.006386 -0.009964 0.009424 -0.006418 -0.004668 -0.000131 2009-01-06 0.022026 -0.016494 0.013981 0.027771 -0.006038 0.011411 0.001762 0.007817 -0.016374 2009-01-07 -0.101724 -0.021608 -0.044649 -0.016047 -0.009389 -0.060000 -0.032630 -0.030010 -0.025436 2009-01-08 0.043186 0.018569 0.002062 -0.006955 -0.001858 0.031097 -0.011715 0.003397 0.010659 2010 AA AAPL GE IBM JNJ MSFT PEP SPX XOM 2010-01-04 0.032746 0.015565 0.020891 0.011908 0.004126 0.015700 0.007282 0.016043 0.014102 2010-01-05 -0.031098 0.001729 0.005457 -0.012079 -0.011506 0.000336 0.012048 0.003116 0.003930 2010-01-06 0.052234 -0.015906 -0.005427 -0.006547 0.008148 -0.006382 -0.010034 0.000546 0.008582 2010-01-07 -0.021531 -0.001849 0.051842 -0.003414 -0.007257 -0.010142 -0.006356 0.004001 -0.003135 2010-01-08 0.025061 0.006648 0.021401 0.010039 0.003489 0.006831 -0.003285 0.002882 -0.003893 2011 AA AAPL GE IBM JNJ MSFT PEP SPX XOM 2011-01-03 0.026797 0.021732 0.000000 0.004900 0.015773 0.002556 0.006421 0.011315 0.019624 2011-01-04 0.045831 0.005219 0.017937 0.001099 0.008336 0.004006 -0.005135 -0.001313 0.004641 2011-01-05 0.002435 0.008180 0.001652 -0.003979 -0.000648 -0.003264 0.017988 0.005007 -0.002717 2011-01-06 -0.012143 -0.000808 -0.004398 0.010951 -0.001460 0.029476 0.003841 -0.002123 0.006540 2011-01-07 0.003688 0.007161 -0.007178 -0.004973 -0.009747 -0.007777 -0.006735 -0.001845 0.005414
def get_beta(RR):
RR = RR.dropna()
RR['intercept'] = 1.
model = sm.OLS(RR['MSFT'], RR.ix[:, ['AAPL', 'intercept']]).fit()
return model.params
#get_beta(RR), for AAPL
grouped = RR.groupby([lambda x: x.year, lambda x: x.month])
beta_by_ym = grouped.apply(get_beta)
beta_by_ym.unstack(0)['AAPL']
1990 | 1991 | 1992 | 1993 | 1994 | 1995 | 1996 | 1997 | 1998 | 1999 | 2000 | 2001 | 2002 | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | NaN | 0.586836 | 0.392003 | 0.337470 | 0.034171 | 0.240722 | 0.064221 | 0.131182 | 0.036281 | 0.182694 | 0.242851 | 0.410385 | 0.313725 | 0.906993 | 0.261317 | -0.036247 | 0.144353 | -0.024860 | 0.250590 | 0.310269 | 0.548070 | 0.305503 |
2 | -0.002426 | 0.398197 | 0.597157 | -0.191693 | 0.004739 | 0.166778 | 0.069597 | 0.002899 | 0.167381 | 0.504353 | 0.367281 | 0.387085 | 0.285311 | 0.781728 | 0.207407 | 0.128596 | 0.187659 | 0.532232 | 0.232091 | 0.649430 | 0.618771 | 0.332002 |
3 | 0.205730 | 0.604625 | 0.229554 | 0.467642 | 0.157241 | -0.130381 | 0.234062 | 0.161575 | 0.127244 | 0.330306 | 0.215007 | 0.625861 | 0.562043 | 1.137488 | 0.227146 | 0.020839 | 0.093893 | 0.550877 | 0.505030 | 0.998851 | 0.070641 | 0.466862 |
4 | 0.406723 | -0.055939 | 0.766337 | 0.645863 | 0.337084 | -0.097639 | -0.166418 | -0.092456 | 0.015637 | 0.124685 | 0.445522 | 0.421632 | 0.598160 | 0.468462 | 0.044444 | 0.272978 | -0.009106 | 0.059114 | 0.463362 | 0.321722 | 0.061764 | -0.232956 |
5 | 0.185421 | 0.034390 | -0.118079 | 0.700196 | 0.287318 | 0.097011 | 0.110495 | 0.474461 | 0.117327 | 0.193558 | 0.406288 | 0.305603 | 0.927363 | 0.160195 | 0.242406 | 0.115821 | 0.227492 | 0.144143 | 0.415390 | 0.229804 | 0.548791 | 0.517262 |
6 | 0.173833 | 0.124975 | 0.218792 | 0.117013 | 0.386103 | 0.221485 | 0.266115 | 0.464815 | 0.500223 | 0.319529 | 0.422132 | 0.227274 | 0.232655 | 0.315496 | 0.106455 | 0.032854 | 0.126484 | 0.054197 | 0.275618 | 0.171693 | 0.743858 | 0.465253 |
7 | 0.130995 | -0.022488 | 0.507049 | 0.156466 | 0.475714 | 0.886968 | 0.098349 | 0.223480 | 0.260195 | 0.274491 | 0.259399 | 0.216299 | 0.486249 | 0.380805 | 0.019886 | 0.241723 | 0.007129 | 0.037335 | 0.492093 | 0.512913 | 0.207223 | 0.380265 |
8 | 0.902234 | 0.307631 | 0.345300 | 0.214452 | 0.278754 | 0.673664 | 0.054254 | 0.054791 | 0.389318 | 0.113426 | -0.037612 | 0.708398 | 0.919043 | 0.061068 | 0.295398 | 0.241400 | -0.019745 | 0.338582 | 0.562713 | 0.589150 | 0.382959 | 0.847749 |
9 | 0.904113 | 0.422216 | 0.399828 | 0.161215 | 0.382306 | 0.229729 | 0.074385 | -0.009814 | 0.482903 | 0.287425 | 0.024519 | 0.528889 | 0.766441 | 0.725022 | 0.215218 | -0.022490 | -0.007107 | 0.219036 | 0.480952 | 0.453823 | 0.514019 | 0.678709 |
10 | 0.362579 | -0.027998 | 0.159585 | -0.069302 | 0.109282 | 0.280408 | -0.004444 | 0.142951 | 0.349357 | 0.186547 | -0.122512 | 0.348552 | 0.804821 | 0.299456 | 0.035525 | 0.146915 | 0.091303 | 0.234525 | 0.592720 | 0.249662 | 0.405615 | 0.232681 |
11 | 0.569319 | 0.233042 | 0.781442 | 0.219592 | 0.502614 | 0.075520 | 0.183464 | 0.153227 | 0.212188 | 0.005807 | 0.466090 | 0.337734 | 0.519945 | 0.348925 | -0.057999 | 0.009591 | 0.126967 | 0.320820 | 0.763112 | 0.493005 | 0.473650 | NaN |
12 | 0.204879 | 0.179746 | 0.305322 | 0.035841 | 0.194147 | 0.255704 | 0.067983 | 0.244797 | 0.316726 | 0.138289 | 0.425003 | 0.308701 | 0.435298 | 0.047410 | 0.102319 | 0.053251 | 0.109467 | 0.705481 | 0.711442 | 0.158556 | 0.903940 | NaN |
RR = close_px / close_px.shift(1) - 1
y = RR['AAPL']
x = RR.ix[:, ['MSFT']]
model = ols(y=y, x=x) # OLS
model
-------------------------Summary of Regression Analysis------------------------- Formula: Y ~ <MSFT> + <intercept> Number of Observations: 5471 Number of Degrees of Freedom: 2 R-squared: 0.1339 Adj R-squared: 0.1337 Rmse: 0.0289 F-stat (1, 5469): 845.4989, p-value: 0.0000 Degrees of Freedom: model 1, resid 5469 -----------------------Summary of Estimated Coefficients------------------------ Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5% -------------------------------------------------------------------------------- MSFT 0.5227 0.0180 29.08 0.0000 0.4874 0.5579 intercept 0.0007 0.0004 1.84 0.0665 -0.0000 0.0015 ---------------------------------End of Summary---------------------------------
# more least squares
model = ols(y=y, x=x, window=250)
model.beta.info()
model.beta['MSFT'].plot()
remove_border()
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 5222 entries, 1991-01-29 00:00:00 to 2011-10-14 00:00:00 Data columns (total 2 columns): MSFT 5222 non-null values intercept 5222 non-null values dtypes: float64(2)
Beta of Microsoft relative to the S&P 500 as a function of time.
def RB(returns, benchmark, window=250):
betas = {}
for col, ts in returns.iteritems():
betas[col] = _r_b(ts, spx, window=250)
return DataFrame(betas)
def _r_b(returns, benchmark, window=250):
model = ols(y=returns, x={'bmk' : benchmark}, window=window,min_periods=100)
return model.beta['bmk']
def GMR(daily_returns):
daily_index = T(daily_returns)
monthly_index = daily_index.asfreq('EOM', method='pad')
return TR(monthly_index)
def T(returns):
return (1 + returns).cumprod()
def TR(prices):
return prices / prices.shift(1) - 1
rets = GMR(TR(close_px))
spx = rets.pop('SPX')
betas =RB(rets, spx, window=36)
betas['MSFT'].plot()
remove_border()
tracking error:
$TE = \omega =\sqrt{\operatorname{Var}(r_p - r_b)} = \sqrt{{E}[(r_p-r_b)^2]-({E}[r_p - r_b])^2}$
# benchmark is the S&P 500
px = close_px.ix[datetime.datetime(1992, 1, 1):]
RR = px / px.shift(1) - 1
annualizer = np.sqrt(250)
bmk = RR.pop('SPX')
tickers = RR.columns
weights = Series(1. / len(tickers), index=tickers)
weights
# plot weighted portfolio versus the benchmark
port_returns = (RR * weights).sum(1)
plot_returns(port_returns, bmk)
Annualized:
calc_te(weights, RR, bmk) * annualizer
0.092509298588231245
Let's minimize tracking error for the entire sample of data, which uses hindsight:
clean_RR = RR.fillna(0)
clean_bmk = bmk.fillna(0)
opt_weights = get_opt_holdings_opt(clean_RR, clean_bmk)
calc_te(opt_weights, RR, bmk) * annualizer
0.077470788727965992
def rolling_min_te_portfolio(UNIRET, TRARET, window=250,T_intervals=100):
ports = {}
for i, cur_date in enumerate(UNIRET.index):
if i < T_intervals:
continue
if i % 100 == 0:
print >> sys.stdout, '*',
UNI = UNIRET[max(0,i-window+1):i]
TRA = TRARET[max(0,i-window+1):i]
UNI = UNI.ix[:, UNI.count() > 0]
UNI = UNI.fillna(0)
TRA = TRA.fillna(0)
ports[cur_date] = get_opt_holdings_opt(UNI,TRA)
return DataFrame(ports).T
ports = rolling_min_te_portfolio(RR, bmk)
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
ports
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 4888 entries, 1992-05-26 00:00:00 to 2011-10-14 00:00:00 Data columns (total 8 columns): AA 4888 non-null values AAPL 4888 non-null values GE 4888 non-null values IBM 4888 non-null values JNJ 4888 non-null values MSFT 4888 non-null values PEP 4888 non-null values XOM 4888 non-null values dtypes: float64(8)
ports['JNJ'].plot()
remove_border()
port_return = (ports.shift(1) * RR).sum(1)
plot_returns(port_return, bmk)
The Sharpe ratio of this portfolio is quite high!
def sharpe(RR, ann=250):
return np.sqrt(ann) * RR.mean() / RR.std()
sharpe(port_return)
0.66534966791660943
Particularly in relation to the benchmark...
sharpe(bmk)
0.37931166467025773
This reflects the concept that the lowest risk stocks in an index actually deliver the highest performance, as opposed to the common perception that one must take on more risk to attain a chance at higher reward.
This also worked very well. But let's do some sentiment analysis to round out our analysis.
First, we load in some standard packages and a couple of packages we wrote ourselves.
#ours
import twitter3 as tw3
import performance1 as perf1
import performance2 as perf2
import sentiment1 as sent
#standard
import pandas as pd
import numpy as np
import datetime
from pattern.web import Element
import requests
from sklearn.cross_validation import train_test_split
from sklearn.naive_bayes import MultinomialNB
from scipy import sparse
from sklearn.feature_extraction.text import CountVectorizer
#import ystockquote as ysq
The next two cells scrape wikipedia for the ticker symbols of the S&P 500 companies, our stock data set we'll be using, and also prepare the list of dates we'll try to use for our analysis. Notably, the Twitter API only allows us to go back about a week. We've commented out the first cell and just loaded the symbols from a csv file in the second cell
# spy = requests.get('http://en.wikipedia.org/wiki/List_of_S&P_500_companies').text
# tickers = []
# dom = Element(spy)
# for a in dom('tr td:first-child a'):
# ticker = a.content
# if len(ticker) < 5:
# tickers.append(str(ticker))
# tickers.append('BRK.B')
# tickers.append('CMCSA')
# tickers.append('DISCA')
# tickers = np.sort(tickers)
#tickerlist = tickers
spx = pd.read_csv('SPXSymbolsPD.csv', delimiter=',')
tickerlist = list(spx['Ticker'])
datelist = [datetime.date(2013,12,5),datetime.date(2013,12,6),datetime.date(2013,12,7),datetime.date(2013,12,8),datetime.date(2013,12,9),datetime.date(2013,12,10)]
weekdays_datetime = [datetime.date(2013,12,5),datetime.date(2013,12,6),datetime.date(2013,12,9),datetime.date(2013,12,10)]
weekdays_str = [str(day) for day in weekdays_datetime]
weekend_datetime = [datetime.date(2013,12,7),datetime.date(2013,12,8)]
weekend_str = [str(day) for day in weekend_datetime]
Now we need to use the Twitter API to search for the tweets including each stock ticker symbol on each day. We take all the tweets we can find, or 100 tweets, whichever is smaller. We wrote all of the functions in tw3, although we do use other packages (twython and twitter) designed to interface with the Twitter API. We initially made extensive use of twitter and then switched to twython as it gave us better error handling. As we want to do build this entire data set in one call, we have to be able to handle a wide variety of errors, such as dealing with our rate limits and many of the small issues that could crop up during a single API call. We also make use of extensive printing so we can monitor the function as it runs. We've commented it out and loaded the data from csv so as to avoid the 1.5 hour process of performing the searches.
#loadfull = tw3.safemultisearch(tickerlist,datelist)
#loadfull.to_csv('data.csv')
data = pd.read_csv('data.csv')
#remove weekend days as the market is closed
data = data[data['date']!=weekend_str[0]]
data = data[data['date']!=weekend_str[1]]
Next we need to check the stock performance on each day, so we wrote another function to add a 0 wherever the stock went down and a 1 wherever it went up. We use the ystockquote package to assist in gathering the stock data. Again, we comment this out to limit ourselves to a single round of data collection and load the results from csv.
#data,performance = perf1.check_performance(data)
#data = data[data['performance']>-1]#only keep if performance is 0 or 1, not NaN
#data.to_csv('data_perf.csv')
data = pd.read_csv('data_perf.csv')
We have a data set! We can go ahead and print part of our dataframe to see what sort of information we've stored, namely the symbol of the company, the date in question, text from the twitter searches, and a performance indicator. We made the choice to treat the entire twitter corpus for each stock from each day as a single bag of words, so the text here is the concatenated tweets (converted to ascii and without punctuation) of the entire result of each particular search. We'll turn it into a proper bag of words a bit later on.
This differs from rotten tomatos where we kept each review separate, but in that case we also had a "fresh" or "rotten" indicator for each review. Here we only have the stock performance, which is obviously doesn't change depending on which tweet we're looking at, so we just combine all of the tweets into a single bag of words.
print data.irow(0)
print data.irow(0)['text']
Unnamed: 0 0 Unnamed: 0.1 MMM company MMM date 2013-12-05 text MMM Could 3M Dare Trump GE on Dividend Hikes h... performance 0 Name: 0, dtype: object MMM Could 3M Dare Trump GE on Dividend Hikes httptcoG7ukoGT5kECould 3M Dare Trump GE on Dividend Hikes MMM httptco4QBxAilSRDRT MScharts TSN MAT MMM LH Portfolios Pops amp Drops httptco9hMj9HZuCGBarclays has estimated MMM target price at 121 defensive on 3Ms ability to maintain margins httptcofvmByxwClFLaMonicaBuzz MMM Buy on the dip httptcoxVMwKU07SsGeoffrey2313819 Ha Did 3M sponsor that tweet MMMSampP100 Stocks Performance INTC BA SPG NSC EBAY DVN UNP EMC MA GILD F MMM LMT ABBV AAPL more httptcon4QZIDxy7wTSN MAT MMM LH Portfolios Pops amp Drops httptco9hMj9HZuCGOne more small order filled Sold 1 Dec13 MMM 120 put for 37 centsTSN MAT MMM LH Portfolios Pops amp Drops httptcoeB1gLFQgbDTSN MAT MMM LH Portfolios Pops amp Drops httptcokqZTxc7ofATSN MAT MMM LH Portfolios Pops amp Drops httptco2z5afpz5SrAnalyst estimate MMM EPS at 672 which is 2 above second quarter estimates httptcofvmByxwClF httptco5wqB2bnXnFMMM mean target price is 127 as Barclays Credit Suisse and Citi gave stock neutral ratings httptcoDtPqdKgHjrOptionSniper1 On my list to look at GOGO SLW P AAPL IBM OXY MMM IYR TWTR Didnt get to any yday Wanted to sell MSFT callsMMM breaking out3M Co The stock is testing its highs MMM httptcodS64sIMoSi httptcoZE6oMDONO0Pennystock Research on DCIN MMM GFIG XPO SHOR AIN View now httptcotCMpiAEdzGLooking for winners like ENV STXS RNIN BPZ MMM Got to see httptcoxeMnQv7YcRMMM above 12760 can gain steam Volume not impressing meTSN MAT QQQ XLE XLV XOP MS MMM ChartsinPlay Portfolio changes in stops and sell advice article later httptcoCW4YD3lpfBTSN MAT QQQ XLE XLV XOP MS MMM ChartsinPlay Portfolio changes in stops and sell advice article later httptco70ZRkrIBhJTSN MAT QQQ XLE XLV XOP MS MMM ChartsinPlay Portfolio changes in stops and sell advice article later httptcofyoFkbgRZcTSN MAT QQQ XLE XLV XOP MS MMM ChartsinPlay Portfolio changes in stops and sell advice article later httptcoWy7KgCw9YzMMM 3M Analyst Is Right Limited Short To MediumTerm Appeal gt httptcoTuJoiXQMAf stock stocks MMMMMM 3M Co MMM 3M Analyst Is Right Limited Short To MediumTerm Appeal httptcotT8fZdmp0O3M Co MMM 3 Reasons To Buy 3M E I Du Pont De Nemours And Co MMM httptcox9HB2qh8Xs3M Co MMM 3M Analyst Is Right Limited Short To MediumTerm Appeal MMM httptco0ZrdXLtIZB3M Analyst Is Right Limited Short To MediumTerm Appeal httptcoeu5vDkFlui MMM3M Analyst Is Right Limited Short To MediumTerm Appeal httptcoW8SwGHrC3E MMM
Now we can begin making a proper bag of words and performing sentiment analysis. Many of the functions for this are contained in the file sentiment1.py, which we wrote and which we imported at the top. We start with a make_xy function, where X is a matrix where each row represents a (stock,date) combination and each column represents a particular word, with the elements representing the frequency of that word in the bag of words. The Y is just are performance indicator (1 for up for the day, 0 for down) which we made and attached to our dataframe earlier.
X, Y = sent.make_xy(data,vectorizer=None)
Now we use a Naive Bayes classifier as we did in the Bayesian Tomatoes problem set, split our data into a traininig set and a testing set, and check our accuracy on the training set and on the testing set.
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.33,random_state=22)
clf = MultinomialNB()
clf.fit(X_train,Y_train)
clf.predict(X_test)
print "accuracy on testing set",1-float(sum(abs(Y_test-clf.predict(X_test))))/float(len(Y_test))
print "accuracy on training set",1-float(sum(abs(Y_train-clf.predict(X_train))))/float(len(Y_train))
accuracy on testing set 0.516279069767 accuracy on training set 0.973221117062
So how did we do? At first glance, 51.6% is not very good. On the other hand, even a small edge in finance can be useful. But do we have an edge?
print "fraction that are positive on the day",float(len(Y_test[Y_test==1]))/float(len(Y_test))
fraction that are positive on the day 0.517829457364
We actually did worse than just assuming every stock went up! It turns out that we didn't gain any edge at all. Particularly considering how high our accuracy was on the training set and how poor it is on the testing set, it looks like we have serious over-fitting issues, likely due to the smaller size of our data set, the sparsity of our word frequency, and a potentially weak underlying connnection from tweets to stock performance. For example, people sometimes might tweet "$GOOG is doing great today!", but not all tweets are necessarily as clear. Unlike movie reviews, the explicit purpose of a tweet is not to determine whether a stock is doing well or poorly, the way the movie reviews are explicit about whether a movie is good or bad.
We'll try fitting for the best alpha and min_df parameters as we did in the problem set to see if this resolves these issues, although if the underlying problems are in the data we wouldn't expect this to work.
alphas = [0, .1, 1, 5, 10, 50, 100, 150, 200]
min_dfs = [1e-6, 1e-7, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1]
#Find the best value for alpha and min_df, and the best classifier
best_alpha = None
best_min_df = None
max_loglike = -np.inf
#for alpha in alphas:
for alpha in alphas:
print alpha#I print alpha as this takes a bit to run and I like to be able to see the progress
for min_df in min_dfs:
vectorizer = CountVectorizer(min_df = min_df)
X, Y = sent.make_xy(data, vectorizer)
clf = MultinomialNB(alpha=alpha)#should move outside of inner for loop
loglike=sent.cv_score(clf,X,Y,sent.log_likelihood)
#print loglike
if loglike>max_loglike:
max_loglike=loglike
print "max_loglike",max_loglike,"alpha",alpha,"min_df",min_df
best_alpha = alpha
best_min_df = min_df
print "best max_loglike",max_loglike,"best alpha",best_alpha,"best min_df",best_min_df
0 max_loglike -701.863599147 alpha 0 min_df 0.1 0.1 max_loglike -701.698320488 alpha 0.1 min_df 0.1 1 max_loglike -700.218818186 alpha 1 min_df 0.1 5 max_loglike -693.812946439 alpha 5 min_df 0.1 10 max_loglike -686.175301985 alpha 10 min_df 0.1 50 max_loglike -636.827815439 alpha 50 min_df 0.1 100 max_loglike -595.120648905 alpha 100 min_df 0.1 150 max_loglike -566.893075242 alpha 150 min_df 0.1 200 max_loglike -547.192426876 alpha 200 min_df 0.1 best max_loglike -547.192426876 best alpha 200 best min_df 0.1
We can see that the maximal parameters are absurd, particularly the very high best_min_df, so it looks like we can't salvage our model this way. But is there anything else we can do?
In the case of movie reviews, Rotten Tomatoes compresses down "thumbs up", "four out of five stars", and other such positive reviews to just "fresh". Similarly, we took all positive days and just converted them to "up", and all negative days and converting them to "down". Seeing as how we're up against a shortage of data, it's sensible for us to avoid this strategy (which is what allowed us to do binary classification earlier) and instead keep more detailed performance information.
We rebuild the performance indicator portion of our data set, instead looking at the fractional gain, clipping it from -1 to 1, then transforming it to the range 0 to 1. So now a gain of 3% would be .56. Again, we comment out the code to build the data set and just read it in from our previuos efforts.
#data = pd.read_csv('data.csv')
#data = data[data['date']!=weekend_str[0]]
#data = data[data['date']!=weekend_str[1]]
#data,performance = perf2.check_performance_scaled(data)
#data = data[data['performance']>=0]#only keep if performance is 0 to 1, not NaN
##data.to_csv('data_perf_scaled.csv')
data = pd.read_csv('data_perf_scaled.csv')
As in our previous model, we now run make_xy. Again, the X will be our bag of words results in matrix form, and our Y will be our performance indicators (which, recall, are continuous and based on the gain/loss.)
X, Y = sent.make_xy(data,vectorizer=None)
In the binary classification model, we then ran this through a multinomial naive Bayes model to predict the probability of up/down for each stock. Since we aren't giving our model a binary classification anymore, this will no longer work. However, we can still use a MultinomialNB model. Now, instead of predicting the probability of 1/0, this will actually build a prediction of performance (such as, say, .52 for a 1% gain) based on the elements of our bag of words.
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.33,random_state=22)
clf = MultinomialNB()
clf.fit(X_train,Y_train)
#clf.predict(X_test)
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
To check the accuracy of our model, we see which stocks it predicts a nontrivial gain for (using a tolerance of .2% movement) and then see if the direction predicted is correct.
predvec = clf.predict(X_test)
indexpred = np.abs(predvec-.5)>0
actual = np.divide(Y_test[indexpred]-.5,np.abs(Y_test[indexpred]-.5))*.5+.5
predict = np.divide(clf.predict(X_test)[indexpred]-.5,np.abs(clf.predict(X_test)[indexpred]-.5))*.5+.5
print "accuracy on testing set",1-float(sum(abs(actual-predict)))/float(len(actual))
print "portion of stocks with motion going up in testing set",float(len(predict[predict>0]))/float(len(predict))
predvec = clf.predict(X_train)
indexpred = np.abs(predvec-.5)>0
actual = np.divide(Y_train[indexpred]-.5,np.abs(Y_train[indexpred]-.5))*.5+.5
predict = np.divide(clf.predict(X_train)[indexpred]-.5,np.abs(clf.predict(X_train)[indexpred]-.5))*.5+.5
print "accuracy on training set",1-float(sum(abs(actual-predict)))/float(len(actual))
accuracy on testing set 0.646341463415 portion of stocks with motion going up in testing set 0.40243902439 accuracy on training set 0.981389578164
And we've succeeded! We still have a very high probability on our training set indicating over fitting issues, but it works out to give us accurate results on our testing set, enabling us to gain an edge. The edge is not massive, as we are getting our predictions right about 65% of the time and the actual split between up stocks and downs tocks (among the stocks with sufficient movement) is 40/60, we have gained a several percent edge, which is considered very significant in the realm of finance.
In our binary classification models, we could from here check the level of overfitting of our model by binning up predictions (10%-20% up, 20%-30% up, and so on) and seeing if our accuracy matches our predictions. Unfortunately, one cost of using our Multinomial Naive Bayes model in this way is that we no longer have probability predictions, just direction predictions. In essense, we've lost our error term, which is always something we'd like to avoid when possible in data analysis. This also makes it difficult to tune our model, such as by carefully adjusting min_df or alpha parameters. Still, if the benefit of losing tunability and error analysis is going from just noise to a useful prediction, we'll take it.
One way we can check the reasonableness of our model, though, is by checking some words with strong negative and positive indications, as we did with Rotten Tomatoes.
vec = CountVectorizer(min_df = 1e-3)
#vec = CountVectorizer()
text = [words for i,words in data.text.iteritems()]
vec.fit(text)
words = np.array(vec.get_feature_names())
singles = np.eye(len(words))
X_df, Y_df = sent.make_xy(data,vectorizer=vec)
X_train_df,X_test_df,Y_train_df,Y_test_df = train_test_split(X_df,Y_df,test_size=0.33,random_state=22)
clf_df = MultinomialNB()
clf_df.fit(X_train_df,Y_train_df)
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
#clf_df = clf
negativeindices = clf_df.predict_proba(sparse.csc_matrix(singles))[:,0].argsort()
positiveindices = clf_df.predict_proba(sparse.csc_matrix(singles))[:,1].argsort()
badwords=words[negativeindices]
badprobs = clf_df.predict_proba(singles)[negativeindices,1]
badprobs=badprobs[::-1]
badwords=badwords[::-1]
goodwords=words[positiveindices]
goodwords=goodwords[::-1]
goodprobs = clf_df.predict_proba(singles)[positiveindices,1]
goodprobs = goodprobs[::-1]
print "negative words",[(badwords[i],badprobs[i]) for i in range(10)]
print "positive words",[(goodwords[i],goodprobs[i]) for i in range(10)]
negative words [(u'dva', 0.00065429865287221971), (u'davita', 0.00070301231619910383), (u'healthcare', 0.0006835894031860572), (u'partners', 0.0013194339424065626), (u'dialysis', 0.00072809514936371226), (u'dvadva', 0.00072592477320190742), (u'scan', 0.00072375005946349695), (u'rated', 0.00071782864597029678), (u'patients', 0.00071095170136868311), (u'health', 0.00066552469716559491)] positive words [(u'cpb', 0.032144008476641339), (u'soup', 0.013377734519007296), (u'options', 0.011244340670651403), (u'campbell', 0.010575715690765052), (u'day', 0.008966277117938088), (u'today', 0.0081946750241469728), (u'keeneonmarket', 0.007881447433107246), (u'over', 0.0078077819981858913), (u'iv', 0.00693413576281382), (u'up', 0.0068814076007903853)]
On both sides, it looks like there are a lot of words that were specific to just one or two stocks in our set. We again have to recall that we're not predicting probabilities of going up or down but a magnitude in the direction, so one stock that really tanked on one day could easily have a large effect on our most extreme words. This seems to be the case with so many pharmaceutical and health related words in our negative category. On the positive side, in addition to a few that seem stock specific, we do seem some general terms we might have suspected such as "options" and "up".
This is a long way from a predictive tool that could make money in real time, and it's not clear how good of a choice it is up against some of our other options we explored. We did have the advantage of compiling all of the tweets for each day, including tweets that were about stock behavior that already happened, but this does in its current form provide a several percent edge and shows potential to be honed into a more realtime tool should one desire.
We looked at an array of different solutions. We showed how we collected data, we showed some of our tried and failed attempts, we used a slew of statistical analysis on each, and we showed three great ways to make money (and one not so good one).
In sum, each of our solutions can be expanded. But it seems that we stumbled upon a few good solutions, and are happy with our results.