INTRODUCTION¶

There are several ways of making money in the stock market; here we explore a few of them using techniques from data science and finance. We explore alpha generation from the perspectives of:

leveraged finance (ETF decay) and similar tricks of the trade
digital signal processing
portfolio construction using MPT (Modern Portfolio Theory)
bayesian Natural Language Processing using Twitter data, attempting to perform sentiment analysis

Initial Question¶

Thus our initial question is: Which of the above methods will generate alpha, and of course what are the caveats.

Motivation¶

The motivation of this project is to apply knowledge from 209 to finance. The motivation is to make money. We not only want to just maximize returns, but also minimize volitility. You will see through the statistics we employ, that we focused on these two ideas. That is our motivation.

In terms of related work. Many of our funcitons are derived from other works (like the butterworth filter). We have sited them in our .py files. In addition we have used what we learned on bayesian tomatoes quite extensively. Most of the other work we have done, comes from our own experience.

Data Munging¶

We colleted the initial data from Quandl, and have included it as files in the .zip. In addition, we collected twitter data. That process is descibed below.

Exploratory Analysis¶

Grappling with timeseries data is essential to our task of stock price prediction, since a stock traces out a series of datapoints with regard to time. Fortunately, pandas includes an API for time-series manipulations. In the following, we introduce some exploratory statistics and methodology and begin to analyze stock data from a returns perspective.

The pandas time-series API includes:

1.) Date ranges from files and from scratch
2.) Shift, resample, and filter manipulations
3.) Field accessors
4.) Plotting
5.) Time zones (localization and conversion)
6.) Dual representations (point-in-time vs interval)

Let's import some code. From datetime, we need

datetime
date
time

From pandas, let's get the timeseries tools:

Series
DataFrame
Panel

In addition we need:

matplotlib
sys

In [1]:

# Code to allow inline plotting, derived from psets

import warnings
warnings.filterwarnings('ignore')
from helperfile import *
plt.rc('figure', figsize=(10, 6))

%matplotlib inline

import json

import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 30)

# set some nicer defaults for matplotlib
from matplotlib import rcParams

#these colors come from colorbrewer2.org. Each is an RGB triplet
dark2_colors = [(0.10588235294117647, 0.6196078431372549, 0.4666666666666667),
                (0.8509803921568627, 0.37254901960784315, 0.00784313725490196),
                (0.4588235294117647, 0.4392156862745098, 0.7019607843137254),
                (0.9058823529411765, 0.1607843137254902, 0.5411764705882353),
                (0.4, 0.6509803921568628, 0.11764705882352941),
                (0.9019607843137255, 0.6705882352941176, 0.00784313725490196),
                (0.6509803921568628, 0.4627450980392157, 0.11372549019607843),
                (0.4, 0.4, 0.4)]

rcParams['figure.figsize'] = (10, 6)
rcParams['figure.dpi'] = 150
rcParams['axes.color_cycle'] = dark2_colors
rcParams['lines.linewidth'] = 2
rcParams['axes.grid'] = False
rcParams['axes.facecolor'] = 'white'
rcParams['font.size'] = 14
rcParams['patch.edgecolor'] = 'none'


def remove_border(axes=None, top=False, right=False, left=True, bottom=True):
    """
    Minimize chartjunk by stripping out unnecessary plot borders and axis ticks
    
    The top/right/left/bottom keywords toggle whether the corresponding plot border is drawn
    """
    ax = axes or plt.gca()
    ax.spines['top'].set_visible(top)
    ax.spines['right'].set_visible(right)
    ax.spines['left'].set_visible(left)
    ax.spines['bottom'].set_visible(bottom)
    
    #turn off all ticks
    ax.yaxis.set_ticks_position('none')
    ax.xaxis.set_ticks_position('none')
    
    #now re-enable visibles
    if top:
        ax.xaxis.tick_top()
    if bottom:
        ax.xaxis.tick_bottom()
    if left:
        ax.yaxis.tick_left()
    if right:
        ax.yaxis.tick_right()

Let's sample some data from the SPX.csv data file, sourced from Yahoo Finance S&P 500.

In [2]:

with open('SPX.csv', 'r') as ff:
    print ff.readline() # headers
    print ff.readline() # first row

DATE,OPEN,HIGH,LOW,CLOSE,VOLUME,ADJ

2013-09-13,169.13,169.46,168.74,169.33,72459700,169.33

We can use lists or dicts for parsing dates.

In [3]:

data = pd.read_csv('SPX.csv',parse_dates={'Timestamp': ['DATE']},index_col='Timestamp')
data

Out[3]:

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 5195 entries, 2013-09-13 00:00:00 to 1993-01-29 00:00:00
Data columns (total 6 columns):
OPEN      5195  non-null values
HIGH      5195  non-null values
LOW       5195  non-null values
CLOSE     5195  non-null values
VOLUME    5195  non-null values
ADJ       5195  non-null values
dtypes: float64(5), int64(1)

Let's take a look at some of these more recent values:

In [4]:

history = data.ix[:, ['OPEN', 'VOLUME']]
history.head()

Out[4]:

	OPEN	VOLUME
Timestamp
2013-09-13	169.13	72459700
2013-09-12	169.34	83098200
2013-09-11	168.64	94272700
2013-09-10	168.64	102432800
2013-09-09	166.45	83392900

Compute a VWAP (Volume Weighted Avereage Price, a weighted avereage of price with volume as the weights, often used as an anchoring quantity for algorithmic traders) using resample.

In [5]:

volume = history.VOLUME.resample('1w', how='sum') # weekly vwap
value = history.prod(axis=1).resample('1w', how='sum')
vwap = value / volume
print vwap

Timestamp
1993-01-31    43.970000
1993-02-07    44.554778
1993-02-14    44.845201
1993-02-21    44.021129
1993-02-28    43.816993
1993-03-07    44.786620
1993-03-14    45.346889
1993-03-21    45.133627
1993-03-28    44.827772
1993-04-04    45.121409
1993-04-11    44.453826
1993-04-18    44.844166
1993-04-25    44.560004
1993-05-02    43.723734
1993-05-09    44.391409
...
2013-06-09    163.200245
2013-06-16    163.581478
2013-06-23    162.626114
2013-06-30    159.262948
2013-07-07    161.399900
2013-07-14    165.689620
2013-07-21    168.265996
2013-07-28    169.031879
2013-08-04    169.393906
2013-08-11    169.890286
2013-08-18    167.850913
2013-08-25    165.400218
2013-09-01    164.397174
2013-09-08    165.588047
2013-09-15    168.435810
Freq: W-SUN, Length: 1077, dtype: float64

We can conveniently index financial time series data as follows.

In [6]:

vwap.ix['1993-01-31':'2013-09-15']

Out[6]:

Timestamp
1993-01-31    43.970000
1993-02-07    44.554778
1993-02-14    44.845201
1993-02-21    44.021129
1993-02-28    43.816993
1993-03-07    44.786620
1993-03-14    45.346889
1993-03-21    45.133627
1993-03-28    44.827772
1993-04-04    45.121409
1993-04-11    44.453826
1993-04-18    44.844166
1993-04-25    44.560004
1993-05-02    43.723734
1993-05-09    44.391409
...
2013-06-09    163.200245
2013-06-16    163.581478
2013-06-23    162.626114
2013-06-30    159.262948
2013-07-07    161.399900
2013-07-14    165.689620
2013-07-21    168.265996
2013-07-28    169.031879
2013-08-04    169.393906
2013-08-11    169.890286
2013-08-18    167.850913
2013-08-25    165.400218
2013-09-01    164.397174
2013-09-08    165.588047
2013-09-15    168.435810
Freq: W-SUN, Length: 1077, dtype: float64

In [7]:

selection = vwap.between_time('1993-01-31', '2013-09-15')
vol = pd.Series(volume.between_time('1993-01-31', '2013-09-15'),dtype=float)
#vol.head(20)
# attempt to fill possible missing data
selection.ix['1993-01-31':'2013-09-15'].head(20)
filled = selection.fillna(method='pad', limit=1)
filled.ix['1993-01-31':'2013-09-15'].head(20)
vol = vol.fillna(0.)
#vol.head(20)
# now for visualization, plot the volume weighted avereage price, along with volume, with respect to time. 
vol.plot(style='y')
filled.plot(secondary_y=True, style='r')
remove_border()

Regression analysis of S&P 500 data¶

In [8]:

BB = history.OPEN.resample('1w', how='ohlc')
week_range = BB.high - BB.low
#week_range.describe()
week_return = BB.close / BB.open - 1
#week_return.describe()
#week_return.head()
MM = week_return.between_time('1993-01-31', '2013-09-15')
#lagged = MM.shift(1)
lagged = week_return.tshift(1, 'w').between_time('1993-01-31', '2013-09-15')
# Ordinary least squares
pd.ols(y=MM, x=lagged)

Out[8]:

-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <x> + <intercept>

Number of Observations:         1076
Number of Degrees of Freedom:   2

R-squared:         0.0012
Adj R-squared:     0.0003

Rmse:              0.0222

F-stat (1, 1074):     1.2743, p-value:     0.2592

Degrees of Freedom: model 1, resid 1074

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
             x    -0.0344     0.0305      -1.13     0.2592    -0.0942     0.0254
     intercept     0.0014     0.0007       2.07     0.0383     0.0001     0.0027
---------------------------------End of Summary---------------------------------

And VWAP:¶

In [9]:

MM = vwap / BB.open - 1
MM = MM.between_time('1993-01-31', '2013-09-15')
lagged = MM.tshift(1, 'w').between_time('1993-01-31', '2013-09-15')
pd.ols(y=MM, x=lagged)

Out[9]:

-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <x> + <intercept>

Number of Observations:         1076
Number of Degrees of Freedom:   2

R-squared:         0.0018
Adj R-squared:     0.0009

Rmse:              0.0122

F-stat (1, 1074):     1.9175, p-value:     0.1664

Degrees of Freedom: model 1, resid 1074

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
             x    -0.0422     0.0305      -1.38     0.1664    -0.1020     0.0175
     intercept     0.0003     0.0004       0.91     0.3610    -0.0004     0.0011
---------------------------------End of Summary---------------------------------

Intro to Leveraged Finance: Decaying 3X Leveraged ETFs.¶

So the S&P did very well this year. Why can't we just invest into a 3x ETF and make 60 percent a year!

What are leveraged ETFs? First, ETFs (Exchange Traded Funds) and financial instruments that trade similarly to stocks, and attempt to replicate the return of a basket of other assets. In principle, it would require many costly transactions for an investor to replicate the exposure of an ETF, so they are seen as convenient cost saving tools (Example: one SPY purchase gives you the same exposure as buying every stock in the S&P 500). Some ETFs purport to replicate a multiple of the return of the underlying asset; these are known as Leveraged ETFs. Leveraged ETFs have an interesting behavior; many market participants have noted their long-term downward tendency. This is because they have an asymmetric distribution with a large chance of decay. While the chance of decay is large, the expected value of each timestep is the same as for a similar nonleveraged instrument. Let's attempt to demonstrate this through a Monte-Carlo approach.

Use randn to simulate historical data###¶

We want an etf security with normal historical returns. Average daily return should therefore be 0, with a standard deviation of 1%.

In [10]:

# numpy/matplotlib
%pylab inline
history = randn(10000)
plot(randn(10000)) #plot the 10000-day historical price vector.
title('returns') 
xlabel('day')
ylabel('return [%]')
remove_border()

Populating the interactive namespace from numpy and matplotlib

WARNING: pylab import has clobbered these variables: ['load', 'info', 'mpl', 'datetime', 'save', 'set_printoptions', 'unique']
`%pylab --no-import-all` prevents importing * from pylab and numpy

In [11]:

h = hist(history,100)
print 'Mean:' , history.mean()
print 'Standard deviation:%.2f' % history.std()

Mean: -0.00286332404091
Standard deviation:1.01

This is very near mean 0, standard deviation 1.

Starting with an investmnt of k dollars, on day n our investment would then have a value of $k*r[t]*r[t+1]*...r[n]$. We can simulate these returns by using the cumulative product function.

In [12]:

price = 100*(history/100+1).cumprod() # normalized price with a starting asset base of $100.
plot(price) # cumulative sum
title('normalized price')
xlabel('day number') 
ylabel('price [$]')
remove_border()

Now employ a monte carlo approach, creating many simultaneous price paths.

In [13]:

history = randn(1000,5000)
prices = 100*(history/100+1).cumprod(axis=0) # normalized price along axis
h = plot(prices[:,:300]) # plot first 300
xlabel('day number')
ylabel('price [$]')
remove_border()

Let's make a histogram of the last day of returns. This is the thousandth day, which is simply the end of the prices matrix.

In [14]:

finalPrice = prices[-1,:] # get last row of prices table
h = hist(finalPrice,50) # plot histogram
xlabel('price [$]')
ylabel('frequency [#]')
print 'Mean:', finalPrice.mean()
print 'Std:', finalPrice.std()

Mean: 99.3375039092
Std: 32.1103764376

What about a leverage multiplier?¶

In [15]:

history3x = history*3. # leveraged returns
prices3x = 100*(history3x/100+1).cumprod(axis=0) # normalized price
finalPrice3x = prices3x[-1,:] # last
h1 = hist(finalPrice3x,500) # histogram 1
h2 = hist(finalPrice,100) # histogram 2
xlim([0,400]) # setting limit of x axis
legend(['3 times','1 times'])
print 'Mean 3x:', finalPrice3x.mean()

Mean 3x: 98.0867193128

The mean is close to our initial asset base, but the distribution is skewed to the right. No longer a normal distribution. Therefore, an equal probability of going up or down in percentage terms does not change long term expected value. The resulting distribution is however skewed involving a price decline more often, but it is compensated by the fat tails on the up side.

Note what is happening here: the fat tail of the leveraged ETF is so enormous that while the peak of the distribution is below 50, the long term mean is still at 100. This suggests that the naive strategy of shorting levereaged ETFs is not likely to be as profitable over the long term as it might at first appear.

So okay, our first approach did not work, but now to our next approach.

Signal Processing¶

In [16]:

warnings.filterwarnings('ignore')
import signal_processing as sp
sp.example()

[2013-12-12 20:04] DEBUG: Transform: Running StatefulTransform [short_mavg]
[2013-12-12 20:04] DEBUG: Transform: Running StatefulTransform [long_mavg]
[2013-12-12 20:04] DEBUG: Transform: Finished StatefulTransform [long_mavg]
[2013-12-12 20:04] DEBUG: Transform: Finished StatefulTransform [short_mavg]
[2013-12-12 20:04] INFO: Performance: Simulated 253 trading days out of 253.
[2013-12-12 20:04] INFO: Performance: first open: 1990-01-02 14:31:00+00:00
[2013-12-12 20:04] INFO: Performance: last close: 1990-12-31 21:00:00+00:00

IBM

Dual Moving Average Cross (DMAC)¶

The concept of a DMAC is fairly straightforward. Calculate two moving averages of the price of a security, or in this case exchange rates of a currency. One average would be the short term (ST) (strictly relative to the other moving average) and the other long term (LT). Mathematically speaking, the long term moving average (LTMA) will have a lower variance and will move in the same direction as the short term moving average but at a different rate. The different rates of direction, induces points where the values of the two moving averages may equal and or cross one another. These points are called the crossover points.

In the dual moving average crossover trading strategy, these crossovers are points of decision to buy or sell the desired product. What these crossover points imply depends on the approach the investor has in their strategy. There are two schools of thought: Technical and Value.

The Technical Approach suggests that when the Short Term Moving Average (STMA) moves above the LTMA, that represents a Buy (or Long) signal. (Conversely, when the STMA moves below the LTMA, the Technical Approach indicates a Sell (or Short) signal.) The intuition behind this strategy can be explained in terms of momentum. Basically, the principle of momentum states that a price that is moving up (or down) during period t is likely to continue to move up (or down) in period t+1 unless evidence exists to the contrary. When the STMA moves above the LTMA, this provides a lagged indicator that the price is moving upward relative to the historical price. Buy high, sell higher.

The Value Approach offers the opposite trading signals to the Technical Approach. The Value Approach claims that when the STMA crosses from below to above the LTMA, that the product is now overvalued, and should be sold. Conversely when the currency STMA moves below the LTMA then the product is undervalued it should be bought. The intuition behind the Value Approach can be thought simply as a mean reversion approach. Buy low (value), sell high (overvalued).

Both strategies try to achieve the same goal, but do it in opposing ways to one another. In this paper, we will analyze both the technical and value strategies as applied to the SPY with the periods of 2 and 5.

Let us better understand what we are doing. We are looking for the underlying signal from a stock's price. We are then taking a value based approach and a technical approach using these signals. But what is a signal. Signals are tools for estimating future value of an asset based on its previous behavior, e.g. predict the price of AAPL stock based on its previous price movements for that hour, day or month. But what does it do? Consider below:

In [17]:

# make a new dataframe from the original dataset 
prices = pd.DataFrame(data.reindex( index=data.index[ ::-1 ] ).OPEN)
prices.plot()
remove_border()

Notice how choppy the graph appears. It is hard to tell which direction the S&P is moving. Now consider applying a simple moving average (see included python files: filters.py).

In [18]:

prices['SMA50'] = sp.sma(prices.OPEN,50)
prices['SMA200'] = sp.sma(prices.OPEN,200)
prices.plot()
remove_border()

Now what you see is a smoother set of curves. Notice how the 50 period SMA follows the actual prices more closely than the SMA 200 curve. And the 200 SMA follows it more loosely. Why? The 200 curve looks a prices up to 200 days ago, vs the 50 which only looks 50 days back. That is why the SMA 50 reacts more quickly.

Now that we understand what the true values/underlying signals of the market are. But how do we use this? Now we can implement a DMAC algorithm that based on these signals will produce equity curves (how much we will have made). Consider the code below.

In [19]:

equity_curves = pd.DataFrame(sp.dmac(prices.OPEN,prices.OPEN,sp.sma,2, 5, 5, True),columns =['Value SMA'],index=prices.index)
equity_curves.plot(); remove_border()

Well hot dog! Looks like we have a winner. Well we should just go invest in this! It gives us 10x returns! Well wait, let's first put this in prespective

In [20]:

equity_curves['Technical SMA'] = sp.dmac(prices.OPEN,prices.OPEN,sp.sma,2, 5, 5, False)
equity_curves['Baseline'] = 100*prices.OPEN/prices.OPEN[0]

equity_curves.plot(); remove_border()

Okay, this is more reasonalble. It seems that we have beaten the baseline in our Value base approach (described above) and lost nearly all of our money with our Technical appraoch. But generally speaking just eyeballing our results is not the best strategy, there are certain statistics that we should veiw:

Statistics

Standard and Poor's (SPY) 500 Total Return Index is a capitalization-weighted index of 500 stocks. The index is designed to measure performance of the broad domestic economy through changes in the aggregate market value of 500 stocks representing all major industries. The index was developed with a base value of 10 for the 1941-1943 base period.

Annualized Standard Deviation is a statistical measure of the degree to which an individual value in a probability distribution tends to vary from the mean of the distribution. It is widely applied to Modern Portfolio Theory for example, where the past performance of securities is used to determine the range of possible future performances and a probability is attached to each performance. The standard deviation of performance can then be calculated for each security and for the portfolio as a whole. The greater the dispersion, the greater the risk.

Sharpe Ratio is the amount of reward per unit of risk. The higher the Sharpe Ratio, the more incremental return is added per increase in risk as measured by standard deviation.

Sortino Ratio is a ratio that differentiates between good and bad volatility in the Sharpe ratio. This differentiation of upwards and downwards volatility allows the calculation to provide a risk-adjusted measure of a security or fund's performance without penalizing it for upward price changes.

Alpha is a measure of excess return, or the amount by which a portfolio outperforms a benchmark, such as the T-bill rate. For example, an alpha of 1.20 indicates that a fund is expected to rise 20% during a given time period when the return on the market (and the fund's beta) are zero.

Beta is the measure of a fund's volatility relative to the market. (Many fund managers correlate themselves to the SPY 500.) A beta of greater than 1.0 indicates that the fund is more volatile than the market, and less than 1.0 is less volatile than the market. For example, if the market rises 1% and a fund has a beta equal to 2.5, then the fund is likely to rise faster than the market ( and conversely fall faster than the market when the market falls).

Correlation is the degree to which the movements of two variables are related. The correlation coefficient is expressed as a value between -1 and +1. For example,a correlation coefficient of 1 would suggest that two asset classes are perfectly correlated, a coefficient of zero suggests no relationship, and a coefficient of -1 suggests an inverse relationship.

R-squared is a statistical measure that represents the percentage of a fund or security's movements that can be explained by movements in a benchmark index.

Maximum Drawdown is the maximum peak-to-trough percentage decline in value experienced during the given period.

We will begin by examining the statistics on our baseline.

In [21]:

stats = sp.performance(equity_curves['Baseline'], time_frame= 251)
pd.DataFrame([stats.values()],columns = stats.keys(),index=['Baseline'])

Out[21]:

	sharp	std	sortino	car3	car1	cret	pos	car5	aret	max	ave	ytd	1000
Baseline	0.4506	0.169088	0.868786	14.479322	17.134151	284.648624	0.526083	5.947035	6.967836	56.70044	0.033569	17.134151	3846.486241

So the baseline model (meaning let's just invest in the S&P 500) does pretty well. It gives us nearly 4x! But what else... The max drawdown was 54%. Pretty bad, during the morgage crisis the market really sank. The anuallized return was 7% a year! The sharp ratio (usually looked to be good at .5) was a bit above .45. But lets look at the statistics of all three combined.

In [22]:

stats = []
for strategy in ['Baseline','Value SMA','Technical SMA']:
    stats.append( sp.performance(equity_curves[strategy], time_frame= 251))
    
pd.DataFrame([stat.values() for stat in stats],columns = stats[0].keys(),index=['Baseline','Value SMA','Technical SMA'])

Out[22]:

	sharp	std	sortino	car3	car1	cret	pos	car5	aret	max	ave	ytd	1000
Baseline	0.450600	0.169088	0.868786	14.479322	17.134151	284.648624	0.526083	5.947035	6.967836	56.700440	0.033569	17.134151	3846.486241
Value SMA	0.857395	0.159222	16.518197	12.628268	4.673404	1021.574147	0.337055	14.389879	12.847359	24.042366	0.051072	4.673404	11215.741466
Technical SMA	-1.141516	0.109234	-1.163120	-12.432710	-5.102461	-94.240220	0.307796	-15.226719	-13.299761	20.924733	-0.050313	-5.102461	57.597798

Now we can make some comparisons. First we notice that the Technical SMA did quite poorly as compared to the other strategies. In almost all regards. It lost money. Next is the Value SMA. It made more money. And in fact it is less risky, there is less volitility. We can see this by looking at the sharp and sortino ratios. But then two questions arise... Did we choose the correct period, and did we choose the correct filter. Let's deal with the second question first.

Let us deal with the first of these questions. We have done research, and the periods of 2 and 5 were recommended, but let us check a braoder range of periods.

In [23]:

from mpl_toolkits.mplot3d import Axes3D

values = []
for short_period in range(1,5):
    for long_period in range(short_period + 1, short_period + 5):
        values.append([short_period, long_period,sp.dmac(prices.OPEN,prices.OPEN,sp.sma,short_period, long_period, 5, True,end_value=True)])
        
X,Y,Z = zip(*values)

print max(values,key=lambda x:x[2])

fig = plt.figure()
ax = Axes3D(fig)
ax.plot_trisurf(X,Y,Z)

[2, 5, 1121.5741465850183]

Out[23]:

<mpl_toolkits.mplot3d.art3d.Poly3DCollection at 0x1632d8d0>

So yes, our research worked out, in fact the 2 and 5 periods lead to the best overall gain. So the final question we will ask is: Is our filter right? We will focus on four different types of filters (all described and coded in signal_processing.py). The filters are

SMA (simple moving average): This is the coarsest method of finding a trend. It weights past events the same as current events, which means it is relativly unresponsive
EMA (exponential moving average): This uses exponential weighting. Thus more recent events are weighted more heavily.
Gaussian Filter: This is a n order exponential moving average and is better able to capture price displacment because of events
Butterworth Filter: More sophisticated infinite impulse response filter. It is a higher oder gaussian with two poles.

Let us then do a naive optimization over all of these filters to find their optimal periods, and then veiw some statistics.

In [24]:

for filt,name in [(sp.ema,'EMA'),(sp.gaussian,'Gaussian'),(sp.butterworth,'Butterworth')]:
    values = []
    for short_period in range(1,5):
        for long_period in range(short_period + 1, short_period + 5):
            values.append([short_period, long_period,sp.dmac(prices.OPEN,prices.OPEN,filt,short_period, long_period, 5, True,end_value=True)])

    optimal = max(values,key=lambda x:x[2])
            
    print "%s Value, Returns: %d, Periods: %d, %d" % (name, optimal[2], optimal[0], optimal[1])
    
    values = []
    for short_period in range(1,5):
        for long_period in range(short_period + 1, short_period + 5):
            values.append([short_period, long_period,sp.dmac(prices.OPEN,prices.OPEN,filt,short_period, long_period, 5, False,end_value=True)])

    optimal = max(values,key=lambda x:x[2])
            
    print "%s Technical, Returns: %d, Periods: %d, %d" % (name, optimal[2], optimal[0], optimal[1])
            

EMA Value, Returns: 499, Periods: 1, 5
EMA Technical, Returns: 59, Periods: 4, 8
Gaussian Value, Returns: 457, Periods: 2, 3
Gaussian Technical, Returns: 42, Periods: 2, 6
Butterworth Value, Returns: 101, Periods: 1, 2
Butterworth Technical, Returns: 98, Periods: 1, 2

So we are done right? Not quite yet. So it may seem like SMA preforms the best out of all of the filters above. But we have not looked at their statistics. Let us veiw the best of the best, and compare SMA, EMA, and the baseline.

In [25]:

    equity_curves['Value EMA'] = sp.dmac(prices.OPEN,prices.OPEN,sp.ema,1, 5, 5, True)
equity_curves['Value Gaussian'] = sp.dmac(prices.OPEN,prices.OPEN,sp.gaussian,2, 3, 5, True)
del equity_curves['Technical SMA']
equity_curves.plot()
remove_border()

Before we move on to the stats, let's take a look at what we have here. The most curious thing to note is that the EMA did not do too much better than the baseline. Oddly enough the EMA seemed to do best in times of economic downturn. Well, lets move onto the statistics, and comparative statistics.

In [26]:

stats = []
for strategy in ['Baseline','Value SMA','Value EMA','Value Gaussian']:
    stats.append( sp.performance(equity_curves[strategy], time_frame= 251))
    
pd.DataFrame([stat.values() for stat in stats],columns = stats[0].keys(),index=['Baseline','Value SMA','Value EMA','Value Gaussian'])

Out[26]:

	sharp	std	sortino	car3	car1	cret	pos	car5	aret	max	ave	ytd	1000
Baseline	0.450600	0.169088	0.868786	14.479322	17.134151	284.648624	0.526083	5.947035	6.967836	56.700440	0.033569	17.134151	3846.486241
Value SMA	0.857395	0.159222	16.518197	12.628268	4.673404	1021.574147	0.337055	14.389879	12.847359	24.042366	0.051072	4.673404	11215.741466
Value EMA	0.509205	0.185759	1.388655	8.647352	8.518586	399.564057	0.327045	11.128043	8.375112	48.362182	0.035617	8.518586	4995.640570
Value Gaussian	0.537531	0.168085	1.455399	4.046805	4.218977	357.432135	0.300866	6.198407	7.898730	30.130000	0.033532	4.218977	4574.321353

In [27]:

stats = []
for strategy in ['Value SMA','Value EMA','Value Gaussian']:
    stats.append( sp.stats(equity_curves[strategy], equity_curves['Baseline'],time_frame= 251))
    
pd.DataFrame([stat.values() for stat in stats],columns = stats[0].keys(),index=['Value SMA','Value EMA','Value Gaussian'])

Out[27]:

	alpha	beta	r	return	correlation
Value SMA	12.124887	0.021216	0.000268	1021.574147	0.016368
Value EMA	6.986131	0.019568	0.000234	399.564057	0.015300
Value Gaussian	6.260903	0.038330	0.000822	357.432135	0.028669

So this tells the story. The Value SMA has lower varience (meaning high sharp and sortino, and low maximum drawdown and standard deviation). And it has higher in terms of return. All of the strategies yeild positive alpha and have very low beta.

DMAC Further Work¶

One can much more thruoghouly examine different filters and different periods in order to find the optimal. But we should be wary of overfitting. There are other statistical test that we could preform to judge the value of a strategy.

RSI¶

RSI or relative strength index is a measure of a certain stock being overbought or oversold. Thus if the RSI (implemented and explained further in signal_processing.py) breaches upper and lower bounds of 70 and 30, we should assume that the stock will revert to the mean. But what does this look like?

In [28]:

equity_curves = pd.DataFrame(sp.rsi(prices.OPEN,50),columns =['RSI'],index=prices.index)
equity_curves.plot(); remove_border()

Well this does not tell us much. First the principal of RSI is: Sell wehn above 70 and buy when below 30. So lets add some more graphics to the chart.

In [29]:

equity_curves['Upper Bound'] = [70]*len(equity_curves.RSI)
equity_curves['Lower Bound'] = [30]*len(equity_curves.RSI)
# normalize the original s&p
equity_curves['Baseline'] = 100*prices.OPEN/prices.OPEN[-1]
equity_curves.plot(); remove_border(); plt.legend(loc='upper left')

Out[29]:

<matplotlib.legend.Legend at 0x1be9fb00>

Ah, now we can see the strength of RSI. Notice that when RSI breaks the bounds of 70/30, the morket will then move in the other direction. RSI tells us if the stock is overbought or oversold. Thus we can see that it works, to an extent. Let's try to find the best implementation of this strategy.

In [30]:

values = []
for period in range(1,101):
    values.append([period,sp.equity_curve(sp.rsi(prices.OPEN,period), prices.OPEN, [70]*len(prices.OPEN), [30]*len(prices.OPEN), 5,end_value=True)])
                  
print max(values,key=lambda x:x[1])

periods, returns = zip(*values)

plt.plot(returns)

[13, 114.08440469332842]

Out[30]:

[<matplotlib.lines.Line2D at 0x1babc518>]

So as one can see, there is a trend to RSI such that by increasing or decreasing the period can lead to consistantly better results. Notice that as the period increases, we get less volitility, because we get fewer trades. But we have found a generally optimal strategy. Let us view some statistics on this curve vs. the baseline and finish off with our conclusions on RSI.

In [31]:

equity_curves = pd.DataFrame(sp.equity_curve(sp.rsi(prices.OPEN,13), prices.OPEN, [70]*len(prices.OPEN), [30]*len(prices.OPEN), 5),columns =['RSI'],index=prices.index)
# normalize the original s&p
equity_curves['Baseline'] = 100*prices.OPEN/prices.OPEN[1]
equity_curves.plot(); remove_border();# plt.legend(loc='upper left')

So RSI might not be what we are looking for. RSI is a naive approach to handle the stock market from the 1970s and it is not surprising that such a naive strategy might not work. Further implemntations might be to change the holding period, to set stops, or in short develop a more comprehensive RSI tool.

Conclusions¶

Signal processing can get us far. It can make us money, ensure low volitility, and even beat the stock market. But sometimes it just does not work. Our naive approach to signal processing has given us insight on how traders might look at the market and how to evaluate a strategy, however there exist problems with overfitting and a potentially limitless supply of different signals.

Extensions and future work for signal processing include: run more advanced statistics, to try a wider range of techniques, and develop a more in depth trading system (including stops, limits, commissions, and slippage).

Portfolio Construction and Analysis¶

This is the most classice way to go about making money on the market: buying some stocks and putting them in your portfolio.

Calculate returns from price data.¶

Returns defined as: $$ $$ $$r_t = \frac{p_t-p_{t-1}}{p_{t-1}} $$

We define the iterative return index:

$$i_t = (1 + r_t) \cdot i_{t-1}, \quad i_0 = 1$$

In [32]:

RR = close_px/close_px.shift(1) - 1
daily_index = (1 + RR).cumprod()
daily_index['MSFT'].plot()
remove_border()

In [33]:

# Let's sample and plot monthly returns...
monthly_index = daily_index.asfreq('EOM', method='ffill')
monthly_RR = monthly_index / monthly_index.shift(1) - 1
monthly_RR.plot()
remove_border()

We can plot multiple timeseries in a dataframe using matplotlib:

In [34]:

subRR = RR.ix[:, ['AAPL', 'IBM', 'XOM', 'MSFT']]
(1 + subRR).cumprod().plot()
remove_border()

We can look at the distributions of stock and ETF (Exchange Traded Fund) returns:

In [35]:

RR[-250:].hist(bins=50)
remove_border()

In [36]:

RR.ix[-1].plot(kind='bar')
remove_border()

We can use a sliding window of price to calculate certain functions of timeseries, such as rolling volatility and moving averages.

In [37]:

# We have the following plot for Apple:
px = close_px['AAPL']
close_px['AAPL'].plot()
remove_border()

And the corresponding rolling volatility:

In [38]:

vol = rolling_std(px / px.shift(1) - 1, 250) * np.sqrt(250)
vol.plot()
remove_border()

We calculated the rolling volatility using a sliding window of price; similarly we can calculate a smoothed exponential moving average of the rolling volatility to more easily recognize changes in trend.

In [39]:

evol = ewmstd(RR, span=250) * np.sqrt(250)
vol.plot()
evol.plot()
plt.xlim([datetime.datetime(1991, 1, 1), datetime.datetime(2012, 1, 1)])

Out[39]:

(726833.0, 734503.0)

Correlation of Equity Returns¶

In the variable $RR$, we have the returns of the various $SPX Equities. These returns may be correlated. We can look at the correlation in a grid:

In [40]:

RR.corr()

Out[40]:

	AA	AAPL	GE	IBM	JNJ	MSFT	PEP	SPX	XOM
AA	1.000000	0.216917	0.454180	0.320836	0.239261	0.332316	0.231136	0.599806	0.433569
AAPL	0.216917	1.000000	0.309258	0.353119	0.153393	0.365921	0.129998	0.439518	0.189228
GE	0.454180	0.309258	1.000000	0.413646	0.374792	0.414954	0.332428	0.744141	0.392965
IBM	0.320836	0.353119	0.413646	1.000000	0.247022	0.426098	0.215850	0.575682	0.278011
JNJ	0.239261	0.153393	0.374792	0.247022	1.000000	0.292051	0.372419	0.506015	0.351107
MSFT	0.332316	0.365921	0.414954	0.426098	0.292051	1.000000	0.249435	0.618319	0.314164
PEP	0.231136	0.129998	0.332428	0.215850	0.372419	0.249435	1.000000	0.450798	0.306896
SPX	0.599806	0.439518	0.744141	0.575682	0.506015	0.618319	0.450798	1.000000	0.601648
XOM	0.433569	0.189228	0.392965	0.278011	0.351107	0.314164	0.306896	0.601648	1.000000

Or preferably, we can use a colorful visualization to see at a glance which equities exhibit stronger correlation with each other:

In [41]:

plot_corr(RR.corr())

We can also plot the correlation of the monthly returns.

In [42]:

plot_corr(monthly_RR.corr())

Leave-One-Out Regression (LOOCV):

In [43]:

# Note: the function leave_one_out is included in the helper file.
leave_one_out(RR)

Out[43]:

{'ex-AA': GE          -0.056442
IBM          0.196518
JNJ         -0.138967
MSFT         0.175767
PEP         -0.113942
SPX          1.142164
XOM         -0.178632
intercept    0.000844
dtype: float64,
 'ex-GE': AA          -0.079250
IBM          0.194958
JNJ         -0.150925
MSFT         0.174841
PEP         -0.118740
SPX          1.170754
XOM         -0.157575
intercept    0.000841
dtype: float64,
 'ex-IBM': AA          -0.082702
GE          -0.060464
JNJ         -0.162179
MSFT         0.187395
PEP         -0.127989
SPX          1.445041
XOM         -0.185627
intercept    0.000924
dtype: float64,
 'ex-JNJ': AA          -0.069560
GE          -0.054263
IBM          0.199524
MSFT         0.173160
PEP         -0.141957
SPX          1.153692
XOM         -0.173623
intercept    0.000799
dtype: float64,
 'ex-MSFT': AA          -0.086019
GE          -0.075172
IBM          0.211504
JNJ         -0.155034
PEP         -0.125363
SPX          1.479190
XOM         -0.183643
intercept    0.000965
dtype: float64,
 'ex-PEP': AA          -0.074750
GE          -0.054790
IBM          0.199032
JNJ         -0.175586
MSFT         0.174323
SPX          1.175799
XOM         -0.168719
intercept    0.000829
dtype: float64,
 'ex-SPX': AA           0.026372
GE           0.214939
IBM          0.332856
JNJ         -0.028252
MSFT         0.319514
PEP         -0.031255
XOM          0.026058
intercept    0.000598
dtype: float64,
 'ex-XOM': AA          -0.090336
GE          -0.035589
IBM          0.205419
JNJ         -0.161027
MSFT         0.180873
PEP         -0.123881
SPX          1.095649
intercept    0.000791
dtype: float64}

We can organize this output into a dataframe.

In [44]:

DataFrame(leave_one_out(RR))

Out[44]:

	ex-AA	ex-GE	ex-IBM	ex-JNJ	ex-MSFT	ex-PEP	ex-SPX	ex-XOM
AA	NaN	-0.079250	-0.082702	-0.069560	-0.086019	-0.074750	0.026372	-0.090336
GE	-0.056442	NaN	-0.060464	-0.054263	-0.075172	-0.054790	0.214939	-0.035589
IBM	0.196518	0.194958	NaN	0.199524	0.211504	0.199032	0.332856	0.205419
JNJ	-0.138967	-0.150925	-0.162179	NaN	-0.155034	-0.175586	-0.028252	-0.161027
MSFT	0.175767	0.174841	0.187395	0.173160	NaN	0.174323	0.319514	0.180873
PEP	-0.113942	-0.118740	-0.127989	-0.141957	-0.125363	NaN	-0.031255	-0.123881
SPX	1.142164	1.170754	1.445041	1.153692	1.479190	1.175799	NaN	1.095649
XOM	-0.178632	-0.157575	-0.185627	-0.173623	-0.183643	-0.168719	0.026058	NaN
intercept	0.000844	0.000841	0.000924	0.000799	0.000965	0.000829	0.000598	0.000791

We can conveniently use the function 'groupby' to group the data by year.

In [45]:

grouped = RR.groupby(lambda x: x.year)
for year, group in grouped:
    print year
    print group[:5]

1990
                  AA      AAPL        GE       IBM       JNJ  MSFT       PEP       SPX       XOM
1990-02-01       NaN       NaN       NaN       NaN       NaN   NaN       NaN       NaN       NaN
1990-02-02  0.012048  0.017812  0.000000  0.005956  0.023419     0  0.008278  0.006478  0.019608
1990-02-05  0.005952  0.022500  0.000000  0.025459 -0.006865     0 -0.006568  0.002810  0.001603
1990-02-06 -0.011834 -0.007335  0.003484  0.013857 -0.004608     0  0.016529 -0.006599 -0.003200
1990-02-07  0.005988 -0.043103  0.010417  0.021071  0.013889     0  0.003252  0.012407  0.016051
1991
                  AA      AAPL        GE       IBM       JNJ      MSFT       PEP       SPX       XOM
1991-01-02  0.006250  0.011799 -0.014545 -0.007874 -0.025554  0.000000 -0.008526 -0.011417 -0.019746
1991-01-03 -0.006211 -0.011662 -0.022140  0.003472 -0.010490  0.000000 -0.029484 -0.013907  0.005755
1991-01-04 -0.022917  0.005900 -0.011321 -0.003460 -0.007067  0.012195 -0.005063 -0.002827  0.010014
1991-01-07 -0.006397  0.000000 -0.015267 -0.016865 -0.030249 -0.012048 -0.025445 -0.017321 -0.012748
1991-01-08 -0.015021  0.000000  0.007752 -0.011100  0.009174 -0.024390  0.005222 -0.001712 -0.002869
1992
                  AA      AAPL        GE       IBM       JNJ      MSFT       PEP       SPX       XOM
1992-01-02 -0.005445  0.054978  0.000000  0.014328 -0.019958  0.027473  0.000000  0.000408 -0.014874
1992-01-03  0.010949 -0.008451 -0.002646  0.001177 -0.004287 -0.010695  0.015654  0.004985  0.002323
1992-01-06 -0.009025 -0.017045 -0.005305  0.020576  0.009688  0.032432 -0.015413 -0.003291 -0.008111
1992-01-07  0.000000  0.019509 -0.005333  0.025922  0.012793  0.026178  0.000000 -0.001340 -0.005841
1992-01-08  0.001821  0.023388 -0.002681 -0.023582  0.018947  0.035714  0.041436  0.001677 -0.012926
1993
                  AA      AAPL        GE       IBM       JNJ      MSFT       PEP       SPX       XOM
1993-01-04  0.019108 -0.025017  0.000000 -0.004980 -0.019883 -0.004762 -0.008889 -0.000757  0.004348
1993-01-05  0.001562  0.017106  0.011494 -0.024024 -0.022673  0.014354 -0.009716 -0.002389  0.009740
1993-01-06 -0.006240  0.042046 -0.004545 -0.018462 -0.030525  0.033019 -0.015094  0.000414 -0.004287
1993-01-07 -0.001570 -0.012105 -0.002283 -0.020899 -0.026448 -0.022831 -0.028352 -0.008722 -0.002153
1993-01-08 -0.028302  0.020422 -0.002288 -0.010672  0.024580  0.000000  0.016562 -0.003900 -0.016181
1994
                  AA      AAPL        GE       IBM       JNJ      MSFT       PEP       SPX       XOM
1994-01-03  0.019293  0.021038 -0.007299  0.020636  0.007702 -0.005051 -0.006662 -0.002165  0.009054
1994-01-04 -0.001577  0.054945 -0.003676  0.023589 -0.002548  0.005076 -0.011923  0.003115  0.007976
1994-01-05  0.023697  0.071615 -0.007380  0.008230 -0.002554  0.020202 -0.009804  0.001414  0.005935
1994-01-06 -0.004630 -0.030377  0.009294 -0.017143 -0.020487  0.029703 -0.012186 -0.000920 -0.005900
1994-01-07  0.004651  0.012531  0.005525  0.006645  0.011765  0.004808  0.022359  0.005951 -0.007913
1995
                  AA      AAPL        GE       IBM       JNJ      MSFT       PEP       SPX       XOM
1995-01-03 -0.013854 -0.015560  0.000000  0.003251 -0.004111 -0.013333 -0.006552 -0.000348 -0.001996
1995-01-04  0.003831  0.025290  0.000000  0.008425  0.009288  0.006757 -0.014015  0.003485  0.004000
1995-01-05 -0.007634 -0.012333  0.003643 -0.004499 -0.005112 -0.016779 -0.020903 -0.000803  0.001992
1995-01-06  0.029487  0.080125 -0.005445  0.014848 -0.006166  0.017065  0.000000  0.000739  0.000000
1995-01-09  0.009963 -0.019268 -0.009124  0.005089 -0.011375 -0.006711  0.003416  0.000326 -0.007952
1996
                  AA      AAPL        GE       IBM       JNJ      MSFT       PEP       SPX       XOM
1996-01-02  0.028340  0.007528  0.021330 -0.005691 -0.014848  0.023202 -0.002611  0.007793 -0.006442
1996-01-03  0.015748  0.000000  0.001229 -0.017690  0.039974 -0.031746  0.002618  0.000950  0.002882
1996-01-04  0.008721 -0.017435 -0.011043 -0.026483 -0.009452  0.004684 -0.002611 -0.005826  0.006466
1996-01-05 -0.006724  0.084918  0.001241  0.020131  0.000000 -0.011655 -0.002094 -0.001603  0.024982
1996-01-08 -0.054159  0.011682  0.007435  0.005333  0.012723  0.000000  0.000000  0.002838  0.011838
1997
                  AA      AAPL        GE       IBM       JNJ      MSFT       PEP       SPX       XOM
1997-01-02  0.021470  0.005747 -0.013405  0.011413  0.000000 -0.012315  0.008358 -0.005036  0.003425
1997-01-03  0.044462  0.036190  0.015399  0.038426  0.010388  0.036160  0.008289  0.014952  0.002844
1997-01-06 -0.002322 -0.178309 -0.000892  0.013216  0.007576 -0.002407 -0.016441 -0.000508  0.013613
1997-01-07 -0.005431 -0.020134  0.020536  0.013333  0.002148  0.007238  0.008358  0.007463  0.002798
1997-01-08 -0.001560  0.006849 -0.013998 -0.021453  0.007503 -0.019162  0.004388 -0.006399 -0.007812
1998
                  AA      AAPL        GE       IBM       JNJ      MSFT       PEP       SPX       XOM
1998-01-02  0.012546  0.237805  0.008289  0.009545 -0.012215  0.014173 -0.006844  0.004750  0.011111
1998-01-05  0.017493 -0.022167  0.018203  0.007696 -0.000824 -0.005435  0.013783  0.002082 -0.010989
1998-01-06 -0.022923  0.193955 -0.013264 -0.011128 -0.015264  0.005464 -0.036136 -0.010736 -0.036000
1998-01-07  0.002199 -0.075949  0.008182 -0.009488  0.009636 -0.011646  0.019673 -0.002669  0.031812
1998-01-08 -0.042429  0.038813 -0.009275 -0.000668  0.011618  0.007070  0.000000 -0.008247 -0.021448
1999
                  AA      AAPL        GE       IBM       JNJ      MSFT       PEP       SPX       XOM
1999-01-04 -0.010981  0.007820 -0.014274 -0.007505 -0.013889  0.016514 -0.004727 -0.000919 -0.006899
1999-01-05  0.020819  0.050436  0.021295  0.036295  0.003841  0.038989  0.001583  0.013582 -0.008775
1999-01-06  0.039429 -0.036011  0.019600 -0.004621  0.011798  0.032662  0.026241  0.022140  0.039838
1999-01-07 -0.012426  0.077586 -0.016769  0.007575 -0.009455 -0.005047 -0.025570 -0.002051 -0.001419
1999-01-08  0.101987  0.000000 -0.005824 -0.013823  0.000636 -0.004058  0.015492  0.004221 -0.006039
2000
                  AA      AAPL        GE       IBM       JNJ      MSFT       PEP       SPX       XOM
2000-01-03 -0.024605  0.089105 -0.030891  0.075319 -0.011504 -0.001526  0.046093 -0.009549 -0.027742
2000-01-04  0.004360 -0.084673 -0.039774 -0.033934 -0.036617 -0.033843 -0.025473 -0.038345 -0.019244
2000-01-05  0.057674  0.014832 -0.001763  0.035125  0.010607  0.010621 -0.024373  0.001922  0.054465
2000-01-06 -0.013193 -0.086538  0.013243 -0.017214  0.031487 -0.033542  0.044895  0.000956  0.051652
2000-01-07 -0.002971  0.047579  0.038629 -0.004429  0.042397  0.013188  0.027027  0.027090 -0.002746
2001
                  AA      AAPL        GE       IBM       JNJ      MSFT       PEP       SPX       XOM
2001-01-02 -0.037407  0.000000 -0.087551 -0.002150 -0.029224  0.000000 -0.003787 -0.028032  0.025161
2001-01-03  0.013467  0.100806  0.092764  0.115690 -0.031875  0.104985 -0.058287  0.050099 -0.043664
2001-01-04  0.032650  0.041514  0.005251 -0.015210 -0.021688  0.010085 -0.045748 -0.010552 -0.027753
2001-01-05 -0.018382 -0.039859 -0.015670  0.008703  0.013088  0.014188  0.021151 -0.026242  0.004604
2001-01-08  0.014981  0.010989 -0.036851 -0.004618 -0.001318 -0.003627  0.016570 -0.001918 -0.004583
2002
                  AA      AAPL        GE       IBM       JNJ      MSFT       PEP       SPX       XOM
2002-01-02  0.003434  0.063927  0.021575  0.004416 -0.006894  0.011909  0.009647  0.005740  0.007600
2002-01-03  0.013005  0.012017 -0.008381  0.017775 -0.000217  0.032650 -0.007292  0.009180  0.001571
2002-01-04  0.032095  0.005089  0.008452  0.015718 -0.009330 -0.004779 -0.008865  0.006213  0.008472
2002-01-07  0.023241 -0.033755 -0.038887 -0.012398 -0.004599 -0.004802 -0.005878 -0.006499 -0.008712
2002-01-08 -0.021753 -0.012227 -0.010464  0.005315 -0.006601  0.011878 -0.005913 -0.003588  0.001255
2003
                  AA      AAPL        GE       IBM       JNJ      MSFT       PEP       SPX       XOM
2003-01-02  0.033578  0.033520  0.046729  0.039604  0.030168  0.039389  0.021092  0.033200  0.015288
2003-01-03  0.010660  0.006757 -0.003151  0.013445  0.026788  0.001421  0.006791 -0.000484  0.000684
2003-01-06  0.024611  0.000000  0.025817  0.023770  0.009507  0.017975 -0.010118  0.022474  0.024624
2003-01-07  0.000000 -0.002685 -0.006163  0.028888 -0.022558  0.019052 -0.018171 -0.006545 -0.033712
2003-01-08 -0.103922 -0.020188 -0.015504 -0.021123 -0.005826 -0.028272  0.012435 -0.014086 -0.004145
2004
                  AA      AAPL        GE       IBM       JNJ      MSFT       PEP       SPX       XOM
2004-01-02 -0.011967 -0.004677  0.004203 -0.012204  0.000000  0.002765 -0.003352 -0.003094 -0.008929
2004-01-05  0.032609  0.042293  0.015069  0.016391  0.005971  0.025276  0.017076  0.012395  0.023249
2004-01-06 -0.007218 -0.003607 -0.006598  0.000120 -0.004511  0.003586  0.006360  0.001292 -0.006816
2004-01-07 -0.007573  0.022624  0.009548 -0.003008  0.000477 -0.001340 -0.012892  0.002367 -0.007149
2004-01-08  0.011905  0.033628  0.018092  0.002776  0.004529 -0.001342 -0.019974  0.004963 -0.002592
2005
                  AA      AAPL        GE       IBM       JNJ      MSFT       PEP       SPX       XOM
2005-01-03 -0.013120 -0.017081  0.002787 -0.008455 -0.008205  0.000845 -0.004984 -0.008119 -0.022717
2005-01-04 -0.018095  0.010111 -0.012161 -0.010687 -0.003078  0.003797 -0.007058 -0.011671 -0.006904
2005-01-05 -0.006017  0.008758 -0.005980 -0.002068 -0.000772 -0.002522  0.001147 -0.003628 -0.005098
2005-01-06  0.004162  0.000930  0.007785 -0.003109  0.002897 -0.000843  0.007100  0.003506  0.012579
2005-01-07  0.010173  0.072491 -0.005969 -0.004390 -0.003659 -0.002952  0.008642 -0.001431 -0.006441
2006
                  AA      AAPL        GE       IBM       JNJ      MSFT       PEP       SPX       XOM
2006-01-03  0.010985  0.039783  0.009191 -0.001741  0.025469  0.026428  0.011600  0.016430  0.041046
2006-01-04  0.005620  0.002943 -0.001401 -0.001342  0.015402  0.004568 -0.000583  0.003673  0.001739
2006-01-05  0.008197 -0.007870 -0.002455  0.006717 -0.004171  0.000827 -0.005834  0.000016 -0.005016
2006-01-06 -0.003326  0.025813  0.006681  0.029624  0.004570 -0.002891  0.003717  0.009399  0.019779
2006-01-09  0.011865 -0.003277 -0.002445 -0.014256  0.006255 -0.002071 -0.005067  0.003656 -0.000570
2007
                  AA      AAPL        GE       IBM       JNJ      MSFT       PEP       SPX       XOM
2007-01-03 -0.022661 -0.012258  0.020692  0.001230  0.005795  0.000000  0.002734 -0.001199 -0.032981
2007-01-04 -0.007479  0.022196 -0.006018  0.010721  0.012395 -0.001472  0.006908  0.001228 -0.018699
2007-01-05 -0.012057 -0.007121 -0.004780 -0.009061 -0.008967 -0.005898 -0.003069 -0.006085  0.007165
2007-01-08 -0.009535  0.004938 -0.000320  0.015165 -0.001740  0.010011  0.002173  0.002220 -0.008022
2007-01-09  0.001540  0.083070  0.000000  0.011863 -0.003660  0.000734  0.004156 -0.000517 -0.007782
2008
                  AA      AAPL        GE       IBM       JNJ      MSFT       PEP       SPX       XOM
2008-01-02 -0.011481 -0.016357 -0.008186 -0.031513 -0.011862 -0.010658 -0.008092 -0.014438 -0.001862
2008-01-03  0.001489  0.000462  0.000952  0.001944  0.000171  0.004001  0.006526  0.000000  0.003382
2008-01-04 -0.036574 -0.076335 -0.020615 -0.035948 -0.001372 -0.027897 -0.003095 -0.024552 -0.018596
2008-01-07 -0.050309 -0.013385  0.003886 -0.010699  0.015625  0.006623  0.023503  0.003223 -0.009356
2008-01-08 -0.063698 -0.035972 -0.021613 -0.024521  0.001183 -0.033521  0.007943 -0.018352 -0.012791
2009
                  AA      AAPL        GE       IBM       JNJ      MSFT       PEP       SPX       XOM
2009-01-02  0.076137  0.063269  0.053999  0.038154  0.013774  0.046002  0.021860  0.031608  0.022773
2009-01-05 -0.020708  0.042204 -0.025940 -0.006386 -0.009964  0.009424 -0.006418 -0.004668 -0.000131
2009-01-06  0.022026 -0.016494  0.013981  0.027771 -0.006038  0.011411  0.001762  0.007817 -0.016374
2009-01-07 -0.101724 -0.021608 -0.044649 -0.016047 -0.009389 -0.060000 -0.032630 -0.030010 -0.025436
2009-01-08  0.043186  0.018569  0.002062 -0.006955 -0.001858  0.031097 -0.011715  0.003397  0.010659
2010
                  AA      AAPL        GE       IBM       JNJ      MSFT       PEP       SPX       XOM
2010-01-04  0.032746  0.015565  0.020891  0.011908  0.004126  0.015700  0.007282  0.016043  0.014102
2010-01-05 -0.031098  0.001729  0.005457 -0.012079 -0.011506  0.000336  0.012048  0.003116  0.003930
2010-01-06  0.052234 -0.015906 -0.005427 -0.006547  0.008148 -0.006382 -0.010034  0.000546  0.008582
2010-01-07 -0.021531 -0.001849  0.051842 -0.003414 -0.007257 -0.010142 -0.006356  0.004001 -0.003135
2010-01-08  0.025061  0.006648  0.021401  0.010039  0.003489  0.006831 -0.003285  0.002882 -0.003893
2011
                  AA      AAPL        GE       IBM       JNJ      MSFT       PEP       SPX       XOM
2011-01-03  0.026797  0.021732  0.000000  0.004900  0.015773  0.002556  0.006421  0.011315  0.019624
2011-01-04  0.045831  0.005219  0.017937  0.001099  0.008336  0.004006 -0.005135 -0.001313  0.004641
2011-01-05  0.002435  0.008180  0.001652 -0.003979 -0.000648 -0.003264  0.017988  0.005007 -0.002717
2011-01-06 -0.012143 -0.000808 -0.004398  0.010951 -0.001460  0.029476  0.003841 -0.002123  0.006540
2011-01-07  0.003688  0.007161 -0.007178 -0.004973 -0.009747 -0.007777 -0.006735 -0.001845  0.005414

Further, we can use groupby to do a linear regression, calculating beta.¶

In [46]:

def get_beta(RR):
    RR = RR.dropna()
    RR['intercept'] = 1.
    model = sm.OLS(RR['MSFT'], RR.ix[:, ['AAPL', 'intercept']]).fit()
    return model.params

#get_beta(RR), for AAPL

grouped = RR.groupby([lambda x: x.year, lambda x: x.month])
beta_by_ym = grouped.apply(get_beta)
beta_by_ym.unstack(0)['AAPL']

Out[46]:

	1990	1991	1992	1993	1994	1995	1996	1997	1998	1999	2000	2001	2002	2003	2004	2005	2006	2007	2008	2009	2010	2011
1	NaN	0.586836	0.392003	0.337470	0.034171	0.240722	0.064221	0.131182	0.036281	0.182694	0.242851	0.410385	0.313725	0.906993	0.261317	-0.036247	0.144353	-0.024860	0.250590	0.310269	0.548070	0.305503
2	-0.002426	0.398197	0.597157	-0.191693	0.004739	0.166778	0.069597	0.002899	0.167381	0.504353	0.367281	0.387085	0.285311	0.781728	0.207407	0.128596	0.187659	0.532232	0.232091	0.649430	0.618771	0.332002
3	0.205730	0.604625	0.229554	0.467642	0.157241	-0.130381	0.234062	0.161575	0.127244	0.330306	0.215007	0.625861	0.562043	1.137488	0.227146	0.020839	0.093893	0.550877	0.505030	0.998851	0.070641	0.466862
4	0.406723	-0.055939	0.766337	0.645863	0.337084	-0.097639	-0.166418	-0.092456	0.015637	0.124685	0.445522	0.421632	0.598160	0.468462	0.044444	0.272978	-0.009106	0.059114	0.463362	0.321722	0.061764	-0.232956
5	0.185421	0.034390	-0.118079	0.700196	0.287318	0.097011	0.110495	0.474461	0.117327	0.193558	0.406288	0.305603	0.927363	0.160195	0.242406	0.115821	0.227492	0.144143	0.415390	0.229804	0.548791	0.517262
6	0.173833	0.124975	0.218792	0.117013	0.386103	0.221485	0.266115	0.464815	0.500223	0.319529	0.422132	0.227274	0.232655	0.315496	0.106455	0.032854	0.126484	0.054197	0.275618	0.171693	0.743858	0.465253
7	0.130995	-0.022488	0.507049	0.156466	0.475714	0.886968	0.098349	0.223480	0.260195	0.274491	0.259399	0.216299	0.486249	0.380805	0.019886	0.241723	0.007129	0.037335	0.492093	0.512913	0.207223	0.380265
8	0.902234	0.307631	0.345300	0.214452	0.278754	0.673664	0.054254	0.054791	0.389318	0.113426	-0.037612	0.708398	0.919043	0.061068	0.295398	0.241400	-0.019745	0.338582	0.562713	0.589150	0.382959	0.847749
9	0.904113	0.422216	0.399828	0.161215	0.382306	0.229729	0.074385	-0.009814	0.482903	0.287425	0.024519	0.528889	0.766441	0.725022	0.215218	-0.022490	-0.007107	0.219036	0.480952	0.453823	0.514019	0.678709
10	0.362579	-0.027998	0.159585	-0.069302	0.109282	0.280408	-0.004444	0.142951	0.349357	0.186547	-0.122512	0.348552	0.804821	0.299456	0.035525	0.146915	0.091303	0.234525	0.592720	0.249662	0.405615	0.232681
11	0.569319	0.233042	0.781442	0.219592	0.502614	0.075520	0.183464	0.153227	0.212188	0.005807	0.466090	0.337734	0.519945	0.348925	-0.057999	0.009591	0.126967	0.320820	0.763112	0.493005	0.473650	NaN
12	0.204879	0.179746	0.305322	0.035841	0.194147	0.255704	0.067983	0.244797	0.316726	0.138289	0.425003	0.308701	0.435298	0.047410	0.102319	0.053251	0.109467	0.705481	0.711442	0.158556	0.903940	NaN

We can emply the statistical Ordinary Least Squares (OLS) technique to do a moving window regression:¶

In [47]:

RR = close_px / close_px.shift(1) - 1
y = RR['AAPL']
x = RR.ix[:, ['MSFT']]
model = ols(y=y, x=x) # OLS
model

Out[47]:

-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <MSFT> + <intercept>

Number of Observations:         5471
Number of Degrees of Freedom:   2

R-squared:         0.1339
Adj R-squared:     0.1337

Rmse:              0.0289

F-stat (1, 5469):   845.4989, p-value:     0.0000

Degrees of Freedom: model 1, resid 5469

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
          MSFT     0.5227     0.0180      29.08     0.0000     0.4874     0.5579
     intercept     0.0007     0.0004       1.84     0.0665    -0.0000     0.0015
---------------------------------End of Summary---------------------------------

In [48]:

# more least squares
model = ols(y=y, x=x, window=250)
model.beta.info()
model.beta['MSFT'].plot()
remove_border()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 5222 entries, 1991-01-29 00:00:00 to 2011-10-14 00:00:00
Data columns (total 2 columns):
MSFT         5222  non-null values
intercept    5222  non-null values
dtypes: float64(2)

Beta of Microsoft relative to the S&P 500 as a function of time.

In [49]:

def RB(returns, benchmark, window=250):
    betas = {}
    for col, ts in returns.iteritems():
        betas[col] = _r_b(ts, spx, window=250)
    return DataFrame(betas)
def _r_b(returns, benchmark, window=250):
    model = ols(y=returns, x={'bmk' : benchmark}, window=window,min_periods=100)
    return model.beta['bmk']
def GMR(daily_returns):
    daily_index = T(daily_returns)
    monthly_index = daily_index.asfreq('EOM', method='pad')
    return TR(monthly_index)
def T(returns):
    return (1 + returns).cumprod()
def TR(prices):
    return prices / prices.shift(1) - 1



rets = GMR(TR(close_px))
spx = rets.pop('SPX')
betas =RB(rets, spx, window=36)
betas['MSFT'].plot()
remove_border()

And finally the most important application: discover the portfolio of S&P 500 stocks that minimizes tracking error; (the 'Minimum Variance Portfolio') and visualize this portfolio's performance relative to the benchmark.¶

tracking error:

$TE = \omega =\sqrt{\operatorname{Var}(r_p - r_b)} = \sqrt{{E}[(r_p-r_b)^2]-({E}[r_p - r_b])^2}$

In [50]:

# benchmark is the S&P 500

px = close_px.ix[datetime.datetime(1992, 1, 1):]
RR = px / px.shift(1) - 1
annualizer = np.sqrt(250)
bmk = RR.pop('SPX')

tickers = RR.columns
weights = Series(1. / len(tickers), index=tickers)
weights

# plot weighted portfolio versus the benchmark
port_returns = (RR * weights).sum(1)
plot_returns(port_returns, bmk)

Annualized:

In [51]:

calc_te(weights, RR, bmk) * annualizer

Out[51]:

0.092509298588231245

Let's minimize tracking error for the entire sample of data, which uses hindsight:

In [52]:

clean_RR = RR.fillna(0)
clean_bmk = bmk.fillna(0)
opt_weights = get_opt_holdings_opt(clean_RR, clean_bmk)
calc_te(opt_weights, RR, bmk) * annualizer

Out[52]:

0.077470788727965992

We can even use convex optimization to minimize tracking error over a rolling two year period.¶

In [53]:

def rolling_min_te_portfolio(UNIRET, TRARET, window=250,T_intervals=100):
    
    ports = {}
    for i, cur_date in enumerate(UNIRET.index):
        if i < T_intervals:
            continue
        if i % 100 == 0:
            print >> sys.stdout, '*',
        UNI = UNIRET[max(0,i-window+1):i]
        TRA = TRARET[max(0,i-window+1):i]
        UNI = UNI.ix[:, UNI.count() > 0]
        UNI = UNI.fillna(0)
        TRA = TRA.fillna(0)
        ports[cur_date] = get_opt_holdings_opt(UNI,TRA)
    return DataFrame(ports).T
ports = rolling_min_te_portfolio(RR, bmk)

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

In [54]:

ports

Out[54]:

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 4888 entries, 1992-05-26 00:00:00 to 2011-10-14 00:00:00
Data columns (total 8 columns):
AA      4888  non-null values
AAPL    4888  non-null values
GE      4888  non-null values
IBM     4888  non-null values
JNJ     4888  non-null values
MSFT    4888  non-null values
PEP     4888  non-null values
XOM     4888  non-null values
dtypes: float64(8)

In [55]:

ports['JNJ'].plot()
remove_border()

In [56]:

port_return = (ports.shift(1) * RR).sum(1)
plot_returns(port_return, bmk)

The Sharpe ratio of this portfolio is quite high!

In [57]:

def sharpe(RR, ann=250):
    return np.sqrt(ann) * RR.mean() / RR.std()
sharpe(port_return)

Out[57]:

0.66534966791660943

Particularly in relation to the benchmark...

In [58]:

sharpe(bmk)

Out[58]:

0.37931166467025773

Conclusion¶

This reflects the concept that the lowest risk stocks in an index actually deliver the highest performance, as opposed to the common perception that one must take on more risk to attain a chance at higher reward.

This also worked very well. But let's do some sentiment analysis to round out our analysis.

In [58]:

Bayesian Tweets: Twitter Sentiment Analysis¶

First, we load in some standard packages and a couple of packages we wrote ourselves.

In [59]:

#ours
import twitter3 as tw3
import performance1 as perf1
import performance2 as perf2
import sentiment1 as sent
#standard
import pandas as pd
import numpy as np
import datetime
from pattern.web import Element
import requests
from sklearn.cross_validation import train_test_split
from sklearn.naive_bayes import MultinomialNB
from scipy import sparse
from sklearn.feature_extraction.text import CountVectorizer
#import ystockquote as ysq

Load the Data¶

The next two cells scrape wikipedia for the ticker symbols of the S&P 500 companies, our stock data set we'll be using, and also prepare the list of dates we'll try to use for our analysis. Notably, the Twitter API only allows us to go back about a week. We've commented out the first cell and just loaded the symbols from a csv file in the second cell

In [60]:

# spy = requests.get('http://en.wikipedia.org/wiki/List_of_S&P_500_companies').text
# tickers = []
# dom = Element(spy)
# for a in dom('tr td:first-child a'):
#     ticker = a.content
#     if len(ticker) < 5:
#         tickers.append(str(ticker))
# tickers.append('BRK.B')
# tickers.append('CMCSA')
# tickers.append('DISCA')
# tickers = np.sort(tickers)
#tickerlist = tickers

In [61]:

spx = pd.read_csv('SPXSymbolsPD.csv', delimiter=',')
tickerlist = list(spx['Ticker'])
datelist = [datetime.date(2013,12,5),datetime.date(2013,12,6),datetime.date(2013,12,7),datetime.date(2013,12,8),datetime.date(2013,12,9),datetime.date(2013,12,10)]
weekdays_datetime = [datetime.date(2013,12,5),datetime.date(2013,12,6),datetime.date(2013,12,9),datetime.date(2013,12,10)]
weekdays_str = [str(day) for day in weekdays_datetime]
weekend_datetime = [datetime.date(2013,12,7),datetime.date(2013,12,8)]
weekend_str = [str(day) for day in weekend_datetime]

Now we need to use the Twitter API to search for the tweets including each stock ticker symbol on each day. We take all the tweets we can find, or 100 tweets, whichever is smaller. We wrote all of the functions in tw3, although we do use other packages (twython and twitter) designed to interface with the Twitter API. We initially made extensive use of twitter and then switched to twython as it gave us better error handling. As we want to do build this entire data set in one call, we have to be able to handle a wide variety of errors, such as dealing with our rate limits and many of the small issues that could crop up during a single API call. We also make use of extensive printing so we can monitor the function as it runs. We've commented it out and loaded the data from csv so as to avoid the 1.5 hour process of performing the searches.

In [62]:

#loadfull = tw3.safemultisearch(tickerlist,datelist)
#loadfull.to_csv('data.csv')
data = pd.read_csv('data.csv')
#remove weekend days as the market is closed
data = data[data['date']!=weekend_str[0]]
data = data[data['date']!=weekend_str[1]]

Next we need to check the stock performance on each day, so we wrote another function to add a 0 wherever the stock went down and a 1 wherever it went up. We use the ystockquote package to assist in gathering the stock data. Again, we comment this out to limit ourselves to a single round of data collection and load the results from csv.

In [63]:

#data,performance = perf1.check_performance(data)
#data = data[data['performance']>-1]#only keep if performance is 0 or 1, not NaN
#data.to_csv('data_perf.csv')
data = pd.read_csv('data_perf.csv')

We have a data set! We can go ahead and print part of our dataframe to see what sort of information we've stored, namely the symbol of the company, the date in question, text from the twitter searches, and a performance indicator. We made the choice to treat the entire twitter corpus for each stock from each day as a single bag of words, so the text here is the concatenated tweets (converted to ascii and without punctuation) of the entire result of each particular search. We'll turn it into a proper bag of words a bit later on.

This differs from rotten tomatos where we kept each review separate, but in that case we also had a "fresh" or "rotten" indicator for each review. Here we only have the stock performance, which is obviously doesn't change depending on which tweet we're looking at, so we just combine all of the tweets into a single bag of words.

In [64]:

print data.irow(0)
print data.irow(0)['text']

Unnamed: 0                                                      0
Unnamed: 0.1                                                  MMM
company                                                       MMM
date                                                   2013-12-05
text            MMM Could 3M Dare Trump GE on Dividend Hikes h...
performance                                                     0
Name: 0, dtype: object
MMM Could 3M Dare Trump GE on Dividend Hikes httptcoG7ukoGT5kECould 3M Dare Trump GE on Dividend Hikes MMM httptco4QBxAilSRDRT MScharts TSN MAT MMM LH Portfolios Pops amp Drops  httptco9hMj9HZuCGBarclays has estimated MMM target price at 121 defensive on 3Ms ability to maintain margins  httptcofvmByxwClFLaMonicaBuzz MMM  Buy on the dip httptcoxVMwKU07SsGeoffrey2313819 Ha Did 3M sponsor that tweet MMMSampP100 Stocks Performance INTC BA SPG NSC EBAY DVN UNP EMC MA GILD F MMM LMT ABBV AAPL  more httptcon4QZIDxy7wTSN MAT MMM LH Portfolios Pops amp Drops  httptco9hMj9HZuCGOne more small order filled Sold 1 Dec13 MMM 120 put for 37 centsTSN MAT MMM LH Portfolios Pops amp Drops  httptcoeB1gLFQgbDTSN MAT MMM LH Portfolios Pops amp Drops  httptcokqZTxc7ofATSN MAT MMM LH Portfolios Pops amp Drops  httptco2z5afpz5SrAnalyst estimate MMM EPS at 672 which is 2 above second quarter estimates  httptcofvmByxwClF httptco5wqB2bnXnFMMM mean target price is 127 as Barclays Credit Suisse and Citi gave stock neutral ratings  httptcoDtPqdKgHjrOptionSniper1 On my list to look at GOGO SLW P AAPL IBM OXY MMM IYR TWTR Didnt get to any yday Wanted to sell MSFT callsMMM breaking out3M Co The stock is testing its highs MMM httptcodS64sIMoSi httptcoZE6oMDONO0Pennystock Research on DCIN MMM GFIG XPO SHOR AIN View now httptcotCMpiAEdzGLooking for winners like ENV STXS RNIN BPZ MMM Got to see httptcoxeMnQv7YcRMMM  above 12760 can gain steam  Volume not impressing meTSN MAT QQQ XLE XLV XOP MS MMM ChartsinPlay Portfolio  changes in stops and sell advice  article later httptcoCW4YD3lpfBTSN MAT QQQ XLE XLV XOP MS MMM ChartsinPlay Portfolio  changes in stops and sell advice  article later httptco70ZRkrIBhJTSN MAT QQQ XLE XLV XOP MS MMM ChartsinPlay Portfolio  changes in stops and sell advice  article later httptcofyoFkbgRZcTSN MAT QQQ XLE XLV XOP MS MMM ChartsinPlay Portfolio  changes in stops and sell advice  article later httptcoWy7KgCw9YzMMM  3M  Analyst Is Right Limited Short To MediumTerm Appeal gt httptcoTuJoiXQMAf stock stocks MMMMMM 3M Co MMM 3M  Analyst Is Right Limited Short To MediumTerm Appeal httptcotT8fZdmp0O3M Co MMM 3 Reasons To Buy 3M E I Du Pont De Nemours And Co  MMM httptcox9HB2qh8Xs3M Co MMM 3M  Analyst Is Right Limited Short To MediumTerm Appeal MMM httptco0ZrdXLtIZB3M  Analyst Is Right Limited Short To MediumTerm Appeal httptcoeu5vDkFlui MMM3M  Analyst Is Right Limited Short To MediumTerm Appeal httptcoW8SwGHrC3E MMM

Naive Bayesian Classifier¶

Now we can begin making a proper bag of words and performing sentiment analysis. Many of the functions for this are contained in the file sentiment1.py, which we wrote and which we imported at the top. We start with a make_xy function, where X is a matrix where each row represents a (stock,date) combination and each column represents a particular word, with the elements representing the frequency of that word in the bag of words. The Y is just are performance indicator (1 for up for the day, 0 for down) which we made and attached to our dataframe earlier.

In [65]:

X, Y = sent.make_xy(data,vectorizer=None)

Now we use a Naive Bayes classifier as we did in the Bayesian Tomatoes problem set, split our data into a traininig set and a testing set, and check our accuracy on the training set and on the testing set.

In [66]:

X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.33,random_state=22)
clf = MultinomialNB()
clf.fit(X_train,Y_train)
clf.predict(X_test)
print "accuracy on testing set",1-float(sum(abs(Y_test-clf.predict(X_test))))/float(len(Y_test))
print "accuracy on training set",1-float(sum(abs(Y_train-clf.predict(X_train))))/float(len(Y_train))

accuracy on testing set 0.516279069767
accuracy on training set 0.973221117062

So how did we do? At first glance, 51.6% is not very good. On the other hand, even a small edge in finance can be useful. But do we have an edge?

In [67]:

print "fraction that are positive on the day",float(len(Y_test[Y_test==1]))/float(len(Y_test))

fraction that are positive on the day 0.517829457364

We actually did worse than just assuming every stock went up! It turns out that we didn't gain any edge at all. Particularly considering how high our accuracy was on the training set and how poor it is on the testing set, it looks like we have serious over-fitting issues, likely due to the smaller size of our data set, the sparsity of our word frequency, and a potentially weak underlying connnection from tweets to stock performance. For example, people sometimes might tweet "$GOOG is doing great today!", but not all tweets are necessarily as clear. Unlike movie reviews, the explicit purpose of a tweet is not to determine whether a stock is doing well or poorly, the way the movie reviews are explicit about whether a movie is good or bad.

We'll try fitting for the best alpha and min_df parameters as we did in the problem set to see if this resolves these issues, although if the underlying problems are in the data we wouldn't expect this to work.

In [68]:

alphas = [0, .1, 1, 5, 10, 50, 100, 150, 200]
min_dfs = [1e-6, 1e-7, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1]

#Find the best value for alpha and min_df, and the best classifier
best_alpha = None
best_min_df = None
max_loglike = -np.inf

#for alpha in alphas:
for alpha in alphas:
    print alpha#I print alpha as this takes a bit to run and I like to be able to see the progress
    for min_df in min_dfs:         
        vectorizer = CountVectorizer(min_df = min_df)       
        X, Y = sent.make_xy(data, vectorizer)
        clf = MultinomialNB(alpha=alpha)#should move outside of inner for loop
        loglike=sent.cv_score(clf,X,Y,sent.log_likelihood)
        #print loglike
        if loglike>max_loglike:
            max_loglike=loglike
            print "max_loglike",max_loglike,"alpha",alpha,"min_df",min_df
            best_alpha = alpha
            best_min_df = min_df
print "best max_loglike",max_loglike,"best alpha",best_alpha,"best min_df",best_min_df

0
max_loglike -701.863599147 alpha 0 min_df 0.1
0.1
max_loglike -701.698320488 alpha 0.1 min_df 0.1
1
max_loglike -700.218818186 alpha 1 min_df 0.1
5
max_loglike -693.812946439 alpha 5 min_df 0.1
10
max_loglike -686.175301985 alpha 10 min_df 0.1
50
max_loglike -636.827815439 alpha 50 min_df 0.1
100
max_loglike -595.120648905 alpha 100 min_df 0.1
150
max_loglike -566.893075242 alpha 150 min_df 0.1
200
max_loglike -547.192426876 alpha 200 min_df 0.1
best max_loglike -547.192426876 best alpha 200 best min_df 0.1

We can see that the maximal parameters are absurd, particularly the very high best_min_df, so it looks like we can't salvage our model this way. But is there anything else we can do?

Beyond Binary Classification¶

In the case of movie reviews, Rotten Tomatoes compresses down "thumbs up", "four out of five stars", and other such positive reviews to just "fresh". Similarly, we took all positive days and just converted them to "up", and all negative days and converting them to "down". Seeing as how we're up against a shortage of data, it's sensible for us to avoid this strategy (which is what allowed us to do binary classification earlier) and instead keep more detailed performance information.

We rebuild the performance indicator portion of our data set, instead looking at the fractional gain, clipping it from -1 to 1, then transforming it to the range 0 to 1. So now a gain of 3% would be .56. Again, we comment out the code to build the data set and just read it in from our previuos efforts.

In [69]:

#data = pd.read_csv('data.csv')
#data = data[data['date']!=weekend_str[0]]
#data = data[data['date']!=weekend_str[1]]
#data,performance = perf2.check_performance_scaled(data)
#data = data[data['performance']>=0]#only keep if performance is 0 to 1, not NaN
##data.to_csv('data_perf_scaled.csv')
data = pd.read_csv('data_perf_scaled.csv')

As in our previous model, we now run make_xy. Again, the X will be our bag of words results in matrix form, and our Y will be our performance indicators (which, recall, are continuous and based on the gain/loss.)

In [70]:

X, Y = sent.make_xy(data,vectorizer=None)

In the binary classification model, we then ran this through a multinomial naive Bayes model to predict the probability of up/down for each stock. Since we aren't giving our model a binary classification anymore, this will no longer work. However, we can still use a MultinomialNB model. Now, instead of predicting the probability of 1/0, this will actually build a prediction of performance (such as, say, .52 for a 1% gain) based on the elements of our bag of words.

In [71]:

X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.33,random_state=22)
clf = MultinomialNB()
clf.fit(X_train,Y_train)
#clf.predict(X_test)

Out[71]:

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

To check the accuracy of our model, we see which stocks it predicts a nontrivial gain for (using a tolerance of .2% movement) and then see if the direction predicted is correct.

In [72]:

predvec = clf.predict(X_test)
indexpred = np.abs(predvec-.5)>0
actual = np.divide(Y_test[indexpred]-.5,np.abs(Y_test[indexpred]-.5))*.5+.5
predict = np.divide(clf.predict(X_test)[indexpred]-.5,np.abs(clf.predict(X_test)[indexpred]-.5))*.5+.5
print "accuracy on testing set",1-float(sum(abs(actual-predict)))/float(len(actual))
print "portion of stocks with motion going up in testing set",float(len(predict[predict>0]))/float(len(predict))
predvec = clf.predict(X_train)
indexpred = np.abs(predvec-.5)>0
actual = np.divide(Y_train[indexpred]-.5,np.abs(Y_train[indexpred]-.5))*.5+.5
predict = np.divide(clf.predict(X_train)[indexpred]-.5,np.abs(clf.predict(X_train)[indexpred]-.5))*.5+.5
print "accuracy on training set",1-float(sum(abs(actual-predict)))/float(len(actual))

accuracy on testing set 0.646341463415
portion of stocks with motion going up in testing set 0.40243902439
accuracy on training set 0.981389578164

And we've succeeded! We still have a very high probability on our training set indicating over fitting issues, but it works out to give us accurate results on our testing set, enabling us to gain an edge. The edge is not massive, as we are getting our predictions right about 65% of the time and the actual split between up stocks and downs tocks (among the stocks with sufficient movement) is 40/60, we have gained a several percent edge, which is considered very significant in the realm of finance.

In our binary classification models, we could from here check the level of overfitting of our model by binning up predictions (10%-20% up, 20%-30% up, and so on) and seeing if our accuracy matches our predictions. Unfortunately, one cost of using our Multinomial Naive Bayes model in this way is that we no longer have probability predictions, just direction predictions. In essense, we've lost our error term, which is always something we'd like to avoid when possible in data analysis. This also makes it difficult to tune our model, such as by carefully adjusting min_df or alpha parameters. Still, if the benefit of losing tunability and error analysis is going from just noise to a useful prediction, we'll take it.

One way we can check the reasonableness of our model, though, is by checking some words with strong negative and positive indications, as we did with Rotten Tomatoes.

In [73]:

vec = CountVectorizer(min_df = 1e-3)
#vec = CountVectorizer()
text = [words for i,words in data.text.iteritems()]
vec.fit(text)
words = np.array(vec.get_feature_names())
singles = np.eye(len(words))

In [74]:

X_df, Y_df = sent.make_xy(data,vectorizer=vec)
X_train_df,X_test_df,Y_train_df,Y_test_df = train_test_split(X_df,Y_df,test_size=0.33,random_state=22)
clf_df = MultinomialNB()
clf_df.fit(X_train_df,Y_train_df)

Out[74]:

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [75]:

#clf_df = clf
negativeindices = clf_df.predict_proba(sparse.csc_matrix(singles))[:,0].argsort()
positiveindices = clf_df.predict_proba(sparse.csc_matrix(singles))[:,1].argsort()

In [76]:

badwords=words[negativeindices]
badprobs = clf_df.predict_proba(singles)[negativeindices,1]
badprobs=badprobs[::-1]
badwords=badwords[::-1]
goodwords=words[positiveindices]
goodwords=goodwords[::-1]
goodprobs = clf_df.predict_proba(singles)[positiveindices,1]
goodprobs = goodprobs[::-1]
print "negative words",[(badwords[i],badprobs[i]) for i in range(10)]
print "positive words",[(goodwords[i],goodprobs[i]) for i in range(10)]

negative words [(u'dva', 0.00065429865287221971), (u'davita', 0.00070301231619910383), (u'healthcare', 0.0006835894031860572), (u'partners', 0.0013194339424065626), (u'dialysis', 0.00072809514936371226), (u'dvadva', 0.00072592477320190742), (u'scan', 0.00072375005946349695), (u'rated', 0.00071782864597029678), (u'patients', 0.00071095170136868311), (u'health', 0.00066552469716559491)]
positive words [(u'cpb', 0.032144008476641339), (u'soup', 0.013377734519007296), (u'options', 0.011244340670651403), (u'campbell', 0.010575715690765052), (u'day', 0.008966277117938088), (u'today', 0.0081946750241469728), (u'keeneonmarket', 0.007881447433107246), (u'over', 0.0078077819981858913), (u'iv', 0.00693413576281382), (u'up', 0.0068814076007903853)]

On both sides, it looks like there are a lot of words that were specific to just one or two stocks in our set. We again have to recall that we're not predicting probabilities of going up or down but a magnitude in the direction, so one stock that really tanked on one day could easily have a large effect on our most extreme words. This seems to be the case with so many pharmaceutical and health related words in our negative category. On the positive side, in addition to a few that seem stock specific, we do seem some general terms we might have suspected such as "options" and "up".

This is a long way from a predictive tool that could make money in real time, and it's not clear how good of a choice it is up against some of our other options we explored. We did have the advantage of compiling all of the tweets for each day, including tweets that were about stock behavior that already happened, but this does in its current form provide a several percent edge and shows potential to be honed into a more realtime tool should one desire.

Complete Conclusion¶

We looked at an array of different solutions. We showed how we collected data, we showed some of our tried and failed attempts, we used a slew of statistical analysis on each, and we showed three great ways to make money (and one not so good one).

In sum, each of our solutions can be expanded. But it seems that we stumbled upon a few good solutions, and are happy with our results.

In [ ]: