First, we load in some standard packages and a couple of packages we wrote ourselves.
#ours
import twitter3 as tw3
import performance1 as perf1
import performance2 as perf2
import sentiment1 as sent
#standard
import pandas as pd
import numpy as np
import datetime
from pattern.web import Element
import requests
from sklearn.cross_validation import train_test_split
from sklearn.naive_bayes import MultinomialNB
from scipy import sparse
from sklearn.feature_extraction.text import CountVectorizer
#import ystockquote as ysq
The next two cells scrape wikipedia for the ticker symbols of the S&P 500 companies, our stock data set we'll be using, and also prepare the list of dates we'll try to use for our analysis. Notably, the Twitter API only allows us to go back about a week. We've commented out the first cell and just loaded the symbols from a csv file in the second cell
# spy = requests.get('http://en.wikipedia.org/wiki/List_of_S&P_500_companies').text
# tickers = []
# dom = Element(spy)
# for a in dom('tr td:first-child a'):
# ticker = a.content
# if len(ticker) < 5:
# tickers.append(str(ticker))
# tickers.append('BRK.B')
# tickers.append('CMCSA')
# tickers.append('DISCA')
# tickers = np.sort(tickers)
#tickerlist = tickers
spx = pd.read_csv('SPXSymbolsPD.csv', delimiter=',')
tickerlist = list(spx['Ticker'])
datelist = [datetime.date(2013,12,5),datetime.date(2013,12,6),datetime.date(2013,12,7),datetime.date(2013,12,8),datetime.date(2013,12,9),datetime.date(2013,12,10)]
weekdays_datetime = [datetime.date(2013,12,5),datetime.date(2013,12,6),datetime.date(2013,12,9),datetime.date(2013,12,10)]
weekdays_str = [str(day) for day in weekdays_datetime]
weekend_datetime = [datetime.date(2013,12,7),datetime.date(2013,12,8)]
weekend_str = [str(day) for day in weekend_datetime]
Now we need to use the Twitter API to search for the tweets including each stock ticker symbol on each day. We take all the tweets we can find, or 100 tweets, whichever is smaller. We wrote all of the functions in tw3, although we do use other packages (twython and twitter) designed to interface with the Twitter API. We initially made extensive use of twitter and then switched to twython as it gave us better error handling. As we want to do build this entire data set in one call, we have to be able to handle a wide variety of errors, such as dealing with our rate limits and many of the small issues that could crop up during a single API call. We also make use of extensive printing so we can monitor the function as it runs. We've commented it out and loaded the data from csv so as to avoid the 1.5 hour process of performing the searches.
#loadfull = tw3.safemultisearch(tickerlist,datelist)
#loadfull.to_csv('data.csv')
data = pd.read_csv('data.csv')
#remove weekend days as the market is closed
data = data[data['date']!=weekend_str[0]]
data = data[data['date']!=weekend_str[1]]
Next we need to check the stock performance on each day, so we wrote another function to add a 0 wherever the stock went down and a 1 wherever it went up. We use the ystockquote package to assist in gathering the stock data. Again, we comment this out to limit ourselves to a single round of data collection and load the results from csv.
#data,performance = perf1.check_performance(data)
#data = data[data['performance']>-1]#only keep if performance is 0 or 1, not NaN
#data.to_csv('data_perf.csv')
data = pd.read_csv('data_perf.csv')
We have a data set! We can go ahead and print part of our dataframe to see what sort of information we've stored, namely the symbol of the company, the date in question, text from the twitter searches, and a performance indicator. We made the choice to treat the entire twitter corpus for each stock from each day as a single bag of words, so the text here is the concatenated tweets (converted to ascii and without punctuation) of the entire result of each particular search. We'll turn it into a proper bag of words a bit later on.
This differs from rotten tomatos where we kept each review separate, but in that case we also had a "fresh" or "rotten" indicator for each review. Here we only have the stock performance, which is obviously doesn't change depending on which tweet we're looking at, so we just combine all of the tweets into a single bag of words.
print data.irow(0)
print data.irow(0)['text']
Unnamed: 0 0 Unnamed: 0.1 MMM company MMM date 2013-12-05 text MMM Could 3M Dare Trump GE on Dividend Hikes h... performance 0 Name: 0, dtype: object MMM Could 3M Dare Trump GE on Dividend Hikes httptcoG7ukoGT5kECould 3M Dare Trump GE on Dividend Hikes MMM httptco4QBxAilSRDRT MScharts TSN MAT MMM LH Portfolios Pops amp Drops httptco9hMj9HZuCGBarclays has estimated MMM target price at 121 defensive on 3Ms ability to maintain margins httptcofvmByxwClFLaMonicaBuzz MMM Buy on the dip httptcoxVMwKU07SsGeoffrey2313819 Ha Did 3M sponsor that tweet MMMSampP100 Stocks Performance INTC BA SPG NSC EBAY DVN UNP EMC MA GILD F MMM LMT ABBV AAPL more httptcon4QZIDxy7wTSN MAT MMM LH Portfolios Pops amp Drops httptco9hMj9HZuCGOne more small order filled Sold 1 Dec13 MMM 120 put for 37 centsTSN MAT MMM LH Portfolios Pops amp Drops httptcoeB1gLFQgbDTSN MAT MMM LH Portfolios Pops amp Drops httptcokqZTxc7ofATSN MAT MMM LH Portfolios Pops amp Drops httptco2z5afpz5SrAnalyst estimate MMM EPS at 672 which is 2 above second quarter estimates httptcofvmByxwClF httptco5wqB2bnXnFMMM mean target price is 127 as Barclays Credit Suisse and Citi gave stock neutral ratings httptcoDtPqdKgHjrOptionSniper1 On my list to look at GOGO SLW P AAPL IBM OXY MMM IYR TWTR Didnt get to any yday Wanted to sell MSFT callsMMM breaking out3M Co The stock is testing its highs MMM httptcodS64sIMoSi httptcoZE6oMDONO0Pennystock Research on DCIN MMM GFIG XPO SHOR AIN View now httptcotCMpiAEdzGLooking for winners like ENV STXS RNIN BPZ MMM Got to see httptcoxeMnQv7YcRMMM above 12760 can gain steam Volume not impressing meTSN MAT QQQ XLE XLV XOP MS MMM ChartsinPlay Portfolio changes in stops and sell advice article later httptcoCW4YD3lpfBTSN MAT QQQ XLE XLV XOP MS MMM ChartsinPlay Portfolio changes in stops and sell advice article later httptco70ZRkrIBhJTSN MAT QQQ XLE XLV XOP MS MMM ChartsinPlay Portfolio changes in stops and sell advice article later httptcofyoFkbgRZcTSN MAT QQQ XLE XLV XOP MS MMM ChartsinPlay Portfolio changes in stops and sell advice article later httptcoWy7KgCw9YzMMM 3M Analyst Is Right Limited Short To MediumTerm Appeal gt httptcoTuJoiXQMAf stock stocks MMMMMM 3M Co MMM 3M Analyst Is Right Limited Short To MediumTerm Appeal httptcotT8fZdmp0O3M Co MMM 3 Reasons To Buy 3M E I Du Pont De Nemours And Co MMM httptcox9HB2qh8Xs3M Co MMM 3M Analyst Is Right Limited Short To MediumTerm Appeal MMM httptco0ZrdXLtIZB3M Analyst Is Right Limited Short To MediumTerm Appeal httptcoeu5vDkFlui MMM3M Analyst Is Right Limited Short To MediumTerm Appeal httptcoW8SwGHrC3E MMM
Now we can begin making a proper bag of words and performing sentiment analysis. Many of the functions for this are contained in the file sentiment1.py, which we wrote and which we imported at the top. We start with a make_xy function, where X is a matrix where each row represents a (stock,date) combination and each column represents a particular word, with the elements representing the frequency of that word in the bag of words. The Y is just are performance indicator (1 for up for the day, 0 for down) which we made and attached to our dataframe earlier.
X, Y = sent.make_xy(data,vectorizer=None)
Now we use a Navie Bayes classifier as we did in the Bayesian Tomatoes problem set, split our data into a traininig set and a testing set, and check our accuracy on the training set and on the testing set.
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.33,random_state=22)
clf = MultinomialNB()
clf.fit(X_train,Y_train)
clf.predict(X_test)
print "accuracy on testing set",1-float(sum(abs(Y_test-clf.predict(X_test))))/float(len(Y_test))
print "accuracy on training set",1-float(sum(abs(Y_train-clf.predict(X_train))))/float(len(Y_train))
accuracy on testing set 0.516279069767 accuracy on training set 0.973221117062
So how did we do? At first glance, 51.6% is not very good. On the other hand, even a small edge in finance can be useful. But do we have an edge?
print "fraction that are positive on the day",float(len(Y_test[Y_test==1]))/float(len(Y_test))
fraction that are positive on the day 0.517829457364
We actually did worse than just assuming every stock went up! It turns out that we didn't gain any edge at all. Particularly considering how high our accuracy was on the training set and how poor it is on the testing set, it looks like we have serious over-fitting issues, likely due to the smaller size of our data set, the sparsity of our word frequency, and a potentially weak underlying connnection from tweets to stock performance. For example, people sometimes might tweet "$GOOG is doing great today!", but not all tweets are necessarily as clear. Unlike movie reviews, the explicit purpose of a tweet is not to determine whether a stock is doing well or poorly, the way the movie reviews are explicit about whether a movie is good or bad.
We'll try fitting for the best alpha and min_df parameters as we did in the problem set to see if this resolves these issues, although if the underlying problems are in the data we wouldn't expect this to work.
alphas = [0, .1, 1, 5, 10, 50, 100, 150, 200]
min_dfs = [1e-6, 1e-7, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1]
#Find the best value for alpha and min_df, and the best classifier
best_alpha = None
best_min_df = None
max_loglike = -np.inf
#for alpha in alphas:
for alpha in alphas:
print alpha#I print alpha as this takes a bit to run and I like to be able to see the progress
for min_df in min_dfs:
vectorizer = CountVectorizer(min_df = min_df)
X, Y = sent.make_xy(data, vectorizer)
clf = MultinomialNB(alpha=alpha)#should move outside of inner for loop
loglike=sent.cv_score(clf,X,Y,sent.log_likelihood)
#print loglike
if loglike>max_loglike:
max_loglike=loglike
print "max_loglike",max_loglike,"alpha",alpha,"min_df",min_df
best_alpha = alpha
best_min_df = min_df
print "best max_loglike",max_loglike,"best alpha",best_alpha,"best min_df",best_min_df
0 max_loglike -701.863599147 alpha 0 min_df 0.1 0.1 max_loglike -701.698320488 alpha 0.1 min_df 0.1 1 max_loglike -700.218818186 alpha 1 min_df 0.1 5 max_loglike -693.812946439 alpha 5 min_df 0.1 10 max_loglike -686.175301985 alpha 10 min_df 0.1 50 max_loglike -636.827815439 alpha 50 min_df 0.1 100 max_loglike -595.120648905 alpha 100 min_df 0.1 150 max_loglike -566.893075242 alpha 150 min_df 0.1 200 max_loglike -547.192426876 alpha 200 min_df 0.1 best max_loglike -547.192426876 best alpha 200 best min_df 0.1
We can see that the maximal parameters are absurd, particularly the very high best_min_df, so it looks like we can't salvage our model this way. But is there anything else we can do?
In the case of movie reviews, Rotten Tomatoes compresses down "thumbs up", "four out of five stars", and other such positive reviews to just "fresh". Similarly, we took all positive days and just converted them to "up", and all negative days and converting them to "down". Seeing as how we're up against a shortage of data, it's sensible for us to avoid this strategy (which is what allowed us to do binary classification earlier) and instead keep more detailed performance information.
We rebuild the performance indicator portion of our data set, instead looking at the fractional gain, clipping it from -1 to 1, then transforming it to the range 0 to 1. So now a gain of 3% would be .56. Again, we comment out the code to build the data set and just read it in from our previuos efforts.
#data = pd.read_csv('data.csv')
#data = data[data['date']!=weekend_str[0]]
#data = data[data['date']!=weekend_str[1]]
#data,performance = perf2.check_performance_scaled(data)
#data = data[data['performance']>=0]#only keep if performance is 0 to 1, not NaN
##data.to_csv('data_perf_scaled.csv')
data = pd.read_csv('data_perf_scaled.csv')
As in our previous model, we now run make_xy. Again, the X will be our bag of words results in matrix form, and our Y will be our performance indicators (which, recall, are continuous and based on the gain/loss.)
X, Y = sent.make_xy(data,vectorizer=None)
In the binary classification model, we then ran this through a multinomial naive Bayes model to predict the probability of up/down for each stock. Since we aren't giving our model a binary classification anymore, this will no longer work. However, we can still use a MultinomialNB model. Now, instead of predicting the probability of 1/0, this will actually build a prediction of performance (such as, say, .52 for a 1% gain) based on the elements of our bag of words.
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.33,random_state=22)
clf = MultinomialNB()
clf.fit(X_train,Y_train)
#clf.predict(X_test)
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
To check the accuracy of our model, we see which stocks it predicts a nontrivial gain for (using a tolerance of .2% movement) and then see if the direction predicted is correct.
predvec = clf.predict(X_test)
indexpred = np.abs(predvec-.5)>0
actual = np.divide(Y_test[indexpred]-.5,np.abs(Y_test[indexpred]-.5))*.5+.5
predict = np.divide(clf.predict(X_test)[indexpred]-.5,np.abs(clf.predict(X_test)[indexpred]-.5))*.5+.5
print "accuracy on testing set",1-float(sum(abs(actual-predict)))/float(len(actual))
print "portion of stocks with motion going up in testing set",float(len(predict[predict>0]))/float(len(predict))
predvec = clf.predict(X_train)
indexpred = np.abs(predvec-.5)>0
actual = np.divide(Y_train[indexpred]-.5,np.abs(Y_train[indexpred]-.5))*.5+.5
predict = np.divide(clf.predict(X_train)[indexpred]-.5,np.abs(clf.predict(X_train)[indexpred]-.5))*.5+.5
print "accuracy on training set",1-float(sum(abs(actual-predict)))/float(len(actual))
accuracy on testing set 0.646341463415 portion of stocks with motion going up in testing set 0.40243902439 accuracy on training set 0.981389578164
And we've succeeded! We still have a very high probability on our training set indicating over fitting issues, but it works out to give us accurate results on our testing set, enabling us to gain an edge. The edge is not massive, as we are getting our predictions right about 65% of the time and the actual split between up stocks and downs tocks (among the stocks with sufficient movement) is 40/60, we have gained a several percent edge, which is considered very significant in the realm of finance.
In our binary classification models, we could from here check the level of overfitting of our model by binning up predictions (10%-20% up, 20%-30% up, and so on) and seeing if our accuracy matches our predictions. Unfortunately, one cost of using our Multinomial Naive Bayes model in this way is that we no longer have probability predictions, just direction predictions. In essense, we've lost our error term, which is always something we'd like to avoid when possible in data analysis. This also makes it difficult to tune our model, such as by carefully adjusting min_df or alpha parameters. Still, if the benefit of losing tunability and error analysis is going from just noise to a useful prediction, we'll take it.
One way we can check the reasonableness of our model, though, is by checking some words with strong negative and positive indications, as we did with Rotten Tomatoes.
vec = CountVectorizer(min_df = 1e-3)
#vec = CountVectorizer()
text = [words for i,words in data.text.iteritems()]
vec.fit(text)
words = np.array(vec.get_feature_names())
singles = np.eye(len(words))
X_df, Y_df = sent.make_xy(data,vectorizer=vec)
X_train_df,X_test_df,Y_train_df,Y_test_df = train_test_split(X_df,Y_df,test_size=0.33,random_state=22)
clf_df = MultinomialNB()
clf_df.fit(X_train_df,Y_train_df)
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
#clf_df = clf
negativeindices = clf_df.predict_proba(sparse.csc_matrix(singles))[:,0].argsort()
positiveindices = clf_df.predict_proba(sparse.csc_matrix(singles))[:,1].argsort()
badwords=words[negativeindices]
badprobs = clf_df.predict_proba(singles)[negativeindices,1]
badprobs=badprobs[::-1]
badwords=badwords[::-1]
goodwords=words[positiveindices]
goodwords=goodwords[::-1]
goodprobs = clf_df.predict_proba(singles)[positiveindices,1]
goodprobs = goodprobs[::-1]
print "negative words",[(badwords[i],badprobs[i]) for i in range(10)]
print "positive words",[(goodwords[i],goodprobs[i]) for i in range(10)]
negative words [(u'dva', 0.00065429865287221971), (u'davita', 0.00070301231619910383), (u'healthcare', 0.0006835894031860572), (u'partners', 0.0013194339424065626), (u'dialysis', 0.00072809514936371226), (u'dvadva', 0.00072592477320190742), (u'scan', 0.00072375005946349695), (u'rated', 0.00071782864597029678), (u'patients', 0.00071095170136868311), (u'health', 0.00066552469716559491)] positive words [(u'cpb', 0.032144008476641339), (u'soup', 0.013377734519007296), (u'options', 0.011244340670651403), (u'campbell', 0.010575715690765052), (u'day', 0.008966277117938088), (u'today', 0.0081946750241469728), (u'keeneonmarket', 0.007881447433107246), (u'over', 0.0078077819981858913), (u'iv', 0.00693413576281382), (u'up', 0.0068814076007903853)]
On both sides, it looks like there are a lot of words that were specific to just one or two stocks in our set. We again have to recall that we're not predicting probabilities of going up or down but a magnitude in the direction, so one stock that really tanked on one day could easily have a large effect on our most extreme words. This seems to be the case with so many pharmaceutical and health related words in our negative category. On the positive side, in addition to a few that seem stock specific, we do seem some general terms we might have suspected such as "options" and "up".
This is a long way from a predictive tool that could make money in real time, and it's not clear how good of a choice it is up against some of our other options we explored. We did have the advantage of compiling all of the tweets for each day, including tweets that were about stock behavior that already happened, but this does in its current form provide a several percent edge and shows potential to be honed into a more realtime tool should one desire.