I reanalyze the data from the paper Quantifying Trading Behavior in Financial Markets Using Google Trends, by Preis, Moat and Stanley. The objectives of this reanalysis are:
This analysis is possible because the original authors have provided the Google trends and DJIA data and code used in their paper.
While this is an iPython notebook, all the work is done in R via the rmagic
extension.
# throat clearing
%load_ext rmagic
%%R
# my notebook is wider than the default 80 chars, so set it here:
options(width=300)
# get some packages if you do not have them:
require(xts)
Loading required package: xts Loading required package: zoo Attaching package: ‘zoo’ The following object is masked from ‘package:base’: as.Date, as.Date.numeric
Here I load the data from github, then separate the trends data and DJIA data into separate xts
objects.
%%R
# Read data file; if you have it locally, do this:
# dat <- read.csv("PreisMoatStanley_ScientificReports_3_1684_2013.dat",sep=" ")
# else, workaround for curl/ssl download from github (see http://stackoverflow.com/a/4127133/164611 )
temporaryFile <- tempfile()
download.file("https://raw.github.com/leesanglo/Replicating_google_trends/master/PreisMoatStanley_ScientificReports_3_1684_2013.dat",
destfile=temporaryFile, method="curl")
dat <- read.csv(temporaryFile,sep=" ")
# peel off DJIA data, and make into an xts
DJIA.data <- dat[,names(dat) %in% c("DJIA.Date","DJIA.Closing.Price")]
dat <- dat[,!(names(dat) %in% names(DJIA.data))]
# add +1 day to the date to round up to midnight.
DJIA.xts <- xts(DJIA.data[,"DJIA.Closing.Price"],
order.by=as.POSIXct(DJIA.data[,"DJIA.Date"])+86400)
# peel off dates and make into an xts
google.dates <- dat[,names(dat) %in% c("Google.Start.Date","Google.End.Date")]
dat <- dat[,!(names(dat) %in% names(google.dates))]
search.terms.xts <- xts(dat,order.by=as.POSIXct(google.dates[,"Google.End.Date"]))
# clean up
rm(list=c("DJIA.data","google.dates","dat"))
Now, for every search term used, adjust the search frequency for the previous delta.t
weeks' values. This follows the methodology of the original authors, but is applied by a vectorized function. I spot check my work.
After this, I break ties arbitrarily. In this case, any centered term less than 1e-5
in absolute value is moved to that value. Later we will use the sign of the signal to determine whether to take long or short positions, so this has the effect of taking a long position in the case of a tie. This is arguably a mild headwind given the long bias of the market, but actually has little effect. If you would like, you can change the TIE.BREAKER
and TIE.LIMIT
values to explore.
%%R
# this function takes a vector, and returns the difference
# between a value and the mean value over the previous
# boxwin values.
running.center <- function(x,lag=10) {
x.cum <- cumsum(x)
x.dif <- c(x.cum[1:lag],diff(x.cum,lag))
x.df <- pmin(1:length(x.cum),lag)
x.mu <- x.dif / x.df
x.ret <- c(NaN,x[2:length(x)] - x.mu[1:(length(x)-1)])
return(x.ret)
}
# follow the authors in using a 3 week window:
delta.t <- 3
# make the detrended 'signal'
signal.xts <- xts(apply(search.terms.xts,2,running.center,lag=delta.t),
order.by=time(search.terms.xts))
# at this point, do a spot check to make sure our function worked OK
my.err <- signal.xts[delta.t+5,10] -
(search.terms.xts[delta.t+5,10] - mean(search.terms.xts[5:(delta.t+4),10]))
if (abs(my.err) > 1e-8)
stop("fancy function miscomputes the running mean")
# chop off the first delta.t rows
signal.xts <- signal.xts[-c(1:delta.t)]
mkt.xts <- DJIA.xts[-c(1:delta.t)] # and for the market
# trading signal; the original authors 'short' the trend:
trade.xts <- - signal.xts
# break ties arbitrarily. anything smaller than a certain absolute
# value gets moved to the tie-breaker.
TIE.BREAKER <- 1e-5
TIE.LIMIT <- abs(TIE.BREAKER)
trade.xts[abs(trade.xts) <= TIE.LIMIT] <- TIE.BREAKER
Now I 'backtest' the signal. This is a very simplistic backtesting function, and should not be used for real evaluation of strategies (insert standard legal disclosure here). For the purposes of evaluating a weekly-rebalancing, single instrument strategy where the cost to short is essentially zero, market impact is low, etc., this is a reasonable, if slightly optimistic, estimate of trading performance. It does assume you can trade on the index (instead of an ETF), and also assumes your positions are perfectly sized, and you pay no commissions. If a strategy looked profitable based on this backtest, you would want to go to the next finer level of backtest fidelity, although I doubt it will be warranted in this case.
If you would like to test the signal as a magnitude, you can uncomment one line below.
%%R
# braindead 'backtest' function
dumb.bt <- function(sig.xts,mkt.xts) {
if (dim(sig.xts)[1] != dim(mkt.xts)[1])
stop("wrong row sizes")
mkt.lret <- diff(log(mkt.xts),lag=1)
mkt.rret <- exp(mkt.lret) - 1
mkt.rret <- as.matrix(mkt.rret[-1]) # chop the first
sig.xts <- sig.xts[-dim(sig.xts)[1]] # chop the last
bt.rets <- xts(apply(sig.xts,2,function(v) { v * mkt.rret }),
order.by=time(sig.xts))
return(bt.rets)
}
# backtest the sign:
bt.rets <- dumb.bt(sign(trade.xts),mkt.xts)
# instead, backtest the magnitude:
#bt.rets <- dumb.bt(4 * trade.xts,mkt.xts)
bt.lrets <- log(1 + bt.rets) # compute log returns
bt.mtm <- exp(apply(bt.lrets,2,cumsum))
Now I take the log returns from the 98 tested search terms' backtests, and perform t-tests on each of them. I am testing against a two-sided alternative. This seems reasonable, since the trading scheme is so oddly defined: take the sign of the centered search data, then short it. I suspect that the more obvious version of going long this signal was first tested, and found to be lacking. That is, we can assume that one would be happy to either trade on any of these strategies if they looked profitable, or short any of them if doing so also looked profitable. Thus a two-sided alternative.
Just using the vanilla t-test ignores possible autocorrelation and heteroskedasticity, which tend to inflate the achieved type I rate. This is not of great concern, since I suspect we will not reject the null anyway.
I then Q-Q plot the p-values from the 98 t-tests against a uniform law. Under the null hypothesis that the Google trends data is independent of future DJIA returns, (and ignoring the fact that the backtests are correlated with each other!), the Q-Q plot should fall along the $y=x$ line, which I plot in red here. To my eye, this just looks like data mining (the bad kind).
%%R -w 500 -h 500
# first: apply a t-test to every column, get the p-values
ttest.pvals <- apply(bt.lrets,2,function(x) {
t.res <- t.test(x,alternative="two.sided")
p.v <- t.res$p.value
})
# function for Q-Q plot against uniformity
qqunif <- function(x,xlab="Theoretical Quantiles under Uniformity",
ylab=NULL,...) {
if (is.null(ylab))
ylab=paste("Sample Quantiles (",deparse(substitute(x)),")",sep="")
qqplot(qunif(ppoints(length(x))),x,xlab=xlab,ylab=ylab,...)
abline(0,1,col='red')
}
qqunif(ttest.pvals)
Here I spawn an equal number of totally random strategies, backtest them in the same way, perform a t-test then Q-Q plot the p-values. I expect the results to look just like the above plot. Indeed they do. This suggests that the mild deviation from the $y=x$ line seen above is 'normal'.
%%R -w 500 -h 500
set.seed(12345) # remind me to change my luggage combo ;)
rand.xts <- xts(matrix(rnorm(prod(dim(trade.xts))),nrow=dim(trade.xts)[1]),
order.by=time(trade.xts))
# backtest the sign:
rand.rets <- dumb.bt(sign(rand.xts),mkt.xts)
rand.lrets <- log(1 + rand.rets) # compute log returns
ttest.rand.pvals <- apply(rand.lrets,2,function(x) {
t.res <- t.test(x,alternative="two.sided")
p.v <- t.res$p.value
})
qqunif(ttest.rand.pvals)
Here I perform another kind of 'backtest': For a given signal, I construct the $2\times 2$ contingency table based on the sign of the centered Google Trends signal, and the sign of the leading DJIA weekly return. I then perform an odds-ratio test, and compute the p-value. Again, ignoring correlation across search terms, under the null these p-values should fall near the $y=x$ line when Q-Q plotted versus uniformity. Which they do.
%%R -w 500 -h 500
# perform oddsratio tests:
require(epitools)
# braindead 'odds-ratio backtest' function
dumb.odds.bt <- function(sig.xts,mkt.xts) {
if (dim(sig.xts)[1] != dim(mkt.xts)[1])
stop("wrong row sizes")
mkt.lret <- diff(log(mkt.xts),lag=1)
mkt.rret <- exp(mkt.lret) - 1
mkt.rret <- as.matrix(mkt.rret[-1]) # chop the first
sig.xts <- sig.xts[-dim(sig.xts)[1]] # chop the last
bt.rets <- apply(sig.xts,2,function(v) {
or.tst <- oddsratio(x=factor(sign(v)),y=factor(sign(mkt.rret)))
or.tst$p.value[2,"fisher.exact"]})
return(bt.rets)
}
# odd-ratio backtest the sign:
bt.odds <- dumb.odds.bt(sign(trade.xts),mkt.xts)
qqunif(bt.odds)
Loading required package: epitools
To deal with possible correlation among the returns of the various search terms' implied strategies, I use Hotelling's test, which is the multivariate generalization of the t-test. Essentially I am testing whether the $98$-vector of daily log returns is mean zero (as in the zero vector). If this were the case, then all linear combinations (i.e. portfolios) of the implied strategies would also be zero mean. There is a fascinating connection between Markowitz optimization, Sharpe ratio, and Hotelling's test, but I digress.
In this case the sample optimal Markowitz portfolio has in-sample Sharpe of around $4.5\mbox{yr}^{-1/2}$, with corresponding $T^2$ value of around 140. The corresponding p-value under the null of zero mean is around 0.35, meaning there is little evidence to suggest the returns are not zero mean. 95% confidence intervals on the population-maximal Sharpe (essentially inverting the non-central F distribution for non-centrality parameter) contain zero. Using the Kubokawa-Robert-Saleh ('KRS') method to estimate the population optimal Sharpe yields a value of around $0.9\mbox{yr}^{-1/2}$. Note however that one cannot be certain to achieve this Sharpe because of mis-estimation of the Markowitz portfolio.
%%R
require(SharpeR)
# under the latest github version of the package, this is legit,
# but bonks under current CRAN version:
#srs <- as.sr(bt.lrets)
#sharpe.test <- sr_test(srs,alternative="two.sided")
# this gives the same plot as the t-test plot above, so skip it.
#qqunif(sharpe.test$p.value)
# Hotelling's test
big.sr <- as.sropt(bt.lrets)
print(big.sr)
print(confint(big.sr))
print(inference(big.sr,type="KRS"))
Loading required package: SharpeR SR/sqrt(yr) T^2 value Pr(>T^2) Sharpe 4.474 141.6 0.35 2.5 % 97.5 % [1,] 0 2.566544 [,1] [1,] 0.8804619
One way to evaluate the (backtested) returns of a bunch of strategies is to split the historical data into an 'in-sample' and 'out-of-sample' period, and see how consistent the performance is across the divide. The rationale is that one would, at the cut time, observe the in-sample data, select a portfolio of the strategies to trade upon and experience the returns of the out-of-sample period.
Rather than get fancy, here I split the data into two equal-sized epochs, and scatter the Sharpe ratio in the in- and out-of-sample periods. I suspect this test is really no different than the Hotelling $T^2$ test, but is expressable in terms more easily understood by quant practitioners (or their bosses).
In this case, the cross-validation scatter is a blob; regression from in-sample to out-of-sample does not have a significant slope. If we selected the strategy with the highest in-sample Sharpe (around $2.0\mbox{yr}^{-1/2}$), we would have been disappointed with its performance out of sample (Sharpe of $-0.5\mbox{yr}^{-1/2}$).
I should also note that the best in-sample performance is associated with the search term home
, while the worst is associated with the term fond
. Perhaps someone with a better imagination than I have can spin a story around these; they certainly are not as suggestive as the term debt
, which gives the best performance in the entire sample.
%%R -w 500 -h 500
# split em.
n.row <- dim(bt.lrets)[1]
n.split <- floor(n.row/2)
# you need the latest version of this:
# require(devtools)
# install_github('SharpeR',username='shabbychef)
require(SharpeR)
srs.is <- as.sr(bt.lrets[1:n.split,])
srs.oos <- as.sr(bt.lrets[(n.split+1):n.row,])
i.v.o <- lm(srs.oos$sr ~ srs.is$sr)
print(summary(i.v.o))
plot(srs.is$sr,srs.oos$sr)
abline(i.v.o)
cat(sprintf("best in-sample strategy: '%s'\n",rownames(srs.is$sr)[which.max(srs.is$sr)]))
cat(sprintf("worst in-sample strategy: '%s'\n",rownames(srs.is$sr)[which.min(srs.is$sr)]))
Call: lm(formula = srs.oos$sr ~ srs.is$sr) Residuals: Min 1Q Median 3Q Max -1.36416 -0.31939 0.00189 0.33204 1.08633 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.15266 0.05808 2.628 0.00999 ** srs.is$sr 0.11175 0.11156 1.002 0.31899 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.4984 on 96 degrees of freedom Multiple R-squared: 0.01034, Adjusted R-squared: 3.572e-05 F-statistic: 1.003 on 1 and 96 DF, p-value: 0.319 best in-sample strategy: 'home' worst in-sample strategy: 'fond'
Here I use the Leung and Wong test for equality of Sharpe ratios, as well as the Wright, Yam, Yung variant. Both of these tests reject the null that all 98 search terms' backtests have the same Sharpe. I do not have a lot of experience with these tests (and may not have implemented them correctly!), but suspect they can reject for reasons which are not interesting. On the other hand, these tests might actually be very powerful, but suggest a difference so small that we are unlikely to capture it in the real world (again, due to error in selecting the optimal portfolio).
%%R
require(SharpeR)
# these do *not* have equal SR; but the test
# can reject for weird reasons...
all.eq <- sr_equality_test(as.matrix(bt.lrets),type="F")
print(all.eq)
all.eq <- sr_equality_test(as.matrix(bt.lrets),type="chisq")
print(all.eq)
test for equality of Sharpe ratio, via F test data: as.matrix(bt.lrets) T2 = 243.6777, contrasts = 97, p-value = 5.141e-05 alternative hypothesis: true sum squared contrasts of SNR is not equal to 0 test for equality of Sharpe ratio, via chisq test data: as.matrix(bt.lrets) T2 = 243.6777, contrasts = 97, p-value = 1.317e-14 alternative hypothesis: true sum squared contrasts of SNR is not equal to 0
The tests conducted here suggest there is no detectable predictive ability of Google Trends search data on the future returns of the DJIA when processed in the form suggested by Preis et al. The results seen by those authors are entirely consistent with data-mining bias.