import datetime as dt
import matplotlib.pyplot as plt
import numpy as np
import numpy.linalg as la
import pandas as pd
import scipy.cluster.hierarchy as sch
import seaborn as sns
from pandas.io.data import DataReader
from scipy.stats import mode
from scipy.stats.stats import pearsonr
from nearest_correlation import nearcorr
from IPython.display import Image
%matplotlib inline
Outline of Talk:
Download and wrangle S&P 500 daily price data from Yahoo! Finance.
Simulate heavy-tailed time series. Estimate their correlation matrix and examine the distribution of its eigenvalues.
Review a theorem of Marchenko and Pastur related to the functional form of this eigenvalue distribution.
Apply these results to the S&P 500 return data and use them to develop an eigenvalue based noise filtering method.
Apply the noise filter, reconstruct the correlation matrix, and find the nearest correlation matrix to the filtered matrix.
Cluster and examine the associated sector structure of the filtered correlation matrix.
# Gather Data
#
# A list of S&P 500 tickers is available at:
# http://data.okfn.org/data/core/s-and-p-500-companies/r/constituents.csv
#
# One can also build up a potential list of tickers for
# stocks traded on the NASDAQ, NYSE, and AMEX exchanges from:
# http://www.nasdaq.com/screening/company-list.aspx
tickerdf = pd.read_csv('SandP500.csv') # read in S&P 500 tickers
tickers = list(tickerdf['Symbol'].values) # extract list of tickers from dataframe
verbose_flag = False # flag that turns on ticker print statements
start_date = dt.datetime(2010, 1, 1) # date when to start downloading data
ticker_df_list = [] # initialize list of dataframes for each ticker
for ticker in tickers:
try:
r = DataReader(ticker, "yahoo", start=start_date) # download price data from yahoo
r['Ticker'] = ticker
ticker_df_list.append(r)
if verbose_flag:
print "Obtained data for ticker %s" % ticker
except:
if verbose_flag:
print "No data for ticker %s" % ticker
df = pd.concat(ticker_df_list) # build single df of all data
cell= df[['Ticker','Adj Close']] # extract ticker and close price information
cell.reset_index().sort(['Ticker', 'Date'], ascending=[1,0]).set_index('Ticker')
cell.to_pickle('close_price.pkl') # pickle data
We first simulate time series that have differences drawn from a heavy tailed distribution. Specifically, we:
Sample 2000 draws from a Student's t distribution with degree of freedom parameter 5 for 500 paths
Compute the cumulative sum of each time series and plot the results in the first figure below
Plot the distribution of differences of one of the time series in the second plot below
Estimate the correlation matrix between the different series, compute its eigenvalues, and plot their histogram in the third plot
deg_freedom = 5 # Student's t distribution degree of freedom parameter
num_srs = 500 # number of time series to generate
num_pts = 2000 # number of points per time series
rpaths = np.random.standard_t(deg_freedom,size=(num_pts,num_srs)) # generate paths
evalplot = la.eig(pd.DataFrame(rpaths).corr())[0] # compute eigenvalues
plt.figure(figsize=(16,5))
plt.plot(rpaths.cumsum(axis=0),alpha=0.5)
plt.show()
plt.figure(figsize=(16,5))
plt.hist(rpaths.T[1],bins=50);
plt.figure(figsize=(16,5))
plt.hist(evalplot,bins=50);
Note that the eigenvalue distribution has an interesting shape. We paraphrase a theorem of Marchenko and Pastur that describes this shape:
Let $X$ be a random $n\times r$ matrix whose entries are i.i.d. and have finite variance $\sigma^2$. Then the eigenvalues of the correlation matrix corresponding to $X$ are distributed according to $$ f(\lambda)=\frac{q}{2\pi\lambda}\sqrt{(\lambda_+ - \lambda)(\lambda-\lambda_-)} \quad \mathrm{where\:\:} \lambda_{\pm} = 1+\frac{1}{q}+2\sqrt{\frac{1}{q}} \:\: \mathrm{and} \:\: q = \frac{n}{r} $$
We plot this function overlaid on the eigenvalue histogram below.
def eigden(lam,n,r):
'''
Definition of the Marchenko Pastur density
'''
q = float(n)/r
lplus = 1+1/q+2*np.sqrt(1/q)
lminus = 1+1/q-2*np.sqrt(1/q)
return q/(2*np.pi*lam)*np.sqrt((lplus-lam)*(lam-lminus))
lamvls = np.linspace(0.001,3,1000)
plt.figure(figsize=(14,5))
plt.hist(evalplot,bins=75,normed=True)
plt.plot(lamvls,[eigden(lam,num_pts,num_srs) for lam in lamvls])
plt.show()
We now construct a noise filtering algorithm based on one developed by Bouchaud and Potters that utilizes this density and apply it to the S&P 500. Specifically, we:
Specify a start and end date for the timeseries that we want to consider and compute the mode of the lengths of these series.
We keep series that have lengths equal to this mode, compute their daily returns, normalize the variance of each time series to one, and store the result in a dataframe.
Next, we compute the correlation matrix associated with these time series and compute its eigenvalues and eigenvectors.
We then replace each eigenvalue that is less than the upper eigenvalue threshold in the Marchenko Pastur density with the mean of these eigenvalues so that the trace of the correlation matrix is preserved.
We next reconstruct a filtered candidate correlation matrix from the filtered eigenvalues and our original eigenvectors.
We next use an algorithm of Mike Croucher available at: https://github.com/mikecroucher/nearest_correlation
to find the nearest correlation matrix to our filtered candidate correlation matrix. Here we search over the space of positive definite, symmetric matrices with ones on the main diagonal to find the one that most closely matches the candidate matrix in the Frobenius norm.
def lamplus(n,r):
'''
Upper eigenvalue limit of the Marchenko Pastur density
'''
q = float(n)/float(r)
return (1+1/q+2*np.sqrt(1/q))
# Read in closing price data if it is not already a local variable
if 'cell' not in locals():
df = pd.read_pickle('close_price.pkl')
else:
df = cell
dte1 = '2010-07-01'
dte2 = '2015-07-01'
tickers = sorted(list(set(df['Ticker'].values))) # sorted list of unique tickers
tkrlens = [len(df[df.Ticker==tkr][dte1:dte2]) for tkr in tickers] # find lengths of times series for each ticker
tkrmode = mode(tkrlens)[0][0] # find mode of time series lengths
# idenfity tickers whose lenghts equal the mode and call these good tickers
good_tickers = [tickers[i] for i,tkr in enumerate(tkrlens) if tkrlens[i]==tkrmode]
rtndf = pd.DataFrame() # initialize a return dataframe
# Normalize all time series to have mean zero and variance one and compute their returns
for tkr in good_tickers:
tmpdf = df[df.Ticker==tkr]['Adj Close'][dte1:dte2]
tmprtndf = ((tmpdf-tmpdf.shift(1))/tmpdf).dropna()
rsdf = (tmprtndf-tmprtndf.mean())/tmprtndf.std()
rtndf = pd.concat([rtndf, rsdf],axis=1)
rtndf = rtndf.dropna()
rtndf.columns = good_tickers
t,m = rtndf.shape
cmat = rtndf.corr() # compute correlation matrix
evls, evcs = la.eig(cmat) # compute eigenvalue/vector decomposition of matrix
evallst = map(abs,evls) # take abs of evals (very small imag parts)
filtvals = [val for val in evallst if val < lamplus(t,m)] # filter eigenvalues
sevlist = [np.mean(filtvals)]*len(filtvals)
feval = evallst[:(len(evallst)-len(sevlist))] + sevlist # build list of new eigenvalues
rcmat = abs(np.dot(np.dot(evcs,np.diag(feval)),la.inv(evcs))) # reconstruct candidate correlation matrix
rcmat = (rcmat + rcmat.T)/2 # symmetrize the candidate matrix
ncorr = nearcorr(rcmat, max_iterations=1000) # find nearest correlation matrix
ncorrdf = pd.DataFrame(ncorr,columns=good_tickers,index=good_tickers)
We plot the top ten eigenvalues of our S&P 500 correlation matrix as well as a histogram of the smaller eigenvalues overlaid with the Marchenko Pastur density below
print evls[0:10]
plt.figure(figsize=(14,5))
plt.hist([val for val in evls if val < 3] ,bins=100,normed=True);
plt.plot(lamvls,[eigden(lam,len(rtndf),len(rtndf.columns)) for lam in lamvls]);
[ 197.3194315 15.58781937 10.52743889 7.10374275 5.47001199 4.67135602 3.87457002 3.47408963 3.21331472 2.70143365]
-- We now cluster the filtered correlation matrix using a hierarchical clustering method from SciPy.
-- This require us to first define a distance matrix whose entries define a measure of distance between the time series that we are considering.
-- We define this distance to be 1 less the correlation matrix.
-- We next define a notion of distance between multiple cluster of time series to be the mean distance between the clusters. Specifically, if $A$,$B$ are two sets of time series, then we define the distance between $A$ and $B$ to be
$$d(A,B) = \frac{1}{|A||B|}\sum_{a\in A,b\in B}(1-\rho(a,b))$$
-- We then treat each time series as a separate unit, find the pair that has smallest distance between them, then join them together
-- This process is then repeated but now on a single cluster of two time series and the remaining single time series. The process repeats until only one cluster remains
-- We display the results of this clustering method below
-- Note that there is structure in the associated ordered correlation matrix
sns.clustermap(1-ncorrdf, figsize=(23,23)); # all the work is done by seaborn/scipy!
<seaborn.matrix.ClusterGrid at 0x7f64a4682f50>
We drill down into some of the rectangles that encode sector structure in this matrix. We note in our first example, that all the names are energy/power companies, e.g.
PEG - Public Service Enterprise Group (New Jersey)
Image(filename='corrfig1.png')
We also find a rectangle off the main diagonal of this correlation matrix where utilities are clustered together with another set of utiliy stocks.
Image(filename='corrfig2.png')
Here is an off diagonal group of real estate related stocks:
-- EQR: Equity Residential
-- AVB: Avalon Bay Communities
-- HCN: Real Estate Investment Trust
Image(filename='corrfig3.png')
The main rectangle consist of bank stocks. The slightly lighter vertical strips (e.g. SCHW and ACE, ALL and XL) are insurance companies of a brokerage in the case of SCHW
Image(filename='corrfig5.png')
References:
Email: steve98654-at-gmail-dot-com (currently seeking a new position!!)
Talk on GitHub: https://github.com/steve98654/PyTalk/ (file corr_filt.ipynb)
Bouchaud and Potters: Financial Applications of Random Matrix Theory (a short review) http://arxiv.org/pdf/0910.1205.pdf
Marchenko,V. A., Pastur, L. A. (1967) "Distribution of eigenvalues for some sets of random matrices", Mat. Sb. (N.S.), 72(114):4, 507–536