Bayesian data analysis with PyMC3

Thomas Wiecki

Quantopian Inc.

About me¶

PhD candidate at Brown studying decision making using Bayesian modeling.
Quantitative Researcher at Quantopian Inc: Building the world's first algorithmic trading platform in the web browser.

Why should you care about Bayesian Data Analysis?¶

In [1]:

from IPython.display import Image
import prettyplotlib as ppl
from prettyplotlib import plt
import numpy as np
from matplotlib import rc
rc('font',**{'family':'sans-serif','sans-serif':['Helvetica'], 'size': 22})
rc('xtick', labelsize=14) 
rc('ytick', labelsize=14)
## for Palatino and other serif fonts use:
#rc('font',**{'family':'serif','serif':['Palatino']})
rc('text', usetex=True)
%matplotlib inline

In [2]:

Image('backbox_ml.png')

Out[2]:

Blackbox models not good at conveying what they have learned.

In [4]:

Image('openbox_pp.png')

Out[4]:

Probabilistic Programming¶

Model unknown causes of a phenomenon as random variables.
Write a programmatic story of how unknown causes result in observable data.
Use Bayes formula to invert generative model to infer unknown causes.

Random Variables as Probability Distributions¶

Represents our beliefs about an unknown state.
Probability distribution assigns a probability to each possible state.
Not a single number (e.g. most likely state).

Coin-flipping experiment.¶

Given multiple flips, what is probability of getting heads?
Maximum Likelihood answer:

$$\frac{\# \text{heads}}{\text{total throws}}$$

However:

$$\frac{50}{100} = \frac{1}{2}$$

Clearly something is missing!
Quantification of uncertainty.

Moreover...

Consider a single flip which comes up heads:

$$ P(\text{heads}) = \frac{1}{1} = 1 $$

Again, this doesn't seem right.
Incorporate prior knowledge.

In [2]:

from scipy import stats
# set every possibility to be equally possible
x_coin = np.linspace(0, 1, 101)

$$ \text{Express probability of heads as random variable } \theta$$¶

In [4]:

import prettyplotlib as ppl
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, xlabel=r'Hypothesis for chance of heads', 
            ylabel=r'Probability of hypothesis', 
            title=r'Prior probability distribution after no coin tosses')
ppl.plot(ax, x_coin, stats.beta(2, 2).pdf(x_coin), linewidth=3.)
ax.set_xticklabels([r'0\%', r'20\%', r'40\%', r'60\%', r'80\%', r'100\%']);
# fig.savefig('coin1.png')

Out[4]:

[<matplotlib.text.Text at 0x48432d0>,
 <matplotlib.text.Text at 0x4841210>,
 <matplotlib.text.Text at 0x486e590>,
 <matplotlib.text.Text at 0x486ec50>,
 <matplotlib.text.Text at 0x4872350>,
 <matplotlib.text.Text at 0x4872a10>]

/home/wiecki/envs/pymc3/local/lib/python2.7/site-packages/IPython/core/formatters.py:239: FormatterWarning: Exception in image/png formatter: LaTeX was not able to process the following string:
'0\\%'
Here is the full report generated by LaTeX: 

This is pdfTeX, Version 3.1415926-2.5-1.40.14 (TeX Live 2013/Debian)
 restricted \write18 enabled.
entering extended mode
(./73e233a9842fdd7f7961a32eb5d9efbb.tex
LaTeX2e <2011/06/27>
Babel <3.9h> and hyphenation patterns for 2 languages loaded.
(/usr/share/texlive/texmf-dist/tex/latex/base/article.cls
Document Class: article 2007/10/19 v1.4h Standard LaTeX document class
(/usr/share/texlive/texmf-dist/tex/latex/base/size10.clo))
(/usr/share/texlive/texmf-dist/tex/latex/type1cm/type1cm.sty)
(/usr/share/texlive/texmf-dist/tex/latex/psnfss/helvet.sty
(/usr/share/texlive/texmf-dist/tex/latex/graphics/keyval.sty))
(/usr/share/texlive/texmf-dist/tex/latex/psnfss/courier.sty)
(/usr/share/texlive/texmf-dist/tex/latex/base/textcomp.sty
(/usr/share/texlive/texmf-dist/tex/latex/base/ts1enc.def))
(/usr/share/texlive/texmf-dist/tex/latex/geometry/geometry.sty
(/usr/share/texlive/texmf-dist/tex/generic/oberdiek/ifpdf.sty)
(/usr/share/texlive/texmf-dist/tex/generic/oberdiek/ifvtex.sty)
(/usr/share/texlive/texmf-dist/tex/generic/ifxetex/ifxetex.sty)

Package geometry Warning: Over-specification in `h'-direction.
    `width' (5058.9pt) is ignored.


Package geometry Warning: Over-specification in `v'-direction.
    `height' (5058.9pt) is ignored.

)
No file 73e233a9842fdd7f7961a32eb5d9efbb.aux.
(/usr/share/texlive/texmf-dist/tex/latex/base/ts1cmr.fd)
(/usr/share/texlive/texmf-dist/tex/latex/psnfss/ot1pnc.fd)
! Font OT1/pnc/m/n/10=pncr7t at 10.0pt not loadable: Metric (TFM) file not foun
d.
<to be read again> 
                   relax 
l.11 \begin{document}
                     
*geometry* driver: auto-detecting
*geometry* detected driver: dvips
(/usr/share/texlive/texmf-dist/tex/latex/psnfss/ot1phv.fd)
! Font OT1/phv/m/n/14=phvr7t at 14.0pt not loadable: Metric (TFM) file not foun
d.
<to be read again> 
                   relax 
l.12 \fontsize{14.000000}{17.500000}{\sffamily
                                               0\%}
! Font OT1/pnc/m/n/14=pncr7t at 14.0pt not loadable: Metric (TFM) file not foun
d.
<to be read again> 
                   relax 
l.13 \end{document}
                   
[1] (./73e233a9842fdd7f7961a32eb5d9efbb.aux) )
(see the transcript file for additional information)
Output written on 73e233a9842fdd7f7961a32eb5d9efbb.dvi (1 page, 188 bytes).
Transcript written on 73e233a9842fdd7f7961a32eb5d9efbb.log.

  FormatterWarning,

<matplotlib.figure.Figure at 0x41dd610>

In [98]:

stats.beta(2,2).rvs(1)

Out[98]:

array([ 0.37833632])

In [21]:

import random
import numpy as np

def choose_coin():
    return stats.beta(2, 2).rvs(1) # random.uniform(0,1) # np.random.normal(0,1) # 
# pylab.hist([choose_coin() for dummy in range(100000)], normed=True, bins=100)
successes = []
returns = 0.
for i in range(10000):
    prob_heads = choose_coin()
    results = [random.uniform(0,1) < prob_heads for dummy in range(10)]
    if results.count(True) == 9: 
        successes.append(prob_heads)
        if random.uniform(0,1) < prob_heads:
            returns += -1
        else:
            returns += 1
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111)
r = ax.hist(np.array(successes), normed=True, bins=20)
print len(successes)
print "Average return", returns / len(successes)

719
Average return -0.613351877608

$$ \theta \sim \text{Beta}(2, 2) $$$$ P(\theta) = \text{Beta}(2, 2) $$

In [108]:

fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, xlabel='Hypothesis for chance of heads', 
            ylabel='Probability of hypothesis', 
            title='Posterior probability distribution after first heads')
ppl.plot(ax, x_coin, stats.beta(1, 1).pdf(x_coin)*stats.beta(90, 10).pdf(x_coin), linewidth=3.)
ax.set_xticklabels([r'0\%', r'20\%', r'40\%', r'60\%', r'80\%', r'100\%']);
plt.savefig('coin2.png')

In [ ]:

$$ P(\theta | h=1) = \text{Beta}(3, 2) $$

In [10]:

fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, xlabel='Hypothesis for chance of heads', 
            ylabel='Probability of hypothesis', 
            title='Posterior probability distribution after 1 head, 1 tail')
ppl.plot(ax, x_coin, stats.beta(3, 3).pdf(x_coin), linewidth=3.)
ax.set_xticklabels(['0\%', '20\%', '40\%', '60\%', '80\%', '100\%']);
fig.savefig('coin3.png')

In [11]:

Image('coin3.png')

Out[11]:

$$ P(\theta | [h=1, t=1]) = \text{Beta}(3, 3) $$

In [12]:

fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, xlabel='Hypothesis for chance of heads', 
            ylabel='Probability of hypothesis', 
            title='Posterior probability distribution after 20 heads and 20 tails')
ppl.plot(ax, x_coin, stats.beta(22, 22).pdf(x_coin), linewidth=3.)
ax.set_xticklabels(['0\%', '20\%', '40\%', '60\%', '80\%', '100\%']);
fig.savefig('coin4.png')

In [13]:

Image('coin4.png')

Out[13]:

$$ P(\theta | [h=20, t=20]) = \text{Beta}(22, 22) $$

Bayes Formula¶

$$P(\theta| \text{data}) \propto P(\theta) \ast P(\text{data} |\theta)$$$$\text{posterior} \propto \text{prior} \ast \text{likelihood}$$

$\theta$: Parameters of model (chance of getting heads)).

Except in simple cases, posterior impossible to compute analytically.
Blackbox approximation algorithm: Markov-Chain Monte Carlo (MCMC) instead draws samples from the posterior.

In [14]:

from scipy import stats
fig = ppl.plt.figure(figsize=(14, 6))
ax1 = fig.add_subplot(121, title='What we want', ylim=(0, .5), xlabel=r'\theta', ylabel=r'P(\theta)')
ppl.plot(ax1, np.linspace(-4, 4, 100), stats.norm.pdf(np.linspace(-3, 3, 100)), linewidth=4.)
ax2 = fig.add_subplot(122, title='What we get', xlim=(-4, 4), ylim=(0, 1800), xlabel=r'\theta', ylabel='\# of samples')
ax2.hist(np.random.randn(10000), bins=20);

In [14]:

Approximating the posterior with MCMC sampling¶

In [15]:

Image('wantget.png')

Out[15]:

PyMC3¶

Probabilistic Programming framework written in Python.
Allows for construction of probabilistic models using intuitive syntax.
Features advanced MCMC samplers.
Fast: Just-in-time compiled by Theano.
Extensible: easily incorporates custom MCMC algorithms and unusual probability distributions.

Linear Models¶

Assumes a linear relationship between two variables.
E.g. stock price between Gold and Gold Miners.

In [16]:

size = 200
true_intercept = 1
true_slope = 2

x = np.linspace(0, 1, size)
# y = a + b*x
true_regression_line = true_intercept + true_slope * x
# add noise
y = true_regression_line + np.random.normal(scale=.5, size=size)

data = dict(x=x, y=y)

In [17]:

fig = plt.figure(figsize=(7, 7))
ax = fig.add_subplot(111, xlabel='Value of gold', ylabel='Value of gold miners', title='Synthetic data and underlying model')
ppl.scatter(ax, x, y, label='sampled data')
ppl.plot(ax, x, true_regression_line, label='true regression line', linewidth=4.)
ax.legend(loc=2);
fig.savefig('synth_data.png')

In [18]:

Image('synth_data.png')

Out[18]:

Linear Regression¶

$$ y_i = \alpha + \beta \ast x_i + \epsilon $$

with $$ \epsilon \sim \mathcal{N}(0, \sigma^2) $$

Probabilistic Reformulation¶

$$ y_i \sim \mathcal{N}(\alpha + \beta \ast x_i, \sigma^2) $$

Priors¶

$$ \alpha \sim \mathcal{N}(0, 20^2) $$$$ \beta \sim \mathcal{N}(0, 20^2) $$$$ \sigma \sim \mathcal{U}(0, 20) $$

Constructing model in PyMC3¶

In [19]:

import pymc as pm

with pm.Model() as model: # model specifications in PyMC3 are wrapped in a with-statement
    # Define priors
    alpha = pm.Normal('alpha', mu=0, sd=20)
    beta = pm.Normal('beta', mu=0, sd=20)
    sigma = pm.Uniform('sigma', lower=0, upper=20)
    
    # Define linear regression
    y_est = alpha + beta * x
    
    # Define likelihood
    likelihood = pm.Normal('y', mu=y_est, sd=sigma, observed=y)
    
    # Inference!
    start = pm.find_MAP() # Find starting value by optimization
    step = pm.NUTS(state=start) # Instantiate MCMC sampling algorithm
    trace = pm.sample(2000, step, start=start, progressbar=False) # draw 2000 posterior samples using NUTS sampling

/Users/alexcoventry/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/theano/scan_module/scan_perform_ext.py:85: RuntimeWarning: numpy.ndarray size changed, may indicate binary incompatibility
  from scan_perform.scan_perform import *

Covenience function glm()¶

In [20]:

with pm.Model() as model:
    # specify glm and pass in data. The resulting linear model, its likelihood and 
    # and all its parameters are automatically added to our model.
    pm.glm.glm('y ~ x', data)
    step = pm.NUTS() # Instantiate MCMC sampling algorithm
    trace = pm.sample(2000, step, progressbar=False) # draw 2000 posterior samples using NUTS sampling

Posterior¶

In [21]:

fig = pm.traceplot(trace, lines={'alpha': 1, 'beta': 2, 'sigma': .5});

In [22]:

fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111, xlabel='Value of gold', ylabel='Value of gold miners', title='Posterior predictive regression lines')
ppl.scatter(ax, x, y, label='data')
from pymc import glm
glm.plot_posterior_predictive(trace, samples=100, 
                              label='posterior predictive regression lines')
ppl.plot(ax, x, true_regression_line, label='true regression line', linewidth=5.)
ax.legend(loc=0);
fig.savefig('ppc1.png')

In [22]:

In [23]:

Image('ppc1.png')

Out[23]:

Robust Regression¶

In [24]:

# Add outliers
x_out = np.append(x, [.1, .15, .2, .25, .25])
y_out = np.append(y, [8, 6, 9, 7, 9])

data_out = dict(x=x_out, y=y_out)
fig = plt.figure(figsize=(7, 7))
ax = fig.add_subplot(111,  xlabel='Value of gold', ylabel='Value of gold miners', title='Posterior predictive regression lines')
ppl.scatter(ax, x_out, y_out, label='data')

Out[24]:

<matplotlib.axes.AxesSubplot at 0x118ba0150>

In [25]:

with pm.Model() as model:
    glm.glm('y ~ x', data_out)
    trace = pm.sample(2000, pm.NUTS(), progressbar=False)

In [26]:

fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111,  xlabel='Value of gold', ylabel='Value of gold miners', title='Posterior predictive regression lines')
ppl.scatter(ax, x_out, y_out, label='data')
glm.plot_posterior_predictive(trace, samples=100, 
                              label='posterior predictive regression lines')
ppl.plot(ax, x, true_regression_line, 
         label='true regression line', linewidth=5.)

plt.legend(loc=0);
fig.savefig('ppc2.png')

In [27]:

Image('ppc2.png')

Out[27]:

In [28]:

fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111)
normal_dist = pm.Normal.dist(mu=0, sd=1)
t_dist = pm.T.dist(mu=0, lam=1, nu=1)
x_eval = np.linspace(-8, 8, 300)
ppl.plot(ax, x_eval, pm.theano.tensor.exp(normal_dist.logp(x_eval)).eval(), label='Normal', linewidth=2.)
ppl.plot(ax, x_eval, pm.theano.tensor.exp(t_dist.logp(x_eval)).eval(), label='Student T', linewidth=2.)
plt.xlabel('x')
plt.ylabel('Probability density')
plt.legend();
fig.savefig('t-dist.png')

Fit strongly biased by outliers¶

Normal distribution has very light tails.
-> sensitive to outliers.
Instead, use Student T distribution with heavier tails.

In [29]:

Image('t-dist.png')

Out[29]:

In [30]:

with pm.Model() as model_robust:
    family = pm.glm.families.T()
    pm.glm.glm('y ~ x', data_out, family=family)
    
    trace_robust = pm.sample(2000, pm.NUTS(), progressbar=False)

In [31]:

fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111, xlabel='Value of gold', ylabel='Value of gold miners', title='Posterior predictive regression lines')
ppl.scatter(ax, x_out, y_out)
glm.plot_posterior_predictive(trace_robust, samples=100,
                              label='posterior predictive regression lines')
ppl.plot(ax, x, true_regression_line, 
         label='true regression line', linewidth=5.)
plt.legend();
fig.savefig('ppc3.png')

In [32]:

Image('ppc3.png')

Out[32]:

Real-world example: Algorithmic Trading¶

Pairtrading is a famous technique that plays two stocks against each other.
For this to work, stocks must be correlated (cointegrated).
One common example is the price of gold (GLD) and the price of gold mining operations (GDX).

In [33]:

import zipline
import pytz
from datetime import datetime
fig = plt.figure(figsize=(8, 4))

prices = zipline.data.load_from_yahoo(stocks=['GLD', 'GDX'], 
                                 end=datetime(2013, 8, 1, 0, 0, 0, 0, pytz.utc)).dropna()[:1000]
prices.plot();

GLD
GDX

<matplotlib.figure.Figure at 0x123e0d210>

In [34]:

fig = plt.figure(figsize=(9, 6))
ax = fig.add_subplot(111, xlabel='Price GDX in \$', ylabel='Price GLD in \$')
colors = np.linspace(0.1, 1, len(prices))
mymap = plt.get_cmap("winter")
sc = ax.scatter(prices.GDX, prices.GLD, c=colors, cmap=mymap, lw=0)
cb = plt.colorbar(sc)
cb.ax.set_yticklabels([str(p.date()) for p in prices[::len(prices)//10].index]);
fig.savefig('price_corr.png')

In [35]:

Image('price_corr.png')

Out[35]:

Naive model assumes constant linear regression.¶

In [36]:

with pm.Model() as model_reg:
    family = pm.glm.families.Normal()
    pm.glm.glm('GLD ~ GDX', prices, family=family)
    trace_reg = pm.sample(2000, pm.NUTS(), progressbar=False)

Hm... kinda unsatisfying...¶

In [37]:

fig = plt.figure(figsize=(9, 6))
ax = fig.add_subplot(111, xlabel='Price GDX in \$', ylabel='Price GLD in \$', 
            title='Posterior predictive regression lines')
sc = ax.scatter(prices.GDX, prices.GLD, c=colors, cmap=mymap, lw=0)
glm.plot_posterior_predictive(trace_reg, samples=100, 
                              label='posterior predictive regression lines',
                              lm=lambda x, sample: sample['Intercept'] + sample['GDX'] * x,
                              eval=np.linspace(prices.GDX.min(), prices.GDX.max(), 100))
cb = plt.colorbar(sc)
cb.ax.set_yticklabels([str(p.date()) for p in prices[::len(prices)//10].index]);
ax.legend(loc=0);
fig.savefig('ppc4.png')

In [38]:

Image('ppc4.png')

Out[38]:

Clearly the regression between GDX and GLD changes over time.
But it does so gradually.
Can we build a model that allows for gradual changes in the coefficients?
YES!

Improved model¶

Assumes that intercept and slope follow a random walk.
At each time-point, the coefficients can move a step from their previous values.
This allows the coefficients to track the regression as it changes over time.

$$ \alpha_t \sim \mathcal{N}(\alpha_{t-1}, \sigma_\alpha^2) $$$$ \beta_t \sim \mathcal{N}(\beta_{t-1}, \sigma_\beta^2) $$

In [39]:

from pymc.distributions.timeseries import *
from theano.tensor import repeat

$$\text{Priors for }\sigma_{\alpha}\text{ and }\sigma_{\beta}$$¶

In [40]:

model_randomwalk = pm.Model()
with model_randomwalk:
    # std of random walk, best sampled in log space.
    sigma_alpha, log_sigma_alpha = model_randomwalk.TransformedVar(
                            'sigma_alpha', 
                            pm.Exponential.dist(1./.02, testval = .1), 
                            pm.logtransform
    )
    sigma_beta, log_sigma_beta = model_randomwalk.TransformedVar(
                            'sigma_beta', 
                            pm.Exponential.dist(1./.02, testval = .1),
                            pm.logtransform
    )

Define regression coefficients to follow a random walk.¶

In [41]:

# To make the model simpler, we will apply the same coefficient for 50 data points at a time
subsample_alpha = 50
subsample_beta = 50

with model_randomwalk:
    alpha = GaussianRandomWalk('alpha', sigma_alpha**-2, 
                               shape=len(prices) / subsample_alpha)
    beta = GaussianRandomWalk('beta', sigma_beta**-2, 
                              shape=len(prices) / subsample_beta)
    
    # Make coefficients have the same length as prices
    alpha_r = repeat(alpha, subsample_alpha)
    beta_r = repeat(beta, subsample_beta)    

Define regression and likelihood¶

In [42]:

with model_randomwalk:
    # Define regression
    regression = alpha_r + beta_r * prices.GDX.values
    
    # Assume prices are Normally distributed, the mean comes from the regression.
    sd = pm.Uniform('sd', 0, 20)
    likelihood = pm.Normal('y', 
                           mu=regression, 
                           sd=sd, 
                           observed=prices.GLD.values)

Inference!¶

In [43]:

from scipy import optimize
with model_randomwalk:
    # First optimize random walk
    start = pm.find_MAP(vars=[alpha, beta], fmin=optimize.fmin_l_bfgs_b)
    
    # Sample
    step = pm.NUTS(scaling=start)
    trace_rw = pm.sample(2000, step, start=start, progressbar=False)

intercept changes over time.¶

In [44]:

fig = plt.figure(figsize=(8, 6))
ax = plt.subplot(111, xlabel='time', ylabel='alpha', title='Change of alpha over time.')
ppl.plot(ax, trace_rw[-1000:][alpha].T, 'r', alpha=.05);
ax.set_xticklabels([str(p.date()) for p in prices[::len(prices)//5].index]);
fig.savefig('rwalk_alpha.png')

In [45]:

Image('rwalk_alpha.png')

Out[45]:

Slope changes over time.¶

In [46]:

fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, xlabel='time', ylabel='beta', title='Change of beta over time')
ppl.plot(ax, trace_rw[-1000:][beta].T, 'b', alpha=.05);
ax.set_xticklabels([str(p.date()) for p in prices[::len(prices)//5].index]);
fig.savefig('rwalk_beta.png')

In [47]:

Image('rwalk_beta.png')

Out[47]:

Regression slowly adapts to best fit current data¶

In [48]:

fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, xlabel='Price GDX in \$', ylabel='Price GLD in \$', 
            title='Posterior predictive regression lines')

colors = np.linspace(0.1, 1, len(prices))
colors_sc = np.linspace(0.1, 1, len(trace_rw[-500::10]['alpha'].T))
mymap = plt.get_cmap('winter')
mymap_sc = plt.get_cmap('winter')

xi = np.linspace(prices.GDX.min(), prices.GDX.max(), 50)
for i, (alpha, beta) in enumerate(zip(trace_rw[-500::10]['alpha'].T, trace_rw[-500::10]['beta'].T)):
    for a, b in zip(alpha, beta):
        ax.plot(xi, a + b*xi, alpha=.05, lw=1, c=mymap_sc(colors_sc[i]))
        
sc = ax.scatter(prices.GDX, prices.GLD, label='data', cmap=mymap, c=colors)
cb = plt.colorbar(sc)
cb.ax.set_yticklabels([str(p.date()) for p in prices[::len(prices)//10].index]);
fig.savefig('ppc5.png')

In [49]:

Image('ppc5.png')

Out[49]:

Conclusions¶

Probabilistic Programming allows you to tell a genarative story.
Blackbox inference algorithms allow estimation of complex models.
PyMC3 puts advanced samplers at your fingertips.

Outstanding Issues¶

Scalability

Variational Inference
see also Max Welling's work for scaling MCMC

Usability

still too difficult to use
wanted: library on top of PyMC3 with common models