# Bayesian data analysis with PyMC3

## Quantopian Inc.

First navigate down and then right to view all slides.

• PhD candidate at Brown studying decision making using Bayesian modeling.
• Quantitative Researcher at Quantopian Inc: Building the world's first algorithmic trading platform in the web browser.

# Why should you care about Bayesian Data Analysis?

In [2]:
from IPython.display import Image
import prettyplotlib as ppl
from prettyplotlib import plt
import numpy as np
from matplotlib import rc
rc('font',**{'family':'sans-serif','sans-serif':['Helvetica'], 'size': 22})
rc('xtick', labelsize=14)
rc('ytick', labelsize=14)
## for Palatino and other serif fonts use:
#rc('font',**{'family':'serif','serif':['Palatino']})
rc('text', usetex=True)
%matplotlib inline

In [2]:


In [3]:
Image('backbox_ml.png')

Out[3]:
• Blackbox models not good at conveying what they have learned.
In [4]:
Image('openbox_pp.png')

Out[4]:

# Probabilistic Programming

• Model unknown causes of a phenomenon as random variables.
• Write a programmatic story of how unknown causes result in observable data.
• Use Bayes formula to invert generative model to infer unknown causes.

# Random Variables as Probability Distributions

• Represents our beliefs about an unknown state.
• Probability distribution assigns a probability to each possible state.
• Not a single number (e.g. most likely state).

# Coin-flipping experiment.

• Given multiple flips, what is probability of getting heads?

$$\frac{\# \text{heads}}{\text{total throws}}$$

• However:

$$\frac{50}{100} = \frac{1}{2}$$

• Clearly something is missing!
• Quantification of uncertainty.

Moreover...

• Consider a single flip which comes up heads:

$$P(\text{heads}) = \frac{1}{1} = 1$$

• Again, this doesn't seem right.
• Incorporate prior knowledge.
In [3]:
from scipy import stats
# set every possibility to be equally possible
x_coin = np.linspace(0, 1, 101)


## $$\text{Express probability of heads as random variable } \theta$$

In [4]:
import prettyplotlib as ppl
fig = plt.figure(figsize=(8, 6))
ylabel=r'Probability of hypothesis',
title=r'Prior probability distribution after no coin tosses')
ppl.plot(ax, x_coin, stats.beta(2, 2).pdf(x_coin), linewidth=3.)
ax.set_xticklabels([r'0\%', r'20\%', r'40\%', r'60\%', r'80\%', r'100\%']);
# fig.savefig('coin1.png')

Out[4]:
[<matplotlib.text.Text at 0x10f155590>,
<matplotlib.text.Text at 0x10f157f50>,
<matplotlib.text.Text at 0x10f1701d0>,
<matplotlib.text.Text at 0x10f170890>,
<matplotlib.text.Text at 0x10f170f50>]
In [98]:
stats.beta(2,2).rvs(1)

Out[98]:
array([ 0.37833632])
In []:


In [21]:
import random
import numpy as np

def choose_coin():
return stats.beta(2, 2).rvs(1) # random.uniform(0,1) # np.random.normal(0,1) #
# pylab.hist([choose_coin() for dummy in range(100000)], normed=True, bins=100)
successes = []
returns = 0.
for i in range(10000):
results = [random.uniform(0,1) < prob_heads for dummy in range(10)]
if results.count(True) == 9:
returns += -1
else:
returns += 1
fig = plt.figure(figsize=(8, 6))
r = ax.hist(np.array(successes), normed=True, bins=20)
print len(successes)
print "Average return", returns / len(successes)

719
Average return -0.613351877608


$$\theta \sim \text{Beta}(2, 2)$$ $$P(\theta) = \text{Beta}(2, 2)$$

In [108]:
fig = plt.figure(figsize=(8, 6))
ylabel='Probability of hypothesis',
title='Posterior probability distribution after first heads')
ppl.plot(ax, x_coin, stats.beta(1, 1).pdf(x_coin)*stats.beta(90, 10).pdf(x_coin), linewidth=3.)
ax.set_xticklabels([r'0\%', r'20\%', r'40\%', r'60\%', r'80\%', r'100\%']);
plt.savefig('coin2.png')

In []:



$$P(\theta | h=1) = \text{Beta}(3, 2)$$

In [10]:
fig = plt.figure(figsize=(8, 6))
ylabel='Probability of hypothesis',
title='Posterior probability distribution after 1 head, 1 tail')
ppl.plot(ax, x_coin, stats.beta(3, 3).pdf(x_coin), linewidth=3.)
ax.set_xticklabels(['0\%', '20\%', '40\%', '60\%', '80\%', '100\%']);
fig.savefig('coin3.png')

In [11]:
Image('coin3.png')

Out[11]:

$$P(\theta | [h=1, t=1]) = \text{Beta}(3, 3)$$

In [12]:
fig = plt.figure(figsize=(8, 6))
ylabel='Probability of hypothesis',
title='Posterior probability distribution after 20 heads and 20 tails')
ppl.plot(ax, x_coin, stats.beta(22, 22).pdf(x_coin), linewidth=3.)
ax.set_xticklabels(['0\%', '20\%', '40\%', '60\%', '80\%', '100\%']);
fig.savefig('coin4.png')

In [13]:
Image('coin4.png')

Out[13]:

$$P(\theta | [h=20, t=20]) = \text{Beta}(22, 22)$$

# Bayes Formula

$$P(\theta| \text{data}) \propto P(\theta) \ast P(\text{data} |\theta)$$ $$\text{posterior} \propto \text{prior} \ast \text{likelihood}$$

$\theta$: Parameters of model (chance of getting heads)).

• Except in simple cases, posterior impossible to compute analytically.
• Blackbox approximation algorithm: Markov-Chain Monte Carlo (MCMC) instead draws samples from the posterior.
In [14]:
from scipy import stats
fig = ppl.plt.figure(figsize=(14, 6))
ax1 = fig.add_subplot(121, title='What we want', ylim=(0, .5), xlabel=r'\theta', ylabel=r'P(\theta)')
ppl.plot(ax1, np.linspace(-4, 4, 100), stats.norm.pdf(np.linspace(-3, 3, 100)), linewidth=4.)
ax2 = fig.add_subplot(122, title='What we get', xlim=(-4, 4), ylim=(0, 1800), xlabel=r'\theta', ylabel='\# of samples')
ax2.hist(np.random.randn(10000), bins=20);

In [14]:



## Approximating the posterior with MCMC sampling

In [15]:
Image('wantget.png')

Out[15]:

# PyMC3

• Probabilistic Programming framework written in Python.
• Allows for construction of probabilistic models using intuitive syntax.
• Fast: Just-in-time compiled by Theano.
• Extensible: easily incorporates custom MCMC algorithms and unusual probability distributions.

# Linear Models

• Assumes a linear relationship between two variables.
• E.g. stock price between Gold and Gold Miners.
In [16]:
size = 200
true_intercept = 1
true_slope = 2

x = np.linspace(0, 1, size)
# y = a + b*x
true_regression_line = true_intercept + true_slope * x
y = true_regression_line + np.random.normal(scale=.5, size=size)

data = dict(x=x, y=y)

In [17]:
fig = plt.figure(figsize=(7, 7))
ax = fig.add_subplot(111, xlabel='Value of gold', ylabel='Value of gold miners', title='Synthetic data and underlying model')
ppl.scatter(ax, x, y, label='sampled data')
ppl.plot(ax, x, true_regression_line, label='true regression line', linewidth=4.)
ax.legend(loc=2);
fig.savefig('synth_data.png')

In [18]:
Image('synth_data.png')

Out[18]:

# Linear Regression

$$y_i = \alpha + \beta \ast x_i + \epsilon$$

with $$\epsilon \sim \mathcal{N}(0, \sigma^2)$$

## Probabilistic Reformulation

$$y_i \sim \mathcal{N}(\alpha + \beta \ast x_i, \sigma^2)$$

### Priors

$$\alpha \sim \mathcal{N}(0, 20^2)$$ $$\beta \sim \mathcal{N}(0, 20^2)$$ $$\sigma \sim \mathcal{U}(0, 20)$$

## Constructing model in PyMC3

In [19]:
import pymc as pm

with pm.Model() as model: # model specifications in PyMC3 are wrapped in a with-statement
# Define priors
alpha = pm.Normal('alpha', mu=0, sd=20)
beta = pm.Normal('beta', mu=0, sd=20)
sigma = pm.Uniform('sigma', lower=0, upper=20)

# Define linear regression
y_est = alpha + beta * x

# Define likelihood
likelihood = pm.Normal('y', mu=y_est, sd=sigma, observed=y)

# Inference!
start = pm.find_MAP() # Find starting value by optimization
step = pm.NUTS(state=start) # Instantiate MCMC sampling algorithm
trace = pm.sample(2000, step, start=start, progressbar=False) # draw 2000 posterior samples using NUTS sampling

/Users/alexcoventry/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/theano/scan_module/scan_perform_ext.py:85: RuntimeWarning: numpy.ndarray size changed, may indicate binary incompatibility
from scan_perform.scan_perform import *


# Covenience function glm()

In [20]:
with pm.Model() as model:
# specify glm and pass in data. The resulting linear model, its likelihood and
# and all its parameters are automatically added to our model.
pm.glm.glm('y ~ x', data)
step = pm.NUTS() # Instantiate MCMC sampling algorithm
trace = pm.sample(2000, step, progressbar=False) # draw 2000 posterior samples using NUTS sampling


# Posterior

In [21]:
fig = pm.traceplot(trace, lines={'alpha': 1, 'beta': 2, 'sigma': .5});

In [22]:
fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111, xlabel='Value of gold', ylabel='Value of gold miners', title='Posterior predictive regression lines')
ppl.scatter(ax, x, y, label='data')
from pymc import glm
glm.plot_posterior_predictive(trace, samples=100,
label='posterior predictive regression lines')
ppl.plot(ax, x, true_regression_line, label='true regression line', linewidth=5.)
ax.legend(loc=0);
fig.savefig('ppc1.png')

In [22]:


In [23]:
Image('ppc1.png')

Out[23]:

# Robust Regression

In [24]:
# Add outliers
x_out = np.append(x, [.1, .15, .2, .25, .25])
y_out = np.append(y, [8, 6, 9, 7, 9])

data_out = dict(x=x_out, y=y_out)
fig = plt.figure(figsize=(7, 7))
ax = fig.add_subplot(111,  xlabel='Value of gold', ylabel='Value of gold miners', title='Posterior predictive regression lines')
ppl.scatter(ax, x_out, y_out, label='data')

Out[24]:
<matplotlib.axes.AxesSubplot at 0x118ba0150>
In [25]:
with pm.Model() as model:
glm.glm('y ~ x', data_out)
trace = pm.sample(2000, pm.NUTS(), progressbar=False)

In [26]:
fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111,  xlabel='Value of gold', ylabel='Value of gold miners', title='Posterior predictive regression lines')
ppl.scatter(ax, x_out, y_out, label='data')
glm.plot_posterior_predictive(trace, samples=100,
label='posterior predictive regression lines')
ppl.plot(ax, x, true_regression_line,
label='true regression line', linewidth=5.)

plt.legend(loc=0);
fig.savefig('ppc2.png')

In [27]:
Image('ppc2.png')

Out[27]:
In [28]:
fig = plt.figure(figsize=(8, 6))
normal_dist = pm.Normal.dist(mu=0, sd=1)
t_dist = pm.T.dist(mu=0, lam=1, nu=1)
x_eval = np.linspace(-8, 8, 300)
ppl.plot(ax, x_eval, pm.theano.tensor.exp(normal_dist.logp(x_eval)).eval(), label='Normal', linewidth=2.)
ppl.plot(ax, x_eval, pm.theano.tensor.exp(t_dist.logp(x_eval)).eval(), label='Student T', linewidth=2.)
plt.xlabel('x')
plt.ylabel('Probability density')
plt.legend();
fig.savefig('t-dist.png')


# Fit strongly biased by outliers

• Normal distribution has very light tails.
• -> sensitive to outliers.
• Instead, use Student T distribution with heavier tails.
In [29]:
Image('t-dist.png')

Out[29]:
In [30]:
with pm.Model() as model_robust:
family = pm.glm.families.T()
pm.glm.glm('y ~ x', data_out, family=family)

trace_robust = pm.sample(2000, pm.NUTS(), progressbar=False)

In [31]:
fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111, xlabel='Value of gold', ylabel='Value of gold miners', title='Posterior predictive regression lines')
ppl.scatter(ax, x_out, y_out)
glm.plot_posterior_predictive(trace_robust, samples=100,
label='posterior predictive regression lines')
ppl.plot(ax, x, true_regression_line,
label='true regression line', linewidth=5.)
plt.legend();
fig.savefig('ppc3.png')

In [32]:
Image('ppc3.png')

Out[32]:

• Pairtrading is a famous technique that plays two stocks against each other.
• For this to work, stocks must be correlated (cointegrated).
• One common example is the price of gold (GLD) and the price of gold mining operations (GDX).
In [33]:
import zipline
import pytz
from datetime import datetime
fig = plt.figure(figsize=(8, 4))


GLD

<matplotlib.figure.Figure at 0x123e0d210>