#!/usr/bin/env python
# coding: utf-8

# [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/fonnesbeck/Bios8366/blob/master/notebooks/Section4_3-Introduction-to-PyMC.ipynb)
# 
# # Introduction to PyMC
# 
# 
# Probabilistic programming (PP) allows flexible specification of Bayesian statistical models in code. PyMC is a new, open-source PP framework with an intuitive and readable, yet powerful, syntax that is close to the natural syntax statisticians use to describe models. It features next-generation Markov chain Monte Carlo (MCMC) sampling algorithms such as the No-U-Turn Sampler (NUTS; Hoffman, 2014), a self-tuning variant of Hamiltonian Monte Carlo (HMC; Duane, 1987). This class of samplers works well on high dimensional and complex posterior distributions and allows many complex models to be fit without specialized knowledge about fitting algorithms. HMC and NUTS take advantage of gradient information from the likelihood to achieve much faster convergence than traditional sampling methods, especially for larger models. NUTS also has several self-tuning strategies for adaptively setting the tunable parameters of Hamiltonian Monte Carlo, which means you usually don't need to have specialized knowledge about how the algorithms work. 
# 
# ### PyMC Features
# 
# Probabilistic programming in Python confers a number of advantages including multi-platform compatibility, an expressive yet clean and readable syntax, easy integration with other scientific libraries, and extensibility via C, C++, Fortran or Cython. These features make it relatively straightforward to write and use custom statistical distributions, samplers and transformation functions, as required by Bayesian analysis.
# 
# PyMC's feature set helps to make Bayesian analysis as painless as possible. Here is a short list of some of its features:
# 
# -   Fits Bayesian statistical models with Markov chain Monte Carlo, variational inference and
#     other algorithms.
# -   Includes a large suite of well-documented statistical distributions.
# -   Leverages `arviz` for convergence diagnostics, model checking methods and creating summaries including tables and plots.
# -   Extensible: easily incorporates custom step methods and unusual probability distributions.
# -   Non-parametric Bayesian methods, including Gaussian processes and Dirichlet processes, are supported.
# 
# Here, we present a primer on the use of PyMC for solving general Bayesian statistical inference and prediction problems. We will first see the basics of how to use PyMC, motivated by a simple example: installation, data creation, model definition, model fitting and posterior analysis. Then we will cover two case studies and use them to show how to define and fit more sophisticated models. Finally we will show how to extend PyMC and discuss other useful features: the Generalized Linear Models subpackage, custom distributions, custom transformations and alternative storage backends.

# In[ ]:


get_ipython().run_line_magic('load', 'https://raw.githubusercontent.com/fonnesbeck/Bios8366/master/data/melanoma_data.py')


# In[ ]:


get_ipython().run_line_magic('matplotlib', 'inline')
import seaborn as sns; sns.set_context('notebook')
import warnings
warnings.simplefilter('ignore', FutureWarning)

import pymc as pm

with pm.Model() as melanoma_survival:

    # Convert censoring indicators to indicators for failure event
    failure = (censored==0).astype(int)

    # Parameters (intercept and treatment effect) for survival rate
    beta = pm.Normal('beta', mu=0.0, sd=1e5, shape=2)

    # Survival rates, as a function of treatment
    lam = pm.math.exp(beta[0] + beta[1]*treat)
    
    # Survival likelihood, accounting for censoring
    def logp(value, t, lam):
        return (value * pm.math.log(lam) - lam * t).sum()

    exp_surv = pm.DensityDist('exp_surv', t, lam, logp=logp, observed=failure)


# This example will generate 1000 posterior samples.

# In[ ]:


with melanoma_survival:
    trace = pm.sample(1000)


# In[ ]:


import arviz as az

az.plot_trace(trace, var_names=['beta']);


# In[ ]:


az.summary(trace, var_names=['beta'])


# ## Motivating Example: Coal mining disasters
# 
# Recall the time series of recorded coal mining disasters in the UK from 1851 to 1962 (Jarrett, 1979), introduced in a previous lecture, where the number of disasters is thought to have been affected by changes in safety regulations during this period.
# 
# Let's build a model for this series and attempt to estimate when the change occurred.

# In[ ]:


import numpy as np
import matplotlib.pyplot as plt

disasters_data = np.array([4, 5, 4, 0, 1, 4, 3, 4, 0, 6, 3, 3, 4, 0, 2, 6,
                         3, 3, 5, 4, 5, 3, 1, 4, 4, 1, 5, 5, 3, 4, 2, 5,
                         2, 2, 3, 4, 2, 1, 3, 2, 2, 1, 1, 1, 1, 3, 0, 0,
                         1, 0, 1, 1, 0, 0, 3, 1, 0, 3, 2, 2, 0, 1, 1, 1,
                         0, 1, 0, 1, 0, 0, 0, 2, 1, 0, 0, 0, 1, 1, 0, 2,
                         3, 3, 1, 1, 2, 1, 1, 1, 1, 2, 4, 2, 0, 0, 1, 4,
                         0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1])

n_years = len(disasters_data)

plt.figure(figsize=(12.5, 3.5))
plt.bar(np.arange(1851, 1962), disasters_data, color="#348ABD")
plt.xlabel("Year")
plt.ylabel("Disasters")
plt.title("UK coal mining disasters, 1851-1962")
plt.xlim(1851, 1962);


# We represent our conceptual model formally as a statistical model:
# 
# $$\begin{array}{ccc}  
# (y_t | \tau, \lambda_1, \lambda_2) \sim\text{Poisson}\left(r_t\right), & r_t=\left\{
# \begin{array}{lll}             
# \lambda_1 &\text{if}& t \lt \tau \\ 
# \lambda_2 &\text{if}& t \ge \tau             
# \end{array}\right.,&t\in[t_l,t_h] \\         
# \tau \sim \text{DiscreteUniform}(t_l, t_h) \\         
# \lambda_1\sim \text{Exponential}(a) \\         
# \lambda_2\sim \text{Exponential}(b)     
# \end{array}$$
# 
# Because we have defined $y$ by its dependence on $\tau$, $\lambda_1$ and $\lambda_2$, the latter three are known as the *parents* of $y$ and $D$ is called their *child*. Similarly, the parents of $\tau$ are $t_l$ and $t_h$, and $\tau$ is the child of $t_l$ and $t_h$.
# 
# ## Implementing a PyMC Model
# 
# At the model-specification stage (before the data are observed), $y$, $\tau$, $\lambda_1$, and $\lambda_2$ are all random variables. Bayesian "random" variables have not necessarily arisen from a physical random process. The Bayesian interpretation of probability is **epistemic**, meaning random variable $x$'s probability distribution $p(x)$ represents our knowledge and uncertainty about $x$'s value. Candidate values of $x$ for which $p(x)$ is high are relatively more probable, given what we know. 
# 
# We can generally divide the variables in a Bayesian model into two types: **stochastic** and **deterministic**. The only deterministic variable in this model is $r$. If we knew the values of $r$'s parents, we could compute the value of $r$ exactly. A deterministic like $r$ is defined by a mathematical function that returns its value given values for its parents. Deterministic variables are sometimes called the *systemic* part of the model. The nomenclature is a bit confusing, because these objects usually represent random variables; since the parents of $r$ are random, $r$ is random also.
# 
# On the other hand, even if the values of the parents of variables `switchpoint`, `disasters` (before observing the data), `early_mean` or `late_mean` were known, we would still be uncertain of their values. These variables are stochastic, characterized by probability distributions that express how plausible their candidate values are, given values for their parents.
# 
# Let's begin by defining the unknown switchpoint as a discrete uniform random variable:

# In[ ]:


with pm.Model() as disaster_model:

    switchpoint = pm.DiscreteUniform('switchpoint', lower=0, upper=n_years)


# We have done two things here. First, we have created a `Model` object; a `Model` is a Python object that encapsulates all of the variables that comprise our theoretical model, keeping them in a single container so that they may be used as a unit. After a `Model` is created, we can populate it with all of the model components that we specified when we wrote the model down. 

# Notice that the `Model` above was declared using a `with` statement. This expression is used to define a Python idiom known as a **context manager**. Context managers, in general, are used to manage resources of some kind within a program. In this case, our resource is a `Model`, and we would like to add variables to it so that we can fit our statistical model. The key characteristic of the context manager is that the resources it manages are only defined within the indented block corresponding to the `with` statement. PyMC uses this idiom to automatically add defined variables to a model. Thus, any variable we define is automatically added to the `Model`, without having to explicitly add it. This avoids the repetitive syntax of `add` methods/functions that you see in some machine learning packages:
# 
# ```python
# model.add(a_variable)
# model.add(another_variable)
# model.add(yet_another_variable)
# model.add(and_again)
# model.add(please_kill_me_now)
# ...
# ```
# 
# In fact, PyMC variables cannot be defined without a corresponding `Model`:

# In[ ]:


foo = pm.DiscreteUniform('foo', lower=0, upper=10)


# However, variables can be explicitly added to models without the use of a context manager, via the variable's optional `model` argument.
# 
# ```python
# disaster_model = Model()
# switchpoint = DiscreteUniform('switchpoint', lower=0, upper=110, model=disaster_model)
# ```

# Or, if we just want a discrete uniform distribution, and do not need to use it in a PyMC3 model necessarily, we can create one using the `dist` classmethod.

# In[ ]:


x = pm.DiscreteUniform.dist(lower=0, upper=100)

type(x)


# `DiscreteUniform` is an object that represents uniformly-distributed discrete variables. Use of this distribution
# suggests that we have no preference *a priori* regarding the location of the switchpoint; all values are equally likely.  

# In[ ]:


switchpoint


# PyMC3 includes most of the common random variable **distributions** used for statistical modeling.

# In[ ]:


pm.distributions.__all__


# By having a library of variables that represent statistical distributions, users are relieved of having to code distrbutions themselves. 

# Similarly, we can create the exponentially-distributed variables `early_mean` and `late_mean` for the early and late Poisson rates, respectively (also in the context of the model `distater_model`):

# In[ ]:


with disaster_model:
    
    early_mean = pm.Exponential('early_mean', 1)
    late_mean = pm.Exponential('late_mean', 1)


# Notice that we have "re-opened" the `disaster_model` in a new context manager, which allows us to add additional variables.
# 
# Next, we define the variable `rate`, which selects the early rate `early_mean` for times before `switchpoint` and the late rate `late_mean` for times after `switchpoint`. We create `rate` using the `switch` function, which returns `early_mean` when the switchpoint is larger than (or equal to) a particular year, and `late_mean` otherwise.

# In[ ]:


with disaster_model:
    
    rate = pm.math.switch(switchpoint >= np.arange(n_years), early_mean, late_mean)


# In[ ]:


rate


# The last step is to define the **data likelihood**, or sampling distribution. In this case, our measured outcome is the number of disasters in each year, `disasters`. This is a stochastic variable but unlike `early_mean` and `late_mean` we have *observed* its value. To express this, we set the argument `observed` to the observed sequence of disasters. This tells PyMC that this distribution's value is fixed, and should not be changed:

# In[ ]:


with disaster_model:
    
    disasters = pm.Poisson('disasters', mu=rate, observed=disasters_data)


# The model that we specified at the top of the page has now been fully implemented in PyMC3. Let's have a look at the model's attributes to see what we have.
# 
# The stochastic nodes in the model are identified in the `unobserved_RVs` attribute:

# In[ ]:


disaster_model.unobserved_RVs


# The last two variables are the log-transformed versions of the early and late rate parameters.

# What about `rate`, which is a deterministic component of the model? In this model, `rate` has not been given a name and given a formal PyMC data structure. It is essentially an **intermediate calculation** in the model, implying that we are not interested in its value when it comes to summarizing the output from the model. Most PyMC objects have a name assigned; these names are used for storage and post-processing:
# 
# -   as keys in output databases,
# -   as axis labels in plots of traces,
# -   as table labels in summary statistics.
# 
# If we wish to include `rate` in our output, we need to make it a `Deterministic` object, and give it a name:

# In[ ]:


with pm.Model() as disaster_model:

    early_mean = pm.Exponential('early_mean', 1)
    late_mean = pm.Exponential('late_mean', 1)
    switchpoint = pm.DiscreteUniform('switchpoint', lower=0, upper=n_years)
    
    rate = pm.Deterministic('rate', 
        pm.math.switch(switchpoint >= np.arange(n_years), early_mean, late_mean)
    )
    
    disasters = pm.Poisson('disasters', mu=rate, observed=disasters_data)


# Now, `rate` is included in the `Model`'s deterministics list, and the model will retain its samples during MCMC sampling, for example.

# In[ ]:


disaster_model.deterministics


# We can visualize the directed acyclic graph (DAG) of our model using GraphViz (you will need to have the `python-graphviz` package installed in your environment).

# In[ ]:


pm.model_to_graphviz(disaster_model)


# > ### Why are data and unknown variables represented by the same object?
# 
# >Since its represented by PyMC random variable object, `disasters` is defined by its dependence on its parent `rate` even though its value is **fixed**. This isn't just a quirk of PyMC's syntax; Bayesian hierarchical notation itself makes no distinction between random variables and data. The reason is simple: to use Bayes' theorem to compute the posterior, we require the likelihood. Even though `disasters`'s value is known and fixed, we need to formally assign it a *probability distribution* as if it were a random variable. Remember, the likelihood and the probability function are essentially the same, except that the former is regarded as a function of the parameters and the latter as a function of the data. This point can be counterintuitive at first, as many peoples' instinct is to regard data as fixed a priori and unknown variables as dependent on the data. 
# 
# > One way to understand this is to think of statistical models as predictive models for data, or as models of the processes that gave rise to data. Before observing the value of `disasters`, we could have sampled from its prior predictive distribution $p(y)$ (*i.e.* the marginal distribution of the data) as follows:
# 
# > -   Sample `early_mean`, `switchpoint` and `late_mean` from their
# >     priors.
# > -   Sample `disasters` conditional on these values.
# 
# > Even after we observe the value of `disasters`, we need to use this process model to make inferences about `early_mean` , `switchpoint` and `late_mean` because its the only information we have about how the variables are related.
# 
# > We will see later that we can sample from this fixed stochastic random variable, to obtain predictions after having observed our data.

# ## PyMC Variables
# 
# Each of the built-in statistical variables are subclasses of the generic `Distribution` class in PyMC. The `Distribution` carries relevant **attributes** about the probability distribution, such as the data type (called `dtype`), shape (`shape`, see below), and random number generator (`eval()`).

# In[ ]:


disasters.dtype


# In[ ]:


early_mean.eval()


# PyMC's built-in distribution variables can also be used to generate **random values** from that variable. For example, the `switchpoint`, which is a discrete uniform random variable, can generate random draws:

# In[ ]:


pm.draw(switchpoint, draws=10)


# As we noted earlier, some variables have undergone **transformations** prior to sampling. Such variables will have `tranform` attributes that specify which transformation was used.

# In[ ]:


early_mean.tag.value_var.tag.transform


# As with all Python objects, the underlying type of a variable can be exposed with the `type()` function:

# In[ ]:


type(switchpoint)


# as well as an associated `dtype` attribute:

# In[ ]:


switchpoint.dtype


# We will learn more about these types in an upcoming section.

# ## Variable log-probabilities
# 
# All PyMC stochastic variables can evaluate their probability mass or density functions at a particular value, given the values of their parents. The **logarithm** of a stochastic object's probability mass or density can be
# accessed via the `logp` function. 

# In[ ]:


pm.logp(switchpoint, 42).eval()


# Note that the returned value is a `TensorVariable` so it must be evaluated.
# 
# For **vector-valued** variables like `disasters`, the `logp` attribute returns an array of the logarithms of the joint probability or density of all elements of the value.

# In[ ]:


pm.logp(disasters, disasters_data).eval()


# ### Custom variables
# 
# Though we created the variables in `disaster_model` using well-known probability distributions that are available in PyMC, its possible to create custom distributions by **wrapping** functions that compute an arbitrary log-probability using the `DensityDist` function. For example, our initial example showed an exponential survival function, which accounts for censored data. If we pass this function as the `logp` argument for `DensityDist`, we can use it as the data likelihood in a survival model:
# 
# ```python
# def logp(value, t, lam):
#     return (value * pm.math.log(lam) - lam * t).sum()
# 
# exp_surv = pm.DensityDist('exp_surv', t, lam, logp=logp, observed=failure)
# ```
# 
# Users are thus not limited to the set of of statistical distributions provided by PyMC.

# ## Fitting the model with MCMC
# 
# PyMC's `sample` function will fit probability models (linked collections of variables) like ours using Markov chain Monte Carlo (MCMC) sampling. Unless we manually assign particular algorithms to variables in our model, PyMC will assign algorithms that it deems appropriate (it usually does a decent job of this):

# In[ ]:


with disaster_model:
    trace = pm.sample(1000, tune=1000)


# This returns the Markov chain of draws, conventionally called a "trace", from the model in an `InferenceData` object. This is from the `arviz` library that we have been using for plotting, and is an extension of the [xarray](https://xarray.pydata.org/en/stable/) data structure.

# In[ ]:


trace


# The `sample()` function always takes at least one argument, `draws`, which specifies how many samples to draw. However, there are a number of additional optional arguments that are worth knowing about:

# In[ ]:


help(pm.sample)


# The `step` argument is what allows users to manually override the sampling algorithms used to fit the model. For example, if we wanted to use a **slice sampler** to sample the `early_mean` and `late_mean` variables, we could specify it:

# In[ ]:


with disaster_model:
    trace = pm.sample(1000, step=pm.Slice(vars=[early_mean, late_mean]))


# ### Accessing the samples
# 
# The output of the `sample` function is a `InferenceData` object, which stores the sequence of samples for each variable in the model. Since there are a variety of outputs that may be derived from a PyMC model, our samples will be stored in the `posterior` attribute of the trace. 

# In[ ]:


trace.posterior['late_mean']


# The dimension of the returned array is the number of chains by the number of samples, in this case 4 by 1000 (or whichever number of chains that corresponds to the number of cores on your particular machine).

# The trace can also be sliced using the NumPy array slice `[start:stop:step]`. 

# In[ ]:


trace.posterior['late_mean'][-5:]


# If selections of chains and draws are required, they can be specified using the `sel` method:

# In[ ]:


trace.posterior['late_mean'].sel(chain=1, draw=slice(0, 5))


# ### Sampling output
# 
# You can examine the marginal posterior of any variable by plotting a
# histogram of its trace:

# In[ ]:


sns.histplot(trace.posterior['late_mean'].sel(chain=0))


# However, we recommend using the `arviz` library, which is dedicated to plotting MCMC output. For example, we can obtain a time series plot of the trace and a histogram using `plot_trace`:

# In[ ]:


az.plot_trace(trace, var_names=['early_mean', 'late_mean', 'switchpoint'], combined=True, backend_kwargs=dict(constrained_layout=True));


# Notice that the `combined=True` argument concatenated all of the chains into a single array for plotting. Additionally, the `backend_kwargs` keyword argument allows us to pass additional arguments to the plotting backend, in this case, `matplotlib`.
# 
# The right-hand pane of each figure shows the temporal series of the
# samples from each parameter, while the right-hand pane shows a kernel density estimate (continuous variables) or histogram (discrete variables) of the trace. The trace is useful for evaluating and diagnosing the algorithm's performance, whereas the KDE/histogram is useful for visualizing the posterior.
# 
# For a non-graphical summary of the posterior, the `summary` function is available.

# In[ ]:


az.summary(trace, var_names=['early_mean', 'late_mean'])


# ### Exercise
# 
# Experiment with an alternative parameterization of the coal mining disasters model. For example, you could try a model where the `switchpoint` is a continuous variable, or a model where the `early_mean` and `late_mean` have different priors.

# In[ ]:


# Write your answer here


# ---
# 
# ## References
# 
# 1.	Hoffman MD, Gelman A. [The No-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo](https://arxiv.org/abs/1111.4246). The Journal of Machine Learning Research. 2014;15(1):1593-1623.
# 2.	Salvatier J, Wiecki TV, Fonnesbeck C. [Probabilistic programming in Python using PyMC3](https://peerj.com/articles/cs-55/). PeerJ Comput Sci. 2016;2(2):e55. doi:10.7717/peerj-cs.55.