#!/usr/bin/env python
# coding: utf-8

# > This is one of the 100 recipes of the [IPython Cookbook](http://ipython-books.github.io/), the definitive guide to high-performance scientific computing and data science in Python.
# 

# # 7.7. Fitting a Bayesian model by sampling from a posterior distribution with a Markov Chain Monte Carlo method

# You can find the instructions to install PyMC on the package's website. (http://pymc-devs.github.io/pymc/)
# 
# You also need the *Storms* dataset from the book's website. (http://ipython-books.github.io)

# 1. Let's import the standard packages and PyMC.

# In[ ]:


import numpy as np
import pandas as pd
import pymc
import matplotlib.pyplot as plt
get_ipython().run_line_magic('matplotlib', 'inline')


# 2. Let's import the data with Pandas.

# In[ ]:


# http://www.ncdc.noaa.gov/ibtracs/index.php?name=wmo-data
df = pd.read_csv("data/Allstorms.ibtracs_wmo.v03r05.csv",
                 delim_whitespace=False)


# 3. With Pandas, it only takes a single line of code to get the annual number of storms in the North Atlantic Ocean. We first select the storms in that basin (`NA`), then we group the rows by year (`Season`), and we take the number of unique storm (`Serial_Num`) as each storm can span several days (`nunique` method).

# In[ ]:


cnt = df[df['Basin'] == ' NA'].groupby('Season') \
      ['Serial_Num'].nunique()
years = cnt.index
y0, y1 = years[0], years[-1]
arr = cnt.values
plt.figure(figsize=(8,4));
plt.plot(years, arr, '-ok');
plt.xlim(y0, y1);
plt.xlabel("Year");
plt.ylabel("Number of storms");


# 4. Now, we define our probabilistic model. We assume that storms arise following a time-dependent Poisson process with a deterministic rate. We assume this rate is a piecewise-constant function that takes a first value `early_mean` before a certain switch point, and a second value `late_mean` after that point. These three unknown parameters are treated as random variables (we will describe them more in the *How it works...* section).

# A [Poisson process](http://en.wikipedia.org/wiki/Poisson_process) is a particular **point process**, that is, a stochastic process describing the random occurence of instantaneous events. The Poisson process is fully random: the events occur independently at a given rate.

# In[ ]:


switchpoint = pymc.DiscreteUniform('switchpoint',
                                   lower=0, upper=len(arr))
early_mean = pymc.Exponential('early_mean', beta=1)
late_mean = pymc.Exponential('late_mean', beta=1)


# 5. We define the piecewise-constant rate as a Python function.

# In[ ]:


@pymc.deterministic(plot=False)
def rate(s=switchpoint, e=early_mean, l=late_mean):
    out = np.empty(len(arr))
    out[:s] = e
    out[s:] = l
    return out


# 6. Finally, the observed variable is the annual number of storms. It follows a Poisson variable with a random mean (the rate of the underlying Poisson process). This fact is a known mathematical property of Poisson processes.

# In[ ]:


storms = pymc.Poisson('storms', mu=rate, value=arr, observed=True)


# 7. Now, we use the MCMC method to sample from the posterior distribution, given the observed data. The `sample` method launches the fitting iterative procedure.

# In[ ]:


model = pymc.Model([switchpoint, early_mean, late_mean, rate, storms])


# In[ ]:


mcmc = pymc.MCMC(model)
mcmc.sample(iter=10000, burn=1000, thin=10)


# 8. Let's plot the sampled Markov chains. Their stationary distribution corresponds to the posterior distribution we want to characterize.

# In[ ]:


plt.figure(figsize=(8,8))
plt.subplot(311);
plt.plot(mcmc.trace('switchpoint')[:]);
plt.ylabel("Switch point"); 
plt.subplot(312);
plt.plot(mcmc.trace('early_mean')[:]);
plt.ylabel("Early mean");
plt.subplot(313);
plt.plot(mcmc.trace('late_mean')[:]);
plt.xlabel("Iteration");
plt.ylabel("Late mean");


# 9. We also plot the distribution of the samples: they correspond to the posterior distributions of our parameters, after the data points have been taken into account.

# In[ ]:


plt.figure(figsize=(14,3))
plt.subplot(131);
plt.hist(mcmc.trace('switchpoint')[:] + y0, 15);
plt.xlabel("Switch point")
plt.ylabel("Distribution")
plt.subplot(132);
plt.hist(mcmc.trace('early_mean')[:], 15);
plt.xlabel("Early mean");
plt.subplot(133);
plt.hist(mcmc.trace('late_mean')[:], 15);
plt.xlabel("Late mean");


# 10. Taking the sample mean of these distributions, we get posterior estimates for the three unknown parameters, including the year where the frequency of storms suddenly increased.

# In[ ]:


yp = y0 + mcmc.trace('switchpoint')[:].mean()
em = mcmc.trace('early_mean')[:].mean()
lm = mcmc.trace('late_mean')[:].mean()
print((yp, em, lm))


# 11. Now we can plot the estimated rate on top of the observations.

# In[ ]:


plt.figure(figsize=(8,4));
plt.plot(years, arr, '-ok');
plt.axvline(yp, color='k', ls='--');
plt.plot([y0, yp], [em, em], '-b', lw=3);
plt.plot([yp, y1], [lm, lm], '-r', lw=3);
plt.xlim(y0, y1);
plt.xlabel("Year");
plt.ylabel("Number of storms");


# For a possible scientific interpretation of the data considered here, see http://www.gfdl.noaa.gov/global-warming-and-hurricanes.

# > You'll find all the explanations, figures, references, and much more in the book (to be released later this summer).
# 
# > [IPython Cookbook](http://ipython-books.github.io/), by [Cyrille Rossant](http://cyrille.rossant.net), Packt Publishing, 2014 (500 pages).