#!/usr/bin/env python # coding: utf-8 # > This is one of the 100 recipes of the [IPython Cookbook](http://ipython-books.github.io/), the definitive guide to high-performance scientific computing and data science in Python. # # # 7.7. Fitting a Bayesian model by sampling from a posterior distribution with a Markov Chain Monte Carlo method # You can find the instructions to install PyMC on the package's website. (http://pymc-devs.github.io/pymc/) # # You also need the *Storms* dataset from the book's website. (http://ipython-books.github.io) # 1. Let's import the standard packages and PyMC. # In[ ]: import numpy as np import pandas as pd import pymc import matplotlib.pyplot as plt get_ipython().run_line_magic('matplotlib', 'inline') # 2. Let's import the data with Pandas. # In[ ]: # http://www.ncdc.noaa.gov/ibtracs/index.php?name=wmo-data df = pd.read_csv("data/Allstorms.ibtracs_wmo.v03r05.csv", delim_whitespace=False) # 3. With Pandas, it only takes a single line of code to get the annual number of storms in the North Atlantic Ocean. We first select the storms in that basin (`NA`), then we group the rows by year (`Season`), and we take the number of unique storm (`Serial_Num`) as each storm can span several days (`nunique` method). # In[ ]: cnt = df[df['Basin'] == ' NA'].groupby('Season') \ ['Serial_Num'].nunique() years = cnt.index y0, y1 = years[0], years[-1] arr = cnt.values plt.figure(figsize=(8,4)); plt.plot(years, arr, '-ok'); plt.xlim(y0, y1); plt.xlabel("Year"); plt.ylabel("Number of storms"); # 4. Now, we define our probabilistic model. We assume that storms arise following a time-dependent Poisson process with a deterministic rate. We assume this rate is a piecewise-constant function that takes a first value `early_mean` before a certain switch point, and a second value `late_mean` after that point. These three unknown parameters are treated as random variables (we will describe them more in the *How it works...* section). # A [Poisson process](http://en.wikipedia.org/wiki/Poisson_process) is a particular **point process**, that is, a stochastic process describing the random occurence of instantaneous events. The Poisson process is fully random: the events occur independently at a given rate. # In[ ]: switchpoint = pymc.DiscreteUniform('switchpoint', lower=0, upper=len(arr)) early_mean = pymc.Exponential('early_mean', beta=1) late_mean = pymc.Exponential('late_mean', beta=1) # 5. We define the piecewise-constant rate as a Python function. # In[ ]: @pymc.deterministic(plot=False) def rate(s=switchpoint, e=early_mean, l=late_mean): out = np.empty(len(arr)) out[:s] = e out[s:] = l return out # 6. Finally, the observed variable is the annual number of storms. It follows a Poisson variable with a random mean (the rate of the underlying Poisson process). This fact is a known mathematical property of Poisson processes. # In[ ]: storms = pymc.Poisson('storms', mu=rate, value=arr, observed=True) # 7. Now, we use the MCMC method to sample from the posterior distribution, given the observed data. The `sample` method launches the fitting iterative procedure. # In[ ]: model = pymc.Model([switchpoint, early_mean, late_mean, rate, storms]) # In[ ]: mcmc = pymc.MCMC(model) mcmc.sample(iter=10000, burn=1000, thin=10) # 8. Let's plot the sampled Markov chains. Their stationary distribution corresponds to the posterior distribution we want to characterize. # In[ ]: plt.figure(figsize=(8,8)) plt.subplot(311); plt.plot(mcmc.trace('switchpoint')[:]); plt.ylabel("Switch point"); plt.subplot(312); plt.plot(mcmc.trace('early_mean')[:]); plt.ylabel("Early mean"); plt.subplot(313); plt.plot(mcmc.trace('late_mean')[:]); plt.xlabel("Iteration"); plt.ylabel("Late mean"); # 9. We also plot the distribution of the samples: they correspond to the posterior distributions of our parameters, after the data points have been taken into account. # In[ ]: plt.figure(figsize=(14,3)) plt.subplot(131); plt.hist(mcmc.trace('switchpoint')[:] + y0, 15); plt.xlabel("Switch point") plt.ylabel("Distribution") plt.subplot(132); plt.hist(mcmc.trace('early_mean')[:], 15); plt.xlabel("Early mean"); plt.subplot(133); plt.hist(mcmc.trace('late_mean')[:], 15); plt.xlabel("Late mean"); # 10. Taking the sample mean of these distributions, we get posterior estimates for the three unknown parameters, including the year where the frequency of storms suddenly increased. # In[ ]: yp = y0 + mcmc.trace('switchpoint')[:].mean() em = mcmc.trace('early_mean')[:].mean() lm = mcmc.trace('late_mean')[:].mean() print((yp, em, lm)) # 11. Now we can plot the estimated rate on top of the observations. # In[ ]: plt.figure(figsize=(8,4)); plt.plot(years, arr, '-ok'); plt.axvline(yp, color='k', ls='--'); plt.plot([y0, yp], [em, em], '-b', lw=3); plt.plot([yp, y1], [lm, lm], '-r', lw=3); plt.xlim(y0, y1); plt.xlabel("Year"); plt.ylabel("Number of storms"); # For a possible scientific interpretation of the data considered here, see http://www.gfdl.noaa.gov/global-warming-and-hurricanes. # > You'll find all the explanations, figures, references, and much more in the book (to be released later this summer). # # > [IPython Cookbook](http://ipython-books.github.io/), by [Cyrille Rossant](http://cyrille.rossant.net), Packt Publishing, 2014 (500 pages).