In this notebook we're gonna give a few examples of statistical data modelling, using first scipy to fit data to a particular distribution, then using statsmodel to model the relationship between an dependent variable and an independant variable (linear regression)
This notebook's examples are mostly borrowed from the series of excellent notebooks written by Chris Fonnesbeck and available here.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from scipy import stats
stats.distributions.
# create a (continous) random variable with normal distribution
Y = stats.distributions.norm()
x = np.linspace(-5,5,100)
fig, axes = plt.subplots(3,1, sharex=True, figsize=(10,10))
# plot the probability distribution function (PDF)
axes[0].plot(x, Y.pdf(x))
axes[0].set_title('PDF')
# plot the commulative distributin function (CDF)
axes[1].plot(x, Y.cdf(x));
axes[1].set_title('CDF')
# plot histogram of 1000 random realizations of the variable Y ~ N(0,1)
axes[2].hist(Y.rvs(size=1000), bins=20);
axes[2].set_title('Random Variable');
An recurring statistical problem is finding estimates of the relevant parameters that correspond to the distribution that best represents our data. In parametric inference, we specify a priori a suitable distribution, then choose the parameters that best fit the data.
We're going to see:
A real life example: Monthly rainfall amounts at Auckland Airport station
import pandas as pd
data = pd.read_excel('./data/AKL_aero_rain_monthly.xlsx', sheetname='AKL',index_col=['date'], parse_dates=True)
data.head()
Station | rain | |
---|---|---|
date | ||
1963-01-01 | Auckland Aero | 58.3 |
1963-02-01 | Auckland Aero | 93.1 |
1963-03-01 | Auckland Aero | 105.8 |
1963-04-01 | Auckland Aero | 72.7 |
1963-05-01 | Auckland Aero | 68.0 |
5 rows × 2 columns
data['rain'].hist(normed=True, bins=20, grid=False)
<matplotlib.axes.AxesSubplot at 0x10b77d510>
There are a few possible choices, but one suitable alternative is the gamma distribution:
There are three different parametrizations in common use, we're gonna use the the one below, with shape parameter $\alpha$ and an inverse scale parameter $\beta$ called a rate parameter.
The *method of moments* simply assigns the empirical mean and variance to their theoretical counterparts, so that we can solve for the parameters.
So, for the gamma distribution, it turns out (see the relevant section in the wikipedia article) that the mean and variance are:
So, if we solve for $\alpha$ and $\beta$, using the sample mean ($\bar{X}$) and variance ($S^2$), we can use a gamma distribution to describe our data:
first step: we calculate the sample mean and variance, using pandas convenience methods
precip_mean = data['rain'].mean() # sample mean
precip_var = data['rain'].var() # sample variance
second step: we calculate the parameters $\alpha$ and $\beta$ using the relations above
alpha_mom = precip_mean ** 2 / precip_var
beta_mom = precip_var / precip_mean
from scipy.stats.distributions import gamma
The implementation of the Gamma function in scipy requires a location parameter, left to zero
gmom = gamma(alpha_mom, 0, beta_mom)
#or gmom = gamma(alpha_mom, scale=beta_mom)
plt.hist(data['rain'], normed=True, bins=20)
plt.plot(np.linspace(0, 300, 100), gmom.pdf(np.linspace(0, 300, 100)), lw=3)
plt.grid('off')
shape,loc,scale = gamma.fit(data['rain'].values)
shape,loc,scale
(5.4226744288635231, -23.424708406178365, 21.440307265117291)
alpha_mom, beta_mom
(3.5688298364400119, 26.013875195669325)
gmle = gamma(shape,loc,scale)
gmle.
f, (ax0, ax1) = plt.subplots(1,2,figsize=(13,5))
ax0.hist(data['rain'], normed=True, bins=20, color='0.6')
ax0.plot(np.linspace(0, 300, 100), gmle.pdf(np.linspace(0, 300, 100)), 'b-', lw=2, label='MLE')
ax0.plot(np.linspace(0, 300, 100), gmom.pdf(np.linspace(0, 300, 100)), 'r-', lw=2, label='Moments')
ax0.legend()
ax0.set_title('Histogram and fitted PDFs')
ax1.plot(np.linspace(0, 300, 100), gmle.cdf(np.linspace(0, 300, 100)), 'b-', lw=2, label='MLE')
ax1.plot(np.linspace(0, 300, 100), gmom.cdf(np.linspace(0, 300, 100)), 'r-', lw=2, label='Moments')
ax1.legend()
ax1.set_title('CDFs');
Goodness of fit tests are available in scipy.stats, through e.g.
scipy.stats.kstest
)scipy.stats.anderson
)from scipy.stats import kstest
kstest?
The null hypothesis is that the two distributions are identical, i.e. you reject the null hypothesis that the two samples were drawn from the same distribution if the p-value is less than your significance level
kstest(data['rain'].values, 'gamma', (shape, loc, scale))
(0.042221063815765791, 0.23225004179264119)
kstest(data['rain'].values, 'gamma', (alpha_mom, 0, beta_mom))
(0.053854461458692715, 0.060756371540928722)
Some stuff to work on your own, coming again from the excellent series of Notebooks created by Christopher Fonnesbeck, and available at his github repo: You're going to see how the Maximum Likelihood Estimates methods works from first principles, using the optimization routines in scipy.optimize.
remember that the formula for the Gamma distribution PDF is
Going back to the rainfall data, if we are using a gamma distribution we need to maximize:
$$\begin{align}l(\alpha,\beta) &= \sum_{i=1}^n \log[\beta^{\alpha} x^{\alpha-1} e^{-x\beta}\Gamma(\alpha)^{-1}] \cr &= n[(\alpha-1)\overline{\log(x)} - \bar{x}\beta + \alpha\log(\beta) - \log\Gamma(\alpha)]\end{align}$$(Its usually easier to work in the log scale)
where $n = 596$ (some values are missing in the time-series) and the bar indicates an average over all i. We choose $\alpha$ and $\beta$ to maximize $l(\alpha,\beta)$.
To find the maximum of any function, we typically take the derivative with respect to the variable to be maximized, set it to zero and solve for that variable.
$$\frac{\partial l(\alpha,\beta)}{\partial \beta} = n\left(\frac{\alpha}{\beta} - \bar{x}\right) = 0$$Which can be solved as $\beta = \alpha/\bar{x}$. However, plugging this into the derivative with respect to $\alpha$ yields:
$$\frac{\partial l(\alpha,\beta)}{\partial \alpha} = \log(\alpha) + \overline{\log(x)} - \log(\bar{x}) - \frac{\Gamma(\alpha)'}{\Gamma(\alpha)} = 0$$This has no closed form solution. We must use *numerical optimization*!
Numerical optimization uses algorithms take an initial "guess" at the solution, and iteratively improve the guess until it gets "close enough" to the answer.
Here, we will use Newton-Raphson algorithm:
which is available from scipy.optimize, and called newton
from scipy.optimize import newton
Below is a simple illustration of how the Newton-Raphson alogorithm works, using an arbitrary function
# some function
func = lambda x: 3./(1 + 400*np.exp(-2*x)) - 1
xvals = np.linspace(0, 6)
# plot
f, ax = plt.subplots(figsize=(8,8))
ax.plot(xvals, func(xvals), lw=2)
ax.text(5.3, 2.1, '$f(x)$', fontsize=18)
# zero line
ax.plot([0,6], [0,0], 'k-', lw=2)
# value at step n
ax.plot([4,4], [0,func(4)], 'k--', lw=2)
ax.text(4, -.2, '$x_n$', fontsize=18)
# tangent line
tanline = lambda x: -0.858 + 0.626*x
ax.plot(xvals, tanline(xvals), 'r--', lw=2)
# point at step n+1
xprime = 0.858/0.626
ax.plot([xprime, xprime], [tanline(xprime), func(xprime)], 'k--', lw=2)
ax.text(xprime+.1, -.2, '$x_{n+1}$', fontsize=18);
In some instances, we may not be interested in the parameters of a particular distribution of data, but just a smoothed representation of the data. In this case, we can estimate the distribution non-parametrically (i.e. making no assumptions about the form of the underlying distribution) using Kernel Density Estimation.
If you are interested into an excellent discussion of the various options in Python to perform Kernel Density Estimation, have a look at this post on Jake VanderPlas blog Pythonic Perambulations. Some more stuff on KDE is available here from Michael Lerner.
In a nutschell, (Gaussian) Kernel density estimation works 'simply' by summing up Gaussian functions (PDFs) centered on each data point. Each Gaussian function is characterized by a location parameter (the value of the data point), and a scale parameter ($\sigma$) which is related to the 'bandwith' of your (Gaussian) kernel.
# Some random data
y = np.random.random(15) * 10
y
array([ 4.57958298, 7.2015776 , 7.49649344, 9.41071943, 1.34311129, 5.83075621, 5.01346904, 9.57097713, 0.14417557, 4.52080071, 1.59014109, 4.84566603, 0.08943273, 6.77033755, 5.33810483])
x = np.linspace(0, 10, 100)
# Smoothing parameter
s = 0.4
# Calculate the kernels
kernels = np.transpose([stats.distributions.norm.pdf(x, yi, s) for yi in y])
plt.plot(x, kernels, 'k:')
plt.plot(x, kernels.sum(1))
plt.plot(y, np.zeros(len(y)), 'ro', ms=10)
[<matplotlib.lines.Line2D at 0x10ba8f0d0>]
Kernel Density Estimation is implemented in scipy.stats.kde by scipy.stats.kde.gaussian_kde
from scipy.stats.kde import gaussian_kde
f, ax = plt.subplots(figsize=(7,7))
xgrid = np.linspace(data['rain'].min(), data['rain'].max(), 100)
density = gaussian_kde(data['rain']).evaluate(xgrid)
ax.hist(data['rain'], bins=20, normed=True, label='data', alpha=.6)
ax.plot(xgrid, density, 'r-', lw=2, label='KDE')
ax.legend()
<matplotlib.legend.Legend at 0x10bc90110>
# Create some dummy data
rvs = np.append(stats.norm.rvs(loc=2,scale=1,size=(200,1)),
stats.norm.rvs(loc=1,scale=3,size=(200,1)),
axis=1)
kde = stats.kde.gaussian_kde(rvs.T)
# Regular grid to evaluate kde upon
x_flat = np.r_[rvs[:,0].min():rvs[:,0].max():128j]
y_flat = np.r_[rvs[:,1].min():rvs[:,1].max():128j]
x,y = np.meshgrid(x_flat,y_flat)
grid_coords = np.append(x.reshape(-1,1),y.reshape(-1,1),axis=1)
z = kde(grid_coords.T)
z = z.reshape(128,128)
f, ax = plt.subplots(figsize=(8,8))
ax.scatter(rvs[:,0],rvs[:,1],alpha=0.8,color='white')
im = ax.imshow(z,aspect=x_flat.ptp()/y_flat.ptp(),origin='lower',\
extent=(rvs[:,0].min(),rvs[:,0].max(),rvs[:,1].min(),rvs[:,1].max()),\
cmap=plt.get_cmap('spectral'))
plt.colorbar(im, shrink=0.7)
<matplotlib.colorbar.Colorbar instance at 0x10b9e1830>
Statsmodels is a Python module that allows users to explore data, estimate statistical models, and perform statistical tests.
from IPython.display import HTML
#HTML('<iframe src=http://statsmodels.sourceforge.net width=900 height=350 </iframe>')
In the previous parts we considered a single variable, and we wanted to describe its distribution, either using a parametric approach (fitting the data to a known distribution) or non-parametric approach (get a smooth representation of its empirical distribution / histogram using Kernel Density Estimation)
Another general, primary goal of many statistical data analysis tasks is to relate the influence of one variable ($X$) on another ($Y$). For example, we may want to know how different levels of contaminants in the water influence the reproduction of one marine invertebrate species, or how the variation of some remote climate phenomenon influence rainfall variability in a specific region of New Zealand. In this case as well one needs to decide on the particular form of the model that relates $Y$ to $X$ (e.g. what function we will consider: linear, quadratic, logistic, etc ...) then use algorithms to estimate the parameters of this model.
Recognizing that additional factors other than $X$ (the ones we have measured or are interested in) may influence the response variable $Y$, we can build a model to characterize the relationship
Where $\epsilon_i$ is the difference between the modelled and observed values.
We would like to select $\beta_0, \beta_1$ so that the difference between the predictions and the observations ($\epsilon_i$) is zero, but this is not usually possible. Instead, we choose a reasonable criterion: *the smallest sum of the squared differences between $\hat{y}$ and $y$*.
Squaring serves two purposes: (1) to prevent positive and negative values from cancelling each other out and (2) to strongly penalize large deviations. Whether the latter is a good thing or not depends on the goals of the analysis.
In other words, we will select the parameters that minimize the squared error of the model.
Here we're going to model the SOI (Southern Oscillation Index) as a function of the Sea Surface Temperature anomalies in the central east Pacific (NINO 3.4 index). We've seen in the previous notebook that these time-series are linearly related (Pearson'R correlation coefficient ~ 0.71) so setting up and estimating the parameters of a linear model seems like a good idea ...
import pandas as pd
data = pd.read_csv('./data/soi_nino.csv', index_col=0, parse_dates=True)
data.head()
SOI | nino | |
---|---|---|
1950-01-01 | 0.570492 | -1.84 |
1950-02-01 | 1.642897 | -1.63 |
1950-03-01 | 1.745849 | -1.30 |
1950-04-01 | 1.895316 | -1.43 |
1950-05-01 | 0.748673 | -1.72 |
5 rows × 2 columns
data.corr()
SOI | nino | |
---|---|---|
SOI | 1.000000 | -0.710077 |
nino | -0.710077 | 1.000000 |
2 rows × 2 columns
import statsmodels.api as sm
y = data['SOI'].values
X = data['nino'].values
X = X[...,np.newaxis]
X = sm.add_constant(X)
lm1_mod = sm.OLS(y, X)
lm1_fit = lm1_mod.fit()
lm1_fit.summary()
Dep. Variable: | y | R-squared: | 0.504 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.504 |
Method: | Least Squares | F-statistic: | 779.0 |
Date: | Fri, 09 May 2014 | Prob (F-statistic): | 8.08e-119 |
Time: | 09:18:10 | Log-Likelihood: | -905.59 |
No. Observations: | 768 | AIC: | 1815. |
Df Residuals: | 766 | BIC: | 1824. |
Df Model: | 1 |
coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
const | -0.2307 | 0.029 | -7.977 | 0.000 | -0.287 -0.174 |
x1 | -0.9506 | 0.034 | -27.911 | 0.000 | -1.017 -0.884 |
Omnibus: | 2.974 | Durbin-Watson: | 1.505 |
---|---|---|---|
Prob(Omnibus): | 0.226 | Jarque-Bera (JB): | 2.809 |
Skew: | -0.138 | Prob(JB): | 0.246 |
Kurtosis: | 3.106 | Cond. No. | 1.28 |
You need to prepend a constant (1. is the default) in order for the matrices to be aligned
X_prime = np.array([1., 3.]).reshape((1,2))
X.shape
(768, 2)
X_prime.shape
(1, 2)
y_hat = lm1_fit.predict(X_prime)
$\hat{y}$ is the value of SOI predicted by the linear model for NINO3.4 = 3.
y_hat
array([-3.08240935])
The statsmodels library has also a formula API that allows you to enter your model using a syntax which should be familiar to R users. The variable names are simply the names of the corresponding columns in the pandas DataFrame, which is given as the second argument.
from statsmodels.formula.api import ols as OLS
lm2_mod = OLS('SOI ~ nino', data)
lm2_fit = lm2_mod.fit()
lm2_fit.summary()
Dep. Variable: | SOI | R-squared: | 0.504 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.504 |
Method: | Least Squares | F-statistic: | 779.0 |
Date: | Fri, 09 May 2014 | Prob (F-statistic): | 8.08e-119 |
Time: | 09:18:18 | Log-Likelihood: | -905.59 |
No. Observations: | 768 | AIC: | 1815. |
Df Residuals: | 766 | BIC: | 1824. |
Df Model: | 1 |
coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
Intercept | -0.2307 | 0.029 | -7.977 | 0.000 | -0.287 -0.174 |
nino | -0.9506 | 0.034 | -27.911 | 0.000 | -1.017 -0.884 |
Omnibus: | 2.974 | Durbin-Watson: | 1.505 |
---|---|---|---|
Prob(Omnibus): | 0.226 | Jarque-Bera (JB): | 2.809 |
Skew: | -0.138 | Prob(JB): | 0.246 |
Kurtosis: | 3.106 | Cond. No. | 1.28 |
f, ax = plt.subplots(figsize=(7,7))
ax.plot(data['nino'], data['SOI'], 'k+')
ax.plot(data['nino'], lm2_fit.fittedvalues, 'b-')
# or
#ax.plot(data['nino'], lm2_fit.predict(), 'b-')
ax.set_xlabel('NINO 3.4')
ax.set_ylabel('SOI')
<matplotlib.text.Text at 0x10d60a650>
A quick demo of the seaborn library, which is built on top of statsmodels
To get and install seaborn, issue these command in a terminal:
conda install pip
pip install seaborn
If it doesnt work:
download it from https://github.com/mwaskom/seaborn/archive/master.zip
unzip the archive and get into the directory created
run:
python setup.py build
python setup.py install
If that fails complaining about missing libraries, you might have to install them with pip
, e.g.
pip install husl
pip install moss
import seaborn as sns
sns.regplot("nino", "SOI", data)
<matplotlib.axes.AxesSubplot at 0x10d673a10>
%load_ext load_style
%load_style talk.css