This notebook will demonstrate how to analyse adaptation time of a population using generalized linear models (GLM), and specificaly exponential regression. This assumes that adaptation time is a geomteric or exponential random variable since it is roughly the waiting time for an event to occur with some independent event of it occuring every generation; this neglects much of the dynamics, of course, but the idea is to get a general way to analyse adaptation time without having a clear dynamic model of the process.
This notebook is under the CC-BY-SA 4.0 license.
The latest version of the notebook can be found on github.
Thanks to Uri Obolski for help with understanding GLMs.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
from scipy import stats, optimize
import statsmodels.api as sm
import pandas as pd
import seaborn as sns
sns.set(
style='white',
palette='Set1',
context='paper',
font_scale=1.3,
)
red, blue, green = sns.color_palette('Set1', 3)
from ipy_progressbar import ProgressBarIPyNb as ProgressBar # pip install git+https://github.com/yoavram/ipy-progressbar.git
import sys
print("Python", sys.version)
print("NumPy:", np.__version__)
print("Matplotlib:", mpl.__version__)
print("statsmodels:", sm.version.version)
print("Pandas:", pd.__version__)
print("Seaborn:", sns.__version__)
Python 3.5.2 | packaged by conda-forge | (default, Jul 26 2016, 01:37:38) [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.54)] NumPy: 1.11.2 Matplotlib: 1.5.3 statsmodels: 0.8.0rc1 Pandas: 0.19.1 Seaborn: 0.7.1
The pdf (probabilit density function) and cdf (cummulative distribution function) of an exponential distribution with rate λ is: $$ f_{\lambda}(x) = \lambda e^{-\lambda x} \\ F_{\lambda}(x) = 1=e^{-\lambda x} $$
We create two exponential random variables, with rates 1/2 and 1/4:
λs = np.array([1/2, 1/4])
expons = [stats.expon(scale=1/λ) for λ in λs]
[exp.mean() for exp in expons], 1/λs
([2.0, 4.0], array([ 2., 4.]))
And we plot their pdf and cdf:
x = np.linspace(0, 16, 1000)
for exp in expons:
plt.plot(x, exp.pdf(x), label='pdf')
plt.plot(x, exp.cdf(x), label='cdf')
plt.axvline(x=exp.mean(), color='k', ls='--', label='mean')
plt.legend(loc='center right')
sns.despine()
Now we draw some random numbers from the same distributions:
for exp in expons:
y = exp.rvs(100)
plt.plot(y, 'o')
sns.despine()
We can try to estimate the rate of the distribution from a random samples. This can done using maximum likelihood. For a given set of samples $x$, the likelihood function of λ is: $$ L(\lambda | x) = \Pi_i{\lambda e^{-\lambda x_i}} = \lambda^n e^{-\lambda n \bar{x}} $$ which gives $\hat{\lambda} = \frac{1}{\bar{x}}$ (although this is a biased estimator $E[\hat{\lambda}]=\frac{n}{n-1}\lambda$, see below).
nsamples = np.logspace(2, 5, 50, dtype=int)
estimates = [
1/exp.rvs(n).mean()
for n in nsamples
]
estimates = np.array(estimates)
λ = 1/exp.mean()
plt.plot(nsamples, (estimates/λ - 1), 'o')
plt.axhline(0, color='k')
plt.plot(nsamples, nsamples/(nsamples - 1) - 1, '--') # excepted bias
plt.xscale('log')
plt.xlabel('# samples')
plt.ylabel('λ relative estimation error (λ={})'.format(1/exp.mean()))
sns.despine()
To show the importance of the number of samples, we draw n=10, 100 or samples from an exponential distribution multiple times and show the distribution of $\hat{\lambda}$.
fig, ax = plt.subplots(1, 3, figsize=(12, 4))
reps = 10000
for i, n in enumerate([10, 100, 1000]):
estimates = [
1/exp.rvs(n).mean()
for _ in range(reps)
]
sns.distplot(estimates, kde=False, fit=stats.norm, ax=ax[i])
ax[i].axvline(x=λ, color='k', ls='-')
ax[i].axvline(x=λ * n/(n-1) , color='k', ls='--')
ax[i].set(
title=n,
xlabel=('λ estimate'),
ylabel=('Density')
)
sns.despine()
The red bars show the histogram; the black curve is a fitted normal distribution; the solid vertical line is the real value, $\lambda$; the dashed line is the expected estimator, $\frac{n}{n-1}\lambda$. You can see the differencet scales on the x-axis, and that as the distribution becomes narrower it is less biased towards positive values - and the estimator converges to the real value.
Note that the sum of iid exponentials is Gamma distributed: $y=\sum_{i=1}^{n}{x_i} \sim Gamma(n, \frac{1}{\lambda})$ (with $(k,\theta)$ parameterization).
What is the expected estimation?
\begin{multline} E[\hat{\lambda}] = \\ E[\frac{n}{\sum_{i=1}^{n}{x_i}}] = \\ n E[\frac{1}{y}] =\\ n \int_{0}^{\infty}{\frac{1}{y} \frac{\lambda^n y^{n-1}}{\Gamma(n)} e^{-\lambda y} dy} = \\ n \frac{\lambda^n}{\Gamma(n)} \int_0^{\infty}{y^{n-2} e^{-\lambda y} dy} = \\ n \frac{\lambda^n}{\Gamma(n)} \frac{1}{\lambda} \lambda^{2-n} \Gamma(n-1) = \\ \frac{\lambda^{2-n} \lambda^n n \Gamma(n-1)}{\lambda \Gamma(n)} = \\ \frac{n-1}{n} \lambda \end{multline}Now we have some measurements $x$ and $y$ where $x_i$ is independent variable in $R$ and $y_i>0$ is exponentially distributed with $\lambda_i = g(x_i)$. We are interested in $g(x)$.
Since we know nothing about $g(x)$ we will attempt to approximate it using a linear model: $$ g(x) \approx a x + b $$
For an indepth description of this approach, see In All Likelihood by Yudi Pawitan, Ch. 6 and specifically section 6.5 (pg. 163).
n = 1000
a, b = 5, 1
x = np.random.random(n)
x.sort()
y = stats.expon.rvs(scale=1/(a * x + b))
plt.plot(x, 1/(a * x + b), lw=2, label='model')
plt.plot(x, y, 'o', alpha=0.25, label='data')
plt.xlabel('x')
plt.ylabel('y')
plt.yscale('log')
plt.legend()
sns.despine()
We can now try to estimate $a, b$ from the data.
The likelihood function for $a,b$ given $x,y$ is: $$ L(a, b | x, y) = \Pi_i {(a x_i + b) \cdot e^{-y_i (a x_i + b)}} $$
The log-likelihood is nicer as it has a sum rather then a product: $$ logL(a, b | x, y) = \sum_i{[log{(a x_i + b)} -y_i (a x_i + b)]} $$
We define the corresponding Python, but it will be the funciton of the negative log-likelihood, so that we can minimize it (rather than maximize it).
def nloglik(params, x, y):
a, b = params
return -np.sum(
[
np.log(a * xi + b) - yi * (a * xi + b)
for xi, yi in zip(x, y)
]
)
res = optimize.minimize(nloglik, (2, 2), args=(x, y))
ahat, bhat = res.x
print('{:.2f} ~ {:.4f}'.format(a, ahat))
print('{:.2f} ~ {:.4f}'.format(b, bhat))
5.00 ~ 5.0701 1.00 ~ 1.0151
Fortunately, we don't have to do this manually; rather, we can use the statsmodels package. The package has tools for GLM fitting. We need to specify a distribution (family) and possibly a link function (here the inverse function which links the average of $y$ to the distribution parameter $\lambda$). The appropriate distribution is gamma, which is a generaliztion of the exponential distribution, adding a dispersal/variance parameter (see below).
x1 = sm.add_constant(x, prepend=False) # add intercept - b
glm = sm.GLM(y, x1, family=sm.families.Gamma())
results = glm.fit()
results.summary()
/Users/yoavram/miniconda3/envs/simarba/lib/python3.5/site-packages/statsmodels/genmod/generalized_linear_model.py:244: DomainWarning: The inverse_power link function does not respect the domain of the Gamma family. DomainWarning)
Dep. Variable: | y | No. Observations: | 1000 |
---|---|---|---|
Model: | GLM | Df Residuals: | 998 |
Model Family: | Gamma | Df Model: | 1 |
Link Function: | inverse_power | Scale: | 1.17200700985 |
Method: | IRLS | Log-Likelihood: | 168.83 |
Date: | Mon, 12 Dec 2016 | Deviance: | 1220.7 |
Time: | 09:19:06 | Pearson chi2: | 1.17e+03 |
No. Iterations: | 6 |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
x1 | 5.0701 | 0.364 | 13.917 | 0.000 | 4.356 | 5.784 |
const | 1.0151 | 0.122 | 8.347 | 0.000 | 0.777 | 1.253 |
The coefs for x1
and const
are $a$ and $b$.
Some advantages for using statsmodels:
P>|z|
column) and confidence intervals on the coefs (last column).Next we plot the results.
ahat, bhat = results.params
plt.plot(x, 1/(a * x + b), lw=2, label='model')
plt.plot(x, y, 'o', alpha=0.25, label='data')
plt.plot(x, 1/(ahat * x + bhat), lw=3, ls='--', label='estimate')
#plt.plot(x, results.predict(x1), lw=3, ls='--', label='estimate') # same as above line
plt.xlabel('x')
plt.ylabel('y')
plt.yscale('log')
plt.legend()
sns.despine()
If the regression model is suitable for the data, we expect that the deviance residuals will be normally distributed - that the estimation error is normal.
The left plot shows a histogram of the residuals, fitted with a normal distribution; the right plot shows a QQ plot that plots the empirical quantiles of the residuals on y-axis vs. the theoretical quantiles on the x-axis. Both plots suggest that the residuals are normally distributed, which is good.
fig, ax = plt.subplots(1, 2, figsize=(8, 4))
resid = results.resid_deviance.copy()
resid_std = stats.zscore(resid)
sns.distplot(resid_std, bins=25, kde=False, fit=stats.norm, ax=ax[0])
ax[0].set_title('Residuals')
sm.graphics.qqplot(resid, line='r', ax=ax[1])
ax[1].set_title('QQ plot of residuals')
sns.despine()
Calculate and plot the log-likelihood for combinations of $a,b$.
def plot_loglik(L, xrange, yrange, xlabel=None, ylabel=None, cmap='viridis'):
fig, ax = plt.subplots()
im = ax.imshow(L.T, cmap=cmap, aspect=1, origin=(0,0))
cntr = ax.contour(L.T, N=5, aspect=1, linewidths=0.5, linestyles='--', colors='k')
ax.clabel(cntr, colors="#252525", fmt=lambda x: '{:.0f}'.format(x))
plt.colorbar(im, ax=ax, label='Log Likelihood', fraction=0.03, pad=0.25, format='%1.e')
xstep = len(xrange) // 5
xticks = list(range(0, len(xrange), xstep)) + [len(xrange)-1]
xticklabels = xrange[xticks]
ax.set_xticks(xticks)
ax.set_xticklabels(['{:.2g}'.format(x) for x in xticklabels], rotation='vertical')
ystep = len(xrange) // 5
yticks = list(range(0, len(yrange), ystep)) + [len(yrange)-1]
yticklabels = yrange[yticks]
ax.set_yticks(yticks)
ax.set_yticklabels(['{:.2g}'.format(y) for y in yticklabels])
imax, jmax = (L == np.nanmax(L)).nonzero()
ax.scatter(imax, jmax, marker='o', s=50, color='r')
ax.set(
xlabel=xlabel,
ylabel=ylabel
)
fig.tight_layout()
sns.despine()
return fig, ax
arange = np.linspace(-5, 15)
brange = np.linspace(0, 10)
logL = np.array([
[
-nloglik((a, b), x, y)
for b in brange
]
for a in arange
])
fig, ax = plot_loglik(logL, arange, brange, xlabel='a', ylabel='b')
/Users/yoavram/miniconda3/envs/simarba/lib/python3.5/site-packages/ipykernel/__main__.py:6: RuntimeWarning: invalid value encountered in log
The red dot marks the maximum likelihood.
Next, a likelihood profile for $a$:
fig, ax = plt.subplots(1, 2, sharex=False, sharey=False, figsize=(8, 4))
alow, ahigh = results.conf_int()[0,:]
arange = np.linspace(0, 12)
idx = (arange > alow) & (arange < ahigh)
logL = np.array([
-nloglik((a, bhat), x, y)
for a in arange
])
ax[0].plot(arange, logL, 'k-')
ax[0].axvline(ahat, color='k', ls='--')
ax[0].fill_between(arange[idx], np.nanmin(logL), logL[idx], color='k', alpha=0.3)
ax[0].set(
xlabel='a',
ylabel='Log likelihood (b={:.2f})'.format(bhat),
ylim=(np.nanmin(logL), None),
)
arange = np.linspace(ahat*0.875, ahat*1.25)
ax[0].axvline(arange.min(), ls='-', color='k')
ax[0].axvline(arange.max(), ls='-', color='k')
idx = (arange > alow) & (arange < ahigh)
logL = np.array([
-nloglik((a, bhat), x, y)
for a in arange
])
ax[1].plot(arange, np.exp(logL), 'k-')
ax[1].axvline(ahat, color='k', ls='--')
ax[1].fill_between(arange[idx], np.nanmin(np.exp(logL)), np.exp(logL[idx]), color='k', alpha=0.3)
ax[1].set(
xlabel='a',
ylabel='Likelihood (b={:.2f})'.format(bhat),
)
fig.tight_layout()
sns.despine()
The left panel shows a plot of the log-likelihood for the estimated $b$ and different values of $a$. The dashed line is for the estimated $a$; the shaded ares is the 95% confidence interval; the solid lines show the area presented in the right panel.
The right panel shows the likelihood (no log) for different $a$ values. The dashed line is for the estimated $a$; the shaded ares is the 95% confidence interval.
We simulate adaptation using a Wright-Fisher model with a single bi-allelic locus and an asexual haploid constant size population. The simulation starts with a population of size N, all with allele a that confers fitness 1-s. Allele A, confering fitness 1, appears with probability u via mutation.
The expected frequency of allele A is: $$ E[p(t+1)] = \frac{p(t) \cdot (1-u) + (1-p(t)) \cdot (1-s) \cdot u}{\bar{\omega}} \\ \bar{\omega} = p(t) + (1-p(t)) \cdot (1-s) $$
The frequency of allele A in the next generation is: $$ n(t+1) \sim Bin(N, E[p(t+1)]) \\ p(t+1) = n(t+1) / N $$ that is, $n(t)$, the number of alleles A, is binomialy distributed.
We are looking for the first time $T$ at which $p(T)=1$.
def simulate_adaptation(s, N, u, reps=1, PRINT=False, PLOT=False, BAR=False):
N = int(N)
p = np.zeros(reps, dtype=float)
T = np.zeros(reps, dtype=int)
t = 0
if PLOT:
ps = []
if BAR:
bar = ProgressBar(reps)
bar.start()
while not np.allclose(p, 1):
if PRINT and t % 500 == 0:
print("{} {:.2g}".format(t, (p<1).mean()), end=" | ")
if BAR:
finished = ((T == 0) & (p == 1)).sum()
for _ in range(finished):
bar.advance()
T[(T == 0) & (p == 1)] = t
assert ((0 <= p) & (p <= 1)).all()
if PLOT: ps.append(p)
# selection and mutation
w = p + (1 - p) * (1 - s)
p = (p + (1 - p) * (1 - s) * u) / w
# random genetic drift
p = np.random.binomial(N, p) / N
t += 1
if PLOT:
ps = np.array(ps)
plt.plot(range(t), ps)
plt.ylim(-0.1, 1)
plt.xlim(0, max(10, t))
plt.xlabel("Time")
plt.ylabel("Frequency")
T[T == 0] = t
if BAR:
bar.finish()
bar.hide()
return T
T = simulate_adaptation(0.1, 1e3, 1e-5, 3, BAR=True, PLOT=True)
for t in T:
plt.axvline(t, ls='--', color='k')
sns.despine()
This shows three popoulations adapting, each one finishing adaptation at a different time.
We now simulate lots of adaptaions and present the:
T = simulate_adaptation(0.01, 1e5, 1e-7, 10, BAR=True)
T_mean = T.mean()
T_est = stats.expon.fit(T)[1]
T_theo = 1/(2*0.01*1e4*1e-6) + np.log(1e4)/0.01
print("Mean T: {:.0f}\nEstimated T: {:.0f}\nTheoretical T: {:.0f}".format(T_mean, T_est, T_theo))
sns.distplot(T, kde=False, fit=stats.expon)
plt.xlabel('Adaptation time')
plt.ylabel('Frequency')
sns.despine()
Mean T: 6664 Estimated T: 5212 Theoretical T: 5921
In the above, there was a single set of parameters $(N, s, u)$.
Now, let's simulate adaptations with a bunch of parameters and use exponential regression!
We randomize 100 parameter sets, and run 5 simulations for each set; this can take some time, especially when the adaptation time is long.
# pip install git+https://github.com/rossant/ipycache --upgrade
%load_ext ipycache
The ipycache extension is already loaded. To reload it, use: %reload_ext ipycache
%%cache mycache.pkl dfs
dfs = []
for _ in range(100):
print(_, end=' ')
s = np.random.uniform(0.01, 0.1)
N = 10**np.random.uniform(4, 7)
u = 10**np.random.uniform(-6, -9)
T = simulate_adaptation(s, N, u, 10)
dfs.append(pd.DataFrame(dict(s=s, N=N, u=u, T=T)))
[Saved variables 'dfs' to file '/Users/yoavram/Work/ipython-notebooks/mycache.pkl'.] 0 1
import pickle
with open('_mycache.pkl','rb') as f:
dfs = pickle.load(f)['dfs']
We now fit exponential regression to the simulated data.
The idea is that we assume that the response variable - T, the adaptation time - is exponentialy distributed with rate $\lambda(N, s, u)$ that is a linear function of the parameters:
$$ T \sim Exp(\lambda(N, s, u)) \\ E[1/T] = \lambda(N, s, u) = c_0 + c_N N + c_s s + c_u u $$Run the regression:
df = pd.concat(dfs, axis=0)
df['N'] = -df['N']
df['s'] = -df['s']
df['u'] = -df['u']
X = df[['N', 's', 'u']]
y = df['T']
X = sm.add_constant(X) # add intercept
distribution = sm.families.Gamma(sm.genmod.families.links.inverse_power)
glm = sm.GLM(y, X, family=distribution)
results = glm.fit()
results.summary()
/Users/yoavram/miniconda3/envs/simarba/lib/python3.5/site-packages/statsmodels/genmod/generalized_linear_model.py:244: DomainWarning: The inverse_power link function does not respect the domain of the Gamma family. DomainWarning)
Dep. Variable: | T | No. Observations: | 1600 |
---|---|---|---|
Model: | GLM | Df Residuals: | 1597 |
Model Family: | Gamma | Df Model: | 2 |
Link Function: | inverse_power | Scale: | 1.88916684088 |
Method: | IRLS | Log-Likelihood: | -14884. |
Date: | Mon, 12 Dec 2016 | Deviance: | 1802.8 |
Time: | 09:31:36 | Pearson chi2: | 3.02e+03 |
No. Iterations: | 9 |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | -1.279e-05 | 1.24e-06 | -10.290 | 0.000 | -1.52e-05 | -1.04e-05 |
N | -1.681e-10 | 1.05e-11 | -16.029 | 0.000 | -1.89e-10 | -1.48e-10 |
s | -0.0001 | 2.94e-05 | -4.629 | 0.000 | -0.000 | -7.84e-05 |
u | -2708.0310 | 142.864 | -18.955 | 0.000 | -2988.039 | -2428.023 |
Residuals analysis:
def plot_residuals(resid):
fig, ax = plt.subplots(1, 2, figsize=(8, 4))
resid_std = stats.zscore(resid)
sns.distplot(resid_std, bins=25, kde=False, fit=stats.norm, ax=ax[0])
ax[0].set(
title='Residuals histogram',
xlabel='Residuals',
ylabel='Frequency'
)
sm.graphics.qqplot(resid, line='r', ax=ax[1])
ax[1].set_title('QQ plot of residuals')
sns.despine()
return fig, ax
plot_residuals(results.resid_deviance.copy());
This analysis shows that the residuals are not really normally distributed. Which residuals are to blame?
plt.plot(results.fittedvalues, resid_std, 'ok', alpha=0.1)
plt.xlabel('T')
plt.ylabel('Absolute standard residual')
plt.xscale('log')
sns.despine()
Looks like the lower adaptation time co-occur with the positive residuals. Let's remove them:
plot_residuals(results.resid_deviance.copy()[results.fittedvalues > 5e4]);
Looks like smaller adaptation times are to blame; this makes sense, as for smaller adaptation time the time for fixation is relatively large and it is not exponential distributed.
c0, cN, cs, cu = results.params.values
N = np.logspace(4, 8, 100)
s = np.linspace(0.01, 0.1, 100)
u = np.logspace(-6, -10, 100)
TN = results.predict(pd.DataFrame(dict(const=1, N=N, s=s[50], u=u[50])))
Ts = results.predict(pd.DataFrame(dict(N=N[50], s=s, u=u[50], const=1)))
Tu = results.predict(pd.DataFrame(dict(N=N[50], s=s[50], u=u, const=1)))
TN = 1/(c0 + cN * N + cs * s[50] + cu * u[50])
Ts = 1/(c0 + cN * N[50] + cs * s + cu * u[50])
Tu = 1/(c0 + cN * N[50] + cs * s[50] + cu * u)
fig, ax = plt.subplots(2, 3, figsize=(9, 8), sharey=False, sharex='col')
ax[0,0].plot(N, TN)
ax[0,1].plot(s, Ts)
ax[0,2].plot(u, Tu)
ax[1,0].plot(N, 1/TN)
ax[1,1].plot(s, 1/Ts)
ax[1,2].plot(u, 1/Tu)
ax[0,0].set(ylabel='Adaptation time - T')
ax[1,0].set(xscale='log', xlabel='N', ylabel='Adaptation rate ~ 1/T')
ax[1,1].set(xlabel='s', ylabel='', xscale='log')
ax[1,2].set(xscale='log', xlabel='u', ylabel='')
fig.tight_layout()
sns.despine()
This last figure shows the effect of increasing each of the paramters on the resulting predicted adaptaion time and rate.