import random
import matplotlib.pyplot as plt
import numpy as np
alpha, beta = 2.5, 5.0
Compute the statistic from a single sample and estimate its uncertainty via bootstrap.
random.seed(42)
sample = np.array([random.betavariate(alpha, beta) for _ in xrange(100)])
estimate = np.percentile(sample, 75) - np.percentile(sample, 25)
print 'Point estimate from single sample:', estimate
Point estimate from single sample: 0.267156121833
bootstrap_samples = []
np.random.seed(42)
for _ in xrange(100000):
data = np.random.choice(sample, size=100, replace=True)
bootstrap_samples.append(np.percentile(data, 75) - np.percentile(data, 25))
bootstrap_samples = np.array(bootstrap_samples)
bootstrap_mean = bootstrap_samples.mean()
bootstrap_std = np.std(bootstrap_samples)
print 'Bootstrap estimate: {0:.4f} +/- {1:.4f}'.format(estimate, bootstrap_std)
Bootstrap estimate: 0.2672 +/- 0.0269
Compute the statistic from a bunch of samples from the real population.
population_samples = []
random.seed(42)
for _ in xrange(100000):
data = np.array([random.betavariate(alpha, beta) for _ in xrange(100)])
population_samples.append(np.percentile(data, 75) - np.percentile(data, 25))
population_samples = np.array(population_samples)
population_mean = population_samples.mean()
population_std = np.std(population_samples)
print "Real distribution of statistic: {0:.4f} +/- {1:.4f}".format(population_mean, population_std)
Real distribution of statistic: 0.2294 +/- 0.0263
Plot the distribution of the statistic for 1) several samples from the real population, and 2) estimated via bootstrap from one sample.
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10, 12), sharex=True)
ax1.hist(bootstrap_samples, bins=100, normed=True)
ax1.axvline(x=estimate, color='red')
ax1.set_title('Bootstrap samples ({0:.4f} +/- {1:.4f})'.format(
estimate, bootstrap_std))
ax2.hist(population_samples, bins=100, normed=True)
ax2.axvline(x=population_mean, color='red')
ax2.set_title('Samples from real population ({0:.4f} +/- {1:.4f})'.format(
population_mean, population_std))
<matplotlib.text.Text at 0xb3fbfac>
The plots share the same x-axis, so we can clearly see that the means are different. This is not unexpected. The mean from the single sample is computed from only 100 data points and is expected to vary from the population mean.
The interesting thing that we get from the bootstrap is that the standard deviations are very close and the distributions look fairly similar around their means.