In [1]:

import random

import matplotlib.pyplot as plt
import numpy as np

In [2]:

alpha, beta = 2.5, 5.0

Compute the statistic from a single sample and estimate its uncertainty via bootstrap.

In [4]:

random.seed(42)
sample = np.array([random.betavariate(alpha, beta) for _ in xrange(100)])
estimate = np.percentile(sample, 75) - np.percentile(sample, 25)
print 'Point estimate from single sample:', estimate

Point estimate from single sample: 0.267156121833

In [5]:

bootstrap_samples = []
np.random.seed(42)
for _ in xrange(100000):
    data = np.random.choice(sample, size=100, replace=True)
    bootstrap_samples.append(np.percentile(data, 75) - np.percentile(data, 25))
bootstrap_samples = np.array(bootstrap_samples)
bootstrap_mean = bootstrap_samples.mean()
bootstrap_std = np.std(bootstrap_samples)
print 'Bootstrap estimate: {0:.4f} +/- {1:.4f}'.format(estimate, bootstrap_std)

Bootstrap estimate: 0.2672 +/- 0.0269

Compute the statistic from a bunch of samples from the real population.

In [6]:

population_samples = []
random.seed(42)
for _ in xrange(100000):
    data = np.array([random.betavariate(alpha, beta) for _ in xrange(100)])
    population_samples.append(np.percentile(data, 75) - np.percentile(data, 25))
population_samples = np.array(population_samples)
population_mean = population_samples.mean()
population_std = np.std(population_samples)
print "Real distribution of statistic: {0:.4f} +/- {1:.4f}".format(population_mean, population_std)

Real distribution of statistic: 0.2294 +/- 0.0263

Plot the distribution of the statistic for 1) several samples from the real population, and 2) estimated via bootstrap from one sample.

In [7]:

fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10, 12), sharex=True)

ax1.hist(bootstrap_samples, bins=100, normed=True)
ax1.axvline(x=estimate, color='red')
ax1.set_title('Bootstrap samples ({0:.4f} +/- {1:.4f})'.format(
              estimate, bootstrap_std))

ax2.hist(population_samples, bins=100, normed=True)
ax2.axvline(x=population_mean, color='red')
ax2.set_title('Samples from real population ({0:.4f} +/- {1:.4f})'.format(
              population_mean, population_std))

Out[7]:

<matplotlib.text.Text at 0xb3fbfac>

The plots share the same x-axis, so we can clearly see that the means are different. This is not unexpected. The mean from the single sample is computed from only 100 data points and is expected to vary from the population mean.

The interesting thing that we get from the bootstrap is that the standard deviations are very close and the distributions look fairly similar around their means.