Confidence bounds via the Chebychev and Hoeffding Inequalities

Chebychev's inequality

Suppose $X$ has finite expected value $\mu$ and variance $\sigma^2$. Then $$ \mathbb P_\mu (| X - \mu | \ge k \sigma ) \le 1/k^2. $$

Using Chebychev's inequality for confidence intervals

Suppose that $\{X_j\}_{j=1}^n$ are independent random variables and $\{\ell_j\}_{j=1}^n$ and $\{u_j\}_{j=1}^n$ are real numbers such that

$$\mathbb P (X_j \in [\ell_j, u_j]) = 1, \forall j = 1, \ldots, n.$$

Let $\bar{X} \equiv \frac{1}{n} \sum_{j=1}^n X_j$, $\mu = \mathbb E \bar{X}$, and $\tau^2 \equiv \sum_{j=1}^n (u_j - \ell_j)^2$.

Let's bound $\mbox{Var } X_j = \mathbb E (X_j - \mathbb E X_j)^2$. Intuitively, the variance is maximized by "spreading out" the distribution as much as possible, which would mean putting mass 1/2 at $\ell_j$ and mass 1/2 at $u_j$. (This is directly analogous to the fact that the variance of a Bernoulli distribution is maximal when $p = 1/2$.) The mean of the corresponding distribution is $(\ell_j + u_j)/2$, and the variance is

$$ \frac{1}{2} \left ( \ell_j - (\ell_j + u_j)/2 \right )^2 + \frac{1}{2} \left ( u_j - (\ell_j + u_j)/2 \right )^2 = \frac{1}{2} \left ( \frac{\ell_j - u_j}{2} \right )^2 + \frac{1}{2} \left ( \frac{u_j - \ell_j}{2} \right )^2 $$$$ = \left ( \frac{\ell_j - u_j}{2} \right )^2 = \frac{(u_j - \ell_j)^2}{4} . $$

(As a sanity check, this variable is a shifted and scaled Bernoulli with $p=1/2$, which has variance $1/4$. The shift doesn't affect the variance, and the scale factor is $u_j - \ell_j$, so the new variance should be $1/4 \times (u_j - \ell_j)^2$.)

Hence $$ \mbox{Var } \bar{X} \equiv \sigma^2 = \frac{1}{n^2} \sum_{j=1}^n \mbox{Var }X_j \le \frac{\tau^2}{4n^2}.$$

Since $\sigma \le \tau/(2n)$, $$ \mathbb P_\mu \left (| \bar{X} - \mu | \ge k \frac{\tau}{2 n} \right ) \le \mathbb P_\mu \left (| \bar{X} - \mu | \ge k \sigma \right ) \le 1/k^2, $$ and thus $$ \mathbb P_\mu \left (| \bar{X} - \mu | \ge a \right ) \le \frac{\tau^2}{4n^2 a^2}. $$ This bound is trivial (because it is greater than 1) until $$ \frac{\tau^2}{4n^2 a^2} < 1, $$ i.e., $$ a > \frac{\tau}{2 n}. $$

Hoeffding's Inequality

Let $\{X_j\}_{j=1}^n$ be independent random variables and $\{\ell_j\}_{j=1}^n$ and $\{u_j\}_{j=1}^n$ be real numbers such that $$\mathbb P (X_j \in [\ell_j, u_j]) = 1, \forall j = 1, \ldots, n,$$ and define $\bar{X} \equiv \frac{1}{n} \sum_j X_j$, $\mu \equiv \mathbb E \bar{X}$, and $\tau^2 \equiv \sum_{j=1}^n (u_j - \ell_j)^2$.

Then $$ \mathbb P_\mu ( \bar{X} - \mu \ge a ) \le \exp \left (-\frac{2n^2 a^2}{\tau^2} \right ) $$ and $$ \mathbb P_\mu ( |\bar{X} - \mu | \ge a ) \le 2 \exp \left(-\frac{2n^2 a^2}{\tau^2} \right ). $$

This bound is useful once $$ 2 \exp \left(-\frac{2n^2 a^2}{\tau^2} \right ) < 1, $$ i.e., $$ a > \sqrt{-2\ln 1/2} \cdot \frac{\tau}{2 n}. $$ Since $\sqrt{-2 \ln 1/2} \approx 1.18$, Hoeffding's inequality requires $a$ to be about 18% larger to be useful compared to Chebychev's inequality, but it is exponentially better as $n$ grows.

Truncation

If we know a priori that $\mu \in [\ell, u]$, we can take the intersection of any confidence interval $\mathcal I$ with $[\ell, u]$ without losing any coverage probability: if $\mathcal I \ni \mu$, then $\mathcal I \cap [\ell, u] \ni \mu$. Hence, we can truncate any confidence interval by taking its intersection with $[\ell, u]$ without reducing the confidence level.

For instance, suppose $\mathbb P X_j \in [\ell_j, u_j]) = 1, \forall j = 1, \ldots, n$. Define $\ell \equiv \frac{1}{n} \sum_{j=1}^n \ell_j$ and $u \equiv \frac{1}{n} \sum_{j=1}^n u_j$. Then

$$\mathbb P \{ \bar{X} \in [\ell, u] \} = 1,$$

and thus $\mu \equiv \mathbb E \bar{X} \in [\ell, u]$. Since we know with certainty that $\mu \in [\ell, u]$, any portion of a confidence interval $\mathcal I$ that lies outside $[\ell, u]$ cannot help it cover $\mu$, and we might as well use $\mathcal I \cap [\ell, u]$ if it is shorter than $\mathcal I$: we gain precision without losing coverage.

Basing a confidence interval on Chebychev's inequality or Hoeffding's inequality sometimes produces an interval that can be improved by truncation, decreasing the length (and expected length) without sacrificing coverage.

Comparing Hoeffding's Inequality and Chebychev's Inequality for the sample mean from finite bounded populations

Both inequalities bound $\mathbb P_\mu (|\bar{X} - \mu | \ge a)$ in terms of $\frac{\sum_{j=1}^n (u_j - \ell_j)^2}{n^2 a^2}$, but Hoeffding's bound is exponentially better.

Here is a comparison, normalized by taking $\tau^2 \equiv \sum_{j=1}^n (u_j - \ell_j)^2 = n$ so that $\tau^2/n = 1$, as would be the case if $\ell_j = 0$ and $u_j = 1$, $\forall j$:

In [1]:
# This is the first cell with code: set up the Python environment
%matplotlib inline
import matplotlib.pyplot as plt
import math
import numpy as np
import scipy as sp
import scipy.stats
from scipy.stats import binom
import pandas as pd
from ipywidgets.widgets import interact, interactive, fixed
import ipywidgets as widgets
from IPython.display import clear_output, display, HTML
In [2]:
def chebychev(n, ssByn, a):
    return ssByn/(4.0 * n * np.square(a))

def hoeffding(n, ssByn, a):
    return 2*np.exp(-2.0 * n * np.square(a) / ssByn)

def chebychevHoeffdingBoundPlot(n, ssByn = 1.0, aMax = 1.0, pts = 1000):
    ''' 
       Plots the bounds on P( |\bar{X} - \mu | \ge a) for Chebychev's inequality and 
       Hoeffding's inequality.
       \bar{X} is the sample mean of n independent bounded random variables.
       ssBy is the mean sum of squares of the differences between the upper and lower bounds on 
       each of those n variables.
       Plots the curves as pts points
    '''
    fig, ax = plt.subplots(nrows=1, ncols=1)
    aMin = math.sqrt(-math.log(0.5)*ssByn/(2.0*n))  # smallest a for which Hoeffding bound is nontrivial
    x = np.linspace(aMin, aMax, pts+1)
    cheb, = plt.plot(x, np.fmin(1,chebychev(n, ssByn, x)), color='b', label='Chebychev')
    hoef, = plt.plot(x, hoeffding(n, ssByn, x), color='r', label='Hoeffding')
    plt.title(r'Bound on $\mathbf{P} (|\bar{X} - \mu| > a)$')
    plt.xlabel('a')
    plt.ylabel('Probability')
    plt.legend(loc = 'best')

interact(chebychevHoeffdingBoundPlot, n=widgets.IntSlider(min=5, max=300, step=1, value=30),\
         ssByn = widgets.FloatSlider(min=0.5, max=10, step=0.1, value=1.0),\
         aMax = widgets.fixed(1), pts = widgets.fixed(1000)
        )
Out[2]:
<function __main__.chebychevHoeffdingBoundPlot>

Using Hoeffding's inequality for confidence intervals

Suppose that $\{X_j \}_{j=1}^n$ are iid with support $[0, 1]$. Then $\tau = \sqrt{n}$.

It follows that $$ A_\alpha(\theta) = \left [0, \theta + \sqrt{\frac{-\ln \alpha}{2n}} \right ] $$ is the (one-sided) acceptance region for a level-$\alpha$ test of the hypothesis $\mu = \theta$; $$ \mathbb P_\mu \left (\bar{X} - \sqrt{\frac{-\ln \alpha}{2n}} \ge \mu \right ) \le \alpha; $$ and $$\left [\bar{X} - \sqrt{\frac{-\ln \alpha}{2n}}, 1 \right ]$$ is a one-sided confidence interval for $\mu$, with confidence level $1-\alpha$.

Similarly, $$ A_\alpha(\theta) = \left [\theta - \sqrt{\frac{-\ln \alpha/2}{2n}}, \theta + \sqrt{\frac{-\ln \alpha/2}{2n}} \right ] $$ is the two-sided acceptance region for a level-$\alpha$ test of the hypothesis $\mu = \theta$; $$ \mathbb P_\mu \left ( \left | \bar{X} - \sqrt{\frac{-\ln \alpha/2}{2n}} \right | \ge \mu \right ) \le \alpha; $$ and $$\left [\bar{X} - \sqrt{\frac{-\ln \alpha/2}{2n}}, \bar{X} + \sqrt{\frac{-\ln \alpha/2}{2n}} \right ]$$ is a two-sided confidence interval for $\mu$, with confidence level $1-\alpha$.

Let's compare Hoeffding confidence intervals and intervals based on the normal approximation by simulation, in the cases we saw previously in [Confidence intervals based on the normal approximation](normApprox.ipynb). We will use two-sided bounds, even though in applications one-sided bounds are often more interesting.

Of course, if we know a priori that $\mu \in [\ell, u]$, the intersection of the Hoeffding confidence interval with $[\ell, u]$ is still a $1-\alpha$ confidence interval.

In [3]:
# Population of two values, {0, 1}, in various proportions.  Amounts to Binomial random variable
ns = np.array([25, 50, 100, 400])  # sample sizes
ps = np.array([.001, .01, 0.1])    # mixture fractions, proportion of 1s in the population
alpha = 0.05  # 1- (confidence level)
reps = int(1.0e4)  # just for demonstration
vals = [0, 1]

simTable = pd.DataFrame(columns=('fraction of 1s', 'sample size', 'Student-t cov',\
                                 'Hoeffding cov', 'Student-t len', 'Trunc Hoeffding len'))
for p in ps:
    popMean = p
    for n in ns:
        tCrit = sp.stats.t.ppf(q=1-alpha/2, df=n-1)
        hCrit = np.sqrt(-math.log(alpha/2)/(2*n))
        sam = sp.stats.binom.rvs(n, p, size=reps)
        samMean = sam/float(n)
        samSD = np.sqrt(samMean*(1-samMean)/(n-1))
        coverT = (np.fabs(samMean-popMean) < tCrit*samSD).sum()
        lenT = 2*(tCrit*samSD).mean()
        lenH = (np.minimum(samMean+hCrit, 1.0)-np.maximum(samMean-hCrit, 0.0)).mean()
        coverH = (np.fabs(samMean-popMean) < hCrit).sum()
        simTable.loc[len(simTable)] =  p, n, str(100*float(coverT)/float(reps)) + '%', \
            str(100*float(coverH)/float(reps)) + '%',\
            str(round(lenT,4)),\
            str(round(lenH,4))
#
ansStr =  '<h3>Simulated coverage probability for Student-t and truncated Hoeffding confidence intervals for a {0, 1} population</h3>' +\
          '<strong>Nominal coverage probability ' + str(100*(1-alpha)) +\
          '%</strong>.<br /><strong>Estimated from ' + str(round(reps)) + ' replications.</strong>'
display(HTML(ansStr))
display(simTable)

Simulated coverage probability for Student-t and truncated Hoeffding confidence intervals for a {0, 1} population

Nominal coverage probability 95.0%.
Estimated from 10000.0 replications.
fraction of 1s sample size Student-t cov Hoeffding cov Student-t len Trunc Hoeffding len
0 0.001 25.0 2.47% 100.0% 0.0041 0.2726
1 0.001 50.0 5.08% 100.0% 0.0041 0.1931
2 0.001 100.0 10.03% 100.0% 0.0041 0.1369
3 0.001 400.0 33.47% 100.0% 0.0036 0.0689
4 0.010 25.0 22.29% 100.0% 0.0386 0.2817
5 0.010 50.0 40.03% 100.0% 0.0356 0.2023
6 0.010 100.0 63.72% 100.0% 0.0309 0.1459
7 0.010 400.0 90.81% 100.0% 0.0188 0.0779
8 0.100 25.0 92.54% 100.0% 0.2329 0.3721
9 0.100 50.0 87.75% 100.0% 0.1669 0.2918
10 0.100 100.0 92.99% 100.0% 0.1177 0.2335
11 0.100 400.0 94.88% 100.0% 0.0589 0.1358
In [4]:
# Nonstandard mixture: a pointmass at zero and a uniform[0,1]
ns = np.array([25, 50, 100, 400])  # sample sizes
ps = np.array([0.9, 0.99, 0.999])    # mixture fraction, weight of pointmass
alpha = 0.05  # 1- (confidence level)
reps = int(1.0e4)   # just for demonstration

simTable = pd.DataFrame(columns=('mass at 0', 'sample size', 'Student-t cov',\
                                 'Hoeffding cov', 'Student-t len', 'Trunc Hoeffding len')
                       )

for p in ps:
    popMean = (1-p)*0.5  #  p*0 + (1-p)*.5
    for n in ns:
        tCrit = sp.stats.t.ppf(q=1-alpha/2, df=n-1)
        hCrit = np.sqrt(-math.log(alpha/2)/(2*n))
        coverT = 0
        coverH = 0
        totT = 0.0
        totH = 0.0
        for rep in range(int(reps)):
            sam = np.random.uniform(size=n)
            ptMass = np.random.uniform(size=n)
            sam[ptMass < p] = 0.0
            samMean = np.mean(sam)
            samSD = np.std(sam, ddof=1)
            coverT += ( math.fabs(samMean-popMean) < tCrit*samSD )
            totT += 2*tCrit*samSD
            coverH += ( math.fabs(samMean-popMean) < hCrit )
            totH +=  min(samMean+hCrit, 1.0)-max(samMean-hCrit, 0.0)
        simTable.loc[len(simTable)] =  p, n,\
            str(100*float(coverT)/float(reps)) + '%',\
            str(100*float(coverH)/float(reps)) + '%',\
            str(round(totT/float(reps),4)),\
            str(round(totH/float(reps),4))
#
ansStr =  '<h3>Simulated coverage probability of Student-t and truncated Hoeffding confidence intervals for ' +\
          'mixture of U[0,1] and pointmass at 0 population</h3>' +\
          '<strong>Nominal coverage probability ' + str(100*(1-alpha)) +\
          '%</strong>.<br /><strong>Estimated from ' + str(round(reps,0)) + ' replications.</strong>'

display(HTML(ansStr))
display(simTable)

Simulated coverage probability of Student-t and truncated Hoeffding confidence intervals for mixture of U[0,1] and pointmass at 0 population

Nominal coverage probability 95.0%.
Estimated from 10000.0 replications.
mass at 0 sample size Student-t cov Hoeffding cov Student-t len Trunc Hoeffding len
0 0.900 25.0 90.56% 100.0% 0.6459 0.3219
1 0.900 50.0 98.75% 100.0% 0.6707 0.2421
2 0.900 100.0 99.97% 100.0% 0.6816 0.1858
3 0.900 400.0 100.0% 100.0% 0.6868 0.1178
4 0.990 25.0 22.2% 100.0% 0.0988 0.2767
5 0.990 50.0 38.95% 100.0% 0.1268 0.1971
6 0.990 100.0 62.96% 100.0% 0.1616 0.1408
7 0.990 400.0 97.81% 100.0% 0.2101 0.0729
8 0.999 25.0 2.66% 100.0% 0.0112 0.2722
9 0.999 50.0 4.87% 100.0% 0.0139 0.1926
10 0.999 100.0 9.54% 100.0% 0.019 0.1363
11 0.999 400.0 32.32% 100.0% 0.0352 0.0684

Remarkably, the truncated Hoeffding intervals are often shorter on average than the Student-t intervals, but have far better coverage.

More in this direction...

We could also apply other inequalities, such as Serfling's Inequality (a version of Hoeffding's inequality for sampling without replacement) or Feige's Inequality.

For sampling without replacement, the "Empirical Bernstein-Serfling Inequality" of Bardenet and Maillard (2013) seems particularly worth exploring.

In [5]:
%run talkTools.py