# The Kaplan-Wald Confidence Bound for a Nonnegative Mean¶

This section presents an approach to finding lower confidence bounds for the mean of a nonnegative random variable described in H.M. Kaplan's website, http://printmacroj.com/martMean.htm. That work fleshes out an idea due to Wald (1945, 2004), and is closely related to a technique presented in Kaplan (1987). For another derivation, see Stark & Teague (2014) https://www.usenix.org/system/files/jets/issues/0301/overview/jets_0301-stark.pdf

We have an iid sequence of random variables $X_1, X_2, \ldots$ such that $\mathbb P \{X_j \ge 0 \} = 1$. Let $F$ be their common distribution function. We seek a lower confidence bound for their common expectation $\mu \equiv \mathbb E X_1 = \int_0^\infty x dF$.

Under the hypothesis that $\mu = t$,

$$t^{-1} \mu = t^{-1} \int_0^\infty xdF(x) = \int_0^\infty xt^{-1} dF(x) = 1.$$

Fix $\gamma \in [0, 1]$. Because $\int_0^\infty dF = 1$, it follows that if $\mu = t$,

$$\mathbb E \left ( (\gamma/t) X_j + (1-\gamma) \right ) = (\gamma/t) \mathbb E X_j + (1-\gamma) = (\gamma/t)t + (1-\gamma) = 1.$$

Now, $$\mathbb E \left ((\gamma/t) X_j + (1-\gamma) \right ) \equiv \int_0^\infty \left (x \gamma/t + (1-\gamma) \right ) dF(x).$$ Since for $x \ge 0$, $(x \gamma/t + (1-\gamma)) \ge 0$, it follows that if we define

$$dG \equiv (x \gamma/t + (1-\gamma))dF$$

then $G$ is the cdf of a nonnegative random variable.

Let $Y$ be a random variable with cdf $G$. By Jensen's inequality, $\mathbb E X_j^2 \ge (\mathbb E X_j)^2 = t \cdot \mathbb E X_j$ (by hypothesis). Since $\mathbb E X_j = t \ge 0$,

\begin{align} \mathbb E Y &= \int_0^\infty x dG(x) \\ &= \int_0^\infty x (x \gamma/t + (1-\gamma)) dF(x) \\ &= \gamma/t \int_0^\infty x^2 dF(x) + (1-\gamma) \int_0^\infty x dF(x) \\ &= \gamma/t \cdot \mathbb E X_j^2 + (1-\gamma) \cdot \mathbb E X_j \\ &\ge \gamma \cdot \mathbb E X_j + (1-\gamma) \cdot \mathbb E X_j = \mathbb E X_j. \end{align}

(The penultimate step uses Jensen's inequality.)

If the data allow us to reject the hypothesis $H_0$ that $\{ X_j\}$ are IID with cdf $F$ in favor of the alternative hypothesis $H_1$ that they are iid with cdf $G$, we have strong statistical evidence that $\mu = \mathbb E X_j > t$.

## Using the SPRT to test $H_1$ versus $H_0$¶

Now a bit of magic occurs. For a given observation $X_j = x_j$, what is the likelihood ratio of $H_1$ to $H_0$?

$$\mbox{LR} = \frac{dG(x_j)}{dF(x_j)} = \frac{(x_j\gamma/t+(1−\gamma))dF(x_j)}{dF(x_j)} = (x_j\gamma/t+(1−\gamma)).$$

This doesn't depend on $F$ or $G$!

The $\mbox{LR}$ for observations $X_1, \ldots, X_m$ is $$\mbox{LR} = \Pi_{i=1}^m \left ( (\gamma/t) X_i + 1 - \gamma \right ).$$ This expression shows why $\gamma$ was introduced in the first place: for $\gamma = 1$, if even a single observation turned out to be zero, $\mbox{LR}$ would forever be zero no matter how large subsequent observations turned out to be. Taking $\gamma < 1$ hedges against that possibility. Any value of $\gamma \in [0, 1]$ gives a conservative test, but smaller values provide more "protection" against small values of $X_j$.

Recall that the $\mbox{LR}$ is the $P$-value of $H_0: \mu = t$ based on the observations $\{X_j\}_{j=1}^m$. We will use this to find a lower confidence bound for $\mu$.

There's more: recall that the SPRT says the chance that $\mbox{LR}$ ever gets larger than $1/\alpha$ is at most $\alpha$ if $H_0$ is true. That means that we can use the observations sequentially, maximizing over the partial products. If any of the partial products exceeds $1/\alpha$, we can reject $H_0$.

That is, our level-$\alpha$ test based on a sample of size $n$ is $$\mbox{ Reject } H_0 \mbox{ if } \max_{m=1}^n \Pi_{i=1}^m \left [ \gamma X_i/t + 1 - \gamma \right ] \ge 1/\alpha.$$

It is only legitimate to do this maximization if the data are in random order. For instance, it's impermissible to sort them from largest to smallest. And if you maximize over partial products, the result will, in general, depend on the order of the data.

It might be hard to explain to a court or consulting client that your confidence bound depends on the order the data happened to arrive in. If that's a concern, you can always just use all $n$ terms in the product and not maximize over $1 \le m \le n$.

## Confidence bounds from the Kaplan-Wald test¶

To find a lower confidence bound, we can invert hypothesis tests: the lower endpoint of a one-sided confidence bound for $\mu$ is the largest value of $t$ for which we would not reject the hypothesis $\mu = t$ at significance level $\alpha$.

For confidence levels above 50%, this lower confidence bound will certainly be between zero and the sample mean. However, for $t=0$, we have a divide-by-zero issue. Hence, for numerical implementation, we search the interval $[\epsilon, \bar{X}]$ for the smallest $t$ for which we can reject the hypothesis $\mu = t$ at significance level $\alpha$. If that smallest $t$ is $\epsilon$, we set the confidence bound to zero.

The following code implements this idea, working with the log of the test statistic to improve numerical stability.

In [1]:
# This is the first cell with code: set up the Python environment
%matplotlib inline
from __future__ import print_function, division
import matplotlib.pyplot as plt
import math
import numpy as np
import scipy as sp
import scipy.stats
from scipy.stats import binom
import scipy.optimize
import pandas as pd
from ipywidgets import interact, interactive, fixed
import ipywidgets as widgets
from IPython.display import clear_output, display, HTML

In [2]:
def kaplanWaldLowerCI(x, cl = 0.95, gamma = 0.99, xtol=1.e-12, logf=True):
'''
Calculates the Kaplan-Wald lower 1-alpha confidence bound for the mean of a nonnegative random
variable.
'''
alpha = 1.0-cl
if any(x < 0):
raise ValueError('Data x must be nonnegative.')
elif all(x <= 0):
lo = 0.0
else:
if logf:
f = lambda t: (np.max(np.cumsum(np.log(gamma*x/t + 1 - gamma))) + np.log(alpha))
else:
f = lambda t: (np.max(np.cumprod(gamma*x/t + 1 - gamma)) - 1/alpha)
xm = np.mean(x)
if f(xtol)*f(xm) > 0.0:
lo = 0.0
else:
lo = sp.optimize.brentq(f, xtol, np.mean(x), xtol=xtol)
return lo


## How well does it work?¶

Let's test the method on our standard test problems: binomial and a mixture of pointmass and uniform. We will fix $\gamma$; you might experiment using different values.

In [3]:
# Nonstandard mixture: a pointmass at 1 and a uniform[0,1]
ns = np.array([25, 50, 100, 400])  # sample sizes
ps = np.array([0.9, 0.99, 0.999])    # mixture fraction, weight of pointmass
alpha = 0.05  # 1- (confidence level)
reps = int(1.0e4)   # just for demonstration
gamma = 0.99
xtol = 1.e-12

simTable = pd.DataFrame(columns=('mass at 1', 'sample size', 'Student-t cov',\
'Kaplan-Wald cov', 'Student-t low', 'Kaplan-Wald low')
)

for p in ps:
popMean = (1-p)*0.5 + p  # population is a mixture of uniform with mass (1-p) and
# pointmass at 1 with mass p

for n in ns:
tCrit = sp.stats.t.ppf(q=1-alpha, df=n-1)
coverT = 0
coverK = 0
lowT = 0.0
lowK = 0.0

for rep in range(int(reps)):
sam = np.random.uniform(size=n)  # the uniform part of the sample
ptMass = np.random.uniform(size=n)  # for a bunch of p-coin tosses
sam[ptMass < p] = 1.0   # mix in pointmass at 1, with probability p
# t interval
samMean = np.mean(sam)
samSD = np.std(sam, ddof=1)
tLo = samMean - tCrit*samSD  # lower endpoint of the t interval
coverT += ( tLo <= popMean )
lowT += tLo
#  Kaplan-Wald interval
kLo = kaplanWaldLowerCI(sam, cl=1-alpha, gamma=gamma, xtol=xtol) # lower endpoint of the Kaplan-Wald interval
coverK += (kLo <= popMean )
lowK += kLo

simTable.loc[len(simTable)] =  p, n,\
str(100*float(coverT)/float(reps)) + '%',\
str(100*float(coverK)/float(reps)) + '%',\
str(round(lowT/float(reps),4)),\
str(round(lowK/float(reps),4))
#
ansStr =  '<h3>Simulated coverage probability and expected lower endpoint of ' +\
'one-sided Student-t and Kaplan-Wald confidence intervals for ' +\
'mixture of U[0,1] and pointmass at 1 population</h3>' +\
'<strong>Nominal coverage probability ' + str(100*(1-alpha)) +\
'%</strong>. Gamma=' + str(gamma) +\
'.<br /><strong>Estimated from ' + str(int(reps)) +\
' replications.</strong>'

display(HTML(ansStr))
display(simTable)


### Simulated coverage probability and expected lower endpoint of one-sided Student-t and Kaplan-Wald confidence intervals for mixture of U[0,1] and pointmass at 1 population

Nominal coverage probability 95.0%. Gamma=0.99.
Estimated from 10000 replications.
mass at 1 sample size Student-t cov Kaplan-Wald cov Student-t low Kaplan-Wald low
0 0.900 25.0 89.64% 100.0% 0.6836 0.82
1 0.900 50.0 98.63% 100.0% 0.6701 0.8713
2 0.900 100.0 99.96% 97.76% 0.6643 0.8954
3 0.900 400.0 100.0% 96.57% 0.6617 0.8951
4 0.990 25.0 21.83% 100.0% 0.9553 0.8794
5 0.990 50.0 38.78% 100.0% 0.9422 0.934
6 0.990 100.0 61.17% 100.0% 0.9285 0.9628
7 0.990 400.0 97.61% 100.0% 0.9069 0.9847
8 0.999 25.0 2.62% 100.0% 0.995 0.8853
9 0.999 50.0 4.62% 100.0% 0.9939 0.9406
10 0.999 100.0 9.65% 100.0% 0.9913 0.9695
11 0.999 400.0 32.36% 100.0% 0.9847 0.9917

Looks pretty good! Next we will introduce one more method, then we'll compare the various methods we've seen.

## Upper confidence bounds and two-sided bounds¶

If every $X_j$ has the same finite, a priori upper bound $u$, we can transform the problem by defining $\tilde{X_j}= u - X_j$. Then $\tilde{X_j}$ is a nonnegative random variable, and a lower confidence bound on its mean translated to can be subtracted from $u$ to make an upper confidence bound on $\mathbb E X_j$.

If every $X_j$ has the finite, a priori upper and lower bound, we can find two-sided confidence intervals in the analogous way, applying the Kaplan Wald method to the original data to find lower confidence bounds and to the data subtracted from the a priori upper bound to find upper confidence bounds.

## More¶

Harold Kaplan has introduced a variety of other methods based on Markov's inequality applied to optionally stopped Martingales; see Kaplan (1987). In my experience, this method is about as good as the others, but the results are really neat.