Penny sampling and Conditional Expectation

Penny Sampling

"Penny sampling" is a sampling and estimation approach closely related to DUS, but that allows particularly simple analysis. It was introduced by Edwards et al. (2013).

We shall derive a continuous version of Penny Sampling that turns the problem of making confidence bounds for the mean of a bounded random variable into the problem of making confidence bounds for a Binomial $p$, by introducing auxiliary independent randomization.

Basic Penny Sampling

We shall derive basic Penny Sampling in the context of financial auditing.

Again, we have a population of $N$ items; item $j$ has unknown value $x_j$, but the value is known to be nonnegative and less than $u_j$.

We divide the upper bound on into "pennies." Item $j$ contains $u_j$ pennies, of which $x_j$ are "good" and $(u_j - x_j)$ are "bad."

Let $u \equiv \sum_{j=1}^N u_j$ and $x \equiv \sum_{j=1}^N x_j$.

We want to estimate the population fraction of "good" pennies:

$$\mu \equiv \frac{x}{u}.$$

We then sample "pennies" at random, without replacement, from the entire population.

If we draw a penny associated with item $j$, we inspect that item to determine the value $x_j$.

If the index of the penny within item $j$ is less than or equal to $x_j$, we consider the penny that was drawn to be "good." Otherwise the penny is "bad."

The number of "good" pennies in the sample has a hypergeometric distribution with parameters $u$, $x = u \mu$, and $n$; the expected fraction of "good" pennies in the sample is $\mu = x/u$.

We invert hypergeometric tests to find confidence bounds for $x$, which we can translate into confidence bounds for $\mu$ by dividing by $u$.

Again, we are imagining situations in which the sample is very small compared to the number of items in the population—much less the number of pennies in those items—so we will act as if we are sampling with replacement, which gives a conservative result in any case.

Thus we treat the distribution of the number of "good" pennies in the sample as Binomial with parameters $n$ and $p = \mu = x/u$, rather than hypergeometric.

Relation between penny sampling and PPS/DUS sampling

While the chance of selecting each "penny" is equal, the chance of selecting item $j$ is proportional to $u_j$, just as in dollar-unit sampling. The difference is in how the data are analyzed: using $X_i$, the value of $x_j$ for the item selected on the $i$th draw, or using information about whether just the single "penny" is good or bad.

Continuous Penny Sampling and the Conditional Expectation

The Two-Step

One can think of penny sampling as taking place in two steps: first, select an item $J$ with PPS sampling, then select a penny uniformly at random from that item. If the penny's index is less than $X_J$, the penny is good. In other words, pennies are selected conditionally uniformly, given the item they are selected from, and compared to the true value of the item.

In the limit as pennies shrink in value, this amounts to comparing $X_J$ to a random variable $Z_J$ that is (continuously) conditionally uniformly distributed on $[0, U_J]$, where $U_J$ is the upper bound on the item selected. Let $W$ be the event that $Z_J \le X_J$. Then $$ \begin{align} \mathbb P \{ W = 1 \} & = \mathbb P \{ Z_J \le X_J \} \\ &= \sum_{j=1}^N \mathbb P\{Z_J \le X_J | J = j\} \mathbb P \{J=j\} \\ &= \sum_{j=1}^N u_j/u \mathbb P\{Z_j \le x_j | J = j\} \\ &= \sum_{j=1}^N u_j/u \cdot x_j/u_j \\ &= x/u. \end{align} $$ In $n$ iid draws, the distribution of the sum of the corresponding values of $W$ is Binomial with parameters $n$ and $p=\mu = x/u$.

Alternatively, we can think of the process as creating a new set of $n$ random variables $\{W_i\}$ where $W_i \sim \mbox{Bernoulli}(X_i/U_i)$. It is clear that unconditionally, $\{W_i\}$ are iid Benoulli($x/u$), so $\sum_{i=1}^n W_i \sim \mbox{Binomial}(n, \mu)$, where $\mu \equiv x/u$.

Sequences of iid variables

This kind of "continuous penny sampling" lets us make confidence bounds for iid samples from an arbitrary bounded distribution, such as the mixture of a uniform and a point-mass at 0. The lower and upper bounds on each draw are 0 and 1. For each $X_i$ in the sample, we generate a uniformly distributed $Z_i$, with all $\{X_i\}$ and $\{Z_i\}$ independent. Let $W_i = 1_{Z_i \le X_i}$. Then $\{W_i \}_{i=1}^n$ are iid Bernoulli random variables with $\mathbb P \{W_i = 1\} = \mathbb E X_i = \mu$: $$ \begin{align} \mathbb P \{ W_i = 1 \} & = \mathbb P \{ Z_i \le X_i \} \\ &= \int_0^1 \mathbb P \{ Z_i \le X_i | Z_i = z\} dz \\ &= \int_0^1 \mathbb P \{ X_i > z \} dz \\ &= \mathbb E X_i, \end{align} $$ by the tail-integral formula for expectation.

Implementation

Let's implement continuous penny sampling in Python.

In [1]:
# This is the first cell with code: set up the Python environment
%matplotlib inline
from __future__ import division, print_function
import matplotlib.pyplot as plt
import math
import numpy as np
import scipy as sp
import scipy.stats
from scipy.stats import binom
import pandas as pd
from ipywidgets import interact, interactive, fixed
import ipywidgets as widgets
from IPython.display import clear_output, display, HTML
In [2]:
def binoLowerCL(n, x, cl = 0.975, inc=0.000001, p = None):
    "Lower confidence level cl confidence interval for Binomial p, for x successes in n trials"
    if p is None:
            p = float(x)/float(n)
    lo = 0.0
    if (x > 0):
            f = lambda q: cl - scipy.stats.binom.cdf(x-1, n, q)
            lo = sp.optimize.brentq(f, 0.0, p, xtol=inc)
    return lo

def binoUpperCL(n, x, cl = 0.975, inc=0.000001, p = None):
    "Upper confidence level cl confidence interval for Binomial p, for x successes in n trials"
    if p is None:
            p = float(x)/float(n)
    hi = 1.0
    if (x < n):
            f = lambda q: scipy.stats.binom.cdf(x, n, q) - (1-cl)
            hi = sp.optimize.brentq(f, p, 1.0, xtol=inc) 
    return hi

def pennySampleReplacement(weights, n):
    '''
       Weighted random sample of size n drawn with replacement.
       Returns indices of the selected items, the "remainder pennies" (the conditionally uniform
       auxiliary randomization within items) and the raw uniform values used to select the sample
    '''
    if any(weights < 0):
        print('negative weight in weightedRandomSample')
        return float('NaN')
    else:
        totWt = np.sum(weights, dtype=float)
        wc = np.cumsum(weights, dtype=float)/totWt  # ensure weights sum to 1
        theSam = np.random.random_sample((n,))
        inx = np.array(wc).searchsorted(theSam)
        penny = [(wc[inx[i]]-theSam[i])*totWt for i in range(n)]
        return inx, penny, theSam

def pennyBinomialLowerBound(x, inx, pennies, cl=0.95):
    '''
       Penny sampling lower (one-sided) 1-alpha confidence bound on the mean, for sampling with replacement.
       x is the vector of observed values
       pennies is the vector of _which_ "penny" in each sampled item is to be adjudicated as "good" or "bad"
       The first x_j pennies in item j are deemed "good," the remaining (u_j - x_j) are "bad."
       Returns the lower bound and the number of "good" pennies in the sample.
    '''
    s = sum([pennies[i] <= x[inx[i]] for i in range(len(pennies))])
    n = len(inx)
    return binoLowerCL(n, s, cl=cl), s

def pennyBinomialBounds(x, inx, pennies, cl=0.95):
    '''
       Penny sampling 2-sided confidence interval for the mean, for sampling with replacement.
       x is the vector of observed values
       pennies is the vector of _which_ "penny" in each sampled item is to be adjudicated as "good" or "bad"
       The first x_j pennies in item j are deemed "good," the remaining (u_j - x_j) are "bad."
       Returns the lower bound, the upper bound and the number of "good" pennies in the sample.
    '''
    s = sum([pennies[i] <= x[inx[i]] for i in range(len(pennies))])
    n = len(inx)
    return binoLowerCL(n, s, cl=1-(1-cl)/2), binoUpperCL(n, s, cl=1-(1-cl)/2), s

Illustration

Let's imagine a population of $N$ items bounded between 0 and 1.

Let's test this:

In [3]:
# Nonstandard mixture: a pointmass at zero and a uniform[0,1]
ns = np.array([25, 50, 100, 400])  # sample sizes
ps = np.array([0.9, 0.99, 0.999])    # mixture fraction, weight of pointmass
alpha = 0.05  # 1- (confidence level)
reps = int(1.0e4) # just for demonstration

simTable = pd.DataFrame(columns=('mass at 0', 'sample size', 'Student-t cov',\
                                 'Penny cov', 'Student-t len', 'Penny len')
                       )

for p in ps:
    popMean = (1-p)*0.5  #  p*0 + (1-p)*.5
    for n in ns:
        tCrit = sp.stats.t.ppf(q=1-alpha/2, df=n-1)
        coverT = 0
        coverP = 0
        totT = 0.0
        totP = 0.0
        for rep in range(int(reps)):
            sam = np.random.uniform(size=n)
            ptMass = np.random.uniform(size=n)
            pennies = np.random.uniform(size=n)  # auxiliary uniform randomization within items
            sam[ptMass < p] = 0.0                # point mass at zero
            samMean = np.mean(sam)
            samSD = np.std(sam, ddof=1)
            coverT += ( math.fabs(samMean-popMean) < tCrit*samSD )
            totT += 2*tCrit*samSD   
            pLo, pHi, s = pennyBinomialBounds(sam, np.r_[0:n], pennies, cl=1-alpha )
            coverP += ( pLo <= popMean ) & (popMean <= pHi)
            totP += pHi - pLo
        simTable.loc[len(simTable)] =  p, n,\
            str(100*float(coverT)/float(reps)) + '%',\
            str(100*float(coverP)/float(reps)) + '%',\
            str(round(totT/float(reps),4)),\
            str(round(totP/float(reps), 4))
#
ansStr =  '<h3>Simulated coverage probability of Student-t and Continuous Penny Sampling confidence intervals for ' +\
          'mixture of U[0,1] and pointmass at 0 population</h3>' +\
          '<strong>Nominal coverage probability ' + str(100*(1-alpha)) +\
          '%</strong>.<br /><strong>Estimated from ' + str(int(reps)) + ' replications.</strong>'

display(HTML(ansStr))
display(simTable)

Simulated coverage probability of Student-t and Continuous Penny Sampling confidence intervals for mixture of U[0,1] and pointmass at 0 population

Nominal coverage probability 95.0%.
Estimated from 10000 replications.
mass at 0 sample size Student-t cov Penny cov Student-t len Penny len
0 0.900 25.0 90.37% 99.35% 0.6419 0.207
1 0.900 50.0 98.89% 98.7% 0.6724 0.1382
2 0.900 100.0 99.99% 98.41% 0.6817 0.0942
3 0.900 400.0 100.0% 95.9% 0.6868 0.0451
4 0.990 25.0 21.96% 99.32% 0.0975 0.1451
5 0.990 50.0 38.61% 99.82% 0.1253 0.0795
6 0.990 100.0 62.16% 98.49% 0.1604 0.0447
7 0.990 400.0 97.98% 98.31% 0.2098 0.0167
8 0.999 25.0 2.57% 98.67% 0.0106 0.1381
9 0.999 50.0 4.44% 97.87% 0.0127 0.0719
10 0.999 100.0 8.94% 99.93% 0.0178 0.0371
11 0.999 400.0 32.85% 98.16% 0.0357 0.0101

As is the case with the other nonparametric methods, the Continuous Penny Sample intervals are in some cases shorter on average than Student-t intervals, but have higher coverage.

In [ ]: