Algorithms for Pseudo-Random Permutations and Samples


Random Sampling

Random sampling is one of the most fundamental tools in Statistics.

Used for:

  • Surveys and extrapolation
    • Opinion surveys and polls
    • Census (for some purposes) and Current Population Survey
    • Environmental statistics
    • Litigation, including class actions, discrimination, ...
  • Experiments
    • Medicine / clinical trials
    • Agriculture
    • Marketing
    • Product development
  • Quality control and auditing
    • Process control
    • Financial auditing
    • Healthcare auditing
    • Election auditing
  • Sampling and resampling methods
    • Bootstrap
    • Permutation tests
    • MCMC

and on and on.


Simple random sampling

Draw $k \le n$ items from a population of $n$ items, in such a way that each of the $n \choose k$ subsets of size $k$ is equally likely.

Many standard statistical methods assume the sample is drawn in this way, or allocated between treatment and control in this way (e.g., $k$ of $n$ subjects are assigned to treatment, and the remaining $n-k$ to control).


Random permutations

$n!$ possible permutations of $n$ distinct objects.

$n!$ grows quickly with $n$:

$$ n! \sim \sqrt {2\pi n}\left({\frac {n}{e}}\right)^n (\mbox{Stirling's approximation})$$

Pigeon-hole principle

If you put $N>n$ pigeons in $n$ pigeonholes, then at least one pigeonhole must contain more than one pigeon.

Corollary

At most $n$ pigeons can be put in $n$ pigeonholes if at most one pigeon is put in each hole.

Some pigeon coops & flocks

Expression full scientific notation
$2^{32}$ 4,294,967,296 4.29e9
$2^{64}$ 18,446,744,073,709,551,616 1.84e19
$2^{128}$ 3.40e38
$2^{32 \times 624}$ 9.27e6010
$13!$ 6,227,020,800 6.23e9
$21!$ 51,090,942,171,709,440,000 5.11e19
$35!$ 1.03e40
$2084!$ 3.73e6013
${50 \choose 10}$ 10,272,278,170 1.03e10
${100 \choose 10}$ 17,310,309,456,440 1.73e13
${500 \choose 10}$ 2.4581e20
$\frac{2^{32}}{{50 \choose 10}}$ 0.418
$\frac{2^{64}}{{500 \choose 10}}$ 0.075

We will come back to these numbers.


Random sampling

Given a good source of randomness, many ways to draw a siimple random sample.

One basic approach is like shuffling a deck of $n$ cards, then dealing the top $k$: permute the population at random, then take the first $k$ elements of the permutation to be the sample.

There are a number of standard ways to generate a random permutation—i.e., to shuffle the deck.

Algorithm PIKK (permute indices and keep $k$)

For instance, if we had a way to generate independent, identically distributed (iid) $U[0,1]$ random numbers, we could do it as follows:

Algorithm PIKK

  • assign iid $U[0,1]$ numbers to the $n$ elements of the population
  • the sample consists of the $k$ items assigned the smallest random numbers (break ties randomly)

  • amounts to generating a random permutation of the population, then taking first $k$.

  • if the numbers really are iid, every permutation is equally likely, and it follows that the first $k$ are an SRS
  • requires $n$ random numbers (and sorting)
In [2]:
%matplotlib inline
from __future__ import division
import math
import numpy as np
import scipy as sp
from scipy.misc import comb, factorial
from scipy.optimize import brentq
from scipy.stats import chisquare, norm
import scipy.integrate
from random import Random
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
In [3]:
def PIKK(n,k, gen=np.random):
    try: 
        x = gen.random(n)
    except TypeError:
        x = np.array([gen.random() for i in np.arange(n)])
    return set(np.argsort(x)[0:k])

There are more efficient ways to generate a random permutation than assigning a number to each element and sorting, for instance the "Fisher-Yates shuffle" or "Knuth shuffle" (Knuth attributes it to Durstenfeld).

Algorithm Fisher-Yates-Knuth-Durstenfeld shuffle (backwards version)

for i=1, ..., n-1:
    J <- random integer uniformly distributed on {i, ..., n}
    (a[J], a[i]) <- (a[i], a[J])

This requires the ability to generate independent random integers on various ranges. Doesn't require sorting.

There's a version suitable for streaming, i.e., generating a random permutation of a list that has an (initially) unknown number $n$ of elements:

Algorithm Fisher-Yates-Knuth-Durstenfeld shuffle (streaming version)

i <- 0
a = []
while there are records left:
    i <- i+1
    J <- random integer uniformly distributed on {1, ..., i}
    if J < i:
        a[i] <- a[J]
        a[J] <- next record
    else:
        a[i] <- next record

Again, need to be able to generate independent uniformly distributed random integers.

In [8]:
def fykd(a, gen=np.random):
    for i in range(len(a)-2):
        J = gen.randint(i,len(a))
        a[i], a[J] = a[J], a[i]
    return(a)

fykd(np.arange(10))
Out[8]:
array([8, 3, 4, 7, 0, 9, 5, 1, 2, 6])

Proof that the streaming Fisher-Yates-Knuth-Durstenfeld algorithm works

Induction:

For $i=1$, obvious.

At stage $i$, suppose all $i!$ permutations are equally likely. For each such permutation, there are $i+1$ distinct, equally likely permutations at stage $i+1$, resulting from swapping the $i+1$st item with one of the first $i$, or putting it in position $i+1$. These possibilities are mutually exclusive of the permutations attainable from a different permutation of the first $i$ items.

Thus, this enumerates all $(i+1)i! = (i+1)!$ permutations of $(i+1)$ items, and they are equally likely. ■

A good way to get a bad shuffle

Sort using a "random" comparison function, e.g.,

def RandomSort(a,b):
    return (0.5 - np.random.random())

But this fails the basic properties of an ordering, e.g., transitivity, reflexiveness, etc. Doesn't produce random permutation. Output also depends on sorting algorithm.

The right way, the wrong way, and the Microsoft way.

Notoriously used by Microsoft to offer a selection of browsers in the EU. Resulted in IE being 5th of 5 about half the time in IE and about 26% of the time in Chrome, but only 6% of the time in Safari (4th about 40% of the time).

See, e.g., http://www.computerworld.com/article/2520190/web-apps/microsoft-s-eu-ballot-fails-to-randomize-browser-order.html

Cormen et al. (2009) Algorithm Random_Sample

  • recursive algorithm
  • requires only $k$ random numbers (integers)
  • does not require sorting
In [9]:
def Random_Sample(n, k, gen=np.random):  # from Cormen et al.
    if k==0:
        return set()
    else:
        S = Random_Sample(n-1, k-1)
        i = gen.randint(1,n+1) 
        if i in S:
            S = S.union([n])
        else:
            S = S.union([i])
    return S

Random_Sample(100,5)
Out[9]:
{22, 44, 79, 82, 86}

Proof. Recursion.

Reservoir algorithms

The previous algorithms require $n$ to be known. There are reservoir algorithms that do not. Moreover, the algorithms are suitable for streaming (aka online) use: items are examined sequentially and either enter into the reservoir, or, if not, are never revisited.

Algorithm R, Waterman (per Knuth, 1997)

  • Put first $k$ items into the reservoir
  • when item $k+j$ is examined, either skip it (with probability $j/(k+j)$) or swap for a uniformly selected item in the reservoir (with probability $k/(k+j)$)
  • naive version requires generating at most $n-k$ random numbers
  • closely related to FYKD shuffle
In [11]:
def R(n,k):  # Waterman's algorithm R
    S = np.arange(1, k+1)  # fill the reservoir
    for t in range(k+1,n+1):
        i = np.random.randint(1,t+1)
        if i <= k:
            S[i-1] = t
    return set(S)

R(100,5)
Out[11]:
{15, 16, 20, 42, 55}

Like Random-Sample, the proof that algorithm R generates an SRS uses the ability to generate independent random integers, uniformly distributed on $\{1, \ldots, m \}$.

Algorithm Z, Vitter (1985)

Much more efficient than R, using random skips. Works in time essentially linear in $k$.

Note: Vitter proposes using the (inaccurate) $J = \lfloor mU \rfloor$ to generate a random integer between $0$ and $m$ in both algorithm R and algorithm Z. The issue is pervasive!

PRNGs in Common packages

Package/Lang default other SRS algorithm
SAS 9.2 MT 32-bit LCG
SPSS 20.0 32-bit LCG MT1997ar
SPSS ≤ 12.0 32-bit LCG
STATA 13 KISS 32
STATA 14 MT
R MT trunc
python MT mask
MATLAB MT

Key. MT = Mersenne Twister. LCG = linear congruential generator. ANS = "assign a number to each of the $n$ items and sort." The KISS generator combines 4 generators of three types: two multiply-with-carry generators, the 3-shift register SHR3 and the congruential generator CONG.

Prior to April 2011, in Stata 10 exhibited predictable behavior: 95.1% of the $2^{31}$ possible seed values resulted in the first and second draws from rnormal() having the same sign.

Best Practices

  • Use a source of real randomness with a substantial amount of entropy to set the seed, e.g., 20 rolls of 10-sided dice.
  • Record the seed so your analysis is reproducible.
  • Use a PRNG at least as good as the Mersenne Twister, and preferably a cryptographic quality PRNG. Consider the PCG family. Avoid standard linear congruential generators and the Wichmann-Hill generator.
  • Use open-source software, and record the version of the software you use.
  • Use a sampling algorithm that does not "waste randomness." Avoid permuting the entire population.
  • Be aware of discretization issues in the sampling algorithm; many methods assume the PRNG produces $U[0,1]$ or $U[0,1)$ random numbers, rather than (an approximation to) numbers that are uniform on $w$-bit binary integers.
  • Consider the size of the problem: are your PRNG and sampling algorithm adequate?
  • Avoid "tests of representativeness" and procedures that reject some samples. They alter the distribution of the sample.

References

  1. Argyros, G. and A. Kiayias, 2012. PRNG: Pwning Random Number Generators. https://media.blackhat.com/bh-us-12/Briefings/Argyros/BH_US_12_Argyros_PRNG_WP.pdf
  2. Bernstein, D.J., T. Lange, and R. Niederhagen, 2016. Dual EC: A Standardized Back Door, in The New Codebreakers, Essays Dedicated to David Kahn on the Occasion of his 85th Birthday, Ryan, P.Y.A., D. Naccache, and J-J Quisquater, eds., Springer, Berlin.
  3. Cormen, T.H., C.E. Leiserson, R.L. Rivest and C. Stein, 2009. Introduction to Algorithms, 3rd edition, MIT Press.
  4. Fishman, G.S., and L.R. Moore, 1981. In Search of Correlation in Multiplicative Congruential Generators with Modulus 2**31-1, Computer Science and Statistics: Proceedings of the 13 Symposium on the Interface, William F. Eddy, ed., Springer Verlag, New York.
  5. Knuth, D., 1997 The Art of Computer Programming, V.II: Seminumerical methods, 3rd edition, Addison-Wesley, Boston.
  6. L'Ecuyer, P. and R. Simard, 2007. TestU01: A C Library for Empirical Testing of Random Number Generators, ACM Trans. Math. Softw., 33, http://doi.acm.org/10.1145/1268776.1268777
  7. Marsaglia, G., 1968. Random Numbers Fall Mainly in the Planes, PNAS, 61, 25–28.
  8. Marsaglia, G., 2003. Xorshift RNGs. Journal of Statistical Software, 8, 1–6.
  9. Matsumoto, M., and T. Nishimura, 1998. 8). Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator, ACM Transactions on Modeling and Computer Simulation, 8, 3–30. doi:10.1145/272991.272995
  10. McCullough, B.D., 2008. Microsoft's 'Not the Wichmann-Hill' random number generator. Computational Statistics and Data Analysis, 52 (10), 4587–4593. http://dx.doi.org/10.1016/j.csda.2008.03.006
  11. NIST Computer Security Division, Random Number Generation http://csrc.nist.gov/groups/ST/toolkit/rng/
  12. O'Neill, M.E., 2015. PCG: A Family of Simple Fast Space-Efficient Statistically Good Algorithms for Random Number Generation, submitted to ACM Transactions on Mathematical Software. http://www.pcg-random.org/pdf/toms-oneill-pcg-family-v1.02.pdf
  13. http://www.pcg-random.org/
  14. Shannon, C.E., 1948. A Mathematical Theory of Communication, Bell System Technical Journal, 27, 379–423, 623–656.
  15. Vitter, J.S., 1985. Random Sampling with a Reservoir, ACM Transactions on Mathematical Software, 11, 37–57.
  16. Wikipedia articles, including https://en.wikipedia.org/wiki/Mersenne_Twister, https://en.wikipedia.org/wiki/Linear_congruential_generator, https://en.wikipedia.org/wiki/Comparison_of_hardware_random_number_generators, https://en.wikipedia.org/wiki/Pseudorandom_number_generator, https://en.wikipedia.org/wiki/List_of_random_number_generators, https://en.wikipedia.org/wiki/Random_number_generator_attack