Random sampling is one of the most fundamental tools in Statistics.

Used for:

- Surveys and extrapolation
- Opinion surveys and polls
- Census (for some purposes) and Current Population Survey
- Environmental statistics
- Litigation, including class actions, discrimination, ...

- Experiments
- Medicine / clinical trials
- Agriculture
- Marketing
- Product development

- Quality control and auditing
- Process control
- Financial auditing
- Healthcare auditing
- Election auditing

- Sampling and resampling methods
- Bootstrap
- Permutation tests
- MCMC

and on and on.

Draw $k \le n$ items from a population of $n$ items, in such a way that each of the $n \choose k$ subsets of size $k$ is equally likely.

Many standard statistical methods assume the sample is drawn in this way, or allocated between treatment and control in this way (e.g., $k$ of $n$ subjects are assigned to treatment, and the remaining $n-k$ to control).

$n!$ possible permutations of $n$ distinct objects.

$n!$ grows quickly with $n$:

$$ n! \sim \sqrt {2\pi n}\left({\frac {n}{e}}\right)^n (\mbox{Stirling's approximation})$$Expression | full | scientific notation |
---|---|---|

$2^{32}$ | 4,294,967,296 | 4.29e9 |

$2^{64}$ | 18,446,744,073,709,551,616 | 1.84e19 |

$2^{128}$ | 3.40e38 | |

$2^{32 \times 624}$ | 9.27e6010 | |

$13!$ | 6,227,020,800 | 6.23e9 |

$21!$ | 51,090,942,171,709,440,000 | 5.11e19 |

$35!$ | 1.03e40 | |

$2084!$ | 3.73e6013 | |

${50 \choose 10}$ | 10,272,278,170 | 1.03e10 |

${100 \choose 10}$ | 17,310,309,456,440 | 1.73e13 |

${500 \choose 10}$ | 2.4581e20 | |

$\frac{2^{32}}{{50 \choose 10}}$ | 0.418 | |

$\frac{2^{64}}{{500 \choose 10}}$ | 0.075 |

We will come back to these numbers.

Given a good source of randomness, many ways to draw a siimple random sample.

One basic approach is like shuffling a deck of $n$ cards, then dealing the top $k$: permute the population at random, then take the first $k$ elements of the permutation to be the sample.

There are a number of standard ways to generate a random permutation—i.e., to shuffle the deck.

`PIKK`

(permute indices and keep $k$)¶For instance, if we had a way to generate independent, identically distributed (iid) $U[0,1]$ random numbers, we could do it as follows:

**Algorithm PIKK**

- assign iid $U[0,1]$ numbers to the $n$ elements of the population
the sample consists of the $k$ items assigned the smallest random numbers (break ties randomly)

amounts to generating a random permutation of the population, then taking first $k$.

- if the numbers really are iid, every permutation is equally likely, and it follows that the first $k$ are an SRS
- requires $n$ random numbers (and sorting)

In [2]:

```
%matplotlib inline
from __future__ import division
import math
import numpy as np
import scipy as sp
from scipy.misc import comb, factorial
from scipy.optimize import brentq
from scipy.stats import chisquare, norm
import scipy.integrate
from random import Random
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
```

In [3]:

```
def PIKK(n,k, gen=np.random):
try:
x = gen.random(n)
except TypeError:
x = np.array([gen.random() for i in np.arange(n)])
return set(np.argsort(x)[0:k])
```

There are more efficient ways to generate a random permutation than assigning a number to each element and sorting, for instance the "Fisher-Yates shuffle" or "Knuth shuffle" (Knuth attributes it to Durstenfeld).

**Algorithm Fisher-Yates-Knuth-Durstenfeld shuffle (backwards version)**

```
for i=1, ..., n-1:
J <- random integer uniformly distributed on {i, ..., n}
(a[J], a[i]) <- (a[i], a[J])
```

This requires the ability to generate independent random integers on various ranges. Doesn't require sorting.

There's a version suitable for streaming, i.e., generating a random permutation of a list that has an (initially) unknown number $n$ of elements:

**Algorithm Fisher-Yates-Knuth-Durstenfeld shuffle (streaming version)**

```
i <- 0
a = []
while there are records left:
i <- i+1
J <- random integer uniformly distributed on {1, ..., i}
if J < i:
a[i] <- a[J]
a[J] <- next record
else:
a[i] <- next record
```

Again, need to be able to generate independent uniformly distributed random integers.

In [8]:

```
def fykd(a, gen=np.random):
for i in range(len(a)-2):
J = gen.randint(i,len(a))
a[i], a[J] = a[J], a[i]
return(a)
fykd(np.arange(10))
```

Out[8]:

Induction:

For $i=1$, obvious.

At stage $i$, suppose all $i!$ permutations are equally likely. For each such permutation, there are $i+1$ distinct, equally likely permutations at stage $i+1$, resulting from swapping the $i+1$st item with one of the first $i$, or putting it in position $i+1$. These possibilities are mutually exclusive of the permutations attainable from a different permutation of the first $i$ items.

Thus, this enumerates all $(i+1)i! = (i+1)!$ permutations of $(i+1)$ items, and they are equally likely. ■

Sort using a "random" comparison function, e.g.,

```
def RandomSort(a,b):
return (0.5 - np.random.random())
```

But this fails the basic properties of an ordering, e.g., transitivity, reflexiveness, etc. Doesn't produce random permutation. Output also depends on sorting algorithm.

Notoriously used by Microsoft to offer a selection of browsers in the EU. Resulted in IE being 5th of 5 about half the time in IE and about 26% of the time in Chrome, but only 6% of the time in Safari (4th about 40% of the time).

`Random_Sample`

¶- recursive algorithm
- requires only $k$ random numbers (integers)
- does not require sorting

In [9]:

```
def Random_Sample(n, k, gen=np.random): # from Cormen et al.
if k==0:
return set()
else:
S = Random_Sample(n-1, k-1)
i = gen.randint(1,n+1)
if i in S:
S = S.union([n])
else:
S = S.union([i])
return S
Random_Sample(100,5)
```

Out[9]:

**Proof.** Recursion.

The previous algorithms require $n$ to be known.
There are *reservoir* algorithms that do not.
Moreover, the algorithms are suitable for streaming (aka *online*) use: items are examined
sequentially and either enter into the reservoir, or, if not, are never revisited.

`R`

, Waterman (per Knuth, 1997)¶- Put first $k$ items into the
*reservoir* - when item $k+j$ is examined, either skip it (with probability $j/(k+j)$) or swap for a uniformly selected item in the reservoir (with probability $k/(k+j)$)
- naive version requires generating at most $n-k$ random numbers
- closely related to FYKD shuffle

In [11]:

```
def R(n,k): # Waterman's algorithm R
S = np.arange(1, k+1) # fill the reservoir
for t in range(k+1,n+1):
i = np.random.randint(1,t+1)
if i <= k:
S[i-1] = t
return set(S)
R(100,5)
```

Out[11]:

Like `Random-Sample`

, the proof that algorithm `R`

generates an SRS uses the ability to generate independent random integers, uniformly distributed on $\{1, \ldots, m \}$.

Much more efficient than `R`

, using random skips. Works in time essentially linear in $k$.

Note: Vitter proposes using the (inaccurate) $J = \lfloor mU \rfloor$ to generate a random integer between $0$ and $m$ in both algorithm `R`

and algorithm `Z`

. The issue is pervasive!

Package/Lang | default | other | SRS algorithm |
---|---|---|---|

SAS 9.2 | MT | 32-bit LCG | |

SPSS 20.0 | 32-bit LCG | MT1997ar | |

SPSS ≤ 12.0 | 32-bit LCG | ||

STATA 13 | KISS 32 | ||

STATA 14 | MT | ||

R | MT | trunc | |

python | MT | mask | |

MATLAB | MT |

**Key.** MT = Mersenne Twister. LCG = linear congruential generator. ANS = "assign a number to each of the $n$ items and sort." The KISS generator combines 4 generators of three types: two multiply-with-carry
generators, the 3-shift register SHR3 and the congruential generator CONG.

Prior to April 2011, in Stata 10 exhibited predictable behavior: 95.1% of the $2^{31}$ possible seed values resulted in the first and second draws from rnormal() having the same sign.

- Use a source of real randomness with a substantial amount of entropy to set the seed, e.g., 20 rolls of 10-sided dice.
- Record the seed so your analysis is reproducible.
- Use a PRNG at least as good as the Mersenne Twister, and preferably a cryptographic quality PRNG. Consider the PCG family. Avoid standard linear congruential generators and the Wichmann-Hill generator.
- Use open-source software, and record the version of the software you use.
- Use a sampling algorithm that does not "waste randomness." Avoid permuting the entire population.
- Be aware of discretization issues in the sampling algorithm; many methods assume the PRNG produces $U[0,1]$ or $U[0,1)$ random numbers, rather than (an approximation to) numbers that are uniform on $w$-bit binary integers.
- Consider the size of the problem: are your PRNG and sampling algorithm adequate?
- Avoid "tests of representativeness" and procedures that reject some samples. They alter the distribution of the sample.

- Argyros, G. and A. Kiayias, 2012. PRNG: Pwning Random Number Generators. https://media.blackhat.com/bh-us-12/Briefings/Argyros/BH_US_12_Argyros_PRNG_WP.pdf
- Bernstein, D.J., T. Lange, and R. Niederhagen, 2016. Dual EC: A Standardized Back Door, in
*The New Codebreakers, Essays Dedicated to David Kahn on the Occasion of his 85th Birthday*, Ryan, P.Y.A., D. Naccache, and J-J Quisquater, eds., Springer, Berlin. - Cormen, T.H., C.E. Leiserson, R.L. Rivest and C. Stein, 2009.
*Introduction to Algorithms, 3rd edition*, MIT Press. - Fishman, G.S., and L.R. Moore, 1981. In Search of Correlation in Multiplicative Congruential Generators with Modulus 2**31-1,
*Computer Science and Statistics: Proceedings of the 13 Symposium on the Interface*, William F. Eddy, ed., Springer Verlag, New York. - Knuth, D., 1997
*The Art of Computer Programming, V.II: Seminumerical methods*, 3rd edition, Addison-Wesley, Boston. - L'Ecuyer, P. and R. Simard, 2007. TestU01: A C Library for Empirical Testing of Random Number Generators,
*ACM Trans. Math. Softw.*,*33*, http://doi.acm.org/10.1145/1268776.1268777 - Marsaglia, G., 1968. Random Numbers Fall Mainly in the Planes,
*PNAS*,*61*, 25–28. - Marsaglia, G., 2003. Xorshift RNGs.
*Journal of Statistical Software*,*8*, 1–6. - Matsumoto, M., and T. Nishimura, 1998. 8). Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator,
*ACM Transactions on Modeling and Computer Simulation*,*8*, 3–30. doi:10.1145/272991.272995 - McCullough, B.D., 2008. Microsoft's 'Not the Wichmann-Hill' random number generator.
*Computational Statistics and Data Analysis*,*52*(10), 4587–4593. http://dx.doi.org/10.1016/j.csda.2008.03.006 - NIST Computer Security Division,
*Random Number Generation*http://csrc.nist.gov/groups/ST/toolkit/rng/ - O'Neill, M.E., 2015. PCG: A Family of Simple Fast Space-Efficient Statistically Good Algorithms for Random Number Generation, submitted to
*ACM Transactions on Mathematical Software*. http://www.pcg-random.org/pdf/toms-oneill-pcg-family-v1.02.pdf - http://www.pcg-random.org/
- Shannon, C.E., 1948. A Mathematical Theory of Communication,
*Bell System Technical Journal, 27*, 379–423, 623–656. - Vitter, J.S., 1985. Random Sampling with a Reservoir,
*ACM Transactions on Mathematical Software, 11*, 37–57. - Wikipedia articles, including https://en.wikipedia.org/wiki/Mersenne_Twister, https://en.wikipedia.org/wiki/Linear_congruential_generator, https://en.wikipedia.org/wiki/Comparison_of_hardware_random_number_generators, https://en.wikipedia.org/wiki/Pseudorandom_number_generator, https://en.wikipedia.org/wiki/List_of_random_number_generators, https://en.wikipedia.org/wiki/Random_number_generator_attack