Problem 1¶

How many distinct m by n contingency tables are there that have exactly N total events?

I used a resource (slide 58 of http://www.math.binghamton.edu/dikran/447/combinatorics.pdf) that I shamelessly stole from Ellen's page.

Let $k = m \times n$ be the number of cells in the contingency table. Then the number of distinct contingency tables with $N$ total events is:

$$ \binom{N + k - 1}{k - 1} = \binom{N + k - 1}{N} $$

Problem 2¶

For every distinct 2 by 2 contingency table containing exactly 14 elements, compute its chi-square statistic, and also its Wald statistic. Display your results as a scatter plot of one statistic versus the other.

In [1]:

import scipy.stats
import itertools
import matplotlib.pyplot as plt
import numpy as np

Define functions for the statistics:

In [2]:

def chi_square(counts):
    expected = scipy.stats.contingency.expected_freq(counts)
    return np.sum((counts - expected)**2 / expected)

In [3]:

def wald_2x2(counts):
    m, n = counts[0, 0], counts[0, 1]
    M, N = counts.sum(axis=0)

    p1_hat = m / float(M)
    p2_hat = n / float(N)
    p_hat = (m + n) / float(M + N)

    numerator = p1_hat - p2_hat
    denominator = np.sqrt(p_hat * (1 - p_hat) * ((1.0 / M) + (1.0 / N)))
    return numerator / denominator

arr = np.array([[1, 2], [3, 4]])
print 'wald', wald_2x2(arr)
print 'chi-square', chi_square(arr)

wald -0.28171808491
chi-square 0.0793650793651

Define the method to iterate over all possible contingency tables.

I use the intuition from the linked slides (see problem 1) to generate all possible 2x2 contingency tables containing 14 elements for problem 2. To do this, I first create a list of $N + k - 1 = 17$ numbers:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

I then use Python's itertools.combinations method to get all combinations of $k - 1 = 3$ numbers from this list. Those 3 numbers represent the position of the separators, and the $N$ remaining numbers represent elements in the table. For example, the 3 numbers 2 9 16 partition the list into 4 sets:

0 1 | 3 4 5 6 7 8 | 10 11 12 13 14 15 |

Which give counts of 2, 6, 6, and 0 for each of the 4 cells.

Note that while I only use this for 2x2 tables, it is general enough to work on tables of any size!

In [4]:

def iter_contingency_tables_counts(m, n, N):
    num_separators = m * n - 1
    num_positions = N + num_separators

    for separators in itertools.combinations(xrange(num_positions), num_separators):
        counts = []
        accumulator = 0
        for pos in xrange(num_positions):
            if pos in separators:
                counts.append(accumulator)
                accumulator = 0
            else:
                accumulator += 1
        counts.append(accumulator)

        assert len(counts) == m * n
        assert sum(counts) == N

        yield np.asarray(counts).reshape((m, n))


# Sanity check for a small example: verify there are not duplicate count
# matrices.
counts = [tuple(c.flatten()) for c in iter_contingency_tables_counts(2, 2, 8)]
assert len(set(counts)) == len(counts)

Finally, create a scatter plot of chi-square and Wald statistics.

In [5]:

chi_squares = []
walds = []
for counts in iter_contingency_tables_counts(2, 2, 14):
    if np.any(counts.sum(axis=0) == 0) or np.any(counts.sum(axis=1) == 0):
        # Skip those with 0 marginals.
        continue
    chi_squares.append(chi_square(counts))
    walds.append(wald_2x2(counts))

In [6]:

plt.scatter(chi_squares, walds)
plt.xlabel('chi-square')
plt.ylabel('wald')

Out[6]:

<matplotlib.text.Text at 0xa7e330c>

Alternatively, use the absolute value of the Wald statistic:

In [7]:

plt.scatter(chi_squares, np.abs(walds))
plt.xlabel('chi-square')
plt.ylabel('abs(wald)')

Out[7]:

<matplotlib.text.Text at 0xa7e3f2c>