How many distinct m by n contingency tables are there that have exactly N total events?
I used a resource (slide 58 of http://www.math.binghamton.edu/dikran/447/combinatorics.pdf) that I shamelessly stole from Ellen's page.
Let $k = m \times n$ be the number of cells in the contingency table. Then the number of distinct contingency tables with $N$ total events is:
$$ \binom{N + k - 1}{k - 1} = \binom{N + k - 1}{N} $$For every distinct 2 by 2 contingency table containing exactly 14 elements, compute its chi-square statistic, and also its Wald statistic. Display your results as a scatter plot of one statistic versus the other.
import scipy.stats
import itertools
import matplotlib.pyplot as plt
import numpy as np
Define functions for the statistics:
def chi_square(counts):
expected = scipy.stats.contingency.expected_freq(counts)
return np.sum((counts - expected)**2 / expected)
def wald_2x2(counts):
m, n = counts[0, 0], counts[0, 1]
M, N = counts.sum(axis=0)
p1_hat = m / float(M)
p2_hat = n / float(N)
p_hat = (m + n) / float(M + N)
numerator = p1_hat - p2_hat
denominator = np.sqrt(p_hat * (1 - p_hat) * ((1.0 / M) + (1.0 / N)))
return numerator / denominator
arr = np.array([[1, 2], [3, 4]])
print 'wald', wald_2x2(arr)
print 'chi-square', chi_square(arr)
wald -0.28171808491 chi-square 0.0793650793651
Define the method to iterate over all possible contingency tables.
I use the intuition from the linked slides (see problem 1) to generate all possible 2x2 contingency tables containing 14 elements for problem 2. To do this, I first create a list of $N + k - 1 = 17$ numbers:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
I then use Python's itertools.combinations
method to get all combinations of $k - 1 = 3$ numbers from this list. Those 3 numbers represent the position of the separators, and the $N$ remaining numbers represent elements in the table. For example, the 3 numbers 2 9 16
partition the list into 4 sets:
0 1 | 3 4 5 6 7 8 | 10 11 12 13 14 15 |
Which give counts of 2, 6, 6, and 0 for each of the 4 cells.
Note that while I only use this for 2x2 tables, it is general enough to work on tables of any size!
def iter_contingency_tables_counts(m, n, N):
num_separators = m * n - 1
num_positions = N + num_separators
for separators in itertools.combinations(xrange(num_positions), num_separators):
counts = []
accumulator = 0
for pos in xrange(num_positions):
if pos in separators:
counts.append(accumulator)
accumulator = 0
else:
accumulator += 1
counts.append(accumulator)
assert len(counts) == m * n
assert sum(counts) == N
yield np.asarray(counts).reshape((m, n))
# Sanity check for a small example: verify there are not duplicate count
# matrices.
counts = [tuple(c.flatten()) for c in iter_contingency_tables_counts(2, 2, 8)]
assert len(set(counts)) == len(counts)
Finally, create a scatter plot of chi-square and Wald statistics.
chi_squares = []
walds = []
for counts in iter_contingency_tables_counts(2, 2, 14):
if np.any(counts.sum(axis=0) == 0) or np.any(counts.sum(axis=1) == 0):
# Skip those with 0 marginals.
continue
chi_squares.append(chi_square(counts))
walds.append(wald_2x2(counts))
plt.scatter(chi_squares, walds)
plt.xlabel('chi-square')
plt.ylabel('wald')
<matplotlib.text.Text at 0xa7e330c>
Alternatively, use the absolute value of the Wald statistic:
plt.scatter(chi_squares, np.abs(walds))
plt.xlabel('chi-square')
plt.ylabel('abs(wald)')
<matplotlib.text.Text at 0xa7e3f2c>