If, under the null hypothesis, the probability distribution of the data is invariant under the action of some group $\mathcal{G}$, then once we observe the actual data, we know other possible data sets that are equally likely.

In particular, every element of the *orbit* of the observed data under $\mathcal{G}$ was just as probable as the
observed data.

Test *conditionally* on the orbit in which the observed data lie.

11 pairs of rats, each pair from the same litter.

Randomly—by coin tosses—put one of each pair into "enriched" environment; other sib gets "normal" environment.

After 65 days, measure cortical mass (mg).

enriched | 689 | 656 | 668 | 660 | 679 | 663 | 664 | 647 | 694 | 633 | 653 |

impoverished | 657 | 623 | 652 | 654 | 658 | 646 | 600 | 640 | 605 | 635 | 642 |

difference | 32 | 33 | 16 | 6 | 21 | 17 | 64 | 7 | 89 | -2 | 11 |

**How should we analyze the data?**

Cartoon of Rosenzweig, M.R., E.L. Bennet, and M.C. Diamond, 1972. Brain changes in response to experience, *Scientific American*, *226*, 22–29 report an experiment in which 11 triples of male rats, each triple from the same litter, were
assigned at random to three different environments, "enriched" (E), standard, and "impoverished."
See also Bennett et al., 1969.

Null hypothesis: treatment has "no effect."

Alternative hypothesis: treatment increases cortical mass.

Suggests 1-sided test for an increase.

- 2-sample Student $t$-test:
$$

$$`\frac{\mbox{mean(treatment) - mean(control)}} {\mbox{pooled estimate of SD of difference of means}}`

1-sample Student $t$-test on the differences: $$

`\frac{\mbox{mean(differences)}}{\mbox{SD(differences)}/\sqrt{11}}`

$$

Better, since littermates are presumably more homogeneous.Permutation test using $t$-statistic of differences: same statistic, different way to calculate $P$-value. Even better?

**Treatment has no effect whatsoever---as if cortical mass were
assigned to each rat before the randomization.**

Then equally likely that the rat with the heavier cortex will be assigned to treatment or to control, independently across littermate pairs.

Gives $2^{11} = 2,048$ equally likely possibilities:

difference | $\pm$32 | $\pm$33 | $\pm$16 | $\pm$6 | $\pm$21 | $\pm$17 | $\pm$64 | $\pm$7 | $\pm$89 | $\pm$2 | $\pm$11 |

For example, just as likely to observe original differences as

difference | -32 | -33 | -16 | -6 | -21 | -17 | -64 | -7 | -89 | -2 | -11 |

On average across pairs, treatment makes no difference.

Individual's response depends only on that individual's assignment

- Special cases: shift, scale, etc.

Interactions/Interference: my response could depend on whether you are assigned to treatment or control.

- 2-sample $t$-test:
- masses are iid sample from normal distribution, same unknown variance, same unknown mean.
- Tests weak null hypothesis (plus normality, independence, non-interference, etc.).

- 1-sample $t$-test on the differences:
- mass differences are iid sample from normal distribution, unknown variance, zero mean.
- Tests weak null hypothesis (plus normality, independence, non-interference, etc.)

- Permutation test:
- Randomization fair, independent across pairs.
- Tests strong null hypothesis.

Assumptions of the permutation test are true by design: That's how treatment was assigned.

The data are elements of $\Re^{11}$. The group operations reflect coordinates about the axes—they flip the signs of elements of the vectors.

In [1]:

```
# boilerplate
%matplotlib inline
from __future__ import division
import math
import numpy as np
import scipy as sp
from scipy import stats # distributions
from scipy import special # special functions
from scipy import random # random variables, distributions, etc.
from scipy.optimize import brentq
from scipy.stats import (binom, hypergeom)
import matplotlib.pyplot as plt
from ipywidgets import widgets
treat = np.array([689, 656, 668, 660, 679, 663, 664, 647, 694, 633, 653])
control = np.array([657, 623, 652, 654, 658, 646, 600, 640, 605, 635, 642])
diffr = treat - control
sp.stats.ttest_1samp(diffr, 0.0) # two-sided one-sample t-test
```

Out[1]:

Mean of differences: 26.73mg

Sample SD of differences: 27.33mg

$t$-statistic: $26.73/(27.33/\sqrt{11}) = 3.244$.

$P$-value for 2-sided $t$-test: 0.0088

- Why do cortical weights have normal distribution?
- Why is variance of the difference between treatment and control the same for different litters?
- Treatment and control are
*dependent*because assigning a rat to treatment excludes it from the control group, and vice versa. - $P$-value depends on assuming differences are iid sample from a normal distribution.
- If we reject the null, is that because there is a treatment effect, or because the other assumptions are wrong?

Could enumerate all $2^{11} = 2,048$ equally likely possibilities. Calculate $t$-statistic for each.

$P$-value is $$ P = \frac{\mbox{number of possibilities with $t \ge 3.244$}}{\mbox{2,048}} $$ (Using the mean as the test statistic instead of $t$-statistic would yield $P = 2/2048 = 0.00098$.)

If there were many more pairs, it would be impractical to enumerate, but we can simulate:

Assign a random sign to each difference. \ Compute $t$-statistic \ Repeat $N$ times $$ P \approx \frac{\mbox{number of simulations with $t \ge 3.244$}}{N} $$ Can invert Binomial tests to make an exact confidence interval for $P$ from the simulation results.

In [2]:

```
np.random.RandomState(seed=20151013) # set seed for reproducibility
iter = 100000 # N
def simPermuTP(z, iter):
# P.B. Stark, statistics.berkeley.edu/~stark
# simulated P-value for 2-sided 1-sample t-test under the
# randomization model.
tsf = lambda x: np.abs(np.mean(x)/(np.std(x, ddof=1)/math.sqrt(len(x))))
ts = tsf(z)
return np.mean([ (tsf(z*(2*np.random.random_integers(0, 1, size=len(z))-1)) >= ts) for i in range(iter)])
estP = simPermuTP(diffr, iter)
print(estP, estP*iter)
```

In [3]:

```
# confidence interval for the p-value
# This function is in the _permute_ library; see http://statlab.github.io/permute/
def binom_conf_interval(n, x, cl=0.975, alternative="two-sided", p=None,
**kwargs):
"""
Compute a confidence interval for a binomial p, the probability of success in each trial.
Parameters
----------
n : int
The number of Bernoulli trials.
x : int
The number of successes.
cl : float in (0, 1)
The desired confidence level.
alternative : {"two-sided", "lower", "upper"}
Indicates the alternative hypothesis.
p : float in (0, 1)
Starting point in search for confidence bounds for probability of success in each trial.
kwargs : dict
Key word arguments
Returns
-------
tuple
lower and upper confidence level with coverage (approximately)
1-alpha.
Notes
-----
xtol : float
Tolerance
rtol : float
Tolerance
maxiter : int
Maximum number of iterations.
"""
assert alternative in ("two-sided", "lower", "upper")
if p is None:
p = x / n
ci_low = 0.0
ci_upp = 1.0
if alternative == 'two-sided':
cl = 1 - (1-cl)/2
if alternative != "upper" and x > 0:
f = lambda q: cl - binom.cdf(x - 1, n, q)
ci_low = brentq(f, 0.0, p, *kwargs)
if alternative != "lower" and x < n:
f = lambda q: binom.cdf(x, n, q) - (1 - cl)
ci_upp = brentq(f, 1.0, p, *kwargs)
return ci_low, ci_upp
binom_conf_interval(iter, math.ceil(estP*iter), cl=0.95, alternative="two-sided", p=estP)
```

Out[3]:

In [4]:

```
binom_conf_interval(10**5, 907, cl=0.99, alternative="upper")
```

Out[4]:

Count pairs where treated rat has heavier cortex, i.e., where difference is positive.

Under strong null, distribution of the number of positive differences is Binomial(11, 1/2). Like number of heads in 11 independent tosses of a fair coin. (Assumes no ties w/i pairs.)

$P$-value is chance of 10 or more heads in 11 tosses of a fair coin: 0.0059.

Only uses signs of differences, not information that only the smallest absolute difference was negative.

uses information about the ordering of the differences: rank the absolute values of the differences, then give them the observed signs and sum them. Null distribution: assign signs at random and sum.

All the tests we've seen here are sensitive to *shifts*--the alternative
hypothesis is that treatment increases response (cortical mass).

There are also nonparametric tests that are sensitive to other treatment effects, e.g., treatment increases the variability of the response.

And there are tests for whether treatment has any effect at all on the distribution of the responses.

You can design a test statistic to be sensitive to any change that interests you, then use the permutation distribution to get a $P$-value (and simulation to approximate that $P$-value).

Treat ordinal data (e.g., Likert scale) as if measured on a linear scale; use Student $t$-test.

Maybe not so silly for large samples$\ldots$

$t$-test asymptotically distribution-free.

How big is big?

Actually had 3 treatments: enriched, standard, deprived.

Randomized 3 rats per litter into the 3 treatments, independently across $n$ litters.

**How should we analyze these data?**

$n$ litters, $s$ treatments (sibs per litter).

- ANOVA--the $F$-test: $$ F = \frac{\mbox{BSS}/(s-1)}{\mbox{WSS}/(n-s)} $$
- Permutation $F$-test: use permutation distribution instead of $F$ distribution to get $P$-value.
- Friedman test: Rank within litters. Mean rank for treatment $i$ is $\bar{R}_i$. $$ Q = \frac{12n}{s(s+1)} \sum_{i=1}^s \left ( \bar{R}_i - \frac{s+1}{2} \right )^2. $$ $P$-value from permutation distribution.

**Treatment has no effect whatsoever—as if cortical mass were
assigned to each rat before the randomization.**

Then it's equally likely that each littermate is assigned to each treatment, independently across litters.

There are $3! = 6$ assignments of each triple to treatments. Thus, there $6^n$ equally likely assignments across all litters.

For 11 litters, that's 362,797,056 possibilities.

(The group under which the distribution of the data is invariant is a subgroup of the permutation group: if you think of the data as an element of $\Re^{33}$ (33 rats), the group consists of all permutations of the 33 elements that shuffle only adjacent sets of 3 components: components 1–3, components 4–6, … components 31–33.)

**The average cortical weight for all three treatment groups are equal.
On average across triples, treatment makes no difference.**

- $F$-test: masses are iid sample from normal distribution, same unknown variance, same unknown mean for all litters and treatments.

Tests weak null hypothesis. - Permutation $F$-test: Randomization was as advertised: fair, independent across triples.

Tests strong null hypothesis. - Friedman test: Ditto.

Assumptions of the permutation test and Friedman test are true by design: that's how treatment was assigned.

Friedman test statistic has $\chi^2$ distribution asymptotically. Ties are a complication.

- Why do cortical weights have normal distribution for each litter and for each treatment?
- Why is the variance of cortical weights the same for different litters?
- Why is the variance of cortical weights the same for different treatments?

Is $F$ a good statistic for this alternative?

$F$-test (and Friedman statistic) sensitive to differences among the mean responses for each treatment group, no matter what pattern the differences have.

But the treatments and the responses can be ordered: we hypothesize that more stimulation produces greater cortical mass.

deprived $\Longrightarrow$ normal $\Longrightarrow$ enriched

low mass $\Longrightarrow$ medium mass $\Longrightarrow$ high mass

*Can we use that to make a more sensitive test?*

Within each litter triple, count pairs of responses that are "in order." Sum across litters.

E.g., if one triple had cortical masses

deprived | 640 |

normal | 660 |

enriched | 650 |

that would contribute 2 to the sum: $660 \ge 640$, $650 \ge 640$, but $640 < 650$.

Each litter triple contributes between 0 and 3 to the overall sum.

Null distribution for the test based on the permutation distribution: 6 equally likely assignments per litter, independent assignments across litters.

Within each litter triple, add differences that are "in order." Sum across litters.

E.g., if one triple had cortical masses

deprived | 640 |

normal | 660 |

enriched | 650 |

that would contribute 30 to the sum: $660 - 640 = 20$ and $650 - 640 = 10$, but $640 < 650$, so that pair contributes zero.

Each litter triple contributes between 0 and $2\times{\mbox{ range }}$ to the sum.

Null distribution for the test based on the permutation distribution: 6 equally likely assignments per litter, independent across litters.

Suppose we have a group of $N$ individuals who are randomized into two groups, a *treatment* group of size $N_t$ and a *control* group of size $N_c = N - N_t$.
Label the individuals from $1$ to $N$.
Let ${\mathcal T}$ denote the labels of individuals assigned to treatment and ${\mathcal C}$ denote
the labels of those assigned to control.

For each of the $N$ individuals, we measure a quantitative (real-valued) response.
Each individual $i$ has two *potential responses*: the response $c_i $individual would have if assigned to
the control group, and the response $t_i$ the individual would have if assigned to the treatment group.

We assume that individual $i$'s response depends *only* on that individual's assigment, and not on anyone else's assignment.
This is the assumption of *non-interference*.
In some cases, this assumption is reasonable; in others, it is not.

For instance, imagine testing a vaccine for a communicable disease.
If you and I have contact, whether you get the disease might depend on whether I am vaccinated—and *vice versa*—since if the vaccine protects me from illness, I won't infect you.
Similarly, suppose we are testing the effectiveness of an advertisement for a product.
If you and I are connected and you buy the product, I might be more likely to buy it, even if I don't
see the advertisement.

Conversely, suppose that "treatment" is exposure to a carcinogen, and the response whether the subject contracts cancer. On the assumption that cancer is not communicable, my exposure and your disease status have no connection.

The *strong null hypothesis* is that individual by individual, treatment makes no difference whatsoever: $c_i = t_i$ for all $i$.

If so, any differences between statistics computed for the treatment and control groups are entirely due to the luck of the draw: which individuals happened to be assigned to treatment and which to control.

We can find the *null distribution* of any statistic computed from the responses of the two groups: if the strong null hypothesis is true, we know what individual $i$'s response would have been whether assigned to treatment or to control—namely, the same.

For instance, suppose we suspect that treatment tends to increase response: in general, $t_i \ge c_i$. Then we might expect $\bar{c} = \frac{1}{N_c} \sum_{i \in {\mathcal C}} c_i$ to be less than $\bar{t} = \frac{1}{N_t} \sum_{i \in {\mathcal T}} t_i$. How large a difference between $\bar{c}$ and $\bar{t}$ would be evidence that treatment increases the response, beyond what might happen by chance through the luck of the draw?

This amounts to asking whether the observed difference in means between the two groups is a high percentile of the distribution of that difference in means, calculated on the assumption that the null hypothesis is true.

Because of how subjects are assigned to treatment or to control, all allocations of $N_t$ subjects to treatment are equally likely.

One way to partition the $N$ subjects randomly into a group of size $N_c$ and a group of size $N_t$ is to permute the $N$ subjects at random, then take the first $N_c$ in the permuted list to be the control group, and the remaining $N_t$ to be the treatment group.

MacNell, Driscoll, and Hunt (2014. What's in a Name: Exposing Gender Bias in Student Ratings of Teaching, *Innovative Higher Education*) conducted a controlled, randomized experiment on
the effect of students' perception of instructors' gender on teaching evaluations
in an online course.
Students in the class did not know the instructors' true genders.

MacNell et al. randomized 47 students in an online course into four groups: two taught by a female instructor and two by a male instructor. One of the groups taught by each instructor was led to believe the instructor was male; the other was led to believe the instructor was female. Comparable instructor biographies were given to all students. Instructors treated the groups identically, including returning assignments at the same time.

When students thought the instructor was female, the responding students rated the instructor lower, on average, in every regard.

Characteristic | F - M |
---|---|

Caring | -0.52 |

Consistent | -0.47 |

Enthusiastic | -0.57 |

Fair | -0.76 |

Feedback | -0.47 |

Helpful | -0.46 |

Knowledgeable | -0.35 |

Praise | -0.67 |

Professional | -0.61 |

Prompt | -0.80 |

Respectful | -0.61 |

Responsive | -0.22 |

Those results are for a 5-point scale, so a difference of 0.8 (Promptness) is 16% of the entire scale.

MacNell et al. graciously shared their data. The evaluation data are coded as follows:

```
Group
3 (12 students) - TA identified as male, true TA gender female
4 (11 students) - TA identified as male, true TA gender male
5 (11 students, 8 responders) - TA identified as female, true TA gender female
6 (13 students, 12 responders) - TA identified as female, true TA gender male
tagender - 1 if TA is actually male, 0 if actually female
taidgender - 1 if TA is identified as male, 0 if identified as female
gender - 1 if student is male, 0 if student is female
```

There are grades for 47 students but evaluations for only 43 (4 did not respond: three in group 5 and one in group 6). The grades are not linked to the evaluations, per the IRB protocol.

Our null hypothesis will include the assumption that whether a student responds does not depend on the identified gender of the TA: the 4 nonresponders would have remained nonresponders had they been in that instructor's other section, and no additional students would become nonresponders if they had been assigned to the same instructor's other section.

Recall that there are $n \choose k$ ways of picking a subset of size $k$ from $n$ items; hence there are $n \choose k$ ways of dividing a set into a subset of size $k$ and one of size $n-k$ (once you select those that belong to the subset of size $k$, the rest must be in the complementary subset of size $n-k$.

In this problem, we are partitioning 47 things into 4 subsets, two of size 11, one of size 12, and one of size 13. How many ways are there of doing that?

Recall the Fundamental Rule of Counting: If a set of choices, $T_1, T_2, \ldots, T_k$, could result, respectively, in $n_1, n_2, \ldots, n_k$ possible outcomes, the entire set of $k$ choices has $\prod_{i=1}^k n_k$ possible outcomes.

We can think of the allocation of students to the four groups as choosing 11 of the 47 students for the first group, then 11 of the remaining 36 for the second, then 12 of the remaining 25 for the third. The fourth group must contain the remaining 13.

The number of ways of doing that is

$$ {47 \choose 11}{36 \choose 11}{25 \choose 12} = \frac{47!}{11! 36!} \frac{36!}{11! 25!} \frac{25!}{12! 13!} = \frac{47!}{11! 11! 12! 13!}. $$Does the number depend on the order in which we made the choices? Suppose we made the choices in a different order: first 12 students for one group, then 11 for the second, then 13 for the third (the fourth gets the remaining 11 students). The number of ways of doing that is $$ {47 \choose 12}{35 \choose 11}{24 \choose 13} = \frac{47!}{12! 35!} \frac{35!}{11! 24!} \frac{24!}{13! 11!} = \frac{47!}{12! 11! 13! 11!} = . $$ The number does not depend on the order in which we make the choices.

By the same reasoning, the number of ways of dividing a set of $n$ objects into
$m$ subsets of sizes $k_1, \ldots k_m$ is given by the *multinomial coefficient*
$$
{n \choose k_1, k_2, \ldots, k_m} =
{n \choose k_1}{n-k_1 \choose k_2} {n-k_1-k_2 \choose k_3} \cdots {n - \sum_{i=1}^{m-1} k_i \choose k_{m-1}}
$$

We will check how surprising it would be for the means to be so much lower when the TA is identified as female, if in fact there is "no real difference" in how they were rated, and the apparent difference is just due to the luck of the draw: which students happened to end up in which section.

In the actual randomization, all $47 \choose 11, 11, 12, 13$ allocations were equally likely. But there might be real differences between the two instructors. Hence, we'd like to use each of them as his or her own "control."

Each student's potential responses are represented by a ticket with 4 numbers:

- the rating that the student would assign to instructor A if instructor A is identified as male
- the rating that the student would assign to instructor A if instructor A is identified as female
- the rating that the student would assign to instructor B if instructor B is identified as male
- the rating that the student would assign to instructor B if instructor B is identified as female

The null hypothesis is that the first two numbers are equal and the second two numbers are equal,
but the first two numbers might be different from the second two numbers.
This corresponds to the hypohtesis that

students assigned to a given TA would rate him or her the same, whether that TA seemed to be male or female.
For all students assigned instructor A, we know both of the first two numbers if the hull hypothesis
is true; but we know neither of the second two numbers.
Similarly, if the null hypothesis is true, we know both of the second two numbers for all students
assigned to instructor B, but we know neither of the first two numbers for those students.

Because of how the randomization was performed, all allocations of students to sections that keep the number of students in each section the same are equally likely, so in particular all allocations that keep the same students assigned to each actual instructor the same are equally likely. This is a subgroup of the full permutation group.

Hence, all ${23 \choose 12}$ ways of splitting the 23 students assigned to the female instructor into two groups, one with 11 students and one with 12, are equally likely. Similarly, all ${24 \choose 11}$ ways of splitting the 24 students assigned to the male instructor into two groups, one with 13 students and one with 11, are equally likely. We can thus imagine shuffling the female TA's students between her sections, and the male TA's students between his sections, and examine the distribution of the difference between the mean score for the sections where the TA was identified as male is larger than the mean score for the sections where the TA was identified as female. (The 4 nonresponders are also allocated at random between sections of the same instructor; they are not included in the averages.)

If the difference is rarely as large as the observed mean difference, the observed mean difference gives evidence that being identified as female really does lower the scores.

In [5]:

```
# Number of allocations to 11, 11, 12, 13
print(sp.special.binom(47,11)*sp.special.binom(36,11)*sp.special.binom(25,12)) # big number!
```

In [6]:

```
# the data are in a .csv file called "Macnell-RatingsData.csv" in the directory Data
import pandas as pd
ratings = pd.read_csv("./Data/Macnell-RatingsData.csv")
categories = ratings.columns.values.tolist()[1:15]
print(ratings.head())
print(ratings.describe())
```

In [7]:

```
from permute.core import corr, two_sample
from permute.utils import get_prng, permute_within_groups
rs = np.random.RandomState(seed=123456789) # set the seed of the PRNG, for reproducibility
```

In [8]:

```
def stratified_two_sample(x, y, group_x, group_y, stat='mean', alternative="greater", reps=10**4,
keep_dist=False, seed=None):
"""
One-sided or two-sided, two-sample permutation test for equality of
two means, with p-value estimated by simulated random sampling with
reps replications.
Tests the hypothesis that x and y are a random partition of x,y
against the alternative that x comes from a population with mean
(a) greater than that of the population from which y comes,
if side = 'greater'
(b) less than that of the population from which y comes,
if side = 'less'
(c) different from that of the population from which y comes,
if side = 'two-sided'
Permutations are carried out within the given groups. Under the null hypothesis,
observations within each group are exchangeable.
If ``keep_dist``, return the distribution of values of the test statistic;
otherwise, return only the number of permutations for which the value of
the test statistic and p-value.
Parameters
----------
x : array-like
Sample 1
y : array-like
Sample 2
group_x : array-like
Group assignments for sample 1
group_y : array-like
Group assignments for sample 2
stat : {'mean', 't'}
The test statistic.
(a) If stat == 'mean', the test statistic is (mean(x) - mean(y))
(equivalently, sum(x), since those are monotonically related), omitting
NaNs, which therefore can be used to code non-responders
(b) If stat == 't', the test statistic is the two-sample t-statistic--
but the p-value is still estimated by the randomization,
approximating the permutation distribution.
The t-statistic is computed using scipy.stats.ttest_ind
(c) If stat is a function (a callable object), the test statistic is
that function. The function should take a permutation of the pooled
data and compute the test function from it. For instance, if the
test statistic is the Kolmogorov-Smirnov distance between the
empirical distributions of the two samples, max_t |F_x(t) - F_y(t)|,
the test statistic could be written:
f = lambda u: np.max( \
[abs(sum(u[:len(x)]<=v)/len(x)-sum(u[len(x):]<=v)/len(y)) for v in u]\
)
alternative : {'greater', 'less', 'two-sided'}
The alternative hypothesis to test
reps : int
Number of permutations
keep_dist : bool
flag for whether to store and return the array of values
of the irr test statistic
seed : RandomState instance or {None, int, RandomState instance}
If None, the pseudorandom number generator is the RandomState
instance used by `np.random`;
If int, seed is the seed used by the random number generator;
If RandomState instance, seed is the pseudorandom number generator.
Returns
-------
float
the estimated p-value
float
the test statistic
list
The distribution of test statistics.
These values are only returned if `keep_dist` == True
"""
prng = get_prng(seed)
z = np.concatenate([x, y]) # pooled responses
groups = np.concatenate([group_x, group_y]) # pooled group assignments
# If stat is callable, use it as the test function. Otherwise, look in the dictionary
stats = {
'mean': lambda u: np.nanmean(u[:len(x)]) - np.nanmean(u[len(x):]),
't': lambda u: ttest_ind(
u[:len(y)][~np.isnan(u[:len(y)])],
u[len(y):][~np.isnan(u[len(y):])],
equal_var=True)[0]
}
if callable(stat):
tst_fun = stat
else:
tst_fun = stats[stat]
theStat = {
'greater': tst_fun,
'less': lambda u: -tst_fun(u),
'two-sided': lambda u: math.fabs(tst_fun(u))
}
tst = theStat[alternative](z)
observed_tst = tst_fun(z)
if keep_dist:
dist = np.empty(reps)
for i in range(reps):
dist[i] = theStat[alternative](permute_within_groups(z, groups, seed=prng))
hits = np.sum(dist >= tst)
return hits/reps, tst, dist
else:
hits = np.sum([(theStat[alternative](permute_within_groups(z, groups, seed=prng)) >= tst)
for i in range(reps)])
return hits/reps, tst
```

In [10]:

```
(p, t) = stratified_two_sample(ratings['overall'][ratings.taidgender==1], ratings['overall'][ratings.taidgender==0],
ratings['tagender'][ratings.taidgender == 1],
ratings['tagender'][ratings.taidgender == 0],
alternative = "two-sided", seed = rs)
print('Overall rating:')
print('Difference in means:', t)
print('P-value (two-sided):', np.round(p, 5), "\n")
print ('\n\n{0:24} {1:8} {2:8}'.format('Category', 'Diff means', 'P-value'))
for col in categories:
(p, t) = stratified_two_sample(ratings[col][ratings.taidgender==1], ratings[col][ratings.taidgender==0],
ratings['tagender'][ratings.taidgender == 1],
ratings['tagender'][ratings.taidgender == 0],
alternative = "two-sided", seed = rs)
print ('{0:20} {1:8.2f} {2:8.2f}'.format(col, t, p))
```

In [ ]:

```
```