Randomization tests in Julia¶

Purpose¶

In this notebook I am contrasting the styles of programming in R with those in Julia. The results from the Julia code below can be contrasted with the R code in this Rpubs document.

The main point I would make, for those who are new to Julia, is that you can program in Julia in the "vectorized" style that is well-suited to R. However, you can also program in a "devectorized" style, which can seem heretical to those coming from R.

Randomization test for independent samples¶

In the R document I explained the paired test first. It happens that I wrote the section for the unpaired test first here.

One of the easiest types of statistical hypothesis test to explain is a randomization test. If we have collected responses under two different conditions and the allocation of conditions has been randomized then a test of, say, $H_0:\mu_1=\mu_2$ versus $H_a:\mu_1<\mu_2$ can be based upon the sums of all possible subsets of the same size from the set of response values. Under the null hypothesis, the sample mean of the particular subset of size $k$ from the responses that we saw for the first condition should be similar to the population of sample means of subsets of size $k$ of the set of responses. Under the alternative hypothesis the sample mean from the combination of responses we saw should be on the low end of the populations of sample means from all possible subsets of size $k$ .

One way to check this is simply to enumerate all possible subsets of size $k$ and evaluate their sums. (Because the subset size, $k$ , is constant we work with the sums rather than the means, which are the sums divided by $k$ .)

In the first chapter of Bob Wardrop's course notes for Statistics 371 he describes a student's experiment to determine if her cat prefers tuna-flavored treats or chicken-flavored treats.

The experiment lasted for 20 days. The assignment of chicken or tuna flavro was randomized, subject to the constraint that each flavor is provided exactly 10 times. According to the randomized selection, the tuna flavor would be given on

In [ ]:

tuna = [2,3,4,6,10,12,14,17,19,20];

Each day she provided 10 treats of the assigned flavor and recorded the number consumed.

In [ ]:

treats = [4,3,5,0,5,4,5,6,1,7,6,3,7,1,3,6,3,5,1,2];

Thus the number of tuna treats consumed (out of a possible 100) was

In [1]:

sum(treats[tuna])

Out[1]:

The number of chicken treats consumed, again out of a possible 100, was

In [2]:

sum(treats) - sum(treats[tuna])

Out[2]:

Formal statement of the test¶

The test of "no preference" versus "the cat prefers chicken treats to tuna treats" can be phrased in terms of the means, $H_0:\mu_t=\mu_c$ versus $H_0:\mu_t<\mu_c$ , assuming that the mean number of treats consumed is a reasonable measure of the cat's preference.

Under the null hypothesis (i.e. "no preference") the observed sum of tuna treats consumed, 29, should not be "unusual" relative to the sum of all other possible assignments of 10 out of the 20 days.

Sampling from the possible sums¶

I explained that it is difficult to enumerate the

In [3]:

binomial(20,10)

Out[3]:

possible selection of 10 days from the total of 20 in R. It is easy to do this in Julia using the combinations iterator described below. For the time being lets consider the generation of a sample from the reference distribution as done in R. There is a sample function in the StatsBase package for Julia, with essentially the same syntax as in R.

In [4]:

using StatsBase
sum(sample(treats,10))

Out[4]:

In R we used the replicate function to generate a vector of such sums. The idiom in Julia is to use a "comprehension" to generate an array.

In [4]:

@time tunasamp = [sum(sample(treats,10)) for i in 1:100000];

elapsed time: 0.122722029 seconds (23191888 bytes allocated)

A histogram of this sample is

In [5]:

using Gadfly, DataFrames
set_default_plot_size(15cm,9cm)
plot(DataFrame(samp=tunasamp), x=:samp, Geom.histogram, Guide.XLabel("Samples from reference distribution"))

Out[5]:

Enumerating the possible sums¶

The combinations function in Julia produces an iterator that enumerates all the possible combinations. In general for loops can be over a range or the elements of a vector or an iterator. For example, we can write the subsets of size 2 from 1:4 as

In [6]:

for c in combinations(1:4,2)
    show(c)
    println()
end

Warning: ignoring conflicting import of StatsBase.range into DataFrames

[1,2]
[1,3]
[1,4]
[2,3]
[2,4]
[3,4]

A "one-liner" function to go through the possible sets of 10 observations from the 20 and compute their sums is

In [7]:

combsums(v::Vector, sz) = [sum(c) for c in combinations(v,sz)]
ss = combsums(treats,10);
length(ss)             # should be 184756

Out[7]:

In [8]:

countnz(ss .<= 29)

Out[8]:

The last expression counts the non-zeros (i.e. trues) in the expression comparing each possible sum to the observed sum of 29. We see that only 5027 out of a possible 184756 gave a sum as small as 29 or smaller. This is less than 3% of the possible sums.

In [9]:

countnz(ss .<= 29)/length(ss)

Out[9]:

0.027208859252202906

If you are wondering why the comparison is written ss .<= 29 instead of simply ss <= 29, the .<= operator is an explicit request to the use "recycling rule" in the comparison. That is, every element of ss is compared to the scalar value 29. In R the recycling rule is always used. The designers of the Julia language felt that it was less error-prone to require the user to explicitly state that it should be used.

It may seem that the combinations iterator would be some deep magic but, in fact, it is itself written in Julia. Indeed, most of the functions provided by the standard Julia distribution are written in Julia.

Creating a frequency table of the possible sums¶

Because there are many repetitions in these 184,756 sums we can save on the size of objects by storing the frequency table. The countmap function in the StatsBase package creates a Dict of the sums and their counts.

In [10]:

countmap(combsums(treats,10))

Out[10]:

[56=>2,35=>11858,55=>8,52=>207,30=>3254,28=>1340,24=>85,37=>14551,23=>36,53=>85,32=>6306,47=>3254,41=>13407,43=>10040,45=>6306,36=>13407,44=>8140,31=>4650,49=>1340,22=>8,39=>15145,51=>416,29=>2154,33=>8140,25=>207,46=>4650,40=>14551,48=>2154,34=>10040,50=>779,54=>36,21=>2,27=>779,38=>15145,26=>416,42=>11858]

It would be more informative to return a DataFrame with the observed sums in increasing order and their counts

In [11]:

function dfsums(v::Vector,sz)
    mm = countmap([sum(c) for c in combinations(v,sz)])
    sums = sort([k for k in keys(mm)])
    DataFrame(sum=sums, count=[mm[s] for s in sums])
end

Out[11]:

dfsums (generic function with 1 method)

In [12]:

dd = dfsums(treats,10)

Out[12]:

	sum	count
1	21	2
2	22	8
3	23	36
4	24	85
5	25	207
6	26	416
7	27	779
8	28	1340
9	29	2154
10	30	3254
11	31	4650
12	32	6306
13	33	8140
14	34	10040
15	35	11858
16	36	13407
17	37	14551
18	38	15145
19	39	15145
20	40	14551
21	41	13407
22	42	11858
23	43	10040
24	44	8140
25	45	6306
26	46	4650
27	47	3254
28	48	2154
29	49	1340
30	50	779
⋮	⋮	⋮

In [13]:

dd[:le29] = [s .<= 29 ? "≤ 29" : "> 29" for s in dd[:sum]]
plot(dd, x=:sum, y=:count, Geom.bar, color=:le29)

no method zero(Type{Any})
 in render at /home/bates/.julia/Gadfly/src/geom/bar.jl:220
 in render_prepared at /home/bates/.julia/Gadfly/src/Gadfly.jl:698
 in render at /home/bates/.julia/Gadfly/src/Gadfly.jl:654
 in writemime at /home/bates/.julia/Gadfly/src/Gadfly.jl:751
 in sprint at io.jl:460
 in display_dict at /home/bates/.julia/IJulia/src/execute_request.jl:35

One of the remarkable aspects of Julia is how fast this can be

In [13]:

@time dfsums(treats,10);

elapsed time: 0.193060026 seconds (54694096 bytes allocated)

If we want to make it a bit faster we can create the count map directly without saving the original sums. It helps to take a look at the code for countmap to understand the details. The Dict object, mm, is initialized to an empty dictionary with keys of the element type of v and integer values. When a sum is computed the value for that key is incremented. The get(mm, key, 0) call returns the current count, if key has already been seen, and 0 otherwise.

In [14]:

function dfsums1(v::Vector, sz)
    mm = (eltype(v)=>Int)[]
    for c in combinations(v,sz)
        key = sum(c)
        mm[key] = get(mm,key,0) + 1
    end
    sums = sort([k for k in keys(mm)])
    DataFrame(sums=sums, counts=[mm[k] for k in sums])
end

Out[14]:

dfsums1 (generic function with 1 method)

The point here is that you do not need to fear the loop. You can write in a vectorized style or, if it seems more natural, you can write loops. In fact, some of the time there is an advantage in writing loops for complex operations, which is why one of the first packages that Dahua Lin created for Julia was a Devectorize package to undo vectorization!

It turns out that the execution time is about the same as before.

In [14]:

dfsums1(treats,10);   # one call to force compilation
@time dfsums1(treats,10);

elapsed time: 0.193720122 seconds (53214568 bytes allocated)

Considering the distribution of the sums gives us a slightly fairer calculation for the p-value of the test of $H_0:\mu_t=\mu_c$ versus $H_a:\mu_t<\mu_c$ . The p-value is the probability of seeing the result that we did, or something even more unusual, if $H_0$ were true. In the previous calculation we counted all the sums that were less than or equal to 29. To be fairer we should count all the sums that were less than 29 plus half of the sums that were equal to 29.

In [15]:

countnz(ss .< 29) + div(countnz(ss .== 29), 2)

Out[15]:

In [16]:

3950/184756

Out[16]:

0.021379549243326332

Randomization tests for paired samples¶

If the observations in the sample are paired then the statistic of interest is the average or mean of the differences within pairs. As before we will work with the sum rather than the mean or average because we are only interested in comparing the result we saw against other possible results in the randomization distribution.

To determine the randomization distribution of the sums we allow for arbitrary changes in the signs of the within-pair differences. If the populations have the same mean then the sum of differences we saw would be as likely to occur as any of the other sums calculated by switching some of the signs. If we start with $k$ differences then the total number of sums for comparison is $2^k$ , which is the number of subsets of a set of size $k$ .

To enumerate these possible sums we count from 0 to $2^k-1$ and use the binary representation of the current value of the count to determine which subset of the signs to modify.

A function to add up the sum with the signs switched according to a number j is

In [17]:

function signedsum(v::Vector, j::Integer)
    s = zero(eltype(v))
    for i in 1:length(v)
        if bool(j & 1)
            s += v[i]
        else 
            s -= v[i]
        end
        j >>= 1
    end
    s
end

Out[17]:

signedsum (generic function with 1 method)

The sum, s, is initialized to a zero of the appropriate type, eltype(v). In the loop we determine if the lowest-order bit in j is 0 or 1 by taking the logical & of j and the integer 1. At the end of the loop we shift the bits in j one position to the right.

There are other possible approaches to enumerating all the possible subsets of a set of size $k$ . It happens that this approach produces compact and fast code.

The calculation of all possible signed sums can now be expressed with the comprehension

In [18]:

ssums(v::Vector) = [signedsum(v,j) for j in 0:2^length(v)-1]

Out[18]:

ssums (generic function with 1 method)

The differences in Secchi measurements described in the R document are

In [19]:

diffs = [1.56, -0.07, 0.75, 0.71, 0.34, 0.39, 1, 0.46, 0.56, 1.13, 0.1, 
    -0.03, 0.45, 0.23, 0.66, 0.55, 0.96, 0.24, 0.75, 0.5, -0.1, -0.2];
ss = ssums(diffs);
length(ss)              # should be 2^22

Out[19]:

In [20]:

plot(DataFrame(signedSums=ss), x=:signedSums, Geom.density)

Out[20]:

If you wish to see that Julia does indeed compile functions (well, methods for functions) you can check the actual assembler code, but only do this for very simple functions.

First we will make one simplification which is to turn off bounds checking in the loop. We only access v[i] in the loop and i runs from 1 to length(v) so we can't get into trouble. We'll also use the ternary operator to tighten up the code

In [21]:

function signedsum1(v::Vector, j::Integer)
    s = zero(eltype(v))
    @inbounds for i in 1:length(v)
        s += bool(j & 1) ? v[i] : -v[i]
        j >>= 1
    end
    s
end

Out[21]:

signedsum1 (generic function with 1 method)

In [22]:

ssums1(v::Vector) = [signedsum1(v,j) for j in 0:2^length(v) - 1]

Out[22]:

ssums1 (generic function with 1 method)

The assembler code for signedsum applied to a vector of floating point numbers and an integer is shown below. You are not expected to make sense of this but rather to appreciate that it is very short and simple.

In [23]:

code_native(signedsum1,(Vector{Float64},Int))

	.text
Filename: In[21]
Source line: 3
	push	RBP
	mov	RBP, RSP
Source line: 3
	mov	RCX, QWORD PTR [RDI + 16]
	xor	EAX, EAX
	test	RCX, RCX
	cmovg	RAX, RCX
	xorps	XMM0, XMM0
	test	RAX, RAX
	je	59
Source line: 4
	mov	RCX, QWORD PTR [RDI + 8]
	xorps	XMM0, XMM0
	movabs	RDX, 140697405909312
	movsd	XMM1, QWORD PTR [RDX]
	movsd	XMM2, QWORD PTR [RCX]
	test	SIL, 1
	jne	4
	xorpd	XMM2, XMM1
	addsd	XMM0, XMM2
Source line: 5
	add	RCX, 8
	sar	RSI
	dec	RAX
	jne	-38
Source line: 7
	pop	RBP
	ret

In [23]:

ssums1(diffs);  # force compilation
@time ssums1(diffs);

elapsed time: 0.276628964 seconds (33554528 bytes allocated)