In this notebook I am contrasting the styles of programming in R with those in Julia. The results from the Julia code below can be contrasted with the R code in this Rpubs document.
The main point I would make, for those who are new to Julia, is that you can program in Julia in the "vectorized" style that is well-suited to R. However, you can also program in a "devectorized" style, which can seem heretical to those coming from R.
In the R document I explained the paired test first. It happens that I wrote the section for the unpaired test first here.
One of the easiest types of statistical hypothesis test to explain is a randomization test. If we have collected responses under two different conditions and the allocation of conditions has been randomized then a test of, say, H0:μ1=μ2 versus Ha:μ1<μ2 can be based upon the sums of all possible subsets of the same size from the set of response values. Under the null hypothesis, the sample mean of the particular subset of size k from the responses that we saw for the first condition should be similar to the population of sample means of subsets of size k of the set of responses. Under the alternative hypothesis the sample mean from the combination of responses we saw should be on the low end of the populations of sample means from all possible subsets of size k.
One way to check this is simply to enumerate all possible subsets of size k and evaluate their sums. (Because the subset size, k, is constant we work with the sums rather than the means, which are the sums divided by k.)
In the first chapter of Bob Wardrop's course notes for Statistics 371 he describes a student's experiment to determine if her cat prefers tuna-flavored treats or chicken-flavored treats.
The experiment lasted for 20 days. The assignment of chicken or tuna flavro was randomized, subject to the constraint that each flavor is provided exactly 10 times. According to the randomized selection, the tuna flavor would be given on
tuna = [2,3,4,6,10,12,14,17,19,20];
Each day she provided 10 treats of the assigned flavor and recorded the number consumed.
treats = [4,3,5,0,5,4,5,6,1,7,6,3,7,1,3,6,3,5,1,2];
Thus the number of tuna treats consumed (out of a possible 100) was
sum(treats[tuna])
29
The number of chicken treats consumed, again out of a possible 100, was
sum(treats) - sum(treats[tuna])
48
The test of "no preference" versus "the cat prefers chicken treats to tuna treats" can be phrased in terms of the means, H0:μt=μc versus H0:μt<μc, assuming that the mean number of treats consumed is a reasonable measure of the cat's preference.
Under the null hypothesis (i.e. "no preference") the observed sum of tuna treats consumed, 29, should not be "unusual" relative to the sum of all other possible assignments of 10 out of the 20 days.
I explained that it is difficult to enumerate the
binomial(20,10)
184756
possible selection of 10 days from the total of 20 in R. It is easy to do this in Julia using the combinations
iterator described below. For the time being lets consider the generation of a sample from the reference distribution as done in R. There is a sample
function in the StatsBase
package for Julia, with essentially the same syntax as in R.
using StatsBase
sum(sample(treats,10))
41
In R we used the replicate
function to generate a vector of such sums. The idiom in Julia is to use a "comprehension" to generate an array.
@time tunasamp = [sum(sample(treats,10)) for i in 1:100000];
elapsed time: 0.122722029 seconds (23191888 bytes allocated)
A histogram of this sample is
using Gadfly, DataFrames
set_default_plot_size(15cm,9cm)
plot(DataFrame(samp=tunasamp), x=:samp, Geom.histogram, Guide.XLabel("Samples from reference distribution"))
The combinations
function in Julia produces an iterator that enumerates all the possible combinations. In general for
loops can be over a range or the elements of a vector or an iterator. For example, we can write the subsets of size 2 from 1:4 as
for c in combinations(1:4,2)
show(c)
println()
end
Warning: ignoring conflicting import of StatsBase.range into DataFrames
[1,2] [1,3] [1,4] [2,3] [2,4] [3,4]
A "one-liner" function to go through the possible sets of 10 observations from the 20 and compute their sums is
combsums(v::Vector, sz) = [sum(c) for c in combinations(v,sz)]
ss = combsums(treats,10);
length(ss) # should be 184756
184756
countnz(ss .<= 29)
5027
The last expression counts the non-zeros (i.e. true
s) in the expression comparing each possible sum to the observed sum of 29. We see that only 5027 out of a possible 184756 gave a sum as small as 29 or smaller. This is less than 3% of the possible sums.
countnz(ss .<= 29)/length(ss)
0.027208859252202906
If you are wondering why the comparison is written ss .<= 29
instead of simply ss <= 29
, the .<=
operator is an explicit request to the use "recycling rule" in the comparison. That is, every element of ss
is compared to the scalar value 29
. In R the recycling rule is always used. The designers of the Julia language felt that it was less error-prone to require the user to explicitly state that it should be used.
It may seem that the combinations
iterator would be some deep magic but, in fact, it is itself written in Julia. Indeed, most of the functions provided by the standard Julia distribution are written in Julia.
Because there are many repetitions in these 184,756 sums we can save on the size of objects by storing the frequency table. The countmap
function in the StatsBase
package creates a Dict
of the sums and their counts.
countmap(combsums(treats,10))
[56=>2,35=>11858,55=>8,52=>207,30=>3254,28=>1340,24=>85,37=>14551,23=>36,53=>85,32=>6306,47=>3254,41=>13407,43=>10040,45=>6306,36=>13407,44=>8140,31=>4650,49=>1340,22=>8,39=>15145,51=>416,29=>2154,33=>8140,25=>207,46=>4650,40=>14551,48=>2154,34=>10040,50=>779,54=>36,21=>2,27=>779,38=>15145,26=>416,42=>11858]
It would be more informative to return a DataFrame
with the observed sums in increasing order and their counts
function dfsums(v::Vector,sz)
mm = countmap([sum(c) for c in combinations(v,sz)])
sums = sort([k for k in keys(mm)])
DataFrame(sum=sums, count=[mm[s] for s in sums])
end
dfsums (generic function with 1 method)
dd = dfsums(treats,10)
sum | count | |
---|---|---|
1 | 21 | 2 |
2 | 22 | 8 |
3 | 23 | 36 |
4 | 24 | 85 |
5 | 25 | 207 |
6 | 26 | 416 |
7 | 27 | 779 |
8 | 28 | 1340 |
9 | 29 | 2154 |
10 | 30 | 3254 |
11 | 31 | 4650 |
12 | 32 | 6306 |
13 | 33 | 8140 |
14 | 34 | 10040 |
15 | 35 | 11858 |
16 | 36 | 13407 |
17 | 37 | 14551 |
18 | 38 | 15145 |
19 | 39 | 15145 |
20 | 40 | 14551 |
21 | 41 | 13407 |
22 | 42 | 11858 |
23 | 43 | 10040 |
24 | 44 | 8140 |
25 | 45 | 6306 |
26 | 46 | 4650 |
27 | 47 | 3254 |
28 | 48 | 2154 |
29 | 49 | 1340 |
30 | 50 | 779 |
⋮ | ⋮ | ⋮ |
dd[:le29] = [s .<= 29 ? "≤ 29" : "> 29" for s in dd[:sum]]
plot(dd, x=:sum, y=:count, Geom.bar, color=:le29)
no method zero(Type{Any}) in render at /home/bates/.julia/Gadfly/src/geom/bar.jl:220 in render_prepared at /home/bates/.julia/Gadfly/src/Gadfly.jl:698 in render at /home/bates/.julia/Gadfly/src/Gadfly.jl:654 in writemime at /home/bates/.julia/Gadfly/src/Gadfly.jl:751 in sprint at io.jl:460 in display_dict at /home/bates/.julia/IJulia/src/execute_request.jl:35
One of the remarkable aspects of Julia is how fast this can be
@time dfsums(treats,10);
elapsed time: 0.193060026 seconds (54694096 bytes allocated)
If we want to make it a bit faster we can create the count map directly without saving the original sums. It helps to take a look at the code for countmap to understand the details. The Dict
object, mm
, is initialized to an empty dictionary with keys of the element type of v
and integer values. When a sum is computed the value for that key is incremented. The get(mm, key, 0)
call returns the current count, if key
has already been seen, and 0
otherwise.
function dfsums1(v::Vector, sz)
mm = (eltype(v)=>Int)[]
for c in combinations(v,sz)
key = sum(c)
mm[key] = get(mm,key,0) + 1
end
sums = sort([k for k in keys(mm)])
DataFrame(sums=sums, counts=[mm[k] for k in sums])
end
dfsums1 (generic function with 1 method)
The point here is that you do not need to fear the loop. You can write in a vectorized style or, if it seems more natural, you can write loops. In fact, some of the time there is an advantage in writing loops for complex operations, which is why one of the first packages that Dahua Lin created for Julia was a Devectorize
package to undo vectorization!
It turns out that the execution time is about the same as before.
dfsums1(treats,10); # one call to force compilation
@time dfsums1(treats,10);
elapsed time: 0.193720122 seconds (53214568 bytes allocated)
Considering the distribution of the sums gives us a slightly fairer calculation for the p-value of the test of H0:μt=μc versus Ha:μt<μc. The p-value is the probability of seeing the result that we did, or something even more unusual, if H0 were true. In the previous calculation we counted all the sums that were less than or equal to 29. To be fairer we should count all the sums that were less than 29 plus half of the sums that were equal to 29.
countnz(ss .< 29) + div(countnz(ss .== 29), 2)
3950
3950/184756
0.021379549243326332
If the observations in the sample are paired then the statistic of interest is the average or mean of the differences within pairs. As before we will work with the sum rather than the mean or average because we are only interested in comparing the result we saw against other possible results in the randomization distribution.
To determine the randomization distribution of the sums we allow for arbitrary changes in the signs of the within-pair differences. If the populations have the same mean then the sum of differences we saw would be as likely to occur as any of the other sums calculated by switching some of the signs. If we start with k differences then the total number of sums for comparison is 2k, which is the number of subsets of a set of size k.
To enumerate these possible sums we count from 0 to 2k−1 and use the binary representation of the current value of the count to determine which subset of the signs to modify.
A function to add up the sum with the signs switched according to a number j
is
function signedsum(v::Vector, j::Integer)
s = zero(eltype(v))
for i in 1:length(v)
if bool(j & 1)
s += v[i]
else
s -= v[i]
end
j >>= 1
end
s
end
signedsum (generic function with 1 method)
The sum, s
, is initialized to a zero of the appropriate type, eltype(v)
. In the loop we determine if the lowest-order bit in j
is 0 or 1 by taking the logical &
of j
and the integer 1
. At the end of the loop we shift the bits in j
one position to the right.
There are other possible approaches to enumerating all the possible subsets of a set of size k. It happens that this approach produces compact and fast code.
The calculation of all possible signed sums can now be expressed with the comprehension
ssums(v::Vector) = [signedsum(v,j) for j in 0:2^length(v)-1]
ssums (generic function with 1 method)
The differences in Secchi measurements described in the R document are
diffs = [1.56, -0.07, 0.75, 0.71, 0.34, 0.39, 1, 0.46, 0.56, 1.13, 0.1,
-0.03, 0.45, 0.23, 0.66, 0.55, 0.96, 0.24, 0.75, 0.5, -0.1, -0.2];
ss = ssums(diffs);
length(ss) # should be 2^22
4194304
plot(DataFrame(signedSums=ss), x=:signedSums, Geom.density)
If you wish to see that Julia does indeed compile functions (well, methods for functions) you can check the actual assembler code, but only do this for very simple functions.
First we will make one simplification which is to turn off bounds checking in the loop. We only access v[i]
in the loop and i
runs from 1 to length(v)
so we can't get into trouble. We'll also use the ternary operator to tighten up the code
function signedsum1(v::Vector, j::Integer)
s = zero(eltype(v))
@inbounds for i in 1:length(v)
s += bool(j & 1) ? v[i] : -v[i]
j >>= 1
end
s
end
signedsum1 (generic function with 1 method)
ssums1(v::Vector) = [signedsum1(v,j) for j in 0:2^length(v) - 1]
ssums1 (generic function with 1 method)
The assembler code for signedsum
applied to a vector of floating point numbers and an integer is shown below. You are not expected to make sense of this but rather to appreciate that it is very short and simple.
code_native(signedsum1,(Vector{Float64},Int))
.text Filename: In[21] Source line: 3 push RBP mov RBP, RSP Source line: 3 mov RCX, QWORD PTR [RDI + 16] xor EAX, EAX test RCX, RCX cmovg RAX, RCX xorps XMM0, XMM0 test RAX, RAX je 59 Source line: 4 mov RCX, QWORD PTR [RDI + 8] xorps XMM0, XMM0 movabs RDX, 140697405909312 movsd XMM1, QWORD PTR [RDX] movsd XMM2, QWORD PTR [RCX] test SIL, 1 jne 4 xorpd XMM2, XMM1 addsd XMM0, XMM2 Source line: 5 add RCX, 8 sar RSI dec RAX jne -38 Source line: 7 pop RBP ret
ssums1(diffs); # force compilation
@time ssums1(diffs);
elapsed time: 0.276628964 seconds (33554528 bytes allocated)