Polymorphic Parallel Prefix by Jiahao Chen is licensed under a Creative Commons Attribution 4.0 International License.

This notebook accompanies the workshop paper by Jiahao Chen and Alan Edelman entitled "Parallel Prefix Polymorphism Permits Parallelization, Presentation & Proof", Proceedings of the First Workshop for High Performance Technical Computing in Dynamic Languages, held in conjunction with SC14: The International Conference on High Performance Computing, Networking, Storage and Analysis.

The latest version of this notebook can be downloaded from https://github.com/jiahao/ijulia-notebooks/blob/master/2014-08-06-parallel-prefix.ipynb.

Tested to work on Julia v0.6.0.

In [1]:

#Uncomment these lines to install the necessary packages
#Pkg.update()
#map(Pkg.add, ("BenchmarkTools", Cairo", "Compose", "Gadfly", "Interact", "ProgressMeter"));

In [2]:

using BenchmarkTools, Compose, Gadfly, Interact, ProgressMeter

WARNING: using ProgressMeter.Progress in module Main conflicts with an existing identifier.
WARNING: using ProgressMeter.update! in module Main conflicts with an existing identifier.

`reduce()`¶

Reduction applies an operator to a vector repeatedly to return a number.

In [3]:

reduce(+, 1:8)

Out[3]:

In [4]:

#Result is the same as
sum(1:8)

Out[4]:

In [5]:

reduce(*, 1:6)

Out[5]:

In [6]:

#Result is the same as
prod(1:6)

Out[6]:

In [7]:

boring_left(a, b)=a
reduce(boring_left, 1:8)

Out[7]:

In [8]:

boring_right(a, b)=b
reduce(boring_right, 1:8)

Out[8]:

You can also use reduce to compute Fibonacci numbers using their recurrences.

$$\begin{pmatrix} f_2 \\ f_1 \end{pmatrix} = \begin{pmatrix} f_1 + f_0 \\ f_1 \end{pmatrix} = \begin{pmatrix} 1 & 1 \\ 1 & 0 \end{pmatrix} \begin{pmatrix} f_1 \\ f_0 \end{pmatrix} $$$$\begin{pmatrix} f_{n+1} \\ f_n \end{pmatrix} = \dots = \begin{pmatrix} 1 & 1 \\ 1 & 0 \end{pmatrix}^n \begin{pmatrix} f_1 \\ f_0 \end{pmatrix} $$

From this, you can show that

$$\begin{pmatrix} 1 & 1 \\ 1 & 0 \end{pmatrix}^n = \begin{pmatrix} f_{n+1} & f_n \\ f_n & f_{n-1} \end{pmatrix} $$

In [9]:

fib(j)=reduce(*, [[1 1; 1 0] for i=1:j])[1,2]
map(fib, [1, 2, 3, 4, 5, 11, 100])

Out[9]:

7-element Array{Int64,1}:
                   1
                   1
                   2
                   3
                   5
                  89
 3736710778780434371

You can solve recurrences of any complexity using reduce. For example, reduce can compute a Hadamard matrix from its definition in terms of its submatrices:

$$H_2 = \begin{pmatrix} H_1 & H_1 \\ H_1 & -H_1 \end{pmatrix} = \begin{pmatrix} 1 & 1 \\ 1 & -1 \end{pmatrix} \otimes H_1$$

and so on.

In [10]:

Hadamard(n)=reduce(kron, [[1 1; 1 -1] for i=1:n])
Hadamard(3)

Out[10]:

8×8 Array{Int64,2}:
 1   1   1   1   1   1   1   1
 1  -1   1  -1   1  -1   1  -1
 1   1  -1  -1   1   1  -1  -1
 1  -1  -1   1   1  -1  -1   1
 1   1   1   1  -1  -1  -1  -1
 1  -1   1  -1  -1   1  -1   1
 1   1  -1  -1  -1  -1   1   1
 1  -1  -1   1  -1   1   1  -1

In [11]:

ans*ans' #This is a legitimate Hadamard matrix

Out[11]:

8×8 Array{Int64,2}:
 8  0  0  0  0  0  0  0
 0  8  0  0  0  0  0  0
 0  0  8  0  0  0  0  0
 0  0  0  8  0  0  0  0
 0  0  0  0  8  0  0  0
 0  0  0  0  0  8  0  0
 0  0  0  0  0  0  8  0
 0  0  0  0  0  0  0  8

In [12]:

#Reduction of matrix multiplications
M=[randn(2,2) for i=1:4]

printnice(x)=display(round.(x, 3))

printnice(M[4]*M[3]*M[2]*M[1])
printnice(reduce((A,B)->B*A, M)) #backward multiply <-- left associativity
printnice(reduce(*, M))          #forward multiply

2×2 Array{Float64,2}:
 -2.556  -2.636
 -2.769  -2.892

2×2 Array{Float64,2}:
 -2.556  -2.636
 -2.769  -2.892

2×2 Array{Float64,2}:
 -0.093   0.036
  1.715  -1.659

In the following example we apply reduce() to function composition:

In [13]:

h=reduce((f,g)->(x->f(g(x))), [sin cos tan])
h(π)

Out[13]:

0.8414709848078965

In [14]:

sin(cos(tan(π)))

Out[14]:

0.8414709848078965

`prefix`¶

Having discussed reduce, we are now ready for the idea behind prefix sum.

Suppose you wanted to compute the partial sums of a vector, i.e. given y[1:n], we want to overwrite the vector y with the vector of partial sums

new_y[1] = y[1]
new_y[2] = y[1] + y[2]
new_y[3] = y[1] + y[2] + y[3]
...

At first blush, it seems impossible to parallelize this, since

new_y[1] = y[1]
new_y[2] = new_y[1] + y[2]
new_y[3] = new_y[2] + y[3]
...

which is an intrinsically serial process.

In [15]:

function prefix_serial!(y, +)
    for i=2:length(y)
        y[i] = y[i-1] + y[i]
    end
    y
end

Out[15]:

prefix_serial! (generic function with 1 method)

In [16]:

#Compute the first 8 triangular numbers
prefix_serial!([1:8;], +)'

Out[16]:

1×8 RowVector{Int64,Array{Int64,1}}:
 1  3  6  10  15  21  28  36

One particularly neat trick in Julia is that we are able to specify the + operator as an argument to prefix_serial!. As a result, we can use this function to compute other things too, just by swapping out the operator.

In [17]:

#Compute the first 8 factorials
prefix_serial!([1:8;], *)'

Out[17]:

1×8 RowVector{Int64,Array{Int64,1}}:
 1  2  6  24  120  720  5040  40320

However, it turns out that because addition (+) is associative, we can regroup the order of how these sums are carried out. (This of course extends to other associative operations such as multiplication.) Another ordering of 8 associative operations is provided by prefix8!:

In [18]:

function prefix8!(y, ⋅)
    length(y)==8 || error("length 8 only")
    for i in [2,4,6,8]; y[i] = y[i-1] ⋅ y[i]; end
    for i in [  4,  8]; y[i] = y[i-2] ⋅ y[i]; end
    for i in [      8]; y[i] = y[i-4] ⋅ y[i]; end
    for i in [    6  ]; y[i] = y[i-2] ⋅ y[i]; end
    for i in [ 3,5,7 ]; y[i] = y[i-1] ⋅ y[i]; end
    y
end

Out[18]:

prefix8! (generic function with 1 method)

In [19]:

#Compute the first 8 triangular numbers
prefix8!([1:8;], +)'

Out[19]:

1×8 RowVector{Int64,Array{Int64,1}}:
 1  3  6  10  15  21  28  36

In [20]:

#Compute the first 8 factorials
prefix8!([1:8;], *)'

Out[20]:

1×8 RowVector{Int64,Array{Int64,1}}:
 1  2  6  24  120  720  5040  40320

The general case is a bit more convoluted to write out:

In [21]:

function prefix!(y, ⋅)
    l=length(y)
    k=ceil(Int, log2(l))
    for j=1:k, i=2^j:2^j:min(l, 2^k)              #"reduce"
        y[i] = y[i-2^(j-1)] ⋅ y[i]
    end
    for j=(k-1):-1:1, i=3*2^(j-1):2^j:min(l, 2^k) #"broadcast"
        y[i] = y[i-2^(j-1)] ⋅ y[i]
    end
    y
end

Out[21]:

prefix! (generic function with 1 method)

Polymorphism for visualization¶

We can visualize the operations with a little bit of trickery. In Julia, arrays are simply types that expose the array protocol. In particular, they need to implement methods for the generic functions length, getindex and setindex!. The last two are used in indexing operations, since statements

y = A[1]
A[3] = y

get desugared to

y = getindex(A, 1)
setindex!(A, y, 3)

respectively.

We can trace through the iterable by introduce a dummy array type, AccessArray, which stores no useful information but records every access to getindex and setindex!.

Specifically:

length(A::AccessArray) returns an integer that is stored internally in the A.length field
getindex(A::AccessArray, i) records read access to the index i in the A.read field and always retuns nothing.
setindex!(A::AccessArray, x, i) records write access to the index i. The A.history field is appended with a new tuple consisting of the current A.read field and the index i.

The way AccessArray works assumes an association between a single setindex! call and and all the preceding getindex calls since the previous setindex! call, which is sufficient for the purposes of tracing through prefix calls.

In [22]:

import Base: getindex, setindex!, length, size

isdefined(:AccessArray) || type AccessArray <: AbstractArray{Void, 1}
    length :: Int
    read :: Vector
    history :: Vector{Any}
    AccessArray(length, read=[], history=[])=new(length, read, history)
end


length(A::AccessArray) = A.length
size(A::AccessArray) = (A.length,)

function getindex(A::AccessArray, i)
    push!(A.read, i)
    nothing
end

function setindex!(A::AccessArray, x, i)
    push!(A.history, (A.read, [i]))
    A.read = []
end

Out[22]:

setindex! (generic function with 202 methods)

We also need a dummy associative operator to pass to the prefix call.

In [23]:

import Base.+
+(a::Void, b::Void)=a

Out[23]:

+ (generic function with 279 methods)

In [24]:

A=prefix8!(AccessArray(8),+)

MethodError: no method matching size(::AccessArray)
Closest candidates are:
  size(::AbstractArray{T,N}, ::Any) where {T, N} at abstractarray.jl:29
  size(::Any, ::Integer, ::Integer, ::Integer...) where N at abstractarray.jl:30
  size(::Char) at char.jl:13
  ...

Stacktrace:
 [1] #showarray#263(::Bool, ::Function, ::IOContext{Base.AbstractIOBuffer{Array{UInt8,1}}}, ::AccessArray, ::Bool) at ./show.jl:1690
 [2] limitstringmime(::MIME{Symbol("text/plain")}, ::AccessArray) at /Users/cqm540/.julia/v0.6/IJulia/src/inline.jl:25
 [3] display_dict(::AccessArray) at /Users/cqm540/.julia/v0.6/IJulia/src/execute_request.jl:27
 [4] execute_request(::ZMQ.Socket, ::IJulia.Msg) at /Users/cqm540/.julia/v0.6/IJulia/src/execute_request.jl:188
 [5] eventloop(::ZMQ.Socket) at /Users/cqm540/.julia/v0.6/IJulia/src/eventloop.jl:8
 [6] (::IJulia.##11#14)() at ./task.jl:335

Now let's visualize this! Each entry in A.history is rendered by a gate object:

In [25]:

type gate
    ins :: Vector
    outs:: Vector
end

import Gadfly.render

function render(G::gate, x₁, y₁, y₀; rᵢ=0.1, rₒ=0.25)
    ipoints = [(i, y₀+rᵢ) for i in G.ins]
    opoints = [(i, y₀+0.5) for i in G.outs]
    igates  = [circle(i..., rᵢ) for i in ipoints]
    ogates  = [circle(i..., rₒ) for i in opoints]
    lines = [line([i, j]) for i in ipoints, j in opoints]
    compose(context(units=UnitBox(0.5,0,x₁,y₁+1)),
    compose(context(), stroke(colorant"black"), fill(colorant"white"),
            igates..., ogates...),
    compose(context(), linewidth(0.3mm), stroke(colorant"black"),
            lines...))
end

A=gate([1,2],[2])
render(A,2,0,0)

Out[25]:

Now we render the whole algorithm. We have to scan through the trace twice; the first time merely calculates the maximum depth that needs to be drawn and the second time actually generates the objects.

In [26]:

function render(A::AccessArray)
    #Scan to find maximum depth
    olast = depth = 0
    for y in A.history
        (any(y[1] .≤ olast)) && (depth += 1)
        olast = maximum(y[2])
    end
    maxdepth = depth
    
    olast = depth = 0
    C = []
    for y in A.history
        (any(y[1] .≤ olast)) && (depth += 1)
        push!(C, render(gate(y...), A.length, maxdepth, depth))
        olast = maximum(y[2])
    end
    
    push!(C, compose(context(units=UnitBox(0.5,0,A.length,1)),
      [line([(i,0), (i,1)]) for i=1:A.length]...,
    linewidth(0.1mm), stroke(colorant"grey")))
    compose(context(), C...)
end

Out[26]:

render (generic function with 40 methods)

In [27]:

render(prefix!(AccessArray(20), +))

Out[27]:

Now we can see that prefix! rearranges the operations to form two spanning trees:

In [28]:

render(prefix!(AccessArray(120),+))

Out[28]:

In [29]:

function prefix!(y, +)
    l=length(y)
    k=ceil(Int, log2(l))
    @inbounds for j=1:k, i=2^j:2^j:min(l, 2^k)              #"reduce"
        y[i] = y[i-2^(j-1)] + y[i]
    end
    @inbounds for j=(k-1):-1:1, i=3*2^(j-1):2^j:min(l, 2^k) #"broadcast"
        y[i] = y[i-2^(j-1)] + y[i]
    end
    y
end
@show prefix!(AccessArray(8),+)
render(prefix!(AccessArray(9),+))

MethodError: no method matching size(::AccessArray)
Closest candidates are:
  size(::AbstractArray{T,N}, ::Any) where {T, N} at abstractarray.jl:29
  size(::Any, ::Integer, ::Integer, ::Integer...) where N at abstractarray.jl:30
  size(::Char) at char.jl:13
  ...

Stacktrace:
 [1] array_eltype_show_how(::AccessArray) at ./show.jl:1748
 [2] show_vector(::Base.AbstractIOBuffer{Array{UInt8,1}}, ::AccessArray, ::String, ::String) at ./show.jl:1752
 [3] #showarray#263(::Bool, ::Function, ::Base.AbstractIOBuffer{Array{UInt8,1}}, ::AccessArray, ::Bool) at ./show.jl:1681
 [4] showall(::Base.AbstractIOBuffer{Array{UInt8,1}}, ::AccessArray) at ./show.jl:1724
 [5] repr(::AccessArray) at ./strings/io.jl:146

In [30]:

#Look at the compose tree
#render(prefix8!(AccessArray(8),+)) |> introspect

as contrasted with the serial code:

In [31]:

render(prefix_serial!(AccessArray(8),+))

Out[31]:

In [32]:

@manipulate for npp=1:180
    render(prefix!(AccessArray(npp),+))
end

Out[32]:

In [33]:

@manipulate for npp=1:180
    render(prefix_serial!(AccessArray(npp),+))
end

Out[33]:

In [34]:

draw(PDF("serial.pdf", 3.5inch, 3.5inch), render(prefix_serial!(AccessArray(8),+)))
draw(PDF("tree.pdf", 3.5inch, 3.5inch), render(prefix8!(AccessArray(8),+)))

Polymorphism for parallel operations¶

Now let's run prefix in parallel on every core on the CPU. Use addprocs to attach additional worker processes.

In [35]:

import Base: fetch, length, +, *
fetch(t::Vector) = map(fetch, t) #Vectorize fetch

#Define elementary operations on remote data
length(r1::Future)=length(fetch(r1))
+(r1::Future, r2::Future)=@spawnat r2.where fetch(r1)+fetch(r2)
*(r1::Future, r2::Future)=@spawnat r2.where fetch(r1)*fetch(r2)

Out[35]:

* (generic function with 284 methods)

For 8 processes, the serial version requires 7 operations. The parallel version uses 11 operations, but they are grouped into 5 simultaneous chunks of operations. Hence we expect that the parallel version runs in 5/7 the time needed for the naïve serial code (on a machine with 8 processors).

In [80]:

println("This computer has $(nprocs()) processors")

This computer has 8 processors

In [94]:

max_procs = nprocs() #Add all possible processors

#Before timing, force Julia to JIT compile the functions
[f([randn(1,1) for i=1:2], *) for f in (prefix_serial!, prefix!)]

addprocs(max(0, Sys.CPU_CORES-max_procs)) 

Base.zero(::Type{Tuple{Float64,Float64}}) = (0.0, 0.0)
timings=zeros(Tuple{Float64,Float64}, max_procs)
@showprogress for np in 1:max_procs
    gc_enable(false) #Disable garbage collector for timing purposes
    n=2048#128#512#2048#512#2048#4096#3072
    
    #Create Futures on np processes
    r=[@spawnat p randn(n,n) for p in procs()[1:np]]
    s=fetch(r) #These are now the same matrices in memory on the master process
    
    #TODO use @belapsed from BenchmarkTools.jl to measure timings
    #in a more reliable manner - something funny happens though
    t_ser = @belapsed prefix_serial!($s, *)
    t_par = @belapsed @sync prefix!($r, *)
    timings[np] = (t_ser, t_par)
    gc_enable(true)
end
timings

Progress: 100%|█████████████████████████████████████████| Time: 0:03:43

Out[94]:

8-element Array{Tuple{Float64,Float64},1}:
 (6.335e-9, 7.55627e-7)
 (0.136874, 0.447522)  
 (0.312059, 0.98023)   
 (0.47621, 1.35074)    
 (0.611405, 1.77155)   
 (0.785864, 2.03217)   
 (1.05001, 2.52966)    
 (1.07574, 2.73774)

In [37]:

#timings = #Data from julia.mit.edu, a large shared memory machine with 80 cores
#[
# ( 15.585,  15.721), ( 16.325,  15.893), ( 32.079,  31.678), ( 46.037,  33.257),
# ( 61.947,  51.412), ( 77.247,  50.769), ( 92.588,  67.796), (108.057,  65.185),
# (123.556,  68.197), (138.408,  66.658), (153.392,  82.580), (168.874,  83.373),
# (184.227,  82.875), (199.741,  83.098), (215.881, 101.271), (230.531, 104.860),
# (246.576, 103.229), (261.412, 102.497), (276.926, 103.367), (294.404, 104.162),
# (308.995, 104.944), (323.287, 104.838), (340.173, 121.495), (353.717, 119.729),
# (369.113, 120.281), (384.638, 118.710), (402.484, 119.237), (417.036, 119.980),
# (431.273, 120.463), (446.288, 120.560), (461.756, 137.209), (476.653, 135.873),
# (493.989, 138.150), (508.089, 137.998), (523.703, 139.059), (539.305, 135.201),
# (554.201, 138.242), (569.276, 137.470), (585.268, 137.261), (599.921, 137.871),
# (615.718, 137.887), (632.463, 137.579), (647.355, 138.181), (664.344, 139.917),
# (678.899, 138.382), (695.815, 138.354), (717.068, 154.826), (730.902, 155.603),
# (745.608, 154.549), (756.012, 154.244), (769.611, 154.000), (793.350, 152.893),
# (801.568, 155.276), (818.174, 153.436), (832.994, 155.923), (849.526, 155.721),
# (863.337, 155.289), (881.577, 156.188), (893.735, 157.889), (911.597, 155.557),
# (925.117, 157.048), (940.897, 153.338), (954.825, 173.942), (990.757, 173.236),
# (987.457, 172.005), (1009.688, 173.869), (1022.357, 173.303), (1043.536, 175.044),
# (1052.156, 173.361), (1067.174, 172.230), (1079.543, 173.491), (1101.516, 174.311),
# (1119.993, 174.057), (1136.092, 175.804), (1154.758, 175.075), (1170.732, 173.453),
# (1184.598, 171.936), (1190.922, 172.809), (1204.645, 172.878),
#]

In [96]:

speedup(p::Integer) = p==1 ? 1.0 : ((p-1)/(floor(Int, log2(p)) + 1 + floor(Int, log2(p/3))))
#for (np, (t_ser, t_par)) in sort!([x for x in timings])
#    @printf("np: %3d Serial: %.3f sec  Parallel: %.3f sec  speedup: %.3fx (theory=%.3fx)\n",
#        np, t_ser, t_par, t_ser/t_par, speedup(np))
#end

xend = 1+max_procs

theplot=plot(
    
    #Theoretical speedup curve
    layer(x=1:xend, y=map(speedup, 1:xend), Geom.line(), Theme(default_color=colorant"darkkhaki")),

    #Empirical data
    layer(x=1:max_procs,
        y=[t_ser/t_par for (t_ser, t_par) in timings],
        Geom.point,
        Theme(default_color=colorant"firebrick",
        point_size=1.25pt, highlight_width=0pt)),
    Guide.xlabel("Number of processors, p"), Guide.ylabel("Speedup, r")
)

Out[96]:

If you run this notebook on your local machine you will probably notice that a lot of things can go wrong in benchmarking:

If the problem size is too small, then timing measurements tend to include more and more system jitter. BenchmarkTools.jl was specifically written to handle the timing of fast serial kernels; however, the problem of developing proper statistics for parallel kernels remains an unsolved problem.
If the problem size is too small, then the parallel kernel will underperform significantly relative to the serial kernel. The timing measurements start to have a higher fraction of serial overhead code.
If you play with the code you might accidentally delete the @sync barrier and the parallel code will magically look like it's faster because execution flow goes back to the master node immediately after the Future is created.

In [39]:

# This plot is specifically for the timing data in the paper
#
# labeledplot = compose(render(theplot),
# compose(context(), stroke(colorant"purple"), #Powers of 2
# text(0.21, 0.73, "2"), text(0.24, 0.72, "4"), text(0.27, 0.7, "8"), text(0.33, 0.63, "16"),
# text(0.47, 0.53, "32"), text(0.76, 0.35, "64")),
# compose(context(), stroke(colorant"green"), #3x powers of 2
# text(0.25, 0.62, "6"), text(0.28, 0.57, "12"), text(0.39, 0.49, "24"), text(0.63, 0.32, "48"),
# )
# )
# draw(SVG(3.5inch, 3.2inch), labeledplot)

In [40]:

#draw(SVG("scaling.svg", 3.5inch, 3.2inch), labeledplot)
#draw(PDF("scaling.pdf", 3.5inch, 3inch), theplot)

The exact same serial code now runs in parallel thanks to multiple dispatch!

Formal verification¶

Julia allows us to implement very easily the interval of summations monoid technique for verifying the correctness of prefix sums (doi:10.1145/2535838.2535882).

In [41]:

#Definition 4.3 of Chong et al., 2014
abstract type ⊤ end
abstract type Id end
const 𝕊 = Union{UnitRange, Type{⊤}, Type{Id}};

In [42]:

⊕(I::UnitRange, J::UnitRange) = I.stop+1==J.start ? (I.start:J.stop) : ⊤
⊕(::Type{Id}, ::Type{Id}) = Id
⊕(I::𝕊, ::Type{Id}) = I
⊕(::Type{Id}, I::𝕊) = I
⊕(I::𝕊, J::𝕊) = ⊤

Out[42]:

⊕ (generic function with 5 methods)

In [43]:

for kernel in (prefix_serial!, prefix!), n=1:1000
    @assert kernel([k:k for k=1:n], ⊕) == [1:k for k=1:n]
end

Wrong kernels produce type errors!

In [44]:

function prefix_wrong!(y, .+)
    l = length(y)
    offset = 1
    z = copy(y)
    while offset < l
        for i=offset+1:l
            z[i] = y[i-offset] .+ z[i]
        end
        offset += 2
        y = copy(z)
    end
    y
end

@assert prefix_wrong!([k:k for k=1:8], ⊕) #Should produce conversion error

MethodError: Cannot `convert` an object of type Array{Int64,1} to an object of type UnitRange{Int64}
This may have arisen from a call to the constructor UnitRange{Int64}(...),
since type constructors fall back to convert methods.

Stacktrace:
 [1] prefix_wrong!(::Array{UnitRange{Int64},1}, ::Function) at ./In[44]:7

In [56]:

@manipulate for npp=1:10
    render(prefix_wrong!(AccessArray(npp),+))
end

Out[56]:

Experimental code which doesn't work¶

In this section I play with other parallel prefix algorithms. All the ones here suffer from read antidependencies and they break the logic I used to draw the gates.

In [57]:

#Kogge-Stone variant
#Works but doesn't render properly
function prefix_ks!(y, .+)
    l = length(y)
    offset = 1
    z = copy(y)
    while offset < l
        for i=offset+1:l
            z[i] = y[i-offset] .+ z[i]
        end
        offset *= 2
        y = copy(z)
    end
    y
end

@assert prefix_ks!(ones(8), +) == [1:8;]

In [58]:

render(prefix_ks!(AccessArray(8),+))

Out[58]:

In [59]:

#Hillis-Steele = Kogge-Stone
function prefix_hs!(y, +)
    l = length(y)
    lk = floor(Int, log2(l))
    for d=1:lk
        for k=2d:l
            y[k] = y[k-2^(d-1)] + y[k]
        end
    end
    y
end
prefix_hs!(ones(4), +)

Out[59]:

4-element Array{Float64,1}:
 1.0
 2.0
 3.0
 6.0

In [60]:

render(prefix_hs!(AccessArray(4), +))

Out[60]:

In [61]:

#Sklansky
function prefix_s!(y, +)
    n = length(y)
    for d=1:log2(n)
        for k=1:n÷2+1
            block = 2 * (k - (mod(k, 2^d)))
            me = block + mod(k, 2^d) + 2^d
            spine = block + 2^d - 1
            if spine ≤ n && me ≤ n
                y[me] = y[me] + y[spine]
            end
        end
    end
    y
end
prefix_s!(ones(4), +)

ArgumentError: invalid index: 3.0

Stacktrace:
 [1] prefix_s!(::Array{Float64,1}, ::Base.#+) at ./In[61]:10

In [62]:

function prefix(v,⊕)
    @show v
    k=length(v)
    k≤1 && return v
    v[2:2:k] = @show v[1:2:k-1]⊕v[2:2:k]  # Line 1
    v[2:2:k] = @show prefix(v[2:2:k],⊕)   # Line 2
    v[3:2:k] = @show v[2:2:k-1]⊕v[3:2:k]  # Line 3
    v
end

Out[62]:

prefix (generic function with 1 method)

In [63]:

prefix(ones(8),+)

v = [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
v[1:2:k - 1] ⊕ v[2:2:k] = [2.0, 2.0, 2.0, 2.0]
v = [2.0, 2.0, 2.0, 2.0]
v[1:2:k - 1] ⊕ v[2:2:k] = [4.0, 4.0]
v = [4.0, 4.0]
v[1:2:k - 1] ⊕ v[2:2:k] = [8.0]
v = [8.0]
prefix(v[2:2:k], ⊕) = [8.0]
v[2:2:k - 1] ⊕ v[3:2:k] = Float64[]
prefix(v[2:2:k], ⊕) = [4.0, 8.0]
v[2:2:k - 1] ⊕ v[3:2:k] = [6.0]
prefix(v[2:2:k], ⊕) = [2.0, 4.0, 6.0, 8.0]
v[2:2:k - 1] ⊕ v[3:2:k] = [3.0, 5.0, 7.0]

Out[63]:

8-element Array{Float64,1}:
 1.0
 2.0
 3.0
 4.0
 5.0
 6.0
 7.0
 8.0

In [64]:

function ⊕{T<:Range}(v1::Vector{T}, v2::Vector{T})
    for v in (v1, v2), i in 1:length(v)-1
        @assert v[i].stop + 1 == v[i+1].start
    end
    if length(v1)>0 && length(v2)>0
        @assert v1[end].stop + 1 == v2[1].start
        v1[1].start:v2[end].stop
    elseif length(v1)>0 #&& length(v2)==0
        v1[1].start:v1[end].stop
    elseif length(v2)>0 #&& length(v1)==0
        v2[1].start:v2[end].stop
    else #Both are empty
        v1
    end
end

Out[64]:

⊕ (generic function with 6 methods)

In [65]:

for kernel in (prefix,), n=1:100#(prefix_serial!, prefix!, prefix_ks!), n=1:100
    @assert kernel([k:k for k=1:n], ⊕) == [1:k for k=1:n]
end

v = UnitRange{Int64}[1:1]
v = UnitRange{Int64}[1:1, 2:2]
v[1:2:k - 1] ⊕ v[2:2:k] = 1:2

DimensionMismatch("tried to assign 2 elements to 1 destinations")

Stacktrace:
 [1] throw_setindex_mismatch(::UnitRange{Int64}, ::Tuple{Int64}) at ./indices.jl:92
 [2] setindex_shape_check(::UnitRange{Int64}, ::Int64) at ./indices.jl:143
 [3] setindex!(::Array{UnitRange{Int64},1}, ::UnitRange{Int64}, ::StepRange{Int64,Int64}) at ./array.jl:563
 [4] prefix(::Array{UnitRange{Int64},1}, ::#⊕) at ./In[62]:5
 [5] macro expansion at ./In[65]:2 [inlined]
 [6] anonymous at ./<missing>:?

In [ ]:

reduce()¶

prefix¶