X = rand(5,5)
A = X + X'
display(A)
for n = 0:10
Q,R = qr(A)
A = R*Q
display(A)
end
You might find it interesting to compare the results above with the eigenvalues of the starting matrix $X+X^*$:
eigvals(X + X')
Here we compute matrix addition $C = A + B$ for $m \times n$ matrices, which is trivial to implement: just two loops over $i$ and $j$ to compute $c_{ij} = a_{ij} + b_{ij}$ for each entry. There is no possibility of temporal locality in this problem, because each matrix element is used exactly once. However, the ordering of the operations still matters because of spatial locality (primarily because of cache lines). In particular, we will look at the influence of the loop order: does it matter (for performance) if we perform the $i$ loop first (outermost), then the $j$ loop, or vice versa?
Here are two simple matrix-addition implementations in Julia that differ only in their loop ordering. In both cases we have used the @inbounds
macro to avoid the overhead of bounds-checking array accesses (which makes a slight performance difference in inner loops), and we implement "mutating" variants that require the output array C
to be pre-allocated (to avoid the overhead of array allocation during benchmarking).
function matadd1!(C, A, B)
m,n = size(A)
(m,n) == size(B) || error("matrix sizes must match")
for i = 1:m
for j = 1:n
@inbounds C[i,j] = A[i,j] + B[i,j]
end
end
return C
end
matadd1(A,B) = matadd1!(similar(A, promote_type(eltype(A),eltype(B))), A,B)
# same as matadd1 except the loop order is reversed
function matadd2!(C, A, B)
m,n = size(A)
(m,n) == size(B) || error("matrix sizes must match")
for j = 1:n
for i = 1:m
@inbounds C[i,j] = A[i,j] + B[i,j]
end
end
return C
end
matadd2(A,B) = matadd2!(similar(A, promote_type(eltype(A),eltype(B))), A,B)
# correctness check:
A = rand(5,7)
B = rand(5,7)
norm(matadd1(A,B) - (A+B)), norm(matadd2(A,B) - (A+B))
You can also use the ?
command in Julia to look at how the built-in A+B
is implemented, since ?A+B
will give a link to the source code of the +
function that is used for these argument types:
?A+B
Now, let's benchmark matadd1!
and matadd2!
, along with the built-in A+B
(which allocates a new array as output each time you call it), for a variety of $n\times n$ matrix sizes:
N = round(Int, linspace(10, 1000, 200))
t0 = Float64[]
t1 = Float64[]
t2 = Float64[]
for n in N
A = rand(n,n)
B = rand(n,n)
C = similar(A)
push!(t0, @elapsed A+B) # note: does not preallocate output
push!(t1, @elapsed matadd1!(C,A,B))
push!(t2, @elapsed matadd2!(C,A,B))
end
and plot the results as gigaflops ($n^2$ divided by time in nanoseconds) vs. $n$:
using PyPlot
plot(N, N.^2 ./ t0 * 1e-9, "k.-")
plot(N, N.^2 ./ t1 * 1e-9, "bo-")
plot(N, N.^2 ./ t2 * 1e-9, "rs-")
xlabel(L"n")
ylabel(L"gflops ($n^2/t$)")
legend(["A+B","matmul1!","matmul2!"])
Can you (roughly) explain the results? Hint: google for row-major and column-major order and think about how that impacts loop ordering.