Parallel Inner Products

In [1]:
mat = numpy.random.random_sample((600, 600))
In [2]:
mat = numpy.asfortranarray(mat)
In [3]:
from IPython.parallel import Client, require, interactive
In [4]:
rc = Client()
In [5]:
dv = rc.direct_view()
In [6]:
lv = rc.load_balanced_view()
In [7]:
@require("numpy")
@interactive
def simple_inner(i):
    column = mat[:, i]
    # have to use a list comprehension to prevent closure
    return sum([numpy.inner(column, mat[:, j]) for j in xrange(i + 1, mat.shape[1])])

Local, serial performance.

In [8]:
%timeit sum(simple_inner(i) for i in xrange(mat.shape[1] - 1))
1 loops, best of 3: 720 ms per loop
In [9]:
dv.push(dict(mat=mat), block=True);

Parallel implementation using a DirectView.

In [10]:
%timeit sum(dv.map(simple_inner, range(mat.shape[1] - 1), block=False))
1 loops, best of 3: 1.52 s per loop

Parallel implementation using a LoadBalancedView with a large chunksize and unordered results.

In [11]:
%timeit sum(lv.map(simple_inner, range(mat.shape[1] - 1), ordered=False, chunksize=(mat.shape[1] - 1) // len(lv), block=False))
1 loops, best of 3: 1.2 s per loop

Using two indices takes even more time due to additional communication.

In [12]:
@require("numpy")
@interactive
def inner(i, j):
    return numpy.inner(mat[:, i], mat[:, j])
In [13]:
first = [i for i in xrange(mat.shape[1] - 1) for j in xrange(i + 1, mat.shape[1])]
In [14]:
second = [j for i in xrange(mat.shape[1] - 1) for j in xrange(i + 1, mat.shape[1])]
In [15]:
%timeit sum(dv.map(inner, first, second, block=False))
1 loops, best of 3: 2.79 s per loop
In [16]:
%timeit sum(lv.map(inner, first, second, unordered=True, chunksize=len(first) // len(lv), block=False))
1 loops, best of 3: 2.74 s per loop