Parallel Inner Products¶

In [1]:

from IPython.parallel import Client, require, interactive

In [5]:

rc = Client()
dv = rc.direct_view()
lv = rc.load_balanced_view()

In [6]:

with dv.sync_imports():
    import numpy

importing numpy on engine(s)

In [7]:

mat = numpy.random.random_sample((800, 800))
mat = numpy.asfortranarray(mat)

In [8]:

def simple_inner(i):
    column = mat[:, i]
    # have to use a list comprehension to prevent closure
    return sum([numpy.inner(column, mat[:, j]) for j in xrange(i + 1, mat.shape[1])])

Local, serial performance.

In [9]:

%timeit sum(simple_inner(i) for i in xrange(mat.shape[1] - 1))

1 loops, best of 3: 892 ms per loop

In [10]:

dv.push(dict(mat=mat), block=True);

Parallel implementation using a DirectView.

In [11]:

%timeit sum(dv.map(simple_inner, range(mat.shape[1] - 1), block=False))

1 loops, best of 3: 3.24 s per loop

Parallel implementation using a LoadBalancedView with a large chunksize and unordered results.

In [12]:

%timeit sum(lv.map(simple_inner, range(mat.shape[1] - 1), ordered=False, chunksize=(mat.shape[1] - 1) // len(lv), block=False))

1 loops, best of 3: 2.79 s per loop

But the transfer forced the array back to C-contiguous, which explains the slowness.

If we re-apply the fortran-contiguous transformation on the engines, we should geet our performance back.

In [13]:

%px mat = numpy.asfortranarray(mat)

And re-run the timings

In [14]:

%timeit sum(dv.map(simple_inner, range(mat.shape[1] - 1), block=False))

1 loops, best of 3: 439 ms per loop

In [15]:

%timeit sum(lv.map(simple_inner, range(mat.shape[1] - 1), ordered=False, chunksize=(mat.shape[1] - 1) // len(lv), block=False))

1 loops, best of 3: 473 ms per loop

Using two indices takes even more time due to additional communication.

In [16]:

def inner(i, j):
    return numpy.inner(mat[:, i], mat[:, j])

In [17]:

first = [i for i in xrange(mat.shape[1] - 1) for j in xrange(i + 1, mat.shape[1])]

In [18]:

second = [j for i in xrange(mat.shape[1] - 1) for j in xrange(i + 1, mat.shape[1])]

In [19]:

%timeit sum(dv.map(inner, first, second, block=False))

1 loops, best of 3: 1.11 s per loop

In [20]:

%timeit sum(lv.map(inner, first, second, unordered=True, chunksize=len(first) // len(lv), block=False))

1 loops, best of 3: 1.27 s per loop

In [21]:

%timeit sum(map(inner, first, second))

1 loops, best of 3: 1.91 s per loop

So, in every case the double-index case is slower (locality, etc., etc.), but it's still faster in parallel than serial.