Diagnosing Slow Parallel Inner Products

In [1]:
from IPython.parallel import Client, require, interactive
In [2]:
rc = Client()
dv = rc.direct_view()
lv = rc.load_balanced_view()
In [3]:
with dv.sync_imports():
    import numpy
importing numpy on engine(s)

In [4]:
mat = numpy.random.random_sample((800, 800))
mat = numpy.asfortranarray(mat)
In [5]:
def simple_inner(i):
    column = mat[:, i]
    # have to use a list comprehension to prevent closure
    return sum([numpy.inner(column, mat[:, j]) for j in xrange(i + 1, mat.shape[1])])

Local, serial performance.

In [6]:
%timeit sum(simple_inner(i) for i in xrange(mat.shape[1] - 1))
1 loops, best of 3: 1.44 s per loop

In [7]:
dv.push(dict(mat=mat), block=True);

Parallel implementation using a DirectView.

In [8]:
%timeit sum(dv.map(simple_inner, range(mat.shape[1] - 1), block=False))
1 loops, best of 3: 3.34 s per loop

Parallel implementation using a LoadBalancedView with a large chunksize and unordered results.

In [12]:
%timeit sum(lv.map(simple_inner, range(mat.shape[1] - 1), ordered=False, chunksize=(mat.shape[1] - 1) // len(lv), block=False))
1 loops, best of 3: 2.79 s per loop

But those are super slow! Why?

In [11]:
amr = dv.map(simple_inner, range(mat.shape[1] - 1), block=False)
amr.get()
s = sum(amr)
In [12]:
print "serial time: %.3f" % amr.serial_time
print "  wall time: %.3f" % amr.wall_time
serial time: 10.576
  wall time: 4.898

But that's weird, the total computation time was over ten seconds. That says that maybe the computation itself is slow on the engines for some reason.

Let's try running the local code exactly on one of the engines.

In [15]:
e0 = rc[0]
e0.block = True
e0.activate('0') # for %px0 magic
e0.push(dict(simple_inner=simple_inner));
In [16]:
# execute the timeit line on engine zero, *exactly* as we typed it above
%px0 %timeit sum(simple_inner(i) for i in xrange(mat.shape[1] - 1))
1 loops, best of 3: 11.4 s per loop

Now that's super slow, even though the code is identical to the first run! IPython.parallel isn't getting in the way at all, here, so something must be up.

The only optimization we have made is the asfortranarray, so let's check mat.flags

In [22]:
print 'local'
print mat.flags
print 'engine 0:'
%px0 print mat.flags
local
  C_CONTIGUOUS : False
  F_CONTIGUOUS : True
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  UPDATEIFCOPY : False
engine 0:
  C_CONTIGUOUS : False
  F_CONTIGUOUS : True
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  UPDATEIFCOPY : False

Aha! mat on the engines is somehow not FORTRAN-contiguous. Maybe we will get our performance back if we re-apply the transformation on the engines after the push.

In [19]:
%px mat = numpy.asfortranarray(mat)

And re-run the timings, to check:

In [20]:
%timeit sum(dv.map(simple_inner, range(mat.shape[1] - 1), block=False))
1 loops, best of 3: 470 ms per loop

In [21]:
%timeit sum(lv.map(simple_inner, range(mat.shape[1] - 1), ordered=False, chunksize=(mat.shape[1] - 1) // len(lv), block=False))
1 loops, best of 3: 375 ms per loop

Yes, that's much more sensible than eleven seconds.

Back to top