In [1]:
import cv2

Performance, HPC and IPython Parallel

Performance issues

  • Python looping in big arrays can be slow
  • However
    • OpenCV operations are efficient machine code
    • NumPy operations on arrays are efficient machine code
    • SciPy Stack relies in C, C++ and Fortran implementations for numerical software

Low-level code written in Python, as looping in big arrays, can be slow, mainly because Python is dynamically typed and interpreted. However, in the scientific computing environment described above, this is rarely a problem: the OpenCV interface just access optimized C/C++ code, and most of the software in the SciPy Stack relies in a base of numerical software implemented in C, C++ and Fortran, including the efficient NumPy arrays.


  • Cython is a static compiler
  • It works on a super-set of the Python language that supports C-like static type declarations
  • Compiles Python code to C
    • Produces a module that can be imported by the Python interpreter
  • Useful to
    • speed-up low-level looping in arrays;
    • access external C/C++ libraries
  • Pareto Principle
    • 80% of the run-time is spent in 20% of the source code

But in the few situations where a low-level looping must be implemented (if the task cannot be implemented using NumPy capabilities) or the functionality of a external library is needed, Cython raises as an alternative. Cython is a static compiler capable of working in a super-set of the Python language that supports C-like static type declarations. It compiles Python code to C, further producing a Python module that can be imported and used from the interpreter. As noted by Behnel et al.), the key idea behind Cython is the Pareto Principle, also known as the "80/20 rule": 80% of the run-time is spent in 20% of the source code. Cython’s goal is to speed up the critical parts of the code while avoiding too much overhead on coding by the programmer.


  • Other performance issues can be addressed by parallelization
  • IPython.parallel allows parallel and distributed computing
    • Single Program, Multiple Data (SPMD)
    • Multiple Program, Multiple Data (MPMD).
  • Parallel applications can be developed, executed and monitored from the IPython shell
  • Computer vision tasks can involve large sets of images or big point clouds
  • However, the parallelization of these tasks is trivial
  • In IPython.parallel, those tasks can be implemented in a few lines of code

Other performance issues can be addressed by parallelization. IPython.parallel is a powerful architecture for parallel and distributed computing supporting different styles of parallelism, such as single program, multiple data (SPMD) and multiple program, multiple data (MPMD). Parallel applications can be easily developed, executed and monitored interactively from the IPython shell. Computer vision tasks can involve large sets of images or big point clouds, but many times the parallelization of these tasks is trivial and, using IPython.parallel, implemented in a few lines of code. The dynamic load balancing feature allows the use of all the available processing threads in the computer or all the processing power available in a cluster, but keeping the interactive computing environment free from large amounts of specific code for parallel computing.

Example 9 - Process a bundle of images in parallel

In this example, SIFT descriptors of a reference image $I_1$ are computed. Then, descriptors are extracted for every image $I_n$ in a list, and the matches to $I_1$ descriptors are computed. The processing of the list is done in parallel, using all the available cores in the user’s machine.

Let $D_1$ be an array containing the descriptors of $I_1$.

In [4]:
T1 = cv2.imread('data/templeRing/templeR0001.png', cv2.IMREAD_GRAYSCALE)
sift = cv2.xfeatures2d.SIFT_create(nfeatures=5000)
_, D_1 = sift.detectAndCompute(T1, mask=None)

In a system shell, an IPython cluster for parallel computing is started using:

ipcluster start --n=8

Eight nodes are started (in this example, the number of clusters is selected based on the number of cores available in the user’s machine).

Note: if ipcluster is not available, it can be installed using, for example, pip:

$ pip install ipyparallel

Back to the IPython shell, the next step is the creation of a Client object. A LoadBalancedView object is created to provide a load-balanced parallel execution:

In [6]:
from ipyparallel import Client
rc = Client()
lview = rc.load_balanced_view()
  • The decorator @lview.parallel defines a parallel, load-balanced funtion
  • The arguments are:
    • The image file absolute path in the filesystem
    • The reference descriptor set, $D_1$
  • get_num_matches will:
    • read the image;
    • compute SIFT features and their descriptors;
    • perform matching using OpenCV's brute force matching BFMatcher and
    • return the number of martches found

Next, a Python decorator is used to define a parallel function that computes the descriptors and the matches (the decorator starts with a "@" symbol). The function below takes a path to an image in the file system, computes the SIFT features and uses OpenCV’s BFMatcher to get the matches to $D_1$, returning the number of matches found and the image’s path:

In [7]:
def get_num_matches(arg):    
    fname, D_src = arg
    import cv2
    frame = cv2.imread(fname, cv2.IMREAD_GRAYSCALE)
    print frame.shape
    sift = cv2.SIFT(nfeatures=5000)
    _, D = sift.detectAndCompute(frame, mask=None)
    matcher = cv2.BFMatcher(cv2.NORM_L2, crossCheck=True)
    matches = matcher.match(D_src, D)
    return fname, len(matches)
  • File paths and $D_1$ are assembled in an arguments list
  • The map function starts the parallelized call
  • Load balancing is automatically performed

IPython capability to access the system’s shell is employed to list all the files in a directory and store the file paths in a list of strings, fnames. Finally, the map function calls get_num_matches to every string in the fnames list, automatically performing the load balance on the nodes:

In [10]:
fnames = !ls data/templeRing/temple*.png

args = [(fname, D_1) for fname in fnames]
async_res =
BadYieldError                             Traceback (most recent call last)
<ipython-input-10-0ade8fe7f65c> in <module>()
      3 args = [(fname, D_1) for fname in fnames]
----> 4 async_res =

/usr/local/lib/python2.7/dist-packages/ipyparallel/client/remotefunction.pyc in map(self, *sequences)
    283         and mismatched sequence lengths will be padded with None.
    284         """
--> 285         return self(*sequences, __ipp_mapping=True)
    287 __all__ = ['remote', 'parallel', 'RemoteFunction', 'ParallelFunction']

<decorator-gen-128> in __call__(self, *sequences, **kwargs)

/usr/local/lib/python2.7/dist-packages/ipyparallel/client/remotefunction.pyc in sync_view_results(f, self, *args, **kwargs)
     77     view._in_sync_results = True
     78     try:
---> 79         ret = f(self, *args, **kwargs)
     80     finally:
     81         view._in_sync_results = False

/usr/local/lib/python2.7/dist-packages/ipyparallel/client/remotefunction.pyc in __call__(self, *sequences, **kwargs)
    257             view = self.view if balanced else client[t]
    258             with view.temp_flags(block=False, **self.flags):
--> 259                 ar = view.apply(f, *args)
    260                 ar.owner = False

/usr/local/lib/python2.7/dist-packages/ipyparallel/client/view.pyc in apply(self, f, *args, **kwargs)
    209         ``f(*args, **kwargs)``.
    210         """
--> 211         return self._really_apply(f, args, kwargs)
    213     def apply_async(self, f, *args, **kwargs):

<decorator-gen-144> in _really_apply(self, f, args, kwargs, block, track, after, follow, timeout, targets, retries)

/usr/local/lib/python2.7/dist-packages/ipyparallel/client/view.pyc in sync_results(f, self, *args, **kwargs)
     45     """sync relevant results from self.client to our results attribute."""
     46     if self._in_sync_results:
---> 47         return f(self, *args, **kwargs)
     48     self._in_sync_results = True
     49     try:

<decorator-gen-143> in _really_apply(self, f, args, kwargs, block, track, after, follow, timeout, targets, retries)

/usr/local/lib/python2.7/dist-packages/ipyparallel/client/view.pyc in save_ids(f, self, *args, **kwargs)
     33     n_previous = len(self.client.history)
     34     try:
---> 35         ret = f(self, *args, **kwargs)
     36     finally:
     37         nmsgs = len(self.client.history) - n_previous

/usr/local/lib/python2.7/dist-packages/ipyparallel/client/view.pyc in _really_apply(self, f, args, kwargs, block, track, after, follow, timeout, targets, retries)
   1036         future = self.client.send_apply_request(self._socket, f, args, kwargs, track=track,
-> 1037                                 metadata=metadata)
   1039         ar = AsyncResult(self.client, future, fname=getname(f),

/usr/local/lib/python2.7/dist-packages/ipyparallel/client/client.pyc in send_apply_request(self, socket, f, args, kwargs, metadata, track, ident)
   1398         future = self._send(socket, "apply_request", buffers=bufs, ident=ident,
-> 1399                             metadata=metadata, track=track)
   1401         msg_id = future.msg_id

/usr/local/lib/python2.7/dist-packages/ipyparallel/client/client.pyc in _send(self, socket, msg_type, content, parent, ident, buffers, track, header, metadata)
    967             self.metadata.pop(msg_id, None)
--> 969         multi_future(futures).add_done_callback(cleanup)
    971         def _really_send():

/usr/local/lib/python2.7/dist-packages/tornado/gen.pyc in multi_future(children, quiet_exceptions)
    776     else:
    777         keys = None
--> 778     children = list(map(convert_yielded, children))
    779     assert all(is_future(i) for i in children)
    780     unfinished_children = set(children)

/usr/local/lib/python2.7/dist-packages/singledispatch.pyc in wrapper(*args, **kw)
    209     def wrapper(*args, **kw):
--> 210         return dispatch(args[0].__class__)(*args, **kw)
    212     registry[object] = func

/usr/local/lib/python2.7/dist-packages/tornado/gen.pyc in convert_yielded(yielded)
   1227         return _wrap_awaitable(yielded)
   1228     else:
-> 1229         raise BadYieldError("yielded unknown object %r" % (yielded,))
   1231 if singledispatch is not None:

BadYieldError: yielded unknown object <Future at 0x7f10477efb50 state=pending>
In [33]:
for f, n in async_res:
    print f, n
/tmp/templeRing/templeR0001.png 802
/tmp/templeRing/templeR0002.png 549
/tmp/templeRing/templeR0003.png 491
/tmp/templeRing/templeR0004.png 482
/tmp/templeRing/templeR0005.png 470
/tmp/templeRing/templeR0006.png 441
/tmp/templeRing/templeR0007.png 401
/tmp/templeRing/templeR0008.png 358
/tmp/templeRing/templeR0009.png 393
/tmp/templeRing/templeR0010.png 438
/tmp/templeRing/templeR0011.png 456
/tmp/templeRing/templeR0012.png 455
/tmp/templeRing/templeR0013.png 444
/tmp/templeRing/templeR0014.png 405
/tmp/templeRing/templeR0015.png 436
/tmp/templeRing/templeR0016.png 421
/tmp/templeRing/templeR0017.png 410
/tmp/templeRing/templeR0018.png 398
/tmp/templeRing/templeR0019.png 408
/tmp/templeRing/templeR0020.png 451
/tmp/templeRing/templeR0021.png 430
/tmp/templeRing/templeR0022.png 444
/tmp/templeRing/templeR0023.png 454
/tmp/templeRing/templeR0024.png 476
/tmp/templeRing/templeR0025.png 475
/tmp/templeRing/templeR0026.png 509
/tmp/templeRing/templeR0027.png 509
/tmp/templeRing/templeR0028.png 546
/tmp/templeRing/templeR0029.png 558
/tmp/templeRing/templeR0030.png 662
/tmp/templeRing/templeR0031.png 577
/tmp/templeRing/templeR0032.png 465
/tmp/templeRing/templeR0033.png 477
/tmp/templeRing/templeR0034.png 479
/tmp/templeRing/templeR0035.png 457
/tmp/templeRing/templeR0036.png 456
/tmp/templeRing/templeR0037.png 473
/tmp/templeRing/templeR0038.png 477
/tmp/templeRing/templeR0039.png 473
/tmp/templeRing/templeR0040.png 454
/tmp/templeRing/templeR0041.png 461
/tmp/templeRing/templeR0042.png 431
/tmp/templeRing/templeR0043.png 442
/tmp/templeRing/templeR0044.png 454
/tmp/templeRing/templeR0045.png 448
/tmp/templeRing/templeR0046.png 454
/tmp/templeRing/templeR0047.png 459

This simple example is able to explore all the available cores in the local machine, just asking for a few extra lines of code. But the parallel computing capabilities in IPython go farbeyond, supporting SPMD and MPMD parallelism and the use of StarCluster for execution in Amazon’s Elastic Compute Cloud (EC$_2$). The interested reader is referred to the section Using IPython for parallel computing in the IPython documentation.

In [ ]: