A Practical Guide towards Performance in Research Code

"Premature optimization is the root of all evil" (Donald Knuth)

We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.

  • The most important goal to achieve in programming scientific code to to write code that produces correct output. (My oppinion anyway).
  • Second to that is to write the code fast. That means you need to be able to tap in, debug and verify (intermediate) results as you go along and write your functions.
  • Thirdly the code should be readable as much as possible, because that allows other people and yourself understand it and modify it.
  • Then it should be fast/ efficient

Optimized Code, its fast, but

Optimized code usually does not satisfy the above criteria. Although it might be correct at the end, it takes very long time to write, because

  • debugging tools are not as good
  • code (logic) has arcane structure
  • code is longer (its harder to keep track, make alterations)

Exploiting the benefits of all

Develop in highlevel (e.g. Python, Matlab) -> profile -> optimize only bottlenecks or "embarrassingly parallel/lowlevel"

This part of the tutorial is intented to be a practical approach to the development of efficient code using python.

There are a few benchmarks out there, but they tend to be rather old. An interesting writeup is (because it includes Numba, a relativly new approach)

http://nbviewer.ipython.org/4271861/

Getting started

We load an example image and extract a patch from it. Then we show both.

For this tutorial we are going to compute the first part of the histogram of oriented gradients descriptor. (see http://en.wikipedia.org/wiki/Histogram_of_oriented_gradients)

In [1]:
%pylab inline

import numpy as np
import skimage.data as skiD
import skimage.color as clr
from time import time


img = skiD.lena()
img = clr.rgb2grey(img)

###################
p = img[100:200, 400:]

figure()
imshow(img, cmap='gray')
figure()
imshow(p, cmap='gray')

Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.kernel.zmq.pylab.backend_inline].
For more information, type 'help(pylab)'.

Out[1]:
<matplotlib.image.AxesImage at 0x3948e50>

Naive Numpy Implementation

For this naive histogram of oriented gradients (HOG) implementation we compute

  • gradient in x and y direction
  • magnitude and angle of the gradient
  • discretize the gradient angle
  • summing up the magnitudes of all angles in the image
In [2]:
def calcHOGNaive(img, binSize):
    # calculate gradient
    dx, dy = np.gradient(img)
    
    # calculate magnitude of gradient
    mag = np.sqrt(dx ** 2 + dy ** 2).flatten()
    
    # calculate angles and round them into bins
    ## plain angle 0 .. 2pi
    # >>>> angle = np.arctan2(dy, dx) + np.pi)
    ## normalization to 0 .. binSize-0.000000000001
    # >>>> normAngle = angle / (2 * np.pi + np.spacing(10000)) * binSize)
    ## binning and flatting
    # >>>> np.floor(normAngle).flatten()    
    
    ang = (np.floor((np.arctan2(dy, dx) + np.pi) \
            / (2 * np.pi + np.spacing(10000)) * binSize)).flatten()
    
    # initialize histogram
    hog = np.zeros((binSize, ))
    
    # fill histogram
    for i in range(ang.shape[0]):
        hog[ang[i]] += mag[i]
    
    # normalize
    return hog / ang.shape[0]
In [3]:
hog = calcHOGNaive(p, 9)
%timeit calcHOGNaive(p, 9)
10 loops, best of 3: 28.3 ms per loop

Profiling

http://pynash.org/2013/03/06/timing-and-profiling.html

Profile function calls using IPython

In [4]:
%prun -s cumulative calcHOGNaive(p, 9)


Profile lines using line_profiler

https://github.com/certik/line_profiler

Since the function above does not have many function calls and spends the most of the time within itself, we need to analize how long it takes to compute each line.

This profiler does not give sensible output if it is used on notebook cells. Therefore I just put the function in a module and import it back.

This is an example of using the line_profiler magic.

In [5]:
import pyTools.sandbox.lineProfExample as lPE

%load_ext line_profiler_ext
%lprun -f lPE.calcHOGNaive lPE.calcHOGNaive(p, 9)


Alternatively, one can also use the API to profile a function

In [6]:
import line_profiler
lP = line_profiler.LineProfiler(lPE.calcHOGNaive)
res = lP.runcall(lPE.calcHOGNaive, p, 9)
lP.print_stats()
Timer unit: 1e-06 s

File: /home/peter/code/pyTools/sandbox/lineProfExample.py
Function: calcHOGNaive at line 3
Total time: 0.153876 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     3                                           def calcHOGNaive(img, binSize):
     4                                               # calculate gradient
     5         1         1771   1771.0      1.2      dx, dy = np.gradient(img)
     6                                               
     7                                               # calculate magnitude of gradient
     8         1          535    535.0      0.3      mag = np.sqrt(dx ** 2 + dy ** 2).flatten()
     9                                               
    10                                               # calculate angles and round them into bins
    11                                               ## plain angle 0 .. 2pi
    12                                               # >>>> angle = np.arctan2(dy, dx) + np.pi)
    13                                               ## normalization to 0 .. binSize-0.000000000001
    14                                               # >>>> normAngle = angle / (2 * np.pi + np.spacing(10000)) * binSize)
    15                                               ## binning and flatting
    16                                               # >>>> np.floor(normAngle).flatten()    
    17                                               
    18         1            6      6.0      0.0      ang = (np.floor((np.arctan2(dy, dx) + np.pi) \
    19         1         3129   3129.0      2.0              / (2 * np.pi + np.spacing(10000)) * binSize)).flatten()
    20                                               
    21                                               # initialize histogram
    22         1           15     15.0      0.0      hog = np.zeros((binSize, ))
    23                                               
    24                                               # fill histogram
    25     11201        36855      3.3     24.0      for i in range(ang.shape[0]):
    26     11200       111471     10.0     72.4          hog[ang[i]] += mag[i]
    27                                               
    28                                               # normalize
    29         1           94     94.0      0.1      return hog / ang.shape[0]


Embedding C++ to speed up loop

Loops are the obvious performance bottle necks. In this case it is only type-checking that slows this loop down, if functions are called there is likely some range-checking overhead, too.

Lets just embed the loop in C++ and compute the rest in numpy. It could quickly get messy if we converted the rest into C++, because each of the lines of numpy code would expand into two nested loops. Sometimes, however, it might pay off to put even vectorized numpy expressions in C. That is usually the case if several operations are performed sequencially and temporary results are reused. In such cases the fact that the data stays in the L1 or L2 cache might speed up the program enormously.

In [7]:
#import scipy.weave as weave

def calcHOGWeaveCpp(img, binSize):
    import scipy.weave as weave
    import numpy as np
    dx, dy = np.gradient(img)
    
    mag = np.sqrt(dx ** 2 + dy ** 2).flatten()
    ang = (np.floor((np.arctan2(dy, dx) + np.pi) \
            / (2 * np.pi + np.spacing(10000)) * binSize)).flatten()
        
    hog = np.zeros((binSize, ))
    
#    for i in range(ang.shape[0]):
#        hog[ang[i]] += mag[i]
    code = """
    int pos;
    for (int i = 0; i < Nang[0]; ++i){
        pos = ang[i];
        hog[pos] += mag[i];
    }
    """    
    weave.inline(code, ['ang', 'hog', 'mag'], 
                 extra_compile_args=['-O3'], compiler='gcc')
    
    return hog / ang.shape[0]

hogCpp = calcHOGWeaveCpp(p, 9)
%timeit calcHOGWeaveCpp(p, 9)
print "numpy and cpp version are the same: ", np.allclose(hog, hogCpp)
100 loops, best of 3: 4.83 ms per loop
numpy and cpp version are the same:  True

Using Fortran subroutine to speed up computations after gradient

The advantage of fortran is not the speed (compared to optimized C code, which can be slightly faster) rather than the simplicity of the syntax of the logic which is very close to Matlab/Numpy expressions. However, the variable declaration makes it easily look cumbersome. Also in plain fortran code much of the structural syntax is quite verbose and debugging is limited. In combination with Python, the logic is the biggest part, therefore fortran can be a fast way of converting already vectorized expressions from numpy into a compiled language. Gaining impressive speed-ups.

code

For simplicity this code is just implementing the easy part. Namely the part that relies on the standard library of fortran and can be almost copy-pasted.

.. code-block:: fortran

!file hog.f90

module hog

contains 

subroutine angHist(dx, dy, binSize, hist)
  ! compute simple hog from gradients
  implicit none
  ! I/O
  real(kind=8), dimension(:,:), intent(in) :: dx, dy
  integer, intent(in) :: binSize
  real(kind=8), dimension(binSize), intent(out) :: hist

  ! dummy variables
  real(kind=8), dimension(size(dx,1), size(dx,2)) :: mag
  integer, dimension(size(dx,1), size(dx,2)) :: ang
  real(kind=8) :: pi, eps
  integer :: i, m, p

  ! constants
  pi = 3.14159265358979323846
  eps = 0.0000000000000000001


  ! magnitude and angles
  mag = sqrt(dx ** 2 + dy ** 2)
  ang = floor((atan2(dy, dx) + pi) / (2 * pi + eps) * binSize)

  ! initialize hist as zero
  hist = 0.0_8

  do i=1,size(ang,1)
    do m=1,size(ang,2)
      p = ang(i,m) + 1 ! fortran counts from 1 to end (incl)
      hist(p) = hist(p) + mag(i,m)
    enddo
  enddo

  ! normalize hist
  hist = hist / (size(ang,1) * size(ang,2))

end subroutine angHist

end module hog

Compilation

$f2py -c -m hog hog.f90
In [8]:
import pyTools.imgProc.hog as fHog

dx, dy = np.gradient(p)
hogF = fHog.hog.anghist(dx, dy, 9)
%timeit dx, dy = np.gradient(p); fHog.hog.anghist(dx, dy, 9)

print "Naive and Fortran (1) equal: ", np.allclose(hog, hogF)
100 loops, best of 3: 1.42 ms per loop
Naive and Fortran (1) equal:  True

Fortran code including gradient calculation

We can also go a step further and replicate numpys gradient function in fortran. This is not much work, as the syntax is almost the same.

You can find the code also in https://gist.github.com/groakat/5775099

numpy code

https://github.com/numpy/numpy/blob/v1.7.0/numpy/lib/function_base.py#L829

for axis in range(N):
    # select out appropriate parts for this dimension
    out = np.empty_like(f, dtype=otype)
    slice1[axis] = slice(1, -1)
    slice2[axis] = slice(2, None)
    slice3[axis] = slice(None, -2)
    # 1D equivalent -- out[1:-1] = (f[2:] - f[:-2])/2.0
    out[slice1] = (f[slice2] - f[slice3])/2.0
    slice1[axis] = 0
    slice2[axis] = 1
    slice3[axis] = 0
    # 1D equivalent -- out[0] = (f[1] - f[0])
    out[slice1] = (f[slice2] - f[slice3])
    slice1[axis] = -1
    slice2[axis] = -1
    slice3[axis] = -2
    # 1D equivalent -- out[-1] = (f[-1] - f[-2])
    out[slice1] = (f[slice2] - f[slice3])

    # divide by step size
    outvals.append(out / dx[axis])

    # reset the slice object in this dimension to ":"
    slice1[axis] = slice(None)
    slice2[axis] = slice(None)
    slice3[axis] = slice(None)

code

.. code-block:: fortran90

subroutine angHist2(img, binSize, hist) 
! compute simple hog from image
implicit none
! I/O variables
real(kind=8), dimension(:,:), intent(in) :: img
integer, intent(in) :: binSize
real(kind=8), dimension(binSize), intent(out) :: hist

! dummy variables
real(kind=8), dimension(size(img,1),size(img,2)) :: dx, dy, mag
integer, dimension(size(img,1),size(img,2)) :: ang
real(kind=8) :: pi, eps
integer :: i, m, p, l

! constants
pi = 3.14159265358979323846
eps = 0.0000000000000000001

! initialize hist
hist = 0.0_8

! compute dx as numpy.gradient does
dx(1,:) = img(2,:) - img(1,:)
do i=3,size(img,1)
  dx(i-1,:) = (img(i,:) - img(i-2,:)) / 2.0
enddo
l = size(img,1)
dx(l,:) = img(l,:) - img(l-1,:)

! compute dy as numpy.gradient does
dy(:,1) = img(:,2) - img(:,1)
do i=3,size(img,2)
  dy(:, i-1) = (img(:,i) - img(:,i-2)) / 2.0
enddo
l = size(img,2)
dy(:,l) = img(:,l) - img(:,l-1)    

! compute magnitude and angle of gradients  
mag = sqrt(dx ** 2 + dy ** 2)
ang = floor((atan2(dy, dx) + pi) / (2 * pi + eps) * binSize)

! bin them
do i=1,size(ang,1)
  do m=1,size(ang,2)
    p = ang(i,m) + 1 ! fortran counts from 1 to end (incl)
    hist(p) = hist(p) + mag(i,m)
  enddo
enddo

! normalize hist
hist = hist / (size(ang,1) * size(ang,2))

end subroutine angHist2

Compilation

$f2py -c -m hog hog.f90
In [9]:
import pyTools.imgProc.hog as fHog

hogF2 = fHog.hog.anghist2(p, 9)
%timeit fHog.hog.anghist2(p, 9)

print "Naive and Fortran (2) equal: ", np.allclose(hog, hogF2)
1000 loops, best of 3: 1.19 ms per loop
Naive and Fortran (2) equal:  True

Numba

Uses LLVM to runtime-compile the code. Here the overhead of the compilation is relatively large compared with the exectuted code length. This compiler seems to be intended to be used with completely "unrolled" and not vectorized functions. I.e. if the code looks more like a C program

In [10]:
from numba.decorators import jit
from numba import float64, int32

@jit(argtypes=[float64[:,:], int32])
def calcHOGNumba(img, binSize):
    # calculate gradient
    d = np.gradient(img)
    
    dx = d[0]
    dy = d[1]
    
    # calculate magnitude of gradient
    mag = np.sqrt(dx ** 2 + dy ** 2).flatten()
    
    # calculate angles and round them into bins   
    
    ang = (np.floor((np.arctan2(dy, dx) + np.pi) \
            / (2 * np.pi + np.spacing(10000)) * binSize)).flatten()
    
    # initialize histogram
    hog = np.zeros((binSize, ))
    
    # fill histogram
    for i in range(ang.shape[0]):
        hog[ang[i]] += mag[i]
    
    # normalize
    return hog / ang.shape[0]

hogN = calcHOGNumba(p, 9)
%timeit calcHOGNumba(p, 9)
print "Naive and Numba equal: ", np.allclose(hog, hogN)
10 loops, best of 3: 70.2 ms per loop
Naive and Numba equal:  True

The LLVM jit struggles to compile the numpy functions fast. Since the compiled output is not cached it is a real overhead. The following code shows that numba can be competitive to weave/C++ if only the loop is compiled.

In [11]:
from numba.decorators import jit
from numba import float64, int32

@jit(argtypes=[float64[:], int32[:], float64[:]])
def fillHist(hog, ang, mag): 
    # fill histogram
    for i in range(ang.shape[0]):
        hog[ang[i]] += mag[i]
    
    return hog
        
        
def calcHOGNumba2(img, binSize):
    # calculate gradient
    d = np.gradient(img)
    
    dx = d[0]
    dy = d[1]
    
    # calculate magnitude of gradient
    mag = np.sqrt(dx ** 2 + dy ** 2).flatten()
    
    # calculate angles and round them into bins   
    
    ang = np.int32(np.floor((np.arctan2(dy, dx) + np.pi) \
            / (2 * np.pi + np.spacing(10000)) * binSize)).flatten()
    
    # initialize histogram
    hog = np.zeros((binSize, ))
    
    hog = fillHist(hog, ang, mag)
    
    # normalize
    return hog / ang.shape[0]

hogN2 = calcHOGNumba2(p, 9)
%timeit calcHOGNumba2(p, 9)
print "Naive and Numba equal: ", np.allclose(hog, hogN2)
100 loops, best of 3: 4.88 ms per loop
Naive and Numba equal:  True

Final consistency check

In [12]:
p = img[100:200, 400:]
dx, dy = np.gradient(p)
hog = calcHOGNaive(p, 9)
hogCpp = calcHOGWeaveCpp(p, 9)
hogF = fHog.hog.anghist(dx, dy, 9)
hogF2 = fHog.hog.anghist2(p, 9)
hogN = calcHOGNumba(p, 9)
hogN2 = calcHOGNumba2(p, 9)

print "all results are equal:", all([np.allclose(hog, hogCpp), 
                                     np.allclose(hog, hogF), 
                                     np.allclose(hog, hogF2), 
                                     np.allclose(hog, hogN), 
                                     np.allclose(hog, hogN2)])
all results are equal: True

Benchmark Summary

In [13]:
# patch 100x112
p = img[100:200, 400:]
dx, dy = np.gradient(p)
print "naive"
%timeit -n 100 calcHOGNaive(p, 9)
print "\nweave cpp"
%timeit -n 100 calcHOGWeaveCpp(p, 9)
print "\nfortran (1)"
%timeit -n 100 dx, dy = np.gradient(p); fHog.hog.anghist(dx, dy, 9)
print "\nfortran (2)"
%timeit -n 100 fHog.hog.anghist2(p, 9)
print "\nnumba (1)"
%timeit -n 100 calcHOGNumba(p, 9)
print "\nnumba (2)"
%timeit -n 100 calcHOGNumba2(p, 9)
naive
100 loops, best of 3: 40.4 ms per loop

weave cpp
100 loops, best of 3: 4.79 ms per loop

fortran (1)
100 loops, best of 3: 4.1 ms per loop

fortran (2)
100 loops, best of 3: 2.88 ms per loop

numba (1)
100 loops, best of 3: 62.1 ms per loop

numba (2)
100 loops, best of 3: 4.79 ms per loop

In [14]:
# over entire image
print "naive"
%timeit -n 100 calcHOGNaive(img, 9)
print "\nweave cpp"
%timeit -n 100 calcHOGWeaveCpp(img, 9)
print "\nfortran (1)"
%timeit -n 100 dx, dy = np.gradient(img); fHog.hog.anghist(dx, dy, 9)
print "\nfortran (2)"
%timeit -n 100 fHog.hog.anghist2(img, 9)
print "\nnumba (1)"
%timeit -n 100 calcHOGNumba(img, 9)
print "\nnumba (2)"
%timeit -n 100 calcHOGNumba2(img, 9)
naive
100 loops, best of 3: 406 ms per loop

weave cpp
100 loops, best of 3: 31.4 ms per loop

fortran (1)
100 loops, best of 3: 39.7 ms per loop

fortran (2)
100 loops, best of 3: 36.3 ms per loop

numba (1)
100 loops, best of 3: 470 ms per loop

numba (2)
100 loops, best of 3: 31.4 ms per loop

Parallelism

IPython Parallel

Lets compute the HOG descriptors for non-overlapping patches in the image.

Start some clusters and get socket connection to them

In [15]:
from IPython.parallel import Client
# open clients
rc = Client()
print rc.ids
[0, 1, 2, 3]

Get direct view of the remote clients (used later for clean up) and load balanced view (used to asyncronously add jobs to the clusters)

In [16]:
dview = rc[:]        
lbview = rc.load_balanced_view()

Now we are ready to put stuff on the clusters

lbview.apply_async(func, **args)

But we need to have some data to feed. In our example we split the image into parts and process them individually.

In [17]:
splits = np.split(img, 16)

results = []
for split in splits:
    results += [lbview.apply_async(calcHOGWeaveCpp, split, 9)]

We could add more jobs to the cluster that are other functions. Now, however, we have to wait until all results came in and continue processing.

In [18]:
from time import sleep

allReady = False
while not allReady:
    allReady = True
    for ar in results:
        allReady = allReady and ar.ready()
                
    sleep(0.01)

Now we copy the data back and clean up on the cluster (data on the cluster will not be deleted/ garbage collected and will cause a memory leak)

In [19]:
hogs = []

for ar in results:
    # copy data
    data = ar.get()
    hogs += [data]
    # delete data from cluster
    msgId = ar.msg_id
    #~ del lbview.results[msgId]
    del rc.results[msgId]
    del rc.metadata[msgId]
    

# close client to close socket connections
rc.close()

General Threading using Qt

Normally, one should do it by subclassing QObject and using moveToThread() as described in the article http://labs.qt.nokia.com/2007/07/05/qthreads-no-longer-abstract/ . But that did not work for me, because there needs to be a QMainLoop around (if I remember correctly). Anyway the old style subclassing of QThread still works.

The BaseThread idea comes from http://pythonchem.blogspot.co.uk/2012/06/threaded-pyqt-gui-and-particles.html .

This QThread example replicates the IPython Parallel interface, but it creates a thread for each object with might not be desired. This is an untested toy example. I cannot guarantee that it would not break in some circumstances.

In [20]:
from PyQt4.QtCore import * 

class BaseThread(QThread):
    """ Provides infrastructure for save deleting of thread.
        Check self.exiting in heavy operations frequently and 
        abort exectution of run(), to allow the thread to get 
        deleted.
    """
    def __init__(self):
        QThread.__init__(self)
        self.exiting = False

    def __del__(self):     
        self.exiting = True
        self.wait()
        
class Worker(BaseThread):            
    def __init__(self, func, *args):
        """
        This object will exectute `func` with `args` in a
        separate thread. You can query ready() to check
        if processing finished and get() to get the result.

        Args:
            func (function pointer)
                        function will be called asyncroneously
            args (arguments)
                        arguments for func
        """
        BaseThread.__init__(self)   
        
        self.func = func
        self.args = args        
        self.result = None
        self.isReady = False
    
        self.start()
    
    def run(self):
        self.result = self.func(*self.args)
        self.isReady = True
            
    def ready(self):
        return self.isReady
    
    def get(self):
        if self.isReady:
            return self.result
        else:
            return None
In [21]:
w = Worker(calcHOGWeaveCpp, p, 9)
w.ready()
Out[21]:
False
In [22]:
working = True
while working:
    isReady = w.ready()
    if isReady:
        res = w.get()
        working = False
        
        # clean up, otherwise the object will
        # stay in memory for ever (similar to
        # clients in IPython parallel)
        del w
        
print res
[ 0.00134604  0.00474881  0.00305021  0.00114426  0.00127909  0.00407113
  0.00395126  0.00130679  0.00104403]

Appendix: example how the Python syntax of *args works

In [23]:
def test(a, *args):
    print a
    print args
    test2(*args)
    
def test2(a,b):
    print a
    print b
In [24]:
test(4,5,6)
4
(5, 6)
5
6