#!/usr/bin/env python
# coding: utf-8

# I had the pleasure of attending a [workshop](http://calcul.math.cnrs.fr/spip.php?article284) given by the [groupe calcul (CNRS)](http://calcul.math.cnrs.fr) this week. The topic was: how do you optimize the execution speed of your Python code, under the hypothesis that you already tried to make it fast using NumPy? 
# 
# The training was held over three days and presented three interesting ways to achieve speedups: [Cython](http://cython.readthedocs.io/en/latest/), [pythran](https://pythonhosted.org/pythran/) and [numba](http://numba.pydata.org).
# 
# Overall, the workshop was great. The goal of this blog post is to summarize some of the key insights that I learnt while using these three tools on an practical application: image filtering.

# This post uses the following versions of the libraries:

# In[1]:


import cython; print(cython.__version__)
import pythran; print(pythran.__version__)
import numba; print(numba.__version__)


# # Introducing the discrete Laplacian 

# When working with images, the [discrete Laplacian operator](https://en.wikipedia.org/wiki/Discrete_Laplace_operator) is often used for edge detection.
# 
# Intuitively, if you want to find the edges of an image, you compute the Laplacian and threshold it to see the edges appear. Let's give an example using scikit-image. 
# 
# First, let's load a standard grayscale image that ships with scikit-image, the `astronaut`.

# In[2]:


import skimage.data
import skimage.color


# In[3]:


image = skimage.data.astronaut()
image = skimage.color.rgb2gray(image)
image.shape


# Now, let's write a function that uses the scikit-image implementation of the discrete Laplace operator and thresholds the result with a fixed threshold. 

# In[4]:


from skimage.filters import laplace
import numpy as np


# In[5]:


def laplace_skimage(image):
    """Applies Laplace operator to 2D image using skimage implementation. 
    Then tresholds the result and returns boolean image."""
    laplacian = laplace(image)
    thresh = np.abs(laplacian) > 0.05
    return thresh


# Let's apply the function and check its result.

# In[6]:


edges = laplace_skimage(image)


# In[7]:


edges.shape


# Let's display the obtained edges, along with the original image.

# In[8]:


import matplotlib.pyplot as plt
get_ipython().run_line_magic('matplotlib', 'inline')


# In[9]:


def compare(left, right):
    """Compares two images, left and right."""
    fig, ax = plt.subplots(1, 2, figsize=(10, 5))
    ax[0].imshow(left, cmap='gray')
    ax[1].imshow(right, cmap='gray')


# In[10]:


compare(left=image, right=edges)


# Since the topic of this post is to deal with optimization, we will now suppose that we're not using scikit-image. 
# 
# As is usually the case when you're working on your own problems, we'll implement the Laplacian algorithm by ourselves using NumPy.

# # NumPy implementation 

# So here we go: we write a function that makes use of vectorized NumPy to perform the same operation than the scikit-image implementation. 

# In[11]:


def laplace_numpy(image):
    """Applies Laplace operator to 2D image using our own NumPy implementation. 
    Then tresholds the result and returns boolean image."""
    laplacian = image[:-2, 1:-1] + image[2:, 1:-1] + image[1:-1, :-2] + image[1:-1, 2:] - 4*image[1:-1, 1:-1]
    thresh = np.abs(laplacian) > 0.05
    return thresh


# If you're paying attention closely, we're not actually implementing exactly the same function as scikit-image, since we're excluding all border pixels of the image from our calculation. 
# 
# Let's check the result is qualitatively correct.

# In[12]:


laplace_numpy(image).shape


# In[13]:


compare(image, laplace_numpy(image))


# And let's also check that our implementation quantitavely agrees with scikit-image on every interior pixel.

# In[14]:


np.allclose(laplace_skimage(image)[1:-1, 1:-1], laplace_numpy(image))


# Good, we're now all set for the next step. Let's pretend that we think our implementation is slow. How do we objectively measure that? And then: how do we make it faster?

# # Profiling and timing 

# So, we think our NumPy version is slow. How can we put a number on that feeling? The first idea is to time the execution on our test image. We can easily do this with the `%time` line magic.

# In[15]:


get_ipython().run_line_magic('time', 'laplace_numpy(image)')


# Mmmh, around 5 milliseconds per call. If we run this multiple times, the results slightly vary.

# Therefore, it is better to use the  `%timeit` line magic, which averages the timing operation and gives us a more accurate measurement.

# In[16]:


get_ipython().run_line_magic('timeit', 'laplace_numpy(image)')


# How does the scikit-image implementation compare to that?

# In[17]:


get_ipython().run_line_magic('timeit', 'laplace_skimage(image)')


# Approximately the same, albeit a little slower. 
# 
# So, our function is too slow for say, our realtime application. To make it go faster, we first need to have an idea about what takes time in it.
# Our function works in two steps: first we compute the laplacian and then we threshold the results. Which part of the function is taking the most time? 
# 
# To answer that question, we use a profiler. Many different profilers exit. We'll look at three below: %prun, %lprun and pprofile.

# Using %prun gives us a high level view of what's happening inside our call.

# In[18]:


r = get_ipython().run_line_magic('prun', '-r laplace_numpy(image)')
r.print_stats()


# This is not very informative. But we understand that most of the time (column tottime) is spent inside our laplace_numpy function.

# What about %lprun?

# In[19]:


get_ipython().run_line_magic('load_ext', 'line_profiler')


# In[20]:


r = get_ipython().run_line_magic('lprun', '-r -f laplace_numpy laplace_numpy(image)')
r.print_stats()


# Running `%lprun -r -f laplace_numpy laplace_numpy(image)` means that we aske `%lprun` to show how long each line took to execute inside the function `laplace_numpy` and return the result as an object.
# 
# Here, we can answer the earlier question: computing the laplacian is what takes the most time in this function.

# Let's see if [`pprofile`](https://github.com/vpelletier/pprofile) confirms that measurement.

# In[21]:


import pprofile
def func_to_profile():
    prof = pprofile.Profile()
    with prof():
        laplace_numpy(image)
    prof.print_stats()


# In[22]:


func_to_profile()


# pprofile confirms the previous assessment: it's the laplacian computation that takes time. 
# 
# What's more, pprofile is a little bit simpler to use than line_profiler. However, it doesn't feature a cell magic (although there's already someone asking for this [here](https://github.com/vpelletier/pprofile/issues/21)).

# At this point, we know that to make our function faster, the first thing to optimize is the computation of the laplacian. Let's try that using Cython.

# # Cython 

# What Cython does, like the other tools we'll see below, is to compile our Python code into C code that will hopefully be very fast. 
# Since Cython has a so-called IPython magic function, we'll be able to continue working in the notebook. 
# If we use the `-a` option with the Cython magic, we can obtain an annotated version of our source code that shows where Cython uses C code and where it uses Python objects.

# In[23]:


get_ipython().run_line_magic('load_ext', 'cython')


# In[24]:


get_ipython().run_cell_magic('cython', '-a', 'import numpy as np\n   \ndef laplace_cython(image):\n    """Applies Laplace operator to 2D image, then tresholds the result and returns boolean image.\n    Cython implementation."""\n    laplacian = image[:-2, 1:-1] + image[2:, 1:-1] + image[1:-1, :-2] + image[1:-1, 2:] - 4*image[1:-1, 1:-1]\n    thresh = np.abs(laplacian) > 0.05\n    return thresh\n')


# What the above code shows is that Cython couldn't optimize our function much, as indicated by the yellow hue of the lines above. The yellow intensity in each line shows the amount of interaction with Python objects that it found. The more you have to use Python objects, the less speedup you are likely to achieve. 
# 
# Let's time this first compiled version.

# In[25]:


get_ipython().run_line_magic('timeit', 'laplace_cython(image)')


# It performs roughly at the same speed that our NumPy version. So there's no much use in switching to Cython. Can we do better? The key is to eliminate interaction with Python through different ways. A first one is to declare our input image is an array.

# In[26]:


get_ipython().run_cell_magic('cython', '-a', 'import numpy as np\ncimport numpy as cnp\n\ndef laplace_cython(cnp.ndarray image):\n    """Applies Laplace operator to 2D image, then tresholds the result and returns boolean image.\n    Cython implementation."""\n    laplacian = image[:-2, 1:-1] + image[2:, 1:-1] + image[1:-1, :-2] + image[1:-1, 2:] - 4*image[1:-1, 1:-1]\n    thresh = np.abs(laplacian) > 0.05\n    return thresh\n')


# In[27]:


get_ipython().run_line_magic('timeit', 'laplace_cython(image)')


# The timing didn't improve. It turns, if you click on the yellow lines, that Cython is not able to optimize the broadcasting operation (and its fancy bracketing) that we usually perform with NumPy. To allow Cython to optimize, let's write an explicit loop over the pixels in the image.

# In[28]:


get_ipython().run_cell_magic('cython', '-a', 'import numpy as np\ncimport numpy as cnp\n\ndef laplace_cython(cnp.ndarray image):\n    """Applies Laplace operator to 2D image, then tresholds the result and returns boolean image.\n    Cython implementation."""\n    cdef int h = image.shape[0]\n    cdef int w = image.shape[1]\n    laplacian = np.empty((w-2, h-2))\n    cdef int i, j\n    for i in range(1, h-1):\n        for j in range(1, w-1):\n            laplacian[i-1, j-1] = image[i-1, j] + image[i+1, j] + image[i, j-1] + image[i, j+1] - 4*image[i, j]\n    thresh = np.abs(laplacian) > 0.05\n    return thresh\n')


# In[29]:


get_ipython().run_line_magic('timeit', 'laplace_cython(image)')


# Woops! The function we just wrote is 100x times slower than our NumPy implementation! The reason can be traced to the core of our function: the sum of the five pixel bound terms should be free from any Python object conversions. However, that line is all yellow and if you click it, you'll see it's full of Python object conversions. 
# 
# How do we solve this problem? The answer is: we need to pass the array as a buffer that can be accessed at C speeds. Let's see:

# In[30]:


get_ipython().run_cell_magic('cython', '-a', 'import numpy as np\ncimport numpy as cnp\n\ndef laplace_cython(cnp.ndarray[double, ndim=2] image):\n    """Applies Laplace operator to 2D image, then tresholds the result and returns boolean image.\n    Cython implementation."""\n    cdef int h = image.shape[0]\n    cdef int w = image.shape[1]\n    cdef cnp.ndarray[double, ndim=2] laplacian = np.empty((w-2, h-2), dtype=np.double)\n    cdef int i, j\n    for i in range(1, h-1):\n        for j in range(1, w-1):\n            laplacian[i-1, j-1] = image[i-1, j] + image[i+1, j] + image[i, j-1] + image[i, j+1] - 4*image[i, j]\n    thresh = np.abs(laplacian) > 0.05\n    return thresh\n')


# In[31]:


get_ipython().run_line_magic('timeit', 'laplace_cython(image)')


# Finally! We have made our function faster by a factor of 2.

# But can we go further? Yes, by observing that on the summation line, Cython still performs some bounds checks.

# In[32]:


get_ipython().run_cell_magic('cython', '-a', 'import numpy as np\ncimport numpy as cnp\nimport cython\n\n@cython.boundscheck(False) # turn off bounds-checking for entire function\n@cython.wraparound(False)  # turn off negative index wrapping for entire function\ndef laplace_cython(cnp.ndarray[double, ndim=2] image):\n    """Applies Laplace operator to 2D image, then tresholds the result and returns boolean image.\n    Cython implementation."""\n    cdef int h = image.shape[0]\n    cdef int w = image.shape[1]\n    cdef cnp.ndarray[double, ndim=2] laplacian = np.empty((w-2, h-2), dtype=np.double)\n    cdef int i, j\n    for i in range(1, h-1):\n        for j in range(1, w-1):\n            laplacian[i-1, j-1] = image[i-1, j] + image[i+1, j] + image[i, j-1] + image[i, j+1] - 4*image[i, j]\n    thresh = np.abs(laplacian) > 0.05\n    return thresh\n')


# In[33]:


get_ipython().run_line_magic('timeit', 'laplace_cython(image)')


# In[34]:


np.allclose(laplace_numpy(image), laplace_cython(image))


# What we can notice in the above function is that we have gotten rid completely of interaction in the for loop, which leads to a 4x increase in performance over our NumPy implementation.

# Let's now turn to pythran.

# # Pythran 

# One of the advantages of pythran is that it can optimize higher level numpy code than Cython. Luckily for us again, it features a line magic, so we keep working in the notebook.

# In[35]:


get_ipython().run_line_magic('load_ext', 'pythran.magic')


# To use pythran, all you have to do is to annotate the function you want to export and give it a signature. This is done in a comment line starting the pythran file. One of the promises of pythran is that it can often handle high level broadcasting with NumPy and still optimize our function.
# 
# Annotating our NumPy function, we obtain this:

# In[36]:


get_ipython().run_cell_magic('pythran', '', '#pythran export laplace_pythran_highlevel(float[][])\nimport numpy as np\ndef laplace_pythran_highlevel(image):\n    """Laplace operator in NumPy for 2D images. Pythran accelerated."""\n    laplacian = image[:-2, 1:-1] + image[2:, 1:-1] + image[1:-1, :-2] + image[1:-1, 2:] - 4*image[1:-1, 1:-1]\n    thresh = np.abs(laplacian) > 0.05\n    return thresh\n')


# In[37]:


np.allclose(laplace_numpy(image), laplace_pythran_highlevel(image))


# In[39]:


get_ipython().run_line_magic('timeit', 'laplace_pythran_highlevel(image)')


# Wow! Pythran translated that to the fastest version so far!

# But pythran can also translate "low level style" code, just like Cython. Let's try again in a low-level style.

# In[40]:


get_ipython().run_cell_magic('pythran', '', '#pythran export laplace_pythran_lowlevel(float[][])\nimport numpy as np\ndef laplace_pythran_lowlevel(image):\n    """Laplace operator in NumPy for 2D images. Pythran accelerated."""\n    h = image.shape[0]\n    w = image.shape[1]\n    laplacian = np.empty((h - 2, w - 2))\n    for i in range(1, h - 1):\n        for j in range(1, w - 1):\n            laplacian[i-1, j-1] = image[i-1, j] + image[i+1, j] + image[i, j-1] + image[i, j+1] - 4*image[i, j]\n    thresh = np.abs(laplacian) > 0.05\n    return thresh\n')


# In[41]:


np.allclose(laplace_numpy(image), laplace_pythran_lowlevel(image))


# In[42]:


get_ipython().run_line_magic('timeit', 'laplace_pythran_lowlevel(image)')


# Less fast than before. What if we move the thresholding operation inside the main loop?

# In[43]:


get_ipython().run_cell_magic('pythran', '', '#pythran export laplace_pythran_lowlevel(float[][])\nimport numpy as np\ndef laplace_pythran_lowlevel(image):\n    """Laplace operator in NumPy for 2D images. Pythran accelerated."""\n    h = image.shape[0]\n    w = image.shape[1]\n    laplacian = np.empty((h - 2, w - 2))\n    for i in range(1, h - 1):\n        for j in range(1, w - 1):\n            laplacian[i-1, j-1] = np.abs(image[i-1, j] + image[i+1, j] + image[i, j-1] + image[i, j+1] - 4*image[i, j]) > 0.05\n    return laplacian\n')


# In[44]:


np.allclose(laplace_numpy(image), laplace_pythran_lowlevel(image))


# In[45]:


get_ipython().run_line_magic('timeit', 'laplace_pythran_lowlevel(image)')


# So in this case, pythran with a low level approach outperforms Cython, with the best result being when it's given high-level Numpy code.

# Another interesting point here, concerning debugging pythran compilation, is that pythran annotated code is still perfectly valid Python. This helps during the debugging phase, compared to the previously explained Cython approach.

# Let's turn to our last tool, `numba`.

# # Numba 

# Numba is very similar to pythran, except that it does just-in-time compilation (JIT). We have several options to for use. Let's try the standard `jit` approach with a decorator.

# In[46]:


from numba import jit


# In[47]:


@jit
def laplace_numba(image):
    """Laplace operator in NumPy for 2D images. Accelerated using numba."""
    laplacian = image[:-2, 1:-1] + image[2:, 1:-1] + image[1:-1, :-2] + image[1:-1, 2:] - 4*image[1:-1, 1:-1]
    thresh = np.abs(laplacian) > 0.05
    return thresh


# We call it once to avoid timing issues due to JIT compilation.

# In[48]:


laplace_numba(image);


# In[49]:


get_ipython().run_line_magic('timeit', 'laplace_numba(image)')


# This is rightaway faster than NumPy.

# We can also tell numba to avoid using Python at all in its computation. This is called with `@jit(nopython=True)`.

# In[50]:


@jit(nopython=True)
def laplace_numba(image):
    """Laplace operator in NumPy for 2D images. Accelerated using numba."""
    laplacian = image[:-2, 1:-1] + image[2:, 1:-1] + image[1:-1, :-2] + image[1:-1, 2:] - 4*image[1:-1, 1:-1]
    thresh = np.abs(laplacian) > 0.05
    return thresh


# In[51]:


laplace_numba(image);


# In[52]:


get_ipython().run_line_magic('timeit', 'laplace_numba(image)')


# No big effect. In fact, we can infer from this that numba managed to generate pure C code from our function and that it did it already previously. 
# 
# To confirm that, we can either use the `inspect_types()` method of our jitted function or use the annotation tool provided with numba.

# In[53]:


laplace_numba.inspect_types()


# Admittedly, not very helpful. Let's try the annotation tool.

# In[54]:


get_ipython().run_cell_magic('file', 'numba_annotation_demo.py', 'import numpy as np\nimport skimage.data\nimport skimage.color\nfrom numba import jit\n\n@jit(nopython=True)\ndef laplace_numba(image):\n    """Laplace operator in NumPy for 2D images. Accelerated using numba."""\n    laplacian = image[:-2, 1:-1] + image[2:, 1:-1] + image[1:-1, :-2] + image[1:-1, 2:] - 4*image[1:-1, 1:-1]\n    thresh = np.abs(laplacian) > 0.05\n    return thresh\n\nimage = skimage.data.astronaut()\nimage = skimage.color.rgb2gray(image)\nlaplace_numba(image)\n')


# We have to run the annotation tool from the command line:

# In[55]:


get_ipython().system('numba --annotate-html annotation.html numba_annotation_demo.py')


# In[56]:


from IPython.display import HTML


# In[57]:


HTML(data=open('annotation.html').read())


# What this shows is that `numba` managed to correctly optimize the broadcasting (if it hadn't we would have seen colored source code lines).

# Let's see also investigate how numba optimizes the low-level approach with explicit for loops.

# In[58]:


@jit(nopython=True)
def laplace_numba(image):
    """Laplace operator for 2D images. Numba accelerated."""
    h = image.shape[0]
    w = image.shape[1]
    laplacian = np.empty((h - 2, w - 2))
    for i in range(1, h - 1):
        for j in range(1, w - 1):
            laplacian[i-1, j-1] = np.abs(image[i-1, j] + image[i+1, j] + image[i, j-1] + image[i, j+1] - 4*image[i, j]) > 0.05
    return laplacian


# In[59]:


laplace_numba(image);


# In[60]:


get_ipython().run_line_magic('timeit', 'laplace_numba(image)')


# This approach improves the performance! 

# Another way of approaching optimization with numba is to thing in terms of NumPy ufuncs. You can write generalized ufuncs with numba using another decorator, `@guvectorize`. However, this comes at the cost that we have to think in terms of an ufunc and thus we have to provide an array that gets filled by our computation routine.

# In[61]:


from numba import guvectorize


# In[62]:


@guvectorize('void(float64[:, :], float64[:, :])', "(m, n)->(m, n)")
def laplace_numba_guvectorize(image, laplacian):
    """Laplace operator in NumPy for 2D images. Numba accelerated."""
    h = image.shape[0]
    w = image.shape[1]
    for i in range(1, h - 1):
        for j in range(1, w - 1):
            laplacian[i-1, j-1] = np.abs(4 * image[i, j] - image[i - 1, j] - image[i + 1, j] - image[i, j + 1] - image[i, j - 1]) > 0.05


# In[63]:


laplacian = np.empty_like(image)


# In[64]:


laplace_numba_guvectorize(image, laplacian);


# In[65]:


np.allclose(laplace_numpy(image), laplacian[:-2, :-2])


# In[66]:


get_ipython().run_line_magic('timeit', 'laplace_numba_guvectorize(image, laplacian);')


# This approach is even faster. However, it has some limitations: the signature for array sizes forces us to use the same shapes in the input and output. I could'nt use the  "(m, n)->(m-1, n-1)" signature. Note that we also could have target parallel execution by using the `target=parallel` option in the decorator.

# # Wrap-up and plots 

# Time for a wrap-up and some plots.

# In[67]:


timings = {}
for func in [laplace_skimage, laplace_numpy, laplace_cython, laplace_pythran_highlevel, laplace_pythran_lowlevel, laplace_numba]:
    t = get_ipython().run_line_magic('timeit', '-o func(image)')
    timings[func.__name__] = t


# In[68]:


t = get_ipython().run_line_magic('timeit', '-o laplace_numba_guvectorize(image, laplacian);')
timings['laplace_numba_guvectorize'] = t


# Here are the table and the plots.

# In[69]:


import pandas as pd


# In[70]:


pd.Series({key: timings[key].average * 1e6 for key in timings}).to_frame(name='timings (μs)').sort_values(by='timings (μs)')


# In[71]:


fig, ax = plt.subplots(figsize=(10, 10))
pd.Series({key: timings[key].average * 1e6 for key in timings}).to_frame(name='timings (μs)').sort_values(by='timings (μs)').plot(kind='barh', ax=ax)


# **conclusions**
# 
# 
# So, what have I learned during these last three days? Here are some thoughts:
# 
# - most approaches suggest you rewrite your function in a low-level style if you want a good optimization result
# - most approaches rely on the type annotations which you supply and which are crucial for good results (and sometimes it's easy to forget one variable which in turn prevents the optimizer from doing its best)
# - pythran, like numba, can in theory optimize high-level NumPy code (in our example, pythran did better than numba)
# - pythran yields the best performance in our benchmark
# - Cython, contrary to what I expected, comes in after those two (but still yields a good speed-up)
# - debugging all code generators is usually frustrating, due to difficult to understand error messages, low-level stuff (try reading the Cython output by clicking the yellow lines) 
# - numba also allows targeting GPUs, something I haven't spoken about here since I don't have a GPU on my MacBook (this was the topic I had the most difficulty with during the training session since I've never worked with GPUs)

# **further reading**
# 
# - the resources that we used during the training can be found [here](http://calcul.math.cnrs.fr/spip.php?article284) (in French) - congratulations to the nice people who put in a lot of effort to organize the workshop: [Loïc Gouarin](https://www.math.u-psud.fr/~gouarin/), [Xavier Juvigny](https://www.researchgate.net/profile/Xavier_Juvigny), [Serge Guelton](http://serge.liyun.free.fr/serge/) and [Konrad Hinsen](http://khinsen.net) 
# - there is a [great numba tutorial on YouTube](https://www.youtube.com/watch?v=SzBi3xdEF2Y) that was offered at SciPy 2016 (and that's going to get repeated at this year's SciPy) by Lorena Barba and Gil Forsyth

# *This post was entirely written using the IPython notebook. Its content is BSD-licensed. You can see a static view or download this notebook with the help of nbviewer at [20170706_OptimizingYourPythonCode.ipynb](http://nbviewer.ipython.org/urls/raw.github.com/flothesof/posts/master/20170706_OptimizingYourPythonCode.ipynb).*