Optimization with Numba¶

Contents¶

Context
An overview of Numba
Function Signature
Types and Variables
- Strings
- Arrays
- Ufuncs
Examples
Glossary
References
Credits

A little context¶

As we saw in our C and FORTRAN interfacing, Python is a great language for rapid development. However, its weakness is that it is slow in comparison to low-level languages, especially when having to deal with nested loops with huge arrays.

Current approaches to dealing with this :

Cython
Weave
Wrap it using f2py etc.

Numba Philosophy - Don't wrap or rewrite, just decorate !

In [73]:

import numba
import cmath

Numba : An overview¶

Numba is an Open Source NumPy-aware optimizing compiler for Python sponsored by Continuum Analytics, Inc. It uses the LLVM compiler infrastructure to compile Python syntax to machine code.

Numba gives you the power to speed up your applications with high performance functions written directly in Python. With a few annotations, array-oriented and math-heavy Python code can be just-in-time compiled to native machine instructions, similar in performance to C, C++ and Fortran, without having to switch languages or Python interpreters.

Numba works by generating optimized machine code using the LLVM compiler infrastructure at import time, runtime, or statically (using the included pycc tool). Numba supports compilation of Python to run on either CPU or GPU hardware, and is designed to integrate with the Python scientific software stack.

Here are three of its most important distinguishing features:

Numba is a compiler - It leverages the LLVM Compilation Framework to parse, compile to, and optimize assembly code, in the same manner that the fastest compiled languages such as C and Fortran do.
Numba is Python - Although Numba utilizes very powerful libraries beneath Python for performance, the code you write is always pure Python, making it easier to author, debug, verify, and maintain.
Accelerating a function in Numba can be done with two lines of code - You provide a NumPy array-based Python function, import Numba's jit method, then annotate it with the @jit decorator or create the accelerated function by calling autojit directly.

** Quick Start**

As mentioned earlier, Numba compiles Python code to LLVM, which can be executed at runtime much faster than pure Python. The first step to using Numba is to get familiar with the jit decorator.

In [74]:

from numba import jit

@jit
def sum(x, y):
    return x + y

%timeit sum(2,2)

1000000 loops, best of 3: 252 ns per loop

The very basic example above is compiled for any compatible input types automatically when the sum function is called. The result is a new function with performance comparable to a compiled function written in C. To compile for specific input types, we can tell Numba what those input types are:

In [75]:

@jit('f8(f8,f8)')
def sum(x, y):
    return x + y

%timeit sum(2,2)

1000000 loops, best of 3: 227 ns per loop

The string above passed to the jit decorator tells Numba the return type is an 8 byte float, and the two arguments passed in are also 8 byte floats. The string takes the form returntype(arg1type, arg2type, ...).

One of the main features of Numba is its support for NumPy arrays. The following example shows how a function can be compiled that takes a NumPy array of floats as an input:

In [76]:

import numpy as np
a=np.random.randn(100)
%timeit a.sum()

100000 loops, best of 3: 2.19 µs per loop

In [77]:

@jit('f8(f8[:])')
def sum1d(array):
    sum = 0.0
    for i in range(array.shape[0]):
        sum += array[i]
    return sum

In [78]:

%timeit sum1d(a)

1000000 loops, best of 3: 267 ns per loop

There are two main things to notice in the example above. The input argument is specified by the string f8[:], which means a 1d array of 8 byte floats. A 2d array would be specified as f8[:, :], a 3d array as f8[:, :, :], and so on. The other thing to take note of is the array indexing and shape method call, and the fact that we’re iterating over a NumPy array using Python. Normally this would be terribly slow, but the performance of the code above is better than NumPy’s sum method.

** Comparison Example 1 : Pairwise Distance Function **

Pairwise distance function: This will take an array representing M points in N dimensions, and return the M x M matrix of pairwise distances. This is a nice test function for a few reasons

In [79]:

import numpy as np
X = np.random.random((1000, 3))

In [80]:

def pairwise_numpy(X):
    return np.sqrt(((X[:, None, :] - X) ** 2).sum(-1))
%timeit pairwise_numpy(X)

10 loops, best of 3: 35.5 ms per loop

In [81]:

def pairwise_python(X):
    M = X.shape[0]
    N = X.shape[1]
    D = np.empty((M, M), dtype=np.float)
    for i in range(M):
        for j in range(M):
            d = 0.0
            for k in range(N):
                tmp = X[i, k] - X[j, k]
                d += tmp * tmp
            D[i, j] = np.sqrt(d)
    return D
%timeit pairwise_python(X)

1 loops, best of 3: 3.76 s per loop

As we see, it is over 100 times slower than the numpy broadcasting approach! This is due to Python's dynamic type checking, which can drastically slow down nested loops. With these two solutions, we're left with a tradeoff between efficiency of computation and efficiency of memory usage. This is where Numba could prove to be useful. It is also worthwhile to see how scipy and scikit-learn perform for the same function.

In [82]:

from scipy.spatial.distance import cdist
%timeit cdist(X, X)

100 loops, best of 3: 4.36 ms per loop

In [83]:

from sklearn.metrics import euclidean_distances
%timeit euclidean_distances(X, X)

100 loops, best of 3: 6.49 ms per loop

Now we try Numba! Numba is extremely simple to use. We just wrap our python function with jit (JIT stands for "just in time" compilation) to automatically create an efficient, compiled version of the function:

In [84]:

from numba import double
from numba.decorators import jit, autojit

pairwise_numba = jit(pairwise_python)

%timeit pairwise_numba(X)

100 loops, best of 3: 6.56 ms per loop

In [85]:

%load_ext cythonmagic

The cythonmagic extension is already loaded. To reload it, use:
  %reload_ext cythonmagic

In [86]:

%%cython
import numpy as np
cimport cython
from libc.math cimport sqrt

@cython.boundscheck(False)
@cython.wraparound(False)
def pairwise_cython(double[:, ::1] X):
    cdef int M = X.shape[0]
    cdef int N = X.shape[1]
    cdef double tmp, d
    cdef double[:, ::1] D = np.empty((M, M), dtype=np.float64)
    for i in range(M):
        for j in range(M):
            d = 0.0
            for k in range(N):
                tmp = X[i, k] - X[j, k]
                d += tmp * tmp
            D[i, j] = sqrt(d)
    return np.asarray(D)

In [87]:

%timeit pairwise_cython(X)

100 loops, best of 3: 3.78 ms per loop

** Comparison Example 2 : Bubble Sort **

In [88]:

def bubblesort(X):
    N = len(X)
    for end in range(N, 1, -1):
        for i in range(end - 1):
            cur = X[i]
            if cur > X[i + 1]:
                tmp = X[i]
                X[i] = X[i + 1]
                X[i + 1] = tmp

In [89]:

import numpy as np

original = np.arange(0.0, 10.0, 0.01, dtype='f4')
shuffled = original.copy()
np.random.shuffle(shuffled)

In [90]:

#Check if it works!
sorted = shuffled.copy()
bubblesort(sorted)
print(np.array_equal(sorted, original))

True

In [91]:

#Check execution time

sorted[:] = shuffled[:]
%timeit sorted[:] = shuffled[:]; bubblesort(sorted)

1 loops, best of 3: 149 ms per loop

Now let's make a Numba version of bubblesort!

In [92]:

@jit("void(f4[:])")
def bubblesort_jit(X):
    N = len(X)
    for end in range(N, 1, -1):
        for i in range(end - 1):
            cur = X[i]
            if cur > X[i + 1]:
                tmp = X[i]
                X[i] = X[i + 1]
                X[i + 1] = tmp

In [93]:

sorted[:] = shuffled[:] # reset to shuffled before sorting
bubblesort_jit(sorted)
print(np.array_equal(sorted, original))

True

In [94]:

%timeit sorted[:] = shuffled[:]; bubblesort_jit(sorted)

1000 loops, best of 3: 1.16 ms per loop

Function Signature¶

In order to generate fast code, the compiler needs type information for the code. This allows a direct mapping from the Python operations to the appropriate machine instruction without any type check/dispatch mechanism. In numba, in most cases it suffices to specify the types for the parameters. In many cases, numba can deduce types for intermediate values as well as the return value using type inference. For convenience, it is also possible to specify in the signature the type of the return value

A numba.jit compiled function will only work when called with the right type of arguments (it may, however, perform some conversions on types that it considers equivalent).

A signature contains the return type as well as the argument types. One way to specify the signature is using a string, like in our example. The signature takes the form: <return type> ( <arg1 type>, <arg2 type>, ... ). The types may be scalars or arrays (NumPy arrays). In our example, void(f4[:]), it means a function with no return (return type is void) that takes as unique argument an one-dimensional array of 4 byte floats f4[:]. Starting with numba version 0.12 the result type is optional. In that case the signature will look like the following: <arg1 type>, <arg2 type>, .... When the signature doesn’t provide a type for the return value, the type is inferred. The elementary built-in types in Numba include:

TypeName	Alias	Result Type


   boolean    b1	uint8 (char)
   bool_      b1	uint8 (char)
   byte       u1	unsigned char
   uint8      u1	uint8 (char)
   uint16     u2	uint16
   uint32     u4	uint32
   uint64     u8	uint64
   char       i1	signed char
   int8       i1	int8 (char)
   int16      i2	int16
   int32      i4	int32 
   int64      i8	int64
   float_     f4	float32
   float32    f4	float32
   double     f8	float64
   float64    f8	float64
   complex64  c8	float complex
   complex128 c16   double complex

Some sample signatures are as follows:

Signature	Meaning
void(f4[:], u8)	a function with no return value taking a one-dimensional array of,single precision floats and a 64-bit unsigned integer.
i4(f8)	a function returning a 32-bit signed integer taking a double precision float as argument.
void(f4[:,:],f4[:,:])	a function with no return value taking 2 2-dimensional arrays as arguments

Types and Variables in Numba¶

Types can be used in Numba to compile functions directly with the jit function, and they can be used to declare local variables in both jit and autojit functions:

In [26]:

@jit(locals=dict(array=double[:, :], scalar1=double))
def func(array):
    scalar1 = array[0, 0] # scalar is declared double
    scalar2 = double(array[0, 0])

Numba allows you to obtain the type of a expression or variable through the typeof function in a Numba function. This type can then be used for instance to cast other values:

type = numba.typeof(x + y)
value = type(value)

In [27]:

 numba.typeof(1.0)

Out[27]:

float64

In [28]:

 numba.typeof(cmath.sqrt(-1))

Out[28]:

complex128

Variables declared in the locals dict have a single type throughout the entire function. However, any variable not declared in locals can assume different types, just like in Python:

In [40]:

@jit
def variable_ressign(arg):
    arg = 1.0
    arg = "hello"
    arg = object()
    var = arg
    var = "world"

However, there are some restrictions, namely that variables must have a unifyable type at control flow merge points. For example, the following code will not compile (according to the Numba documentation but seems like an update has been made to make this work):

In [95]:

@jit
def incompatible_types(arg):
    if arg > 10:
        x = "hello"
    else:
        x = arg

    return x       

In [97]:

incompatible_types(5)

Out[97]:

** Strings **

Strings may be specified through the c_string_type type, a name which is subject to change in the future. This does not handle unicode, and is equivalent to char *:

c_string_type

** Arrays **

Support for NumPy arrays is a key focus of Numba development and is currently undergoing extensive refactorization and improvement. Most capabilities of NumPy arrays are supported by Numba in object mode.

In [98]:

@jit
def sum(x, y):
    array = np.arange(x * y).reshape(x, y)
    sum = 0
    for i in range(x):
        for j in range(y):
            sum += array[i, j]
    return sum
%timeit sum(100,100)

1000 loops, best of 3: 1.6 ms per loop

The following is a forced compilation of the nopython-mode and it uses loop-jitting .

In [99]:

# force compilation in nopython mode

@jit(nopython=True)
def jitted_loop(array, x, y):
    sum = 0
    for i in range(x):
        for j in range(y):
            sum += array[i, j]
    return sum

In [100]:

import numpy as np
arr=np.random.randn(100,100)
%timeit jitted_loop(arr,100,100)

100000 loops, best of 3: 8.34 µs per loop

The below code is compiled in object mode

In [101]:

# compiled in object mode
@jit
def sum(x, y):
    array = np.arange(x * y).reshape(x, y)
    return jitted_loop(array, x, y)

%timeit sum(100,100)

100000 loops, best of 3: 13.4 µs per loop

A few noteworthy limitations of arrays at this time:

Arrays can be passed in to a function in nopython mode, but not returned. Arrays can only be returned in object mode.
New arrays can only be created in object mode.
Currently there are no bounds checking for array indexing and slicing, although negative indices will wrap around correctly.
A small number of NumPy array ufuncs are only supported in object mode, but the vast majority work in nopython mode.
Array slicing only works in object mode.

UFuncs

Numba’s vectorize allows Numba functions taking scalar input arguments to be used as NumPy ufuncs. Creating a traditional NumPy ufunc is not the most difficult task in the world, but it is also not the most straightforward process and involves writing some C code. Numba makes this easy though. Using the vectorize decorator, Numba can compile a Python function into a ufunc that operates over NumPy arrays as fast as traditional ufuncs written in C.

Ufunc arguments are scalars of a NumPy array. Function definitions can be arbitrary mathematical expressions. The vectorize decorator needs to know the argument and return types of the ufunc. These are specified much like the jit decorator:

In [35]:

import math
from numba import vectorize

@vectorize(['float64(float64, float64)'])
def my_ufunc(x, y):
    return x+y+math.sqrt(x*y)

a = np.arange(1.0, 10.0)
b = np.arange(1.0, 10.0)
# Calls compiled version of my_ufunc for each element of a and b
print(my_ufunc(a, b))

[  3.   6.   9.  12.  15.  18.  21.  24.  27.]

Examples¶

** An Image Processing Example **

In [102]:

import numpy

def filter2d(image, filt):
    M, N = image.shape
    Mf, Nf = filt.shape
    Mf2 = Mf // 2
    Nf2 = Nf // 2
    result = numpy.zeros_like(image)
    for i in range(Mf2, M - Mf2):
        for j in range(Nf2, N - Nf2):
            num = 0.0
            for ii in range(Mf):
                for jj in range(Nf):
                    num += (filt[Mf-1-ii, Nf-1-jj] * image[i-Mf2+ii, j-Nf2+jj])
            result[i, j] = num
    return result

# This kind of quadruply-nested for-loop is going to be quite slow.
# Using Numba we can compile this code to LLVM which then gets
# compiled to machine code:

from numba import double, jit

fastfilter_2d = jit(double[:,:](double[:,:], double[:,:]))(filter2d)

# Now fastfilter_2d runs at speeds as if you had first translated
# it to C, compiled the code and wrapped it with Python
image = numpy.random.random((100, 100))
filt = numpy.random.random((10, 10))
plt.figure(1)
plt.imshow(image)
plt.gcf().set_size_inches(6, 6)

In [103]:

%timeit result1=filter2d(image,filt)
%timeit result2=fastfilter_2d(image,filt)

1 loops, best of 3: 619 ms per loop
1000 loops, best of 3: 1.23 ms per loop

In [104]:

result1=filter2d(image,filt)
plt.subplot(221)
plt.imshow(result1)
plt.title('Filter')
result2=fastfilter_2d(image,filt)
plt.subplot(222)
plt.imshow(result2)
plt.title('Fast Filter')
plt.gcf().set_size_inches(6, 6)

** Mandelbrot Fractals **

The Mandelbrot set is obtained by iteratively assigning a value to each point $c$ in the complex plane according to the formula $z_{n+1} = z_{n}^2 + c$ ($z_{0} = 0$). If the value $z$ goes to $\infty$ as $n \rightarrow \infty$, then that point $c$ is not part of the Mandelbrot set. Conversely, if the value of $z$ remains bounded no matter how large $n$ becomes, the point $c$ is part of the Mandelbrot set.

From Wikipedia, "Mandelbrot set images are made by sampling complex numbers and determining for each whether the result tends towards infinity when a particular mathematical operation is iterated on it. Treating the real and imaginary parts of each number as image coordinates, pixels are colored according to how rapidly the sequence diverges, if at all."

The following code implements the escape-time algorithm, which tells you how many iterations until a point "escapes", or becomes unbounded, if it does so under a certain maximum limit. This version does not utilize complex numbers, although a version could be written which does so.

In [105]:

from __future__ import division

import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.pyplot as plt

#IPython magic command for inline plotting
%matplotlib inline
#a better plot shape for IPython
mpl.rcParams['figure.figsize']=[15,3]

In [106]:

def mandelbrot(minR, maxR, minI, maxI, samples=51, iters=25):
    """
    Generate the Mandelbrot set within the boundaries of the limits
    maxR, minR, minI, maxI with samples and a maximum iteration of iters.
    """
    # Generate a mesh for the set.
    setR = np.linspace(minR, maxR, samples)
    setI = np.linspace(minI, maxI, samples)
 
    # Calculate the values at each point of the mesh by the escape-time
    # fractal algorithm.
    pts = np.zeros([samples, samples])
    for ii in range(1, len(setR)):
        for jj in range(1, len(setI)):
            it = 0
            x = 0.0
            y = 0.0
 
            xx = setR[ii]
            yy = setI[jj]
 
            # Look for escape---i.e., does the value settle down in a few
            # iterations or does it keep going?
            while(x * x + y * y < 4 and it < iters):
                xtmp = x * x - y * y + xx
                y = 2 * x * y + yy
                x = xtmp
                it += 1
            pts[ii, jj] = it
    return setR, setI, pts
 
# Plot boundaries
minR = -2.25
maxR = 0.75
minI = -1.5
maxI = 1.5
 
samples = 201
iters = 20
 
x, y, z = mandelbrot(minR, maxR, minI, maxI, samples, iters)
z = z.transpose()
 
mpl.rcParams['figure.figsize']=[8,8]
plt.imshow(z, interpolation='nearest')
plt.show()

In [107]:

%timeit mandelbrot(minR, maxR, minI, maxI, samples, iters)

1 loops, best of 3: 694 ms per loop

In [108]:

mandelbrot_jit=jit(mandelbrot)
%timeit mandelbrot_jit(minR, maxR, minI, maxI, samples, iters)

1000 loops, best of 3: 799 µs per loop

Glossary¶

JIT function

Shorthand for “a function compiled with Numba.”

loop-jitting

A feature of compilation in object mode where a loop can be automatically extracted and compiled in nopython mode. This allows functions with operations unsupported in nopython mode (such as array creation) to see significant performance improvements if they contain loops with only nopython-supported operations.

nopython mode

A Numba compilation mode that generates code that does not access the Python C API. This compilation mode produces the highest performance code, but requires that the native types of all values in the function can be inferred, and that no new objects are allocated. Unless otherwise instructed, the @jit decorator will automatically fall back to object mode if nopython mode cannot be used.

object mode

A Numba compilation mode that generates code that handles all values as Python objects and uses the Python C API to perform all operations on those objects. Code compiled in object mode will often run no faster than Python interpreted code, unless the Numba compiler can take advantage of loop-jitting.

type inference

The process by which Numba determines the specialized types of all values within a function being compiled. Type inference can fail if arguments or globals have Python types unknown to Numba, or if functions are used that are not recognized by Numba. Sucessful type inference is a prerequisite for compilation in nopython mode.

ufunc

A NumPy universal function. Numba can create new compiled ufuncs with the @vectorize decorator.

References¶

Credits¶

Neal Davis and Lakshmi Rao developed these materials for Computational Science and Engineering at the University of Illinois at Urbana–Champaign.

This content is available under a [Creative Commons Attribution 3.0 Unported License](https://creativecommons.org/licenses/by/3.0/).