As we saw in our C and FORTRAN interfacing, Python is a great language for rapid development. However, its weakness is that it is slow in comparison to low-level languages, especially when having to deal with nested loops with huge arrays.
Current approaches to dealing with this :
Numba Philosophy - Don't wrap or rewrite, just decorate !
import numba
import cmath
Numba is an Open Source NumPy-aware optimizing compiler for Python sponsored by Continuum Analytics, Inc. It uses the LLVM compiler infrastructure to compile Python syntax to machine code.
Numba gives you the power to speed up your applications with high performance functions written directly in Python. With a few annotations, array-oriented and math-heavy Python code can be just-in-time compiled to native machine instructions, similar in performance to C, C++ and Fortran, without having to switch languages or Python interpreters.
Numba works by generating optimized machine code using the LLVM compiler infrastructure at import time, runtime, or statically (using the included pycc tool). Numba supports compilation of Python to run on either CPU or GPU hardware, and is designed to integrate with the Python scientific software stack.
Here are three of its most important distinguishing features:
Numba is a compiler - It leverages the LLVM Compilation Framework to parse, compile to, and optimize assembly code, in the same manner that the fastest compiled languages such as C and Fortran do.
Numba is Python - Although Numba utilizes very powerful libraries beneath Python for performance, the code you write is always pure Python, making it easier to author, debug, verify, and maintain.
Accelerating a function in Numba can be done with two lines of code - You provide a NumPy array-based Python function, import Numba's jit
method, then annotate it with the @jit
decorator or create the accelerated function by calling autojit directly.
As mentioned earlier, Numba compiles Python code to LLVM, which can be executed at runtime much faster than pure Python. The first step to using Numba is to get familiar with the jit
decorator.
from numba import jit
@jit
def sum(x, y):
return x + y
%timeit sum(2,2)
1000000 loops, best of 3: 252 ns per loop
The very basic example above is compiled for any compatible input types automatically when the sum function is called. The result is a new function with performance comparable to a compiled function written in C. To compile for specific input types, we can tell Numba what those input types are:
@jit('f8(f8,f8)')
def sum(x, y):
return x + y
%timeit sum(2,2)
1000000 loops, best of 3: 227 ns per loop
The string above passed to the jit
decorator tells Numba the return type is an 8 byte float, and the two arguments passed in are also 8 byte floats. The string takes the form returntype(arg1type, arg2type, ...)
.
One of the main features of Numba is its support for NumPy arrays. The following example shows how a function can be compiled that takes a NumPy array of floats as an input:
import numpy as np
a=np.random.randn(100)
%timeit a.sum()
100000 loops, best of 3: 2.19 µs per loop
@jit('f8(f8[:])')
def sum1d(array):
sum = 0.0
for i in range(array.shape[0]):
sum += array[i]
return sum
%timeit sum1d(a)
1000000 loops, best of 3: 267 ns per loop
There are two main things to notice in the example above. The input argument is specified by the string f8[:]
, which means a 1d array of 8 byte floats. A 2d array would be specified as f8[:, :]
, a 3d array as f8[:, :, :]
, and so on. The other thing to take note of is the array indexing and shape method call, and the fact that we’re iterating over a NumPy array using Python. Normally this would be terribly slow, but the performance of the code above is better than NumPy’s sum method.
import numpy as np
X = np.random.random((1000, 3))
def pairwise_numpy(X):
return np.sqrt(((X[:, None, :] - X) ** 2).sum(-1))
%timeit pairwise_numpy(X)
10 loops, best of 3: 35.5 ms per loop
def pairwise_python(X):
M = X.shape[0]
N = X.shape[1]
D = np.empty((M, M), dtype=np.float)
for i in range(M):
for j in range(M):
d = 0.0
for k in range(N):
tmp = X[i, k] - X[j, k]
d += tmp * tmp
D[i, j] = np.sqrt(d)
return D
%timeit pairwise_python(X)
1 loops, best of 3: 3.76 s per loop
As we see, it is over 100 times slower than the numpy broadcasting approach! This is due to Python's dynamic type checking, which can drastically slow down nested loops. With these two solutions, we're left with a tradeoff between efficiency of computation and efficiency of memory usage. This is where Numba could prove to be useful. It is also worthwhile to see how scipy
and scikit-learn
perform for the same function.
from scipy.spatial.distance import cdist
%timeit cdist(X, X)
100 loops, best of 3: 4.36 ms per loop
from sklearn.metrics import euclidean_distances
%timeit euclidean_distances(X, X)
100 loops, best of 3: 6.49 ms per loop
Now we try Numba! Numba is extremely simple to use. We just wrap our python function with jit
(JIT stands for "just in time" compilation) to automatically create an efficient, compiled version of the function:
from numba import double
from numba.decorators import jit, autojit
pairwise_numba = jit(pairwise_python)
%timeit pairwise_numba(X)
100 loops, best of 3: 6.56 ms per loop
%load_ext cythonmagic
The cythonmagic extension is already loaded. To reload it, use: %reload_ext cythonmagic
%%cython
import numpy as np
cimport cython
from libc.math cimport sqrt
@cython.boundscheck(False)
@cython.wraparound(False)
def pairwise_cython(double[:, ::1] X):
cdef int M = X.shape[0]
cdef int N = X.shape[1]
cdef double tmp, d
cdef double[:, ::1] D = np.empty((M, M), dtype=np.float64)
for i in range(M):
for j in range(M):
d = 0.0
for k in range(N):
tmp = X[i, k] - X[j, k]
d += tmp * tmp
D[i, j] = sqrt(d)
return np.asarray(D)
%timeit pairwise_cython(X)
100 loops, best of 3: 3.78 ms per loop
def bubblesort(X):
N = len(X)
for end in range(N, 1, -1):
for i in range(end - 1):
cur = X[i]
if cur > X[i + 1]:
tmp = X[i]
X[i] = X[i + 1]
X[i + 1] = tmp
import numpy as np
original = np.arange(0.0, 10.0, 0.01, dtype='f4')
shuffled = original.copy()
np.random.shuffle(shuffled)
#Check if it works!
sorted = shuffled.copy()
bubblesort(sorted)
print(np.array_equal(sorted, original))
True
#Check execution time
sorted[:] = shuffled[:]
%timeit sorted[:] = shuffled[:]; bubblesort(sorted)
1 loops, best of 3: 149 ms per loop
Now let's make a Numba version of bubblesort!
@jit("void(f4[:])")
def bubblesort_jit(X):
N = len(X)
for end in range(N, 1, -1):
for i in range(end - 1):
cur = X[i]
if cur > X[i + 1]:
tmp = X[i]
X[i] = X[i + 1]
X[i + 1] = tmp
sorted[:] = shuffled[:] # reset to shuffled before sorting
bubblesort_jit(sorted)
print(np.array_equal(sorted, original))
True
%timeit sorted[:] = shuffled[:]; bubblesort_jit(sorted)
1000 loops, best of 3: 1.16 ms per loop
In order to generate fast code, the compiler needs type information for the code. This allows a direct mapping from the Python operations to the appropriate machine instruction without any type check/dispatch mechanism. In numba, in most cases it suffices to specify the types for the parameters. In many cases, numba can deduce types for intermediate values as well as the return value using type inference. For convenience, it is also possible to specify in the signature the type of the return value
A numba.jit
compiled function will only work when called with the right type of arguments (it may, however, perform some conversions on types that it considers equivalent).
A signature contains the return type as well as the argument types. One way to specify the signature is using a string, like in our example. The signature takes the form: <return type> ( <arg1 type>, <arg2 type>, ... )
. The types may be scalars or arrays (NumPy arrays). In our example, void(f4[:])
, it means a function with no return (return type is void) that takes as unique argument an one-dimensional array of 4 byte floats f4[:]
. Starting with numba version 0.12 the result type is optional. In that case the signature will look like the following: <arg1 type>, <arg2 type>
, .... When the signature doesn’t provide a type for the return value, the type is inferred.
The elementary built-in types in Numba include:
TypeName Alias Result Type
boolean b1 uint8 (char)
bool_ b1 uint8 (char)
byte u1 unsigned char
uint8 u1 uint8 (char)
uint16 u2 uint16
uint32 u4 uint32
uint64 u8 uint64
char i1 signed char
int8 i1 int8 (char)
int16 i2 int16
int32 i4 int32
int64 i8 int64
float_ f4 float32
float32 f4 float32
double f8 float64
float64 f8 float64
complex64 c8 float complex
complex128 c16 double complex
Some sample signatures are as follows:
Signature | Meaning |
---|---|
void(f4[:], u8) | a function with no return value taking a one-dimensional array of,single precision floats and a 64-bit unsigned integer. |
i4(f8) | a function returning a 32-bit signed integer taking a double precision float as argument. |
void(f4[:,:],f4[:,:]) | a function with no return value taking 2 2-dimensional arrays as arguments |
Types can be used in Numba to compile functions directly with the jit
function, and they can be used to declare local variables in both jit
and autojit
functions:
@jit(locals=dict(array=double[:, :], scalar1=double))
def func(array):
scalar1 = array[0, 0] # scalar is declared double
scalar2 = double(array[0, 0])
Numba allows you to obtain the type of a expression or variable through the typeof function in a Numba function. This type can then be used for instance to cast other values:
type = numba.typeof(x + y)
value = type(value)
numba.typeof(1.0)
float64
numba.typeof(cmath.sqrt(-1))
complex128
Variables declared in the locals dict have a single type throughout the entire function. However, any variable not declared in locals can assume different types, just like in Python:
@jit
def variable_ressign(arg):
arg = 1.0
arg = "hello"
arg = object()
var = arg
var = "world"
However, there are some restrictions, namely that variables must have a unifyable type at control flow merge points. For example, the following code will not compile (according to the Numba documentation but seems like an update has been made to make this work):
@jit
def incompatible_types(arg):
if arg > 10:
x = "hello"
else:
x = arg
return x
incompatible_types(5)
5
c_string_type
@jit
def sum(x, y):
array = np.arange(x * y).reshape(x, y)
sum = 0
for i in range(x):
for j in range(y):
sum += array[i, j]
return sum
%timeit sum(100,100)
1000 loops, best of 3: 1.6 ms per loop
The following is a forced compilation of the nopython-mode
and it uses loop-jitting
.
# force compilation in nopython mode
@jit(nopython=True)
def jitted_loop(array, x, y):
sum = 0
for i in range(x):
for j in range(y):
sum += array[i, j]
return sum
import numpy as np
arr=np.random.randn(100,100)
%timeit jitted_loop(arr,100,100)
100000 loops, best of 3: 8.34 µs per loop
The below code is compiled in object mode
# compiled in object mode
@jit
def sum(x, y):
array = np.arange(x * y).reshape(x, y)
return jitted_loop(array, x, y)
%timeit sum(100,100)
100000 loops, best of 3: 13.4 µs per loop
A few noteworthy limitations of arrays at this time:
nopython
mode, but not returned. Arrays can only be returned in object mode.Numba’s vectorize allows Numba functions taking scalar input arguments to be used as NumPy ufuncs. Creating a traditional NumPy ufunc is not the most difficult task in the world, but it is also not the most straightforward process and involves writing some C code. Numba makes this easy though. Using the vectorize decorator, Numba can compile a Python function into a ufunc that operates over NumPy arrays as fast as traditional ufuncs written in C.
Ufunc arguments are scalars of a NumPy array. Function definitions can be arbitrary mathematical expressions. The vectorize decorator needs to know the argument and return types of the ufunc. These are specified much like the jit
decorator:
import math
from numba import vectorize
@vectorize(['float64(float64, float64)'])
def my_ufunc(x, y):
return x+y+math.sqrt(x*y)
a = np.arange(1.0, 10.0)
b = np.arange(1.0, 10.0)
# Calls compiled version of my_ufunc for each element of a and b
print(my_ufunc(a, b))
[ 3. 6. 9. 12. 15. 18. 21. 24. 27.]
import numpy
def filter2d(image, filt):
M, N = image.shape
Mf, Nf = filt.shape
Mf2 = Mf // 2
Nf2 = Nf // 2
result = numpy.zeros_like(image)
for i in range(Mf2, M - Mf2):
for j in range(Nf2, N - Nf2):
num = 0.0
for ii in range(Mf):
for jj in range(Nf):
num += (filt[Mf-1-ii, Nf-1-jj] * image[i-Mf2+ii, j-Nf2+jj])
result[i, j] = num
return result
# This kind of quadruply-nested for-loop is going to be quite slow.
# Using Numba we can compile this code to LLVM which then gets
# compiled to machine code:
from numba import double, jit
fastfilter_2d = jit(double[:,:](double[:,:], double[:,:]))(filter2d)
# Now fastfilter_2d runs at speeds as if you had first translated
# it to C, compiled the code and wrapped it with Python
image = numpy.random.random((100, 100))
filt = numpy.random.random((10, 10))
plt.figure(1)
plt.imshow(image)
plt.gcf().set_size_inches(6, 6)
%timeit result1=filter2d(image,filt)
%timeit result2=fastfilter_2d(image,filt)
1 loops, best of 3: 619 ms per loop 1000 loops, best of 3: 1.23 ms per loop
result1=filter2d(image,filt)
plt.subplot(221)
plt.imshow(result1)
plt.title('Filter')
result2=fastfilter_2d(image,filt)
plt.subplot(222)
plt.imshow(result2)
plt.title('Fast Filter')
plt.gcf().set_size_inches(6, 6)
** Mandelbrot Fractals **
The Mandelbrot set is obtained by iteratively assigning a value to each point $c$ in the complex plane according to the formula $z_{n+1} = z_{n}^2 + c$ ($z_{0} = 0$). If the value $z$ goes to $\infty$ as $n \rightarrow \infty$, then that point $c$ is not part of the Mandelbrot set. Conversely, if the value of $z$ remains bounded no matter how large $n$ becomes, the point $c$ is part of the Mandelbrot set.
From Wikipedia, "Mandelbrot set images are made by sampling complex numbers and determining for each whether the result tends towards infinity when a particular mathematical operation is iterated on it. Treating the real and imaginary parts of each number as image coordinates, pixels are colored according to how rapidly the sequence diverges, if at all."
The following code implements the escape-time algorithm, which tells you how many iterations until a point "escapes", or becomes unbounded, if it does so under a certain maximum limit. This version does not utilize complex numbers, although a version could be written which does so.
from __future__ import division
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.pyplot as plt
#IPython magic command for inline plotting
%matplotlib inline
#a better plot shape for IPython
mpl.rcParams['figure.figsize']=[15,3]
def mandelbrot(minR, maxR, minI, maxI, samples=51, iters=25):
"""
Generate the Mandelbrot set within the boundaries of the limits
maxR, minR, minI, maxI with samples and a maximum iteration of iters.
"""
# Generate a mesh for the set.
setR = np.linspace(minR, maxR, samples)
setI = np.linspace(minI, maxI, samples)
# Calculate the values at each point of the mesh by the escape-time
# fractal algorithm.
pts = np.zeros([samples, samples])
for ii in range(1, len(setR)):
for jj in range(1, len(setI)):
it = 0
x = 0.0
y = 0.0
xx = setR[ii]
yy = setI[jj]
# Look for escape---i.e., does the value settle down in a few
# iterations or does it keep going?
while(x * x + y * y < 4 and it < iters):
xtmp = x * x - y * y + xx
y = 2 * x * y + yy
x = xtmp
it += 1
pts[ii, jj] = it
return setR, setI, pts
# Plot boundaries
minR = -2.25
maxR = 0.75
minI = -1.5
maxI = 1.5
samples = 201
iters = 20
x, y, z = mandelbrot(minR, maxR, minI, maxI, samples, iters)
z = z.transpose()
mpl.rcParams['figure.figsize']=[8,8]
plt.imshow(z, interpolation='nearest')
plt.show()
%timeit mandelbrot(minR, maxR, minI, maxI, samples, iters)
1 loops, best of 3: 694 ms per loop
mandelbrot_jit=jit(mandelbrot)
%timeit mandelbrot_jit(minR, maxR, minI, maxI, samples, iters)
1000 loops, best of 3: 799 µs per loop
JIT function
Shorthand for “a function compiled with Numba.”
loop-jitting
A feature of compilation in object mode where a loop can be automatically extracted and compiled in nopython mode. This allows functions with operations unsupported in nopython mode (such as array creation) to see significant performance improvements if they contain loops with only nopython-supported operations.
nopython mode
A Numba compilation mode that generates code that does not access the Python C API. This compilation mode produces the highest performance code, but requires that the native types of all values in the function can be inferred, and that no new objects are allocated. Unless otherwise instructed, the @jit decorator will automatically fall back to object mode if nopython mode cannot be used.
object mode
A Numba compilation mode that generates code that handles all values as Python objects and uses the Python C API to perform all operations on those objects. Code compiled in object mode will often run no faster than Python interpreted code, unless the Numba compiler can take advantage of loop-jitting.
type inference
The process by which Numba determines the specialized types of all values within a function being compiled. Type inference can fail if arguments or globals have Python types unknown to Numba, or if functions are used that are not recognized by Numba. Sucessful type inference is a prerequisite for compilation in nopython mode.
ufunc
A NumPy universal function. Numba can create new compiled ufuncs with the @vectorize
decorator.
Neal Davis and Lakshmi Rao developed these materials for Computational Science and Engineering at the University of Illinois at Urbana–Champaign.
This content is available under a [Creative Commons Attribution 3.0 Unported License](https://creativecommons.org/licenses/by/3.0/).