Numpy performance tricks

I've been using Numpy for nearly five years, but I'm still learning performance tricks. The reason is that I currently need to deal with very large arrays (hundreds of millions of elements) and the performance of my code started to be disappointing. Then, through extensive line-by-line profiling, I discovered some subtleties that explain why some seemingly harmless lines of code can actually lead to major bottlenecks. Very often, a small trick allows to significantly improve the performance. Here is what I've learnt. These tips are intended to regular Numpy users rather than pure beginners.

Know how Numpy works

For a beginner, learning Numpy's basics is more a matter of days rather than weeks. The notion of multidimensional array is quite intuitive, so is the notion of computations on vectors and matrices for someone with a mathematics background. Computing an element-wise multiplication of two vectors with a*b is actually easier and less error-prone than with a classical imperative programming language where it would require a for loop. Yet, when comes the need to do complex computations on multidimensional arrays containing millions of elements, it becomes quite valuable to know a bit about Numpy's internals. I don't know much about it, but what I know sometimes allows me to improve the performance of my code.

Here are simplified facts that can be useful to know. Computer memory is basically one-dimensional: bytes are consecutively stored in a one-dimensional memory space and are accessed through memory adresses. A multidimensional Numpy array is stored as a contiguous block of memory, so that two successive elements in the array occupy two successive places in memory. Each element occupies itemsize bytes, depending on the data type (dtype): 2 for an int16, 4 for a float32 (single precision floating point number), 8 for a float64 = double (double precision), etc. But memory is one-dimensional: how to store a nD array in a one-dimensional space? The solution lies in the notion of shape and stride. The shape is a n-tuple with the number of elements in each dimension. The stride is a n-tuple with, for each dimension, the number of bytes (or step) that one needs to jump in memory to go from one element to the next in that dimension.

For a one-dimensional vector, the stride is typically (itemsize,), but for higher dimensions, there are more than a unique choice. The C-order (row-major order) and Fortran-order (column-major order) are two different conventions. Elements are stored row after row in the C-order, and column after column in the Fortran-order. This notion extends to arrays with more than two dimensions. For example, the matrix with [1, 2] on the first row and [3, 4] on the second row is stored internally as [1, 2, 3, 4] in C-order or [1, 3, 2, 4] in Fortran-order. Numpy uses the C-order, but that can be changed with some Numpy functions using the order keyword argument. Here, by default, the stride is (8, 4) (on a 32-bits system), since the first axis is the column, and the second is the row. 16-bit integers are used by default here.

In [1]:
a = array([[1, 2], [3, 4]])
In [2]:
In [3]:
(2, 2)
In [4]:
In [5]:
In [6]:
In [7]:
(8, 4)
In [8]:
array(a, order='F').strides
(4, 8)

Knowing this is the basis for a few tricks that we'll see below.

Beware of array copies

Memory copies happen transparently with Numpy, which is generally quite convenient compared to low-level languages where memory management is mostly up to the developer. But not knowing what happens under the hood can sometimes helps fixing some performance issues. Consider the following example.

In [9]:
a = rand(1000, 1000)

There are two ways to obtain a 1D array from a nD array: flatten and ravel. The first function returns a copy, whereas the second one returns a view (when possible), which is much faster.

In [10]:
timeit -n 100 b1 = a.flatten()
100 loops, best of 3: 9.14 ms per loop
In [11]:
timeit -n 100 b2 = a.ravel()
100 loops, best of 3: 929 ns per loop
In [12]:
array_equal(a.flatten(), a.ravel())

If you use flatten in your code, maybe you don't need to do a copy and you can use ravel instead? The performance speedup can be significant for large arrays (10000 times faster here!). Be aware, however, that the two results are slightly different, in that with flatten, you obtain a copy, whereas with ravel, you obtain a view, so that changes in the result will change the original array with ravel and not with flatten.

In [13]:
b1 = a.flatten()
b2 = a.ravel()
# return the address of the memory block
id = lambda x: x.__array_interface__['data'][0]
print(id(a), id(b1), id(b2))
(108199968, 116289568, 108199968)

Sometimes, even ravel needs to do a copy, because the array is not in the specified order (C-order by default). In the following example, a.T is in Fortran-order, so that returning a flattened version with C-order implies memory copy. It is 3-4 times slower than with a.ravel().

In [14]:
b = zeros(1000000)
In [15]:
timeit -n 100 b[:] = a.ravel()
100 loops, best of 3: 5.14 ms per loop
In [16]:
timeit -n 100 b[:] = a.T.ravel()
100 loops, best of 3: 20.4 ms per loop
In [17]:
timeit -n 100 b[:] = a.ravel(order='F')
100 loops, best of 3: 19 ms per loop

Know how to use broadcasting

You know how to use tile and repeat to do vectorized computations on your arrays. These functions obviousy involve array copies. You may use them to do some temporary calculations. You don't always need them and you may be able to use broadcasting instead for better performance. In the following example, we want to add identical copies of b to each column of a.

In [18]:
a = rand(10000, 1000)
b = arange(10000)
In [19]:
c = a + b
ValueError                                Traceback (most recent call last)
<ipython-input-19-60f555c9e9aa> in <module>()
----> 1 c = a + b

ValueError: operands could not be broadcast together with shapes (10000,1000) (10000) 

Adding a and b does not work here because they do not have compatible shapes (we'll see what that means in a minute). So a first possibility is to replace the smallest array b with an array with the same size as a so that we can add them. This involves the creation of a temporary array 1000 times bigger than b.

In [20]:
timeit -n 10 c = a + tile(b.reshape((-1, 1)), (1, 1000))
10 loops, best of 3: 206 ms per loop

We can do better thanks to broadcasting: two arrays with different shapes can still be added together: they need to have compatible shapes. This means that in each dimension, the length is the same in the two arrays, or one is equal to 1, in which case it is assumed that this value should be repeated along this axis to match the other array's dimension. This repetition does not involve any copy, so it's about twice as fast. Here, we can just reshape b to make it a column vector, and it will have a shape compatible with a's shape.

In [21]:
timeit -n 10 c = a + b.reshape((-1, 1))
10 loops, best of 3: 107 ms per loop
In [22]:
array_equal(a + tile(b.reshape((-1, 1)), (1, 1000)), a + b.reshape((-1, 1)))

Faster alternatives to fancy indexing

Fancy indexing offers you the possibility to extract any portion of an array, even with repeated parts or non-contiguous parts. But it may be slow sometimes and faster alternatives may exist depending on what you're trying to do. Here is an example.

In [23]:
a = rand(10000, 100)
ind = randint(low=0, high=10000, size=10000)
In [24]:
timeit -n 100 b = a[ind,:]
100 loops, best of 3: 34 ms per loop
In [25]:
timeit -n 100 b = take(a, ind, axis=0)
100 loops, best of 3: 9.68 ms per loop
In [26]:
array_equal(a[ind,:], take(a, ind, axis=0))

Here, take(a, ind, axis=0) replaces a[ind,:] and is 3-4 times faster. Here is another example with boolean masks.

In [27]:
ind = a[:,0] > .5
In [28]:
timeit -n 10 b = a[ind,:]
10 loops, best of 3: 16.7 ms per loop
In [29]:
timeit -n 10 b = compress(ind, a, axis=0)
10 loops, best of 3: 4.92 ms per loop
In [30]:
array_equal(a[ind,:], compress(ind, a, axis=0))

Here, using compress instead of fancy indexing is about 3-4 times faster.

Numpy's loadtxt may be slow

Numpy's savetxt and loadtxt functions are quite useful, but should not be used with large arrays, where binary files would be more adapted and much faster. Sometimes, however, you really need to open text files. In these cases, depending on the specific structure of your files, you may be able to write a faster function yourself. In the following example, the custom loadtxt_fast function (found here) is about twice as fast as loadtxt. Also, it creates less temporary objects and uses less memory. Another solution is to use the great Pandas library, which extends Numpy in different areas, notably for I/O functions which are generally faster and more efficient. This should probably be the subject of a future post.

In [31]:
def loadtxt_fast(filename, dtype=int32, skiprows=0, delimiter=' '):
    def iter_func():
        with open(filename, 'r') as infile:
            for _ in range(skiprows):
            skip = 0
            for line in infile:
                line = line.rstrip().split(delimiter)
                for item in line:
                    yield dtype(item)
            loadtxt_fast.rowlength = len(line)
    data = np.fromiter(iter_func(), dtype=dtype)
    data = data.reshape((-1, loadtxt_fast.rowlength))
    return data
In [32]:
a = randint(low=0, high=1000, size=(100000, 10))
fn = '_array.txt'
savetxt(fn, a, fmt='%d')
In [33]:
timeit -n 1 b = loadtxt(fn)
1 loops, best of 3: 3.23 s per loop
In [34]:
timeit -n 1 b = loadtxt_fast(fn)
1 loops, best of 3: 1.65 s per loop
In [35]:
array_equal(loadtxt(fn), loadtxt_fast(fn))

Do not use built-in Python functions for lists on arrays

This trick is a fun one. I once had a major performance issue in my code. I had a pretty good idea where it was, I can't remember the details but it was quite involved. After a thorough line-by-line profiling session however, I realized that I was completely wrong and that the bottleneck was actually the computation of the maximum element of an array. I was using max(a) instead of a.max(), thereby automatically converting the array into a list and using the built-in Python max function! Using Numpy's max function can be about 100 times faster. This is the kind of mistake you don't do twice.

In [36]:
a = rand(1000000)
In [37]:
timeit -n 10 max(a)
10 loops, best of 3: 273 ms per loop
In [38]:
timeit -n 10 a.max()
10 loops, best of 3: 3.2 ms per loop
In [39]:
max(a) == a.max()

I'm sure there are many more performance tricks out there. But if you want to discover your own: learn a bit more about how Numpy works, and learn how to profile Python code line by line!

by Cyrille Rossant