Numba 0.53 CUDA Release Demo

Key changes to the CUDA target for Release 0.53 demonstrated in this notebook:

  • More atomic operations: AND, OR, XOR, INC, DEC, Exchange, and Compare-and-Swap (Michael Collison).
  • Support for math.log2 and math.remainder (Guilherme Leobas).
  • Grid synchronization with Cooperative Grid Groups (Nick White and Graham Markall).
  • Improved kernel launch performance (Graham Markall).
  • Support for passing tuples and namedtuples to kernels (@heyer2, Alex Tait, and Graham Markall).
  • A new cube root intrinsic (Michael Collison).

Other key changes, not demonstrated in this notebook, include:

  • Support for CUDA 11.2, which uses a new version of NVVM IR (1.6) that is incompatible with previous Numba releases (Graham Markall).
  • The CUDA Array Interface is now at Version 3, which adds stream and synchronization semantics.
    • The V3 specification was agreed with the input of many contributors: Stuart Archibald, Frédéric Bastien, Lisandro Dalcin, Sanjoy Das, Peter Entschev, Leo Fang, Mark Harris, Peter Hawkins, Jake Hemstad, Dave Hirschfeld, John Kirkham, Keith Kraus, Siu Kwan Lam, Graham Markall, Stan Seibert, Peter Würtz, Edward Z. Yang.

Some useful imports

In [1]:
from collections import namedtuple
from numba import cuda, void, int32
import math
import numpy as np

CUDA Atomics

Logical Operations

The following operations are now implemented, with the signatures cuda.atomic.<op>(ary, idx, val). These execute ary[idx] = ary[idx] <op> val atomically, and return the old value at ary[idx] for int32, uint32, int64, and uint64 operands:

  • AND: cuda.atomic.and_(ary, idx, val).
  • OR: cuda.atomic.or_(ary, idx, val).
  • XOR: cuda.atomic.xor(ary, idx, val).

Note the underscore suffix for the AND and OR operations, which is needed to prevent a collision with the and and or keywords.

A quick demo of AND:

In [2]:
@cuda.jit
def and_demo(x, old, val):
    old[0] = cuda.atomic.and_(x, 0, val)


x = np.array([4], dtype=np.uint32)
old = np.zeros_like(x)

and_demo[1, 1](x, old, 3)

We expect that 4 & 3 = 0:

In [3]:
print(x[0])
0

The old value should be 4:

In [4]:
print(old[0])
4

We can demonstrate OR similarly:

In [5]:
@cuda.jit
def or_demo(x, old, val):
    old[0] = cuda.atomic.or_(x, 0, val)


x = np.array([4], dtype=np.uint32)
old = np.zeros_like(x)

or_demo[1, 1](x, old, 3)

Hopefully 4 | 3 = 7:

In [6]:
print(x[0])
7

And the old value should still be 4:

In [7]:
print(old[0])
4

XOR is left as an exercise for the reader.

Increment and decrement

Increment and decrement are also supported for uint32 and uint64 operands:

  • INC: cuda.atomic.inc(ary, idx, val), increments ary[idx] by 1 up to val, then resets to 0.
  • DEC: cuda.atomic.dec(ary, idx, val), decrements ary[idx] by 1 down to 0, resetting to val beyond 0 or if ary[idx] is greater than val initially.

A simple example using INC:

In [8]:
@cuda.jit
def inc_demo(x, old, val):
    old[0] = cuda.atomic.inc(x, 0, val)

First we'll try incrementing 10, with a maximum of 20:

In [9]:
x = np.array([10], dtype=np.uint32)
old = np.zeros_like(x)

inc_demo[1, 1](x, old, 20)

This should yield 11, with an old value of 10:

In [10]:
print(x[0])
print(old[0])
11
10

Now if we increment 10 this time, but with a maximum of 10:

In [11]:
x = np.array([10], dtype=np.uint32)
old = np.zeros_like(x)

inc_demo[1, 1](x, old, 10)

We should see that the counter has reset to 0:

In [12]:
print(x[0])
print(old[0])
0
10

Exchange, Compare-and-swap

Atomic compare-and-swap is now extended to support 64-bit operands (int64, uint64) operands in addition to 32-bit (int32, uint32). Atomic exchange, which swaps values atomically with no comparison, is added:

  • cuda.atomic.compare_and_swap(ary, old, val) performs ary[0] = val if ary[0] == old, returning the original ary[0] value.
  • cuda.atomic.exch(ary, idx, val) performs ary[idx] = val, returning the old value of ary[idx].

A short example exchanging values:

In [13]:
@cuda.jit
def exch_demo(x, old, val):
    old[0] = cuda.atomic.exch(x, 0, val)

x = np.array([10], dtype=np.uint32)
old = np.zeros_like(x)
exch_demo[1, 1](x, old, 15)

We should see that x[0] now contains 15:

In [14]:
print(x[0])
15

And the old value, 10, was returned by the exch function:

In [15]:
print(old[0])
10

Log2 and remainder

The math.log2 and math.remainder functions can now be used in kernels.

In [16]:
@cuda.jit
def log2_remainder_demo(y, x):
    y[0] = math.log2(x[0])
    y[1] = math.remainder(x[0], 3)

x = np.array([4], dtype=np.float32)
y = np.zeros(2, dtype=np.float32)

log2_remainder_demo[1, 1](y, x)

log2(4) is 2:

In [17]:
print(y[0])
2.0

4 remainder 3 is 1:

In [18]:
print(y[1])
1.0

Grid Synchronization / Cooperative Grid Groups

Grid Synchronization provides a way for different blocks to synchronize across the entire grids. It is implemented by instantiating a grid group with this_grid() and calling its sync() method.

The following example kernel uses the whole grid to write to rows of a matrix in sequence - each row of the matrix must be completed before any thread can move on to the next row:

In [19]:
@cuda.jit(void(int32[:,::1]))
def sequential_rows(M):
    col = cuda.grid(1)
    g = cuda.cg.this_grid()

    rows = M.shape[0]
    cols = M.shape[1]

    for row in range(1, rows):
        opposite = cols - col - 1
        # Each row's elements are one greater than the previous row
        M[row, col] = M[row - 1, opposite] + 1
        # Wait until all threads have written their column element,
        # and that the write is visible to all other threads
        g.sync()

We'll create an empty matrix for the kernel to work on, and determine an appropriate block size:

In [20]:
# Empty input data
A = np.zeros((1024, 1024), dtype=np.int32)
# A somewhat arbitrary choice (one warp), but generally smaller block sizes
# allow more blocks to be launched (noting that other limitations on
# occupancy apply such as shared memory size)
blockdim = 32
griddim = A.shape[1] // blockdim

Now we can launch the kernel and print the result:

In [21]:
# Kernel launch - this is implicitly a cooperative launch
sequential_rows[griddim, blockdim](A)

# What do the results look like?
print(A)
[[   0    0    0 ...    0    0    0]
 [   1    1    1 ...    1    1    1]
 [   2    2    2 ...    2    2    2]
 ...
 [1021 1021 1021 ... 1021 1021 1021]
 [1022 1022 1022 ... 1022 1022 1022]
 [1023 1023 1023 ... 1023 1023 1023]]

There is a more strict limit on the grid size for launching a kernel using a Grid Group, as all blocks must be able to be resident on the GPU concurrently - unlike a regular launch, it is not possible to wait for one block to fully execute before another block launches, as this would create the conditions for a deadlock.

The maximum grid size for a cooperative kernel can be enquired with max_cooperative_grid_blocks(). This varies between GPU models, depending on the number of SMs and their available resources. It also varyies for different overloads, because each overload is compiled for a different set of arguments that affects the generated code.

Here we determine the maximum grid size for our example:

In [22]:
# Grab the first overload - there's only one because we've compiled with one signature
overload = next(iter(sequential_rows.overloads.values()))
max_blocks = overload.max_cooperative_grid_blocks(blockdim)
print(max_blocks)
1152

Kernel launch performance

Numba 0.53 updates the CUDA kernel dispatcher mechanism to bring it more into line with the way the CPU dispatcher works

  • Launching lazily-compiled kernels (those jitted without signatures) is now much faster, in line with the time taken to launch eagerly-compiled kernels.
  • There is a very slight increase in dispatch time for eagerly-compiled kernels. This is being worked on and is expected to improve again for 0.54.

For bencharking, we can use a small test launching empty kernels with varying numbers of arguments with and without signatures:

In [23]:
print("Eagerly compiled with signatures:")

@cuda.jit('void()')
def sig_kernel_1():
    return

@cuda.jit('void(float32[:])')
def sig_kernel_2(arr1):
    return

@cuda.jit('void(float32[:],float32[:])')
def sig_kernel_3(arr1,arr2):
    return

@cuda.jit('void(float32[:],float32[:],float32[:])')
def sig_kernel_4(arr1,arr2,arr3):
    return

@cuda.jit('void(float32[:],float32[:],float32[:],float32[:])')
def sig_kernel_5(arr1,arr2,arr3,arr4):
    return

arr = cuda.device_array(10000, dtype=np.float32)

%timeit sig_kernel_1[1, 1]()
%timeit sig_kernel_2[1, 1](arr)
%timeit sig_kernel_3[1, 1](arr,arr)
%timeit sig_kernel_4[1, 1](arr,arr,arr)
%timeit sig_kernel_5[1, 1](arr,arr,arr,arr)

print("Without signatures:")

@cuda.jit
def nosig_kernel_1():
    return

@cuda.jit
def nosig_kernel_2(arr1):
    return

@cuda.jit
def nosig_kernel_3(arr1,arr2):
    return

@cuda.jit
def nosig_kernel_4(arr1,arr2,arr3):
    return

@cuda.jit
def nosig_kernel_5(arr1,arr2,arr3,arr4):
    return

arr = cuda.device_array(10000, dtype=np.float32)

%timeit nosig_kernel_1[1, 1]()
%timeit nosig_kernel_2[1, 1](arr)
%timeit nosig_kernel_3[1, 1](arr,arr)
%timeit nosig_kernel_4[1, 1](arr,arr,arr)
%timeit nosig_kernel_5[1, 1](arr,arr,arr,arr)
Eagerly compiled with signatures:
20 µs ± 204 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
34.2 µs ± 32.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
44.6 µs ± 92.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
53.6 µs ± 795 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
62.4 µs ± 31.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Without signatures:
19.8 µs ± 50.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
35.1 µs ± 147 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
44.9 µs ± 141 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
54.1 µs ± 70.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
62.4 µs ± 43.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Example results

Comparing some results from the above benchmark between Numba 0.52 and Numba 0.53 yields the following on a Linux system with a Quadro RTX 8000 and CUDA 11.1 (11.2 was not used for the comparison as it is only supported by Numba 0.53):

Without signatures:

# Args 0.52 0.53 % Delta
0 26.8 18.4 -31.3%
1 60.0 33.8 -43.7%
2 83.0 43.4 -47.7%
3 104.0 52.3 -49.7%
4 124.0 61.0 -50.8%

With signatures:

# Args 0.52 0.53 % Delta
0 18.6 18.7 +0.5%
1 31.1 33.4 +6.7%
2 39.8 42.7 +7.3%
3 47.8 51.8 +8.4%
4 55.2 60.2 +9.1%

Support for passing tuples and namedtuples to kernels

Tuples and namedtuples can now be passed to kernels.

Tuples

Let's create a kernel and pass a tuple to it.

First we'll define a kernel that extracts values from the heterogeneously-typed tuple x and stores them in r1 and r2:

In [24]:
@cuda.jit
def extract_tuple(r1, r2, x):
    r1[0] = x[0]
    r1[1] = x[1]
    r1[2] = x[2]
    r2[0] = x[3]
    r2[1] = x[4]
    r2[2] = x[5]

A tuple of integers and floating point values:

In [25]:
x = (1, 2, 3, 4.5, 5.5, 6.5)

Some space for our results:

In [26]:
r1 = np.zeros(len(x) // 2, dtype=np.int64)
r2 = np.zeros(len(x) // 2, dtype=np.float64)

We can now launch the kernel:

In [27]:
extract_tuple[1, 1](r1, r2, x)

Next we can verify that the values have been extracted from the tuple as expected:

In [28]:
print(r1)
print(r2)
[1 2 3]
[4.5 5.5 6.5]

Named tuples

We'll create a named tuple to represent a point:

In [29]:
Point = namedtuple('Point', ('x', 'y'))

We'll create a kernel that extracts values from a Point, in a similar fashion to the tuple example above:

In [30]:
@cuda.jit
def extract_point(r, p):
    r[0] = p.x
    r[1] = p.y
    
x = Point(1, 2)
r = np.zeros(len(x), dtype=np.int64)
extract_point[1, 1](r, x)

Our extracted data:

In [31]:
print(r)
[1 2]

Nesting

We can nest tuples and named tuples arbitrarily. Using a tuple of a tuple and a scalar:

In [32]:
@cuda.jit
def extract_nested(r, x):
    r[0] = len(x)
    r[1] = len(x[0])
    r[2] = x[0][0]
    r[3] = x[0][1]
    r[4] = x[0][2]
    r[5] = x[1]


x = ((6, 5, 4), 7)
r = np.ones(6, dtype=np.int64)
extract_nested[1, 1](r, x)

Our output contains (len(x), len(x[0]), *x[0], x[1]):

In [33]:
print(r)
[2 3 6 5 4 7]

Tuples of Arrays

Arrays can also be tuple members, allowing us to group together arrays into a single parameter. For example, we might write a kernel that accepts data bundled up in a struct-of-arrays-type (SoA) form:

In [34]:
@cuda.jit
def vector_magnitudes_soa(r, v):
    i = cuda.grid(1)
    if i >= len(r):
        return
    
    r[i] = math.sqrt(v[0][i] ** 2 + v[1][i] ** 2)

Number of elements, and space for the results:

In [35]:
N = 32
r = np.zeros(N)

An SoA vector structure:

In [36]:
np.random.seed(1) # For reproducibility between notebook runs
vx = np.random.rand(N)
vy = np.random.rand(N)
v = (vx, vy)

We can pass v to the kernel rather than needing to pass vx and vy individually:

In [37]:
vector_magnitudes_soa[1, N](r, v)

Our result:

In [38]:
print(r)
[1.04472949 0.89617665 0.69187712 0.43698409 0.70201198 0.83971806
 0.18715589 0.82591084 1.06549082 0.92199529 0.50435392 1.04522133
 0.22903347 0.98574786 0.90900818 0.73193985 0.50691019 0.57362161
 0.14171652 0.70715054 0.82823808 1.00401469 0.58299133 0.6943761
 1.04769698 0.90655963 0.59541039 0.70084737 0.19827937 0.97086385
 0.70132994 0.59065734]

The cube root intrinsic

A new instrinsic, cuda.cbrt provides an efficient way to compute a cube root for float32 and float64 values. Optimized functions from NVIDIA's libdevice library (__nv_cbrt and __nv_cbrtf) underlie its implementation.

For comparison, we'll define a kernel that implements cube root using a standard mathematical form, and another one using the intrinsic:

In [39]:
@cuda.jit
def cube_root(r, x):
    for i in range(cuda.grid(1), cuda.gridsize(1), len(r)):
        r[i] = x[i] ** (1.0 / 3.0)
    
@cuda.jit
def intrinsic_cube_root(r, x):
    for i in range(cuda.grid(1), cuda.gridsize(1), len(r)):
        r[i] = cuda.cbrt(x[i])

Then we'll set up some data to test with:

In [40]:
# Array size and data type
N = 2 ** 16
dtype = np.float32

np.random.seed(1)
x = np.random.rand(N).astype(dtype) * 100
r = np.zeros_like(x)

n_threads = 256
n_blocks = N // n_threads

# Copy data to device so we can time the kernel execution only
d_r_normal = cuda.to_device(r)
d_r_fast = cuda.to_device(r)
d_x = cuda.to_device(x)

Now we'll define functions to time for benchmarking. We need to synchronize with the device after the kernel launch to ensure we capture the execution time of the kernel and not just the time spent by the CPU launching the kernel:

In [41]:
def run_normal():    
    cube_root[n_blocks, n_threads](d_r_normal, d_x)
    cuda.synchronize()

def run_fast():
    intrinsic_cube_root[n_blocks, n_threads](d_r_fast, d_x)
    cuda.synchronize()

Now let's time both versions:

In [42]:
%timeit run_normal()
%timeit run_fast()
76.2 µs ± 322 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
55.5 µs ± 134 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

The results agree to six decimal places with float32 data (this is likely to be a reasonable expectation for most use cases):

In [43]:
rtol=1.0e-6
h_r_normal = d_r_normal.copy_to_host()
h_r_fast = d_r_fast.copy_to_host()
np.testing.assert_allclose(h_r_normal, h_r_fast, rtol=rtol)
print(f"Results are in agreement to a relative tolerance of {rtol}.")
Results are in agreement to a relative tolerance of 1e-06.