Key changes to the CUDA target for Release 0.53 demonstrated in this notebook:

- More atomic operations: AND, OR, XOR, INC, DEC, Exchange, and Compare-and-Swap (Michael Collison).
- Support for
`math.log2`

and`math.remainder`

(Guilherme Leobas). - Grid synchronization with Cooperative Grid Groups (Nick White and Graham Markall).
- Improved kernel launch performance (Graham Markall).
- Support for passing tuples and namedtuples to kernels (@heyer2, Alex Tait, and Graham Markall).
- A new cube root intrinsic (Michael Collison).

Other key changes, not demonstrated in this notebook, include:

- Support for CUDA 11.2, which uses a new version of NVVM IR (1.6) that is incompatible with previous Numba releases (Graham Markall).
- The CUDA Array Interface is now at Version 3, which adds stream and synchronization semantics.
- The V3 specification was agreed with the input of many contributors: Stuart Archibald, Frédéric Bastien, Lisandro Dalcin, Sanjoy Das, Peter Entschev, Leo Fang, Mark Harris, Peter Hawkins, Jake Hemstad, Dave Hirschfeld, John Kirkham, Keith Kraus, Siu Kwan Lam, Graham Markall, Stan Seibert, Peter Würtz, Edward Z. Yang.

In [1]:

```
from collections import namedtuple
from numba import cuda, void, int32
import math
import numpy as np
```

The following operations are now implemented, with the signatures `cuda.atomic.<op>(ary, idx, val)`

. These execute `ary[idx] = ary[idx] <op> val`

atomically, and return the old value at `ary[idx]`

for `int32`

, `uint32`

, `int64`

, and `uint64`

operands:

- AND:
`cuda.atomic.and_(ary, idx, val)`

. - OR:
`cuda.atomic.or_(ary, idx, val)`

. - XOR:
`cuda.atomic.xor(ary, idx, val)`

.

Note the underscore suffix for the AND and OR operations, which is needed to prevent a collision with the `and`

and `or`

keywords.

A quick demo of AND:

In [2]:

```
@cuda.jit
def and_demo(x, old, val):
old[0] = cuda.atomic.and_(x, 0, val)
x = np.array([4], dtype=np.uint32)
old = np.zeros_like(x)
and_demo[1, 1](x, old, 3)
```

We expect that `4 & 3 = 0`

:

In [3]:

```
print(x[0])
```

The old value should be `4`

:

In [4]:

```
print(old[0])
```

We can demonstrate OR similarly:

In [5]:

```
@cuda.jit
def or_demo(x, old, val):
old[0] = cuda.atomic.or_(x, 0, val)
x = np.array([4], dtype=np.uint32)
old = np.zeros_like(x)
or_demo[1, 1](x, old, 3)
```

Hopefully `4 | 3 = 7`

:

In [6]:

```
print(x[0])
```

And the old value should still be 4:

In [7]:

```
print(old[0])
```

XOR is left as an exercise for the reader.

Increment and decrement are also supported for `uint32`

and `uint64`

operands:

- INC:
`cuda.atomic.inc(ary, idx, val)`

, increments`ary[idx]`

by`1`

up to`val`

, then resets to`0`

. - DEC:
`cuda.atomic.dec(ary, idx, val)`

, decrements`ary[idx]`

by`1`

down to`0`

, resetting to`val`

beyond`0`

or if`ary[idx]`

is greater than`val`

initially.

A simple example using INC:

In [8]:

```
@cuda.jit
def inc_demo(x, old, val):
old[0] = cuda.atomic.inc(x, 0, val)
```

First we'll try incrementing 10, with a maximum of 20:

In [9]:

```
x = np.array([10], dtype=np.uint32)
old = np.zeros_like(x)
inc_demo[1, 1](x, old, 20)
```

This should yield 11, with an old value of 10:

In [10]:

```
print(x[0])
print(old[0])
```

Now if we increment 10 this time, but with a maximum of 10:

In [11]:

```
x = np.array([10], dtype=np.uint32)
old = np.zeros_like(x)
inc_demo[1, 1](x, old, 10)
```

We should see that the counter has reset to 0:

In [12]:

```
print(x[0])
print(old[0])
```

Atomic compare-and-swap is now extended to support 64-bit operands (`int64`

, `uint64`

) operands in addition to 32-bit (`int32`

, `uint32`

). Atomic exchange, which swaps values atomically with no comparison, is added:

`cuda.atomic.compare_and_swap(ary, old, val)`

performs`ary[0] = val if ary[0] == old`

, returning the original`ary[0]`

value.`cuda.atomic.exch(ary, idx, val)`

performs`ary[idx] = val`

, returning the old value of`ary[idx]`

.

A short example exchanging values:

In [13]:

```
@cuda.jit
def exch_demo(x, old, val):
old[0] = cuda.atomic.exch(x, 0, val)
x = np.array([10], dtype=np.uint32)
old = np.zeros_like(x)
exch_demo[1, 1](x, old, 15)
```

We should see that `x[0]`

now contains 15:

In [14]:

```
print(x[0])
```

And the old value, `10`

, was returned by the `exch`

function:

In [15]:

```
print(old[0])
```

The `math.log2`

and `math.remainder`

functions can now be used in kernels.

In [16]:

```
@cuda.jit
def log2_remainder_demo(y, x):
y[0] = math.log2(x[0])
y[1] = math.remainder(x[0], 3)
x = np.array([4], dtype=np.float32)
y = np.zeros(2, dtype=np.float32)
log2_remainder_demo[1, 1](y, x)
```

`log2(4)`

is `2`

:

In [17]:

```
print(y[0])
```

4 remainder 3 is 1:

In [18]:

```
print(y[1])
```

Grid Synchronization provides a way for different blocks to synchronize across the entire grids. It is implemented by instantiating a grid group with `this_grid()`

and calling its `sync()`

method.

The following example kernel uses the whole grid to write to rows of a matrix in sequence - each row of the matrix must be completed before any thread can move on to the next row:

In [19]:

```
@cuda.jit(void(int32[:,::1]))
def sequential_rows(M):
col = cuda.grid(1)
g = cuda.cg.this_grid()
rows = M.shape[0]
cols = M.shape[1]
for row in range(1, rows):
opposite = cols - col - 1
# Each row's elements are one greater than the previous row
M[row, col] = M[row - 1, opposite] + 1
# Wait until all threads have written their column element,
# and that the write is visible to all other threads
g.sync()
```

We'll create an empty matrix for the kernel to work on, and determine an appropriate block size:

In [20]:

```
# Empty input data
A = np.zeros((1024, 1024), dtype=np.int32)
# A somewhat arbitrary choice (one warp), but generally smaller block sizes
# allow more blocks to be launched (noting that other limitations on
# occupancy apply such as shared memory size)
blockdim = 32
griddim = A.shape[1] // blockdim
```

Now we can launch the kernel and print the result:

In [21]:

```
# Kernel launch - this is implicitly a cooperative launch
sequential_rows[griddim, blockdim](A)
# What do the results look like?
print(A)
```

There is a more strict limit on the grid size for launching a kernel using a Grid Group, as all blocks must be able to be resident on the GPU concurrently - unlike a regular launch, it is not possible to wait for one block to fully execute before another block launches, as this would create the conditions for a deadlock.

The maximum grid size for a cooperative kernel can be enquired with `max_cooperative_grid_blocks()`

. This varies between GPU models, depending on the number of SMs and their available resources. It also varyies for different overloads, because each overload is compiled for a different set of arguments that affects the generated code.

Here we determine the maximum grid size for our example:

In [22]:

```
# Grab the first overload - there's only one because we've compiled with one signature
overload = next(iter(sequential_rows.overloads.values()))
max_blocks = overload.max_cooperative_grid_blocks(blockdim)
print(max_blocks)
```

Numba 0.53 updates the CUDA kernel dispatcher mechanism to bring it more into line with the way the CPU dispatcher works

- Launching lazily-compiled kernels (those jitted without signatures) is now much faster, in line with the time taken to launch eagerly-compiled kernels.
- There is a very slight increase in dispatch time for eagerly-compiled kernels. This is being worked on and is expected to improve again for 0.54.

For bencharking, we can use a small test launching empty kernels with varying numbers of arguments with and without signatures:

In [23]:

```
print("Eagerly compiled with signatures:")
@cuda.jit('void()')
def sig_kernel_1():
return
@cuda.jit('void(float32[:])')
def sig_kernel_2(arr1):
return
@cuda.jit('void(float32[:],float32[:])')
def sig_kernel_3(arr1,arr2):
return
@cuda.jit('void(float32[:],float32[:],float32[:])')
def sig_kernel_4(arr1,arr2,arr3):
return
@cuda.jit('void(float32[:],float32[:],float32[:],float32[:])')
def sig_kernel_5(arr1,arr2,arr3,arr4):
return
arr = cuda.device_array(10000, dtype=np.float32)
%timeit sig_kernel_1[1, 1]()
%timeit sig_kernel_2[1, 1](arr)
%timeit sig_kernel_3[1, 1](arr,arr)
%timeit sig_kernel_4[1, 1](arr,arr,arr)
%timeit sig_kernel_5[1, 1](arr,arr,arr,arr)
print("Without signatures:")
@cuda.jit
def nosig_kernel_1():
return
@cuda.jit
def nosig_kernel_2(arr1):
return
@cuda.jit
def nosig_kernel_3(arr1,arr2):
return
@cuda.jit
def nosig_kernel_4(arr1,arr2,arr3):
return
@cuda.jit
def nosig_kernel_5(arr1,arr2,arr3,arr4):
return
arr = cuda.device_array(10000, dtype=np.float32)
%timeit nosig_kernel_1[1, 1]()
%timeit nosig_kernel_2[1, 1](arr)
%timeit nosig_kernel_3[1, 1](arr,arr)
%timeit nosig_kernel_4[1, 1](arr,arr,arr)
%timeit nosig_kernel_5[1, 1](arr,arr,arr,arr)
```

Comparing some results from the above benchmark between Numba 0.52 and Numba 0.53 yields the following on a Linux system with a Quadro RTX 8000 and CUDA 11.1 (11.2 was not used for the comparison as it is only supported by Numba 0.53):

**Without signatures:**

# Args | 0.52 | 0.53 | % Delta |
---|---|---|---|

0 | 26.8 | 18.4 | -31.3% |

1 | 60.0 | 33.8 | -43.7% |

2 | 83.0 | 43.4 | -47.7% |

3 | 104.0 | 52.3 | -49.7% |

4 | 124.0 | 61.0 | -50.8% |

**With signatures:**

# Args | 0.52 | 0.53 | % Delta |
---|---|---|---|

0 | 18.6 | 18.7 | +0.5% |

1 | 31.1 | 33.4 | +6.7% |

2 | 39.8 | 42.7 | +7.3% |

3 | 47.8 | 51.8 | +8.4% |

4 | 55.2 | 60.2 | +9.1% |

Tuples and namedtuples can now be passed to kernels.

Let's create a kernel and pass a tuple to it.

First we'll define a kernel that extracts values from the heterogeneously-typed tuple `x`

and stores them in `r1`

and `r2`

:

In [24]:

```
@cuda.jit
def extract_tuple(r1, r2, x):
r1[0] = x[0]
r1[1] = x[1]
r1[2] = x[2]
r2[0] = x[3]
r2[1] = x[4]
r2[2] = x[5]
```

A tuple of integers and floating point values:

In [25]:

```
x = (1, 2, 3, 4.5, 5.5, 6.5)
```

Some space for our results:

In [26]:

```
r1 = np.zeros(len(x) // 2, dtype=np.int64)
r2 = np.zeros(len(x) // 2, dtype=np.float64)
```

We can now launch the kernel:

In [27]:

```
extract_tuple[1, 1](r1, r2, x)
```

Next we can verify that the values have been extracted from the tuple as expected:

In [28]:

```
print(r1)
print(r2)
```

We'll create a named tuple to represent a point:

In [29]:

```
Point = namedtuple('Point', ('x', 'y'))
```

We'll create a kernel that extracts values from a `Point`

, in a similar fashion to the tuple example above:

In [30]:

```
@cuda.jit
def extract_point(r, p):
r[0] = p.x
r[1] = p.y
x = Point(1, 2)
r = np.zeros(len(x), dtype=np.int64)
extract_point[1, 1](r, x)
```

Our extracted data:

In [31]:

```
print(r)
```

We can nest tuples and named tuples arbitrarily. Using a tuple of a tuple and a scalar:

In [32]:

```
@cuda.jit
def extract_nested(r, x):
r[0] = len(x)
r[1] = len(x[0])
r[2] = x[0][0]
r[3] = x[0][1]
r[4] = x[0][2]
r[5] = x[1]
x = ((6, 5, 4), 7)
r = np.ones(6, dtype=np.int64)
extract_nested[1, 1](r, x)
```

Our output contains `(len(x), len(x[0]), *x[0], x[1])`

:

In [33]:

```
print(r)
```

Arrays can also be tuple members, allowing us to group together arrays into a single parameter. For example, we might write a kernel that accepts data bundled up in a *struct-of-arrays*-type (SoA) form:

In [34]:

```
@cuda.jit
def vector_magnitudes_soa(r, v):
i = cuda.grid(1)
if i >= len(r):
return
r[i] = math.sqrt(v[0][i] ** 2 + v[1][i] ** 2)
```

Number of elements, and space for the results:

In [35]:

```
N = 32
r = np.zeros(N)
```

An SoA vector structure:

In [36]:

```
np.random.seed(1) # For reproducibility between notebook runs
vx = np.random.rand(N)
vy = np.random.rand(N)
v = (vx, vy)
```

We can pass `v`

to the kernel rather than needing to pass `vx`

and `vy`

individually:

In [37]:

```
vector_magnitudes_soa[1, N](r, v)
```

Our result:

In [38]:

```
print(r)
```

A new instrinsic, `cuda.cbrt`

provides an efficient way to compute a cube root for `float32`

and `float64`

values. Optimized functions from NVIDIA's libdevice library (`__nv_cbrt`

and `__nv_cbrtf`

) underlie its implementation.

For comparison, we'll define a kernel that implements cube root using a standard mathematical form, and another one using the intrinsic:

In [39]:

```
@cuda.jit
def cube_root(r, x):
for i in range(cuda.grid(1), cuda.gridsize(1), len(r)):
r[i] = x[i] ** (1.0 / 3.0)
@cuda.jit
def intrinsic_cube_root(r, x):
for i in range(cuda.grid(1), cuda.gridsize(1), len(r)):
r[i] = cuda.cbrt(x[i])
```

Then we'll set up some data to test with:

In [40]:

```
# Array size and data type
N = 2 ** 16
dtype = np.float32
np.random.seed(1)
x = np.random.rand(N).astype(dtype) * 100
r = np.zeros_like(x)
n_threads = 256
n_blocks = N // n_threads
# Copy data to device so we can time the kernel execution only
d_r_normal = cuda.to_device(r)
d_r_fast = cuda.to_device(r)
d_x = cuda.to_device(x)
```

Now we'll define functions to time for benchmarking. We need to synchronize with the device after the kernel launch to ensure we capture the execution time of the kernel and not just the time spent by the CPU launching the kernel:

In [41]:

```
def run_normal():
cube_root[n_blocks, n_threads](d_r_normal, d_x)
cuda.synchronize()
def run_fast():
intrinsic_cube_root[n_blocks, n_threads](d_r_fast, d_x)
cuda.synchronize()
```

Now let's time both versions:

In [42]:

```
%timeit run_normal()
%timeit run_fast()
```

The results agree to six decimal places with `float32`

data (this is likely to be a reasonable expectation for most use cases):

In [43]:

```
rtol=1.0e-6
h_r_normal = d_r_normal.copy_to_host()
h_r_fast = d_r_fast.copy_to_host()
np.testing.assert_allclose(h_r_normal, h_r_fast, rtol=rtol)
print(f"Results are in agreement to a relative tolerance of {rtol}.")
```