One of the most exciting new products from Continuum Analytics is called NumbaPro, which allows code written in Python to target CUDA-capable GPUs for parallelized computation.
For a quick primer on how parallel computation on GPUs works, check out ???
Now, for a brief proof-of-concept look at the capabilities of NumbaPro, let's return to our old friend, 1D Nonlinear Convection.
Yes, this is a trivial problem, but it is a good demonstration of the potential for speed gains using NumbaPro and GPU computation.
We'll start by importing the usual libraries, plus the time
library, so we can measure our performance gains, and also the appropriate libraries from numbapro
.
autojit
is the same library we used with regular numba
, and in fact we'll be using it the same way, to provide a comparison between regular Numba and NumbaPro.
cuda
is the NumbaPro library that provides the CUDA intrinsics which allow us to target the GPU for computation.
float32
is a data type. Python generally takes care of whether we want an int
or a str
for us, but when we start delving into the dark depths of memory management, it can be helpful (and sometimes required) to be a bit more specific concerning our data formats.
import matplotlib.pyplot as plt
import numpy as np
import time
from numbapro import autojit, cuda, jit, float
The first function we're trying out is a simple implementation using array operations in Numpy.
###1-D Nonlinear convection implemented using Numpy
def NonLinNumpy(u, un, nx, nt, dx, dt):
###Run through nt timesteps and plot/animate each step
for n in range(nt): ##loop across number of time steps
un[:] = u[:]
u[1:] = -un[1:]*dt/dx*(un[1:]-un[:-1])+un[1:]
return u
The 'vanilla' version is what we used for Step 2, two nested loops and not that efficient.
###1-D Nonlinear convection implemented using 'vanilla' Python
def NonLinVanilla(u, nx, nt, dx, dt):
for n in range(nt):
for i in range(1,nx-1):
u[i+1] = -u[i]*dt/dx*(u[i]-u[i-1])+u[i]
return u
Here we've implemented the same 'vanilla' version, but we've added the @autojit
decorator, which will tell Numba to JIT compile this function for a nice speed boost.
###1-D Nonlinear convection implemented using Numba JIT compiler (similar to LLVM)
@autojit
def NonLinNumba(u,un, nx, nt, dx, dt):
for n in range(nt):
for i in range(1,nx):
un[i] = -u[i]*dt/dx*(u[i]-u[i-1])+u[i]
return un
There's a lot going on here that will be new to you, so we'll go through it piece by piece.
@jit(argtypes=[float32[:], float32, float32, float32, float32[:]], target='gpu')
Instead of @autojit
which automatically figures out data-types for us, we have to specify what kind of variables will be sent to this function (which is actually a CUDA 'kernel'). The argtypes
above tell the kernel that there will be five variables, three scalar floats and two float arrays.
###1-D Nonlinear convection implemented using NumbaPro CUDA-JIT
d@jit(argtypes=[float32[:], float32, float32, float32, float32[:]], target='gpu')
def NonLinCudaJit(u, dx, dt, nt, un):
tid = cuda.threadIdx.x
blkid = cuda.blockIdx.x
blkdim = cuda.blockDim.x
i = tid + blkid * blkdim
if i >= u.shape[0]:
return
for n in range(nt):
un[i] = -u[i]*dt/dx*(u[i]-u[i-1])+u[i]
cuda.syncthreads()
def main(nx):
##System Conditions
#nx = 500
nt = 500
c = 1
xmax = 15.0
dx = xmax/(nx-1)
sigma = 0.25
dt = sigma*dx
##Initial Conditions for wave
ui = np.ones(nx) ##create a 1xn vector of 1's
ui[.5/dx:1/dx+1]=2 ##set hat function I.C. : .5<=x<=1 is 2
un = np.ones(nx)
if nx < 20001:
t1 = time.time()
u = NonLinVanilla(ui, nx, nt, dx, dt)
t2 = time.time()
print "Vanilla version took: %.6f seconds" % (t2-t1)
ui = np.ones(nx) ##create a 1xn vector of 1's
ui[.5/dx:1/dx+1]=2 ##set hat function I.C. : .5<=x<=1 is 2
t1 = time.time()
u = NonLinNumpy(ui, un, nx, nt, dx, dt)
t2 = time.time()
print "Numpy version took: %.6f seconds" % (t2-t1)
numpytime = t2-t1
#plt.plot(numpy.linspace(0,xmax,nx),u[:],marker='o',lw=2)
ui = np.ones(nx) ##create a 1xn vector of 1's
ui[.5/dx:1/dx+1]=2 ##set hat function I.C. : .5<=x<=1 is 2
t1 = time.time()
u = NonLinNumba(ui, un, nx, nt, dx, dt)
t2 = time.time()
print "Numbapro Vectorize version took: %.6f seconds" % (t2-t1)
vectime = t2-t1
#plt.plot(numpy.linspace(0,xmax,nx),u[:],marker='o',lw=2)
u = np.ones(nx)
u[:] = ui[:]
griddim = 320, 1
blockdim = 768, 1, 1
NonLinCudaJit_conf = NonLinCudaJit[griddim, blockdim]
t1 = time.time()
NonLinCudaJit(u, dx, dt, nt, un)
t2 = time.time()
print "Numbapro Cuda version took: %.6f seconds" % (t2-t1)
cudatime = t2-t1
main(500)
Vanilla version took: 0.581475 seconds Numpy version took: 0.007635 seconds Numbapro Vectorize version took: 0.000966 seconds Numbapro Cuda version took: 0.002658 seconds
main(1000)
Vanilla version took: 1.140803 seconds Numpy version took: 0.008878 seconds Numbapro Vectorize version took: 0.001837 seconds Numbapro Cuda version took: 0.002678 seconds
main(5000)
Vanilla version took: 5.336566 seconds Numpy version took: 0.023648 seconds Numbapro Vectorize version took: 0.009166 seconds Numbapro Cuda version took: 0.002717 seconds
main(10000)
Vanilla version took: 10.719647 seconds Numpy version took: 0.043988 seconds Numbapro Vectorize version took: 0.018464 seconds Numbapro Cuda version took: 0.002899 seconds
main(20000)
Vanilla version took: 21.414605 seconds Numpy version took: 0.083821 seconds Numbapro Vectorize version took: 0.036616 seconds Numbapro Cuda version took: 0.002943 seconds
main(50000)
Numpy version took: 0.207808 seconds Numbapro Vectorize version took: 0.093922 seconds Numbapro Cuda version took: 0.003228 seconds
main(100000)
Numpy version took: 0.456931 seconds Numbapro Vectorize version took: 0.189677 seconds Numbapro Cuda version took: 0.004876 seconds
main(200000)
Numpy version took: 1.255342 seconds Numbapro Vectorize version took: 0.393786 seconds Numbapro Cuda version took: 0.005403 seconds
from IPython.core.display import HTML
def css_styling():
styles = open("../styles/custom.css", "r").read()
return HTML(styles)
css_styling()