Lecturer:José Pedro Silva1 - silva_at_math.uni-wuppertal.de
The first benchmark in this notebook first appeared as a post by Jake Vanderplas on the blog Pythonic Perambulations
[1D Black-Scholes](#1D Black-Scholes)
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
X = np.random.random((1000, 3))
def parse_cap(s):
return cap.stdout.split(' ')[-4:-2]
Benchs = ['python\nloop', 'numpy\nbroadc.', 'sklearn', 'fortran/\nf2py', 'scipy', 'cython', 'numba','parakeet']
Benchmarks = dict((bench,0) for bench in Benchs)
We'll start by defining the array which we'll use for the benchmarks: one thousand points in three dimensions.
np.show_config()
lapack_opt_info: libraries = ['mkl_lapack95_lp64', 'mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'iomp5', 'pthread'] library_dirs = ['/home/jpsilva/anaconda/lib'] define_macros = [('SCIPY_MKL_H', None)] include_dirs = ['/home/jpsilva/anaconda/include'] blas_opt_info: libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'iomp5', 'pthread'] library_dirs = ['/home/jpsilva/anaconda/lib'] define_macros = [('SCIPY_MKL_H', None)] include_dirs = ['/home/jpsilva/anaconda/include'] openblas_lapack_info: NOT AVAILABLE lapack_mkl_info: libraries = ['mkl_lapack95_lp64', 'mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'iomp5', 'pthread'] library_dirs = ['/home/jpsilva/anaconda/lib'] define_macros = [('SCIPY_MKL_H', None)] include_dirs = ['/home/jpsilva/anaconda/include'] blas_mkl_info: libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'iomp5', 'pthread'] library_dirs = ['/home/jpsilva/anaconda/lib'] define_macros = [('SCIPY_MKL_H', None)] include_dirs = ['/home/jpsilva/anaconda/include'] mkl_info: libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'iomp5', 'pthread'] library_dirs = ['/home/jpsilva/anaconda/lib'] define_macros = [('SCIPY_MKL_H', None)] include_dirs = ['/home/jpsilva/anaconda/include']
We'll start with a typical numpy broadcasting approach to this problem. Numpy broadcasting is an abstraction that allows loops over array indices to be executed in compiled C. For many applications, this is extremely fast and efficient. Unfortunately, there is a problem with broadcasting approaches that comes up here: it ends up allocating hidden temporary arrays which can eat up memory and cause computational overhead. Nevertheless, it's a good comparison to have. The function looks like this:
%%capture cap
def pairwise_numpy(X):
return np.sqrt(((X[:, None, :] - X) ** 2).sum(-1))
%timeit pairwise_numpy(X)
Benchmarks['numpy\nbroadc.'] = parse_cap(cap)
print cap.stdout
10 loops, best of 3: 39.8 ms per loop
A loop-based solution avoids the overhead associated with temporary arrays, and can be written like this:
def pairwise_python(X):
M = X.shape[0]
N = X.shape[1]
D = np.empty((M, M), dtype=np.float)
for i in range(M):
for j in range(M):
d = 0.0
for k in range(N):
tmp = X[i, k] - X[j, k]
d += tmp * tmp
D[i, j] = np.sqrt(d)
return D
%%capture cap
%timeit pairwise_python(X)
Benchmarks['python\nloop'] = parse_cap(cap)
print cap.stdout
1 loops, best of 3: 2.6 s per loop
As we see, it is over 100 times slower than the numpy broadcasting approach! This is due to Python's dynamic type checking, which can drastically slow down nested loops. With these two solutions, we're left with a tradeoff between efficiency of computation and efficiency of memory usage. This is where tools like Numba and Cython become vital
Numba is an LLVM compiler for python code, which
allows code written in Python to be converted to highly efficient compiled code
in real-time.
Numba is extremely simple to use. We just wrap our python function with autojit
(JIT stands
for "just in time" compilation) to automatically create an efficient, compiled version of the function:
Note Somehow, importing pylab breaks numba
%%capture cap
import numba
from numba import double, jit,autojit
@jit
def pairwise_numba(X):
M = X.shape[0]
N = X.shape[1]
D = np.empty((M, M))#, dtype=np.float)
for i in range(M):
for j in range(M):
d = 0.0
for k in range(N):
tmp = X[i, k] - X[j, k]
d += tmp * tmp
D[i, j] = np.sqrt(d)
return D
%timeit pairwise_numba(X)
Benchmarks['numba'] = parse_cap(cap)
print numba.__version__, cap.stdout
0.15.1 1 loops, best of 3: 2.94 s per loop
Cython is another package which is built to convert Python-like statements into compiled code. The language is actually a superset of Python which acts as a sort of hybrid between Python and C. By adding type annotations to Python code and running it through the Cython interpreter, we obtain fast compiled code. Here is a highly-optimized Cython version of the pairwise distance function, which we compile using IPython's Cython magic:
%load_ext cythonmagic
%%cython
import numpy as np
cimport cython
from libc.math cimport sqrt
@cython.boundscheck(False)
@cython.wraparound(False)
def pairwise_cython(double[:, ::1] X):
cdef int M = X.shape[0]
cdef int N = X.shape[1]
cdef double tmp, d
cdef double[:, ::1] D = np.empty((M, M), dtype=np.float64)
for i in range(M):
for j in range(M):
d = 0.0
for k in range(N):
tmp = X[i, k] - X[j, k]
d += tmp * tmp
D[i, j] = sqrt(d)
return np.asarray(D)
%%capture cap
%timeit pairwise_cython(X)
Benchmarks['cython'] = parse_cap(cap)
print cap.stdout
100 loops, best of 3: 7.02 ms per loop
The Cython version, despite all the optimization, is more or less the same as
the result of the simple Numba decorator!
By comparison, the Numba version is a simple, unadorned wrapper around plainly-written Python code.
Parakeet is a runtime accelerator for an array-oriented subset of Python. If you're doing a lot of number crunching in Python, Parakeet may be able to significantly speed up your code.
To accelerate a function, wrap it with Parakeet's @jit decorator:
%%capture cap
from parakeet import jit
pairwise_parakeet = jit(pairwise_python)
%timeit pairwise_parakeet(X)
Benchmarks['parakeet'] = parse_cap(cap)
print cap.stdout
1 loops, best of 3: 5.83 ms per loop
Another option for fast computation is to write a Fortran function directly, and use
the f2py
package to interface with the function. We can write the function
as follows:
%%file pairwise_fort.f
subroutine pairwise_fort(X,D,m,n)
integer :: n,m
double precision, intent(in) :: X(m,n)
double precision, intent(out) :: D(m,m)
integer :: i,j,k
double precision :: r
do i = 1,m
do j = 1,m
r = 0
do k = 1,n
r = r + (X(i,k) - X(j,k)) * (X(i,k) - X(j,k))
end do
D(i,j) = sqrt(r)
end do
end do
end subroutine pairwise_fort
Overwriting pairwise_fort.f
We can then use the shell interface to compile the Fortran function. In order
to hide the output of this operation, we direct it into /dev/null
(note: I
tested this on Linux, and it may have to be modified for Mac or Windows).
!f2py -c --help-fcompiler
Gnu95FCompiler instance properties: archiver = ['/usr/bin/gfortran', '-cr'] compile_switch = '-c' compiler_f77 = ['/usr/bin/gfortran', '-Wall', '-g', '-ffixed-form', '- fno-second-underscore', '-fPIC', '-O3', '-funroll-loops'] compiler_f90 = ['/usr/bin/gfortran', '-Wall', '-g', '-fno-second- underscore', '-fPIC', '-O3', '-funroll-loops'] compiler_fix = ['/usr/bin/gfortran', '-Wall', '-g', '-ffixed-form', '- fno-second-underscore', '-Wall', '-g', '-fno-second- underscore', '-fPIC', '-O3', '-funroll-loops'] libraries = ['gfortran'] library_dirs = [] linker_exe = ['/usr/bin/gfortran', '-Wall', '-Wall'] linker_so = ['/usr/bin/gfortran', '-Wall', '-g', '-Wall', '-g', '- shared'] object_switch = '-o ' ranlib = ['/usr/bin/gfortran'] version = LooseVersion ('4.9.1-16') version_cmd = ['/usr/bin/gfortran', '--version'] Fortran compilers found: --fcompiler=gnu95 GNU Fortran 95 compiler (4.9.1-16) Compilers available for this platform, but not found: --fcompiler=absoft Absoft Corp Fortran Compiler --fcompiler=compaq Compaq Fortran Compiler --fcompiler=g95 G95 Fortran Compiler --fcompiler=gnu GNU Fortran 77 compiler --fcompiler=intel Intel Fortran Compiler for 32-bit apps --fcompiler=intele Intel Fortran Compiler for Itanium apps --fcompiler=intelem Intel Fortran Compiler for 64-bit apps --fcompiler=lahey Lahey/Fujitsu Fortran 95 Compiler --fcompiler=nag NAGWare Fortran 95 Compiler --fcompiler=pathf95 PathScale Fortran Compiler --fcompiler=pg Portland Group Fortran Compiler --fcompiler=vast Pacific-Sierra Research Fortran 90 Compiler Compilers not available on this platform: --fcompiler=hpux HP Fortran 90 Compiler --fcompiler=ibm IBM XL Fortran Compiler --fcompiler=intelev Intel Visual Fortran Compiler for Itanium apps --fcompiler=intelv Intel Visual Fortran Compiler for 32-bit apps --fcompiler=intelvem Intel Visual Fortran Compiler for 64-bit apps --fcompiler=mips MIPSpro Fortran Compiler --fcompiler=none Fake Fortran compiler --fcompiler=sun Sun or Forte Fortran 95 Compiler For compiler details, run 'config_fc --verbose' setup command. Removing build directory /tmp/tmpPT3h4C
!f2py -c --help-compiler
List of available compilers: --compiler=bcpp Borland C++ Compiler --compiler=cygwin Cygwin port of GNU C Compiler for Win32 --compiler=emx EMX port of GNU C Compiler for OS/2 --compiler=intel Intel C Compiler for 32-bit applications --compiler=intele Intel C Itanium Compiler for Itanium-based applications --compiler=intelem Intel C Compiler for 64-bit applications --compiler=mingw32 Mingw32 port of GNU C Compiler for Win32 --compiler=msvc Microsoft Visual C++ --compiler=pathcc PathScale Compiler for SiCortex-based applications --compiler=unix standard UNIX-style compiler Removing build directory /tmp/tmp9W2_Rb
# Compile the Fortran with f2py.
# We'll direct the output into /dev/null so it doesn't fill the screen
#!f2py -c --compiler=intel pairwise_fort.f -m pairwise_fort > /dev/null
!f2py -c --fcompiler=gnu95 pairwise_fort.f -m pairwise_fort > /dev/null
We can import the resulting code into Python to time the execution of the function. To make sure we're being fair, we'll first convert the test array to Fortran-ordering so that no conversion needs to happen in the background:
%%capture cap
from pairwise_fort import pairwise_fort
XF = np.asarray(X, order='F')
%timeit pairwise_fort(XF)
Benchmarks['fortran/\nf2py'] = parse_cap(cap)
print cap.stdout
100 loops, best of 3: 6.13 ms per loop
The result is nearly a factor of two slower than the Cython and Numba versions.
%%capture cap
from scipy.spatial.distance import cdist
%timeit cdist(X, X)
Benchmarks['scipy'] = parse_cap(cap)
print cap.stdout
100 loops, best of 3: 5.67 ms per loop
Scikit-learn contains the euclidean_distances
function, works on sparse
matrices as well as numpy arrays, and is implemented in Cython:
%%capture cap
from sklearn.metrics import euclidean_distances
%timeit euclidean_distances(X, X)
Benchmarks['sklearn'] = parse_cap(cap)
print cap.stdout
100 loops, best of 3: 9.75 ms per loop
euclidean_distances
is several times slower than the Numba pairwise function
on dense arrays.
As a summary of the results, we'll create a bar-chart to visualize the timings:
%matplotlib inline
labels = Benchmarks.keys()
timings = [np.timedelta64(int(float(value[0])),str(value[1]))/np.timedelta64(1, 's') for value in Benchmarks.values()]
x = np.arange(len(labels))
ax = plt.axes(xticks=x, yscale='log')
ax.bar(x - 0.3, timings, width=0.6, alpha=0.4, bottom=1E-6)
ax.grid()
ax.set_xlim(-0.5, len(labels) - 0.5)
ax.set_ylim(0.5*min(timings), 1.1*max(timings))
ax.xaxis.set_major_formatter(plt.FuncFormatter(lambda i, loc: labels[int(i)]))
ax.set_ylabel('time (s)')
ax.set_title("Pairwise Distance Timings")
<matplotlib.text.Text at 0x7fd25c6c8e10>
Note that this is log-scaled, so the vertical space between two grid lines indicates a factor of 10 difference in computation time. As Pure Python takes more time by two orders of magnitude we'll now remove it from the benchmarks for a better comparison.
labels = [val for val in Benchmarks.iterkeys() if val != 'python\nloop']
timings = [np.timedelta64(int(float(value[0])),str(value[1]))/np.timedelta64(1, 's')\
for k, value in Benchmarks.items() if k != 'python\nloop']
x = np.arange(len(labels))
ax = plt.axes(xticks=x, yscale='log')
ax.bar(x - 0.3, timings, width=0.6, alpha=0.4, bottom=1E-6)
ax.grid()
ax.set_xlim(-0.5, len(labels) - 0.5)
ax.set_ylim(0.5*min(timings), 1.1*max(timings))
ax.xaxis.set_major_formatter(plt.FuncFormatter(lambda i, loc: labels[int(i)]))
ax.set_ylabel('time (s)')
ax.set_title("Pairwise Distance Timings")
<matplotlib.text.Text at 0x7fb6ba1d5b10>
#Restart notebook!
We'll now benchmark a simple function which returns the price of a 1D European Option using the closed form solution with the following parameters:
S0 = 100
K = 100
T = 1
vol = 0.3
r = 0.03
q = 0
optype = -1
from numpy import log, sqrt, zeros, exp, arange
from scipy.special import erf
def parse_cap(s):
return cap.stdout.split(' ')[-4:-2]
benchs = ['python','numba','cython','matlab']
BS_bench = dict(zip(benchs,zeros(4)))
def std_norm_cdf(x):
return 0.5*(1+erf(x/sqrt(2.0)))
def black_scholes(s, k, t, v, rf, div, cp):
"""Price an option using the Black-Scholes model.
s : initial stock price
k : strike price
t : expiration time
v : volatility
rf : risk-free rate
div : dividend
cp : +1/-1 for call/put
"""
d1 = (log(s/k)+(rf-div+0.5*pow(v,2))*t)/(v*sqrt(t))
d2 = d1 - v*sqrt(t)
optprice = cp*s*exp(-div*t)*std_norm_cdf(cp*d1) - \
cp*k*exp(-rf*t)*std_norm_cdf(cp*d2)
return optprice
%%capture cap
%timeit black_scholes(S0, K, T, vol, r, q, optype)
BS_bench['python'] = parse_cap(cap)
print cap.stdout
100000 loops, best of 3: 11.6 us per loop
from numba import double
from numba.decorators import jit, autojit
import math
@autojit
def std_norm_cdf_numba(x):
return 0.5*(1+math.erf(x/math.sqrt(2.0)))
@autojit
def black_scholes_numba(s, k, t, v, rf, div, cp):
"""Price an option using the Black-Scholes model.
s : initial stock price
k : strike price
t : expiration time
v : volatility
rf : risk-free rate
div : dividend
cp : +1/-1 for call/put
"""
d1 = (math.log(s/k)+(rf-div+0.5*v*v)*t)/(v*math.sqrt(t))
d2 = d1 - v*math.sqrt(t)
optprice = cp*s*math.exp(-div*t)*std_norm_cdf_numba(cp*d1) - \
cp*k*math.exp(-rf*t)*std_norm_cdf_numba(cp*d2)
return optprice
%%capture cap
%timeit black_scholes_numba(S0, K, T, vol, r, q, optype)
BS_bench['numba'] = parse_cap(cap)
print cap.stdout
1 loops, best of 3: 5.01 us per loop
%load_ext cythonmagic
The cythonmagic extension is already loaded. To reload it, use: %reload_ext cythonmagic
%%cython
cimport cython
from libc.math cimport exp, sqrt, pow, log, erf
@cython.cdivision(True)
cdef double std_norm_cdf(double x) nogil:
return 0.5*(1+erf(x/sqrt(2.0)))
@cython.cdivision(True)
def black_scholes(double s, double k, double t, double v,
double rf, double div, double cp):
"""Price an option using the Black-Scholes model.
s : initial stock price
k : strike price
t : expiration time
v : volatility
rf : risk-free rate
div : dividend
cp : +1/-1 for call/put
"""
cdef double d1, d2, optprice
with nogil:
d1 = (log(s/k)+(rf-div+0.5*pow(v,2))*t)/(v*sqrt(t))
d2 = d1 - v*sqrt(t)
optprice = cp*s*exp(-div*t)*std_norm_cdf(cp*d1) - \
cp*k*exp(-rf*t)*std_norm_cdf(cp*d2)
return optprice
%%capture cap
%timeit black_scholes(S0, K, T, vol, r, q, optype)
BS_bench['cython'] = parse_cap(cap)
print cap.stdout
1000000 loops, best of 3: 372 ns per loop
We now analyse the corresponding MATLAB code which we'll use for comparison
!cat ../std_norm_cdf.m
function z = std_norm_cdf(x) z = 0.5*(1+erf(x/sqrt(2.0)));
!cat ../black_scholes.m
function z = black_scholes(s, k, t, v, rf, div, cp) d1 = (log(s/k)+(rf-div+0.5*v*v)*t)/(v*sqrt(t)); d2 = d1 - v*sqrt(t); optprice = cp*s*exp(-div*t)*std_norm_cdf(cp*d1)-cp*k*exp(-rf*t)*std_norm_cdf(cp*d2); z = optprice;
%matplotlib inline
BS_bench['matlab'] = [u'55',u'us']
labels= BS_bench.keys()
timings = [np.timedelta64(int(float(value[0])),str(value[1]))/np.timedelta64(1, 's') for value in BS_bench.values()]
x = arange(len(labels))
ax = plt.axes(xticks=x, yscale='log')
ax.bar(x - 0.3, timings, width=0.6, alpha=0.4, bottom=1E-6)
ax.grid()
ax.set_xlim(-0.5, len(labels) - 0.5)
ax.set_ylim(min(timings), 1.1*max(timings))
ax.xaxis.set_major_formatter(plt.FuncFormatter(lambda i, loc: labels[int(i)]))
ax.set_ylabel('time ($s$)')
ax.set_title("Black-Scholes Timings")
<matplotlib.text.Text at 0x7f7c55db4a50>
%load_ext version_information
%version_information numpy,numba,parakeet,cython,matplotlib,scipy,sklearn
Software | Version |
---|---|
Python | 2.7.8 |Anaconda 2.1.0 (64-bit)| (default, Aug 21 2014, 18:22:21) [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] |
IPython | 2.3.1 |
OS | posix [linux2] |
numpy | 1.9.1 |
numba | 0.15.1 |
parakeet | 0.23.2 |
cython | 0.21.1 |
matplotlib | 1.4.2 |
scipy | 0.14.0 |
sklearn | 0.15.2 |
Fri Dec 05 11:18:11 2014 CET |
from IPython.core.display import HTML
def css_styling():
styles = open("./styles/custom.css", "r").read()
return HTML(styles)
css_styling()