from IPython.display import Image
benchmarks run with gcc 4.8.1 and clang 3.4 on an intel core 2 Duo amd64.
Most benchmarks look like this:
numpy.arange(100)
Image("numpy-vbench/build/html/_images/numpy.arange_100_.png")
But there are exceptions:
d = numpy.zeros(10000);
numpy.zeros_like(d)
Image("numpy-vbench/build/html/_images/numpy.zeros_like_d_.png")
Big win for gcc with lto. the code currently goes over np.copyto, needs checking what in the code path profits so much from lto.
Image("numpy-vbench/build/html/_images/argsort.png")
Significant win for GCC in run of the mill sorting code. Surprising clang performs so bad. GCC higher optimization levels are determinental too.
Image("numpy-vbench/build/html/_images/bincount.png")
This one is interesting, gcc -O2 wins, -O3 is very harmful, clang is slower but not very significant.
d = numpy.arange(50*500, dtype=numpy.complex64).reshape((500,50))
d[...] = 1
Image("numpy-vbench/build/html/_images/cont_assign_complex64.png")
scalar assignment is a very simple loop which should optimize very well.
As one can see gcc -O3 wins big, this is due to the vectorizer kicking in on that level. clang does not seem to vectorize at all.
An interesting side note not visible on this plot, complex assignment regressed significantly in numpy 1.7.x due to a change in alignment handling, clang was not affected, worth investigating why, possibly clang recognized the unaligned bytewise loop and optimized it to a regular one.
dflat = numpy.arange(50*500, dtype=numpy.float32)
dflat[::2] = 2
Image("numpy-vbench/build/html/_images/strided_assign_float32.png")
another one clang loses quite significantly, as one sees -O3 does not improve gcc as strided loops can't be vectorized
[numpy.add.reduce(a, axis=0) for a in squares.itervalues()]
Image("numpy-vbench/build/html/_images/numpy.add.reduce_axis=0__float16.png")
pretty big difference, is clang vectorizing reductions by default? gcc does not do that unless one enables -ffast-math et al.
Image("numpy-vbench/build/html/_images/numpy.add.reduce_axis=0__int16.png")
gcc -O3 really kills this one
Image("numpy-vbench/build/html/_images/numpy.nonzero.png")
another case where lto improves performance significantly, maybe some stuff could be inlined