In [36]:

from IPython.display import Image

Clang vs GCC¶

benchmarks run with gcc 4.8.1 and clang 3.4 on an intel core 2 Duo amd64.

gcc48O2: default setu.py build flags, -O2 -fno-strict-aliasing
gcc48O3lto: -O3 -march=native -flto -fno-strict-aliasing
clangO3nat: -O3 -march=native -fno-strict-aliasing
clangO2: -O2 -fno-strict-aliasing

Cliffnotes:¶

gcc -O2 performs 5-10% better than -O3 in most benchmarks, except in a few cases the vectorizer does its magic
gcc and clang are very close in performance, but the cases where a compiler wins by a large margin its mostly gcc that wins
clang O2 and O3 are almost equal (not surprising, I think it only adds one option more)

Most benchmarks look like this:

numpy.arange(100)

In [37]:

Image("numpy-vbench/build/html/_images/numpy.arange_100_.png")

Out[37]:

But there are exceptions:

Benchmarks GCC wins¶

d = numpy.zeros(10000);
numpy.zeros_like(d)

In [38]:

Image("numpy-vbench/build/html/_images/numpy.zeros_like_d_.png")

Out[38]:

Big win for gcc with lto. the code currently goes over np.copyto, needs checking what in the code path profits so much from lto.

In [39]:

Image("numpy-vbench/build/html/_images/argsort.png")

Out[39]:

Significant win for GCC in run of the mill sorting code. Surprising clang performs so bad. GCC higher optimization levels are determinental too.

In [40]:

Image("numpy-vbench/build/html/_images/bincount.png")

Out[40]:

This one is interesting, gcc -O2 wins, -O3 is very harmful, clang is slower but not very significant.

d = numpy.arange(50*500, dtype=numpy.complex64).reshape((500,50))
d[...] = 1

In [41]:

Image("numpy-vbench/build/html/_images/cont_assign_complex64.png")

Out[41]:

scalar assignment is a very simple loop which should optimize very well.

As one can see gcc -O3 wins big, this is due to the vectorizer kicking in on that level. clang does not seem to vectorize at all.

An interesting side note not visible on this plot, complex assignment regressed significantly in numpy 1.7.x due to a change in alignment handling, clang was not affected, worth investigating why, possibly clang recognized the unaligned bytewise loop and optimized it to a regular one.

dflat = numpy.arange(50*500, dtype=numpy.float32)
dflat[::2] = 2

In [42]:

Image("numpy-vbench/build/html/_images/strided_assign_float32.png")

Out[42]:

another one clang loses quite significantly, as one sees -O3 does not improve gcc as strided loops can't be vectorized

Benchmarks clang wins¶

[numpy.add.reduce(a, axis=0) for a in squares.itervalues()]

In [43]:

Image("numpy-vbench/build/html/_images/numpy.add.reduce_axis=0__float16.png")

Out[43]:

pretty big difference, is clang vectorizing reductions by default? gcc does not do that unless one enables -ffast-math et al.

Interesting ones¶

In [44]:

Image("numpy-vbench/build/html/_images/numpy.add.reduce_axis=0__int16.png")

Out[44]:

gcc -O3 really kills this one

In [45]:

Image("numpy-vbench/build/html/_images/numpy.nonzero.png")

Out[45]:

another case where lto improves performance significantly, maybe some stuff could be inlined

In [45]: