This is an attempt to optimize the mutation function of mamba
.
First I need to change dir:
cd d:\workspace\mamba
d:\workspace\mamba
Now use simulation
to create a population
and some genomes
:
import simulation
p,g = simulation.run(0)
Starting simulation with 0 ticks Tick 0 Simulation finished, 0 ticks, time elapsed 0.213297071155 seconds
load the models where the python and cython functions are written:
import cython_load
import model
import model_c
reload(model);
mutation_numpy = model.mutation_explicit_genomes
mutation_numpy2 = model.mutation_explicit_genomes2
# cython functions removed
Time the functions:
%timeit mutation_numpy(p, g, array([0.003]*len(p)), g.shape[1], g[0,:])
1 loops, best of 3: 3.31 s per loop
%timeit mutation_numpy2(p, g, array([0.003]*len(p)), g.shape[1], g[0,:]);
10 loops, best of 3: 1.76 s per loop
%timeit mutation_cython(p, g, array([0.003]*len(p)), g.shape[1], g[0,:])
1 loops, best of 3: 2.9 s per loop
%timeit mutation_cython_cdef(p, g, array([0.003]*len(p)), g.shape[1], g[0,:])
1 loops, best of 3: 2.98 s per loop
So the second numpy
function wins - cython
has no advantage here.
The difference between the winner and the other numpy
function is that find_row
is at the end, at the update
stage, so it gets called less times.
Let's verify this using profiling:
%prun mutation_numpy(p, g, array([0.003]*len(p)), g.shape[1], g[0,:]);
13515 function calls in 2.314 seconds
ncalls tottime percall cumtime percall filename:lineno(function)
2598 2.180 0.001 2.180 0.001 {model_c.find_row}
%prun mutation_numpy2(p, g, array([0.003]*len(p)), g.shape[1], g[0,:]);
12562 function calls in 1.229 seconds
ncalls tottime percall cumtime percall filename:lineno(function)
1295 1.123 0.001 1.123 0.001 {model_c.find_row}
So I got an improvement of roughly half the time because I prevented roughly half of the find_row
calls.
find_row
¶The next step is to optimize the cython
code of find_row
.
The naive numpy
implelemnation is:
def find_row_naive(matrix, target):
for i, row in enumerate(matrix):
found = True
for j in range(target.shape[0]):
if target[j] != row[j]:
found = False
break
if found:
return i
return -1
But I'll use a faster cython
implementation. First I load the cython
magic extensions and the genome that is in the middle of the genomes
array (average case):
%load_ext cythonmagic
ge = g[g.shape[0]/2,:]
This is the function from model_c.pyx
used above:
%%cython
import numpy as np
cimport numpy as np
cimport cython
from cpython cimport bool
@cython.boundscheck(False)
@cython.wraparound(False) # TODO int -> DTYPE_...
def find_row(np.ndarray[int, ndim=2] matrix, np.ndarray[int, ndim=1] target):
cdef np.ndarray[int, ndim=1] row
cdef Py_ssize_t i, j
cdef bool found
for i, row in enumerate(matrix):
found = True
for j in range(target.shape[0]):
if target[j] != row[j]:
found = False
break
if found:
return i
return -1
%timeit find_row(g, ge)
1000 loops, best of 3: 530 us per loop
So the target is to improve below 500 us.
The next function tries to use numpy
's array comparison to avoid the inner loop.
I think it is less effective, as we compare all the elements, whereas in the previous implementation we stop comparing when we find a mismatch.
But still worth the try:
%%cython
import numpy as np
cimport numpy as np
cimport cython
from cpython cimport bool
@cython.boundscheck(False)
@cython.wraparound(False) # TODO int -> DTYPE_...
def find_row_np_compare(np.ndarray[int, ndim=2] matrix, np.ndarray[int, ndim=1] target):
cdef np.ndarray[int, ndim=1] row
cdef Py_ssize_t i, j
cdef bool found
for i, row in enumerate(matrix):
if (row==target).all():
return i
return -1
%timeit find_row_np_compare(g, ge)
100 loops, best of 3: 3.5 ms per loop
Not as good as the previous find_row
.
The next step is to try to use some other numpy
features:
%%cython
import numpy as np
cimport numpy as np
cimport cython
from cpython cimport bool
@cython.boundscheck(False)
@cython.wraparound(False) # TODO int -> DTYPE_...
def find_row_np_tostring(np.ndarray[int, ndim=2] matrix, np.ndarray[int, ndim=1] target):
cdef np.ndarray[int, ndim=1] row
cdef Py_ssize_t i, j
cdef bool found
for i, row in enumerate(matrix):
if row.tostring()==target.tostring():
return i
return -1
%timeit find_row_np_tostring(g, ge)
1000 loops, best of 3: 996 us per loop
Another approach is to have an array of nums
ready, and to check on that. It is costly to create but can save time when searching:
%timeit model_c.genomes_to_nums(g)
10 loops, best of 3: 103 ms per loop
nums = model_c.genomes_to_nums(g)
%%cython
import numpy as np
cimport numpy as np
cimport cython
@cython.boundscheck(False)
@cython.wraparound(False)
def find_row_nums(np.ndarray[double, ndim=1] nums, np.ndarray[int, ndim=1] target):
cdef target_num
cdef Py_ssize_t i
target_num = (2. ** np.arange(target.shape[0]) * target).sum()
for i,n in enumerate(nums):
if n == target_num:
return i
return -1
Now benchmark against the original function:
%timeit find_row(g, ge)
%timeit find_row_nums(nums, ge)
1000 loops, best of 3: 539 us per loop 10000 loops, best of 3: 165 us per loop
So this approach is 3-5-fold faster on searching but slow on inserting.
genome_to_num
¶Let's see if we can optimize inserting. The first function is the one used above, the others are new:
def genome_to_num(genome):
i = np.arange(genome.shape[0])
return (2. ** i * genome).sum()
%%cython
import numpy as np
cimport numpy as np
cimport cython
DTYPE_INT = np.int
ctypedef np.int_t DTYPE_INT_t
@cython.boundscheck(False)
@cython.wraparound(False)
def genome_to_num_cython(np.ndarray[DTYPE_INT_t, ndim=1] genome):
cdef np.ndarray[DTYPE_INT_t, ndim=1] i
i = np.arange(genome.shape[0])
return (2. ** i * genome).sum()
import numexpr
def genome_to_num_numexpr(genome):
i = np.arange(genome.shape[0])
return numexpr.evaluate("sum(2. ** i * genome)").item()
print genome_to_num(ge)
#genome_to_num_cython(ge) # this keeps throwing errors, and I think it isn't faster anyway
print genome_to_num_numexpr(ge)
1.7014118346e+38 1.7014118346e+38
%timeit [genome_to_num(ge) for ge in g];
%timeit [genome_to_num_numexpr(ge) for ge in g];
10 loops, best of 3: 105 ms per loop 10 loops, best of 3: 51.3 ms per loop
So here the numexpr
approach is ~2-fold faster for the entire genomes
table!
Now to check if list comprehension is better than for
:
%timeit [genome_to_num_numexpr(ge) for ge in g];
10 loops, best of 3: 50.6 ms per loop
%%timeit
a = [None]*g.shape[0]
for i,ge in enumerate(g):
a[i]=genome_to_num_numexpr(ge)
10 loops, best of 3: 49.2 ms per loop
Not much of a difference - I think this is because the list comprehension allocates memory, just like the for
loop I used (in contrast to an approach that uses lst.append(item)
).
Can I improve on 50 ms for the genomes
table?
Another cython
, non-numexpr
attempt:
%%cython
import numpy as np
cimport numpy as np
import numpy.random as npr
cimport cython
from cpython cimport bool
DTYPE_INT = np.int
ctypedef np.int_t DTYPE_INT_t
DTYPE_FLOAT = np.float
ctypedef np.float_t DTYPE_FLOAT_t
@cython.boundscheck(False)
@cython.wraparound(False)
def genomes_to_nums(np.ndarray[DTYPE_INT_t, ndim=2] genomes):
cdef np.ndarray[DTYPE_FLOAT_t, ndim=1] nums
cdef np.ndarray[DTYPE_FLOAT_t, ndim=1] i
cdef DTYPE_FLOAT_t num_loci_f
num_loci_f = np.float(genomes.shape[1])
nums = np.zeros(genomes.shape[0], dtype=DTYPE_FLOAT)
cdef Py_ssize_t j
i = np.arange(num_loci_f)
for j in range(genomes.shape[0]):
nums[j] = (2. ** i * genomes[j,:]).sum()
return nums
%timeit genomes_to_nums(g);
10 loops, best of 3: 100 ms per loop
And one with numexpr
, if possible:
%%cython
import numpy as np
cimport numpy as np
import numpy.random as npr
cimport cython
import model
reload(model)
DTYPE_INT = np.int
ctypedef np.int_t DTYPE_INT_t
DTYPE_FLOAT = np.float
ctypedef np.float_t DTYPE_FLOAT_t
@cython.boundscheck(False)
@cython.wraparound(False)
def genomes_to_nums_numexpr(np.ndarray[DTYPE_INT_t, ndim=2] genomes):
cdef np.ndarray[DTYPE_FLOAT_t, ndim=1] nums
cdef np.ndarray[DTYPE_INT_t, ndim=1] gen
cdef np.ndarray[DTYPE_FLOAT_t, ndim=1] i
cdef DTYPE_FLOAT_t num_loci_f
num_loci_f = np.float(genomes.shape[1])
nums = np.zeros(genomes.shape[0], dtype=DTYPE_FLOAT)
cdef Py_ssize_t j
i = np.arange(num_loci_f)
for j in range(genomes.shape[0]):
gen = genomes[j, :]
nums[j] = model.genome_to_num_numexpr(gen)
return nums
This keeps throwing errors, though.
#%timeit genomes_to_nums_numexpr(g)
The next approach is to use the fact that genome
is a sparse matrix:
def genome_to_num_numexpr_nonzero(genome):
i = genome.nonzero()[0]
return numexpr.evaluate("sum(2. ** i )")
def genome_to_num_nonzero(genome):
i = genome.nonzero()[0]
return (2. ** i).sum()
%timeit [genome_to_num_numexpr(ge) for ge in g]
10 loops, best of 3: 49.3 ms per loop
%timeit [genome_to_num_numexpr_nonzero(ge) for ge in g]
10 loops, best of 3: 30.1 ms per loop
%timeit [genome_to_num_nonzero(ge) for ge in g]
100 loops, best of 3: 13.9 ms per loop
This approach works nice, we cut the running time by 60% compared to the dense numexpr
approach, and by 1/8 compared to the cython
+numpy
which we started from. But now the numexpr
approach isn'y better than a simple numpy
approach!
The sparse approach is the definite winner, cutting the time by another factor of ~50.
Now I want to check if list comprehension
is better than allocating an array and filling it, either with python
or with cython
:
def genomes_to_nums_compr(genomes):
return [genome_to_num_nonzero(ge) for ge in g]
def genomes_to_nums_alloc(genomes):
nums = [None]*genomes.shape[0]
for i,ge in enumerate(genomes):
nums[i] = genome_to_num_nonzero(ge)
return nums
%%cython
import numpy as np
cimport numpy as np
import numpy.random as npr
cimport cython
DTYPE_INT = np.int
ctypedef np.int_t DTYPE_INT_t
DTYPE_FLOAT = np.float
ctypedef np.float_t DTYPE_FLOAT_t
def genome_to_num_nonzero(genome):
i = genome.nonzero()[0]
return (2. ** i).sum()
def genomes_to_nums_cython(np.ndarray[DTYPE_INT_t, ndim=2] genomes):
cdef np.ndarray[DTYPE_FLOAT_t, ndim=1] nums
cdef Py_ssize_t i
cdef np.ndarray[DTYPE_INT_t, ndim=1] ge
nums = np.empty(genomes.shape[0])
for i,ge in enumerate(genomes):
nums[i] = genome_to_num_nonzero(ge)
return nums
%timeit genomes_to_nums_compr(g)
%timeit genomes_to_nums_alloc(g)
%timeit genomes_to_nums_cython(g)
100 loops, best of 3: 13.2 ms per loop 100 loops, best of 3: 13.6 ms per loop 100 loops, best of 3: 23.2 ms per loop
So here the cython
approach isn't worth it, and list comprehension is as good as pre-allocating a list. I also tris using a list
for nums
in the cython
version, it didn't matter.
But now that I think of it, this sparse approach with calculating the num might be improved if instead of calculating the nums I simply compare non-zero elements.
So let's convert genomes
to a sparse matrix and then search on it in various ways.
Start with the naive approach:
Now let's try to use the sparse matrix of scipy
.
Import scipy
and create different kinds of sparse matrix versions of genomes
:
import scipy.sparse as spr
gcsr = spr.csr_matrix(g)
gcsc = spr.csc_matrix(g)
gdok = spr.dok_matrix(g)
glil = spr.lil_matrix(g)
gcoo = spr.coo_matrix(g)
gbsr = spr.bsr_matrix(g)
gdia = spr.dia_matrix(g)
Check which sparse matrix type gives nonzero()
faster:
%timeit -n 100 g.nonzero();
%timeit -n 100 gcsr.nonzero();
%timeit -n 100 gcsc.nonzero();
%timeit -n 100 gdok.nonzero();
%timeit -n 100 glil.nonzero();
%timeit -n 100 gcoo.nonzero();
%timeit -n 100 gbsr.nonzero();
%timeit -n 100 gdia.nonzero();
100 loops, best of 3: 14.1 ms per loop 100 loops, best of 3: 124 us per loop 100 loops, best of 3: 125 us per loop 100 loops, best of 3: 7.87 ms per loop 100 loops, best of 3: 1.42 ms per loop 100 loops, best of 3: 28.8 us per loop 100 loops, best of 3: 305 us per loop 100 loops, best of 3: 28.1 ms per loop
coo_matrix
is the winner - fastest nonzero()
.
Let's verify on random matrix:
r = spr.rand(g.shape[0],g.shape[1],g.sum()/float(len(g.flatten())),format='csr')
%timeit -n 100 r.nonzero();
r = spr.rand(g.shape[0],g.shape[1],g.sum()/float(len(g.flatten())),format='csc')
%timeit -n 100 r.nonzero();
r = spr.rand(g.shape[0],g.shape[1],g.sum()/float(len(g.flatten())),format='dok')
%timeit -n 100 r.nonzero();
r = spr.rand(g.shape[0],g.shape[1],g.sum()/float(len(g.flatten())),format='lil')
%timeit -n 100 r.nonzero();
r = spr.rand(g.shape[0],g.shape[1],g.sum()/float(len(g.flatten())),format='coo')
%timeit -n 100 r.nonzero();
r = spr.rand(g.shape[0],g.shape[1],g.sum()/float(len(g.flatten())),format='bsr')
%timeit -n 100 r.nonzero();
r = spr.rand(g.shape[0],g.shape[1],g.sum()/float(len(g.flatten())),format='dia')
%timeit -n 100 r.nonzero();
100 loops, best of 3: 131 us per loop 100 loops, best of 3: 132 us per loop 100 loops, best of 3: 7.78 ms per loop 100 loops, best of 3: 1.38 ms per loop 100 loops, best of 3: 32 us per loop 100 loops, best of 3: 295 us per loop 100 loops, best of 3: 34.6 ms per loop
Indeed, coo_matrix
is the winner.
Let's use it. Assume that genomes
is coo_matrix
...
I didn't find a nice way to do it.
I did find a way to use nonzero()
with the lil_matrix
:
def find_row_sparse_naive(genomes, target):
#genomes_lil = spr.lil_matrix(genomes)
genz = target.nonzero()[0]
for i,row in enumerate(glil.rows):
x = (row==genz)
if x.all() and x.any():
return i
return -1
%timeit find_row(g, ge)
%timeit find_row_nums(nums, ge)
1000 loops, best of 3: 979 us per loop 1000 loops, best of 3: 273 us per loop
%timeit find_row_sparse_naive(glil, ge)
100 loops, best of 3: 11.4 ms per loop
But it is not as fast as the nums
approach.
genome_to_num
function¶a=g[0,:]
genome_to_num_nonzero(a)
0.0
a[0]=1
genome_to_num_nonzero(a)
1.0
a[0]=0
a[-1]=1
genome_to_num_nonzero(a)
5.3575430359313366e+300
2**999==genome_to_num_nonzero(a)
True
a[-1]=0
a.sum()
0
a.fill(1)
a.sum()
1000
genome_to_num_nonzero(a)==(2**1000)
True
I started with 3 secs per 1 generation of simulation. Where am I now?
import simulation
reload(simulation);
%timeit simulation.run(0,0);
Starting simulation with 0 ticks Simulation finished, 0 ticks, time elapsed 0.319544448837 seconds Starting simulation with 0 ticks Simulation finished, 0 ticks, time elapsed 0.285992677134 seconds Starting simulation with 0 ticks Simulation finished, 0 ticks, time elapsed 0.283181251529 seconds Starting simulation with 0 ticks Simulation finished, 0 ticks, time elapsed 0.290546981349 seconds 1 loops, best of 3: 284 ms per loop
So I got it down from 3 seconds to 0.3 seconds, ~10 times faster. That means that a 500 generations simulations (like the ones I used in my previous work) should take ~100 seconds, in comparison to several hours that they took in my Java implementation.
This is progress.
The profiler for 5 generations:
%%prun
p,g = simulation.run(5,0)
Starting simulation with 5 ticks Simulation finished, 5 ticks, time elapsed 2.63202320618 seconds
11789 function calls in 0.193 seconds
ncalls tottime percall cumtime percall filename:lineno(function)
5696 1.792 0.000 1.792 0.000 model.py:59(find_row_nums)
4823 0.160 0.000 0.160 0.000 {numpy.core.multiarray.array}
6 0.088 0.015 0.089 0.015 model.py:95(clear_empty_classes)
11641 0.085 0.000 0.085 0.000 {method 'nonzero' of 'numpy.ndarray' objects}
5945 0.074 0.000 0.074 0.000 {method 'put' of 'numpy.ndarray' objects}
11641 0.072 0.000 0.184 0.000 model.py:49(genome_to_num)
6 0.071 0.012 0.128 0.021 shape_base.py:348(_replace_zero_by_x_arrays)
6 0.063 0.010 2.228 0.371 model.py:106(mutation_explicit_genomes)
17610 0.061 0.000 0.061 0.000 {method 'sum' of 'numpy.ndarray' objects}
5945 0.060 0.000 0.093 0.000 model.py:38(hamming_fitness_genome)
7 0.043 0.006 0.213 0.030 shape_base.py:11(apply_along_axis)
4773 0.014 0.000 0.014 0.000 {method 'any' of 'numpy.ndarray' objects}
Only functions with >0.009 total time are shown.
Not much I can do here - maybe some playing around with numpy
dtype
s will help,
or moving some stuff to cython
, but based on my experiments I doubt it.
At any case I'm not sure it is worth it, and apart from find_row_nums
there is no function that is a clear case for improvement.