Mutation function optimization¶

This is an attempt to optimize the mutation function of mamba.

First I need to change dir:

In [20]:

cd d:\workspace\mamba

d:\workspace\mamba

Now use simulation to create a population and some genomes:

In [2]:

import simulation

In [3]:

p,g = simulation.run(0)

Starting simulation with  0 ticks
Tick 0
Simulation finished, 0 ticks, time elapsed 0.213297071155 seconds

load the models where the python and cython functions are written:

In [28]:

import cython_load
import model
import model_c
reload(model);

In [13]:

mutation_numpy = model.mutation_explicit_genomes
mutation_numpy2 = model.mutation_explicit_genomes2
# cython functions removed

Time the functions:

In [11]:

%timeit mutation_numpy(p, g, array([0.003]*len(p)), g.shape[1], g[0,:])

1 loops, best of 3: 3.31 s per loop

In [35]:

%timeit mutation_numpy2(p, g, array([0.003]*len(p)), g.shape[1], g[0,:]);

10 loops, best of 3: 1.76 s per loop

In [7]:

%timeit mutation_cython(p, g, array([0.003]*len(p)), g.shape[1], g[0,:])

1 loops, best of 3: 2.9 s per loop

In [8]:

%timeit mutation_cython_cdef(p, g, array([0.003]*len(p)), g.shape[1], g[0,:])

1 loops, best of 3: 2.98 s per loop

So the second numpy function wins - cython has no advantage here.

The difference between the winner and the other numpy function is that find_row is at the end, at the update stage, so it gets called less times.

Let's verify this using profiling:

In [51]:

%prun mutation_numpy(p, g, array([0.003]*len(p)), g.shape[1], g[0,:]);

13515 function calls in 2.314 seconds

ncalls tottime percall cumtime percall filename:lineno(function)

2598 2.180 0.001 2.180 0.001 {model_c.find_row}

In [52]:

%prun mutation_numpy2(p, g, array([0.003]*len(p)), g.shape[1], g[0,:]);

12562 function calls in 1.229 seconds

ncalls tottime percall cumtime percall filename:lineno(function)

1295 1.123 0.001 1.123 0.001 {model_c.find_row}

So I got an improvement of roughly half the time because I prevented roughly half of the find_row calls.

Optimize `find_row`¶

The next step is to optimize the cython code of find_row.

The naive numpy implelemnation is:

In [41]:

def find_row_naive(matrix, target):
	for i, row in enumerate(matrix):
		found = True
		for j in range(target.shape[0]):
			if target[j] != row[j]:
				found = False
				break
		if found:
			return i
	return -1

But I'll use a faster cython implementation. First I load the cython magic extensions and the genome that is in the middle of the genomes array (average case):

In [13]:

%load_ext cythonmagic
ge = g[g.shape[0]/2,:]

This is the function from model_c.pyx used above:

In [42]:

%%cython
import numpy as np
cimport numpy as np
cimport cython
from cpython cimport bool

@cython.boundscheck(False)
@cython.wraparound(False) # TODO int -> DTYPE_...
def find_row(np.ndarray[int, ndim=2] matrix, np.ndarray[int, ndim=1] target):
	cdef np.ndarray[int, ndim=1] row 
	cdef Py_ssize_t i, j
	cdef bool found
	for i, row in enumerate(matrix):
		found = True
		for j in range(target.shape[0]):
			if target[j] != row[j]:
				found = False
				break
		if found:
			return i
	return -1

In [18]:

%timeit find_row(g, ge)

1000 loops, best of 3: 530 us per loop

So the target is to improve below 500 us.

The next function tries to use numpy's array comparison to avoid the inner loop.

I think it is less effective, as we compare all the elements, whereas in the previous implementation we stop comparing when we find a mismatch.

But still worth the try:

In [29]:

%%cython
import numpy as np
cimport numpy as np
cimport cython
from cpython cimport bool

@cython.boundscheck(False)
@cython.wraparound(False) # TODO int -> DTYPE_...
def find_row_np_compare(np.ndarray[int, ndim=2] matrix, np.ndarray[int, ndim=1] target):
    cdef np.ndarray[int, ndim=1] row 
    cdef Py_ssize_t i, j
    cdef bool found
    for i, row in enumerate(matrix):
        if (row==target).all():
            return i
    return -1

In [20]:

%timeit find_row_np_compare(g, ge)

100 loops, best of 3: 3.5 ms per loop

Not as good as the previous find_row.

The next step is to try to use some other numpy features:

In [21]:

%%cython
import numpy as np
cimport numpy as np
cimport cython
from cpython cimport bool

@cython.boundscheck(False)
@cython.wraparound(False) # TODO int -> DTYPE_...
def find_row_np_tostring(np.ndarray[int, ndim=2] matrix, np.ndarray[int, ndim=1] target):
    cdef np.ndarray[int, ndim=1] row 
    cdef Py_ssize_t i, j
    cdef bool found
    for i, row in enumerate(matrix):
        if row.tostring()==target.tostring():
            return i
    return -1

In [22]:

%timeit find_row_np_tostring(g, ge)

1000 loops, best of 3: 996 us per loop

Another approach is to have an array of nums ready, and to check on that. It is costly to create but can save time when searching:

In [23]:

%timeit model_c.genomes_to_nums(g)

10 loops, best of 3: 103 ms per loop

In [31]:

nums = model_c.genomes_to_nums(g)

In [32]:

%%cython
import numpy as np
cimport numpy as np
cimport cython

@cython.boundscheck(False)
@cython.wraparound(False)
def find_row_nums(np.ndarray[double, ndim=1] nums, np.ndarray[int, ndim=1] target):
    cdef target_num
    cdef Py_ssize_t i
    
    target_num =  (2. ** np.arange(target.shape[0]) * target).sum()
    for i,n in enumerate(nums):
        if n == target_num:
            return i
    return -1

Now benchmark against the original function:

In [27]:

%timeit find_row(g, ge)
%timeit find_row_nums(nums, ge)

1000 loops, best of 3: 539 us per loop
10000 loops, best of 3: 165 us per loop

So this approach is 3-5-fold faster on searching but slow on inserting.

Optimize `genome_to_num`¶

Let's see if we can optimize inserting. The first function is the one used above, the others are new:

In [29]:

def genome_to_num(genome):
	i = np.arange(genome.shape[0])
	return (2. ** i * genome).sum()

In [30]:

%%cython
import numpy as np
cimport numpy as np
cimport cython
DTYPE_INT = np.int
ctypedef np.int_t DTYPE_INT_t
@cython.boundscheck(False)
@cython.wraparound(False)
def genome_to_num_cython(np.ndarray[DTYPE_INT_t, ndim=1] genome):
    cdef np.ndarray[DTYPE_INT_t, ndim=1] i
    i = np.arange(genome.shape[0])
    return (2. ** i * genome).sum()

In [105]:

import numexpr
def genome_to_num_numexpr(genome):
	i = np.arange(genome.shape[0])
	return numexpr.evaluate("sum(2. ** i * genome)").item()

In [32]:

print genome_to_num(ge)
#genome_to_num_cython(ge) # this keeps throwing errors, and I think it isn't faster anyway
print genome_to_num_numexpr(ge)

1.7014118346e+38
1.7014118346e+38

In [33]:

%timeit [genome_to_num(ge) for ge in g];
%timeit [genome_to_num_numexpr(ge) for ge in g];

10 loops, best of 3: 105 ms per loop
10 loops, best of 3: 51.3 ms per loop

So here the numexpr approach is ~2-fold faster for the entire genomes table!

Now to check if list comprehension is better than for:

In [34]:

%timeit [genome_to_num_numexpr(ge) for ge in g];

10 loops, best of 3: 50.6 ms per loop

In [35]:

%%timeit 
a = [None]*g.shape[0]
for i,ge in enumerate(g): 
    a[i]=genome_to_num_numexpr(ge)

10 loops, best of 3: 49.2 ms per loop

Not much of a difference - I think this is because the list comprehension allocates memory, just like the for loop I used (in contrast to an approach that uses lst.append(item)).

Can I improve on 50 ms for the genomes table?

Another cython, non-numexpr attempt:

In [36]:

%%cython
import numpy as np
cimport numpy as np
import numpy.random as npr
cimport cython
from cpython cimport bool

DTYPE_INT = np.int
ctypedef np.int_t DTYPE_INT_t
DTYPE_FLOAT = np.float
ctypedef np.float_t DTYPE_FLOAT_t

@cython.boundscheck(False)
@cython.wraparound(False)
def genomes_to_nums(np.ndarray[DTYPE_INT_t, ndim=2] genomes):
    cdef np.ndarray[DTYPE_FLOAT_t, ndim=1] nums
    cdef np.ndarray[DTYPE_FLOAT_t, ndim=1] i
    cdef DTYPE_FLOAT_t num_loci_f 
    num_loci_f = np.float(genomes.shape[1])
    nums = np.zeros(genomes.shape[0], dtype=DTYPE_FLOAT)
    cdef Py_ssize_t j
    i = np.arange(num_loci_f)
    for j in range(genomes.shape[0]):
        nums[j] = (2. ** i * genomes[j,:]).sum()
    return nums

In [37]:

%timeit genomes_to_nums(g);

10 loops, best of 3: 100 ms per loop

And one with numexpr, if possible:

In [38]:

%%cython
import numpy as np
cimport numpy as np
import numpy.random as npr
cimport cython
import model
reload(model)

DTYPE_INT = np.int
ctypedef np.int_t DTYPE_INT_t
DTYPE_FLOAT = np.float
ctypedef np.float_t DTYPE_FLOAT_t

@cython.boundscheck(False)
@cython.wraparound(False)
def genomes_to_nums_numexpr(np.ndarray[DTYPE_INT_t, ndim=2] genomes):
    cdef np.ndarray[DTYPE_FLOAT_t, ndim=1] nums
    cdef np.ndarray[DTYPE_INT_t, ndim=1] gen
    cdef np.ndarray[DTYPE_FLOAT_t, ndim=1] i
    cdef DTYPE_FLOAT_t num_loci_f 
    num_loci_f = np.float(genomes.shape[1])
    nums = np.zeros(genomes.shape[0], dtype=DTYPE_FLOAT)
    cdef Py_ssize_t j
    i = np.arange(num_loci_f)
    for j in range(genomes.shape[0]):
        gen = genomes[j, :]
        nums[j] = model.genome_to_num_numexpr(gen)
    return nums

This keeps throwing errors, though.

In [40]:

#%timeit genomes_to_nums_numexpr(g)

The next approach is to use the fact that genome is a sparse matrix:

In [112]:

def genome_to_num_numexpr_nonzero(genome):
    i = genome.nonzero()[0]
    return numexpr.evaluate("sum(2. ** i )")

In [103]:

def genome_to_num_nonzero(genome):
    i = genome.nonzero()[0]
    return (2. ** i).sum()

In [114]:

%timeit [genome_to_num_numexpr(ge) for ge in g]

10 loops, best of 3: 49.3 ms per loop

In [115]:

%timeit [genome_to_num_numexpr_nonzero(ge) for ge in g]

10 loops, best of 3: 30.1 ms per loop

In [116]:

%timeit [genome_to_num_nonzero(ge) for ge in g]

100 loops, best of 3: 13.9 ms per loop

This approach works nice, we cut the running time by 60% compared to the dense numexpr approach, and by 1/8 compared to the cython+numpy which we started from. But now the numexpr approach isn'y better than a simple numpy approach!

The sparse approach is the definite winner, cutting the time by another factor of ~50.

Now I want to check if list comprehension is better than allocating an array and filling it, either with python or with cython:

In [119]:

def genomes_to_nums_compr(genomes):
    return [genome_to_num_nonzero(ge) for ge in g]

In [120]:

def genomes_to_nums_alloc(genomes):
    nums = [None]*genomes.shape[0]
    for i,ge in enumerate(genomes):
        nums[i] = genome_to_num_nonzero(ge)
    return nums

In [126]:

%%cython
import numpy as np
cimport numpy as np
import numpy.random as npr
cimport cython

DTYPE_INT = np.int
ctypedef np.int_t DTYPE_INT_t
DTYPE_FLOAT = np.float
ctypedef np.float_t DTYPE_FLOAT_t

def genome_to_num_nonzero(genome):
    i = genome.nonzero()[0]
    return (2. ** i).sum()

def genomes_to_nums_cython(np.ndarray[DTYPE_INT_t, ndim=2] genomes):
    cdef np.ndarray[DTYPE_FLOAT_t, ndim=1] nums
    cdef Py_ssize_t i
    cdef np.ndarray[DTYPE_INT_t, ndim=1] ge 
    
    nums = np.empty(genomes.shape[0])
    for i,ge in enumerate(genomes):
        nums[i] = genome_to_num_nonzero(ge)
    return nums

In [127]:

%timeit genomes_to_nums_compr(g)
%timeit genomes_to_nums_alloc(g)
%timeit genomes_to_nums_cython(g)

100 loops, best of 3: 13.2 ms per loop
100 loops, best of 3: 13.6 ms per loop
100 loops, best of 3: 23.2 ms per loop

So here the cython approach isn't worth it, and list comprehension is as good as pre-allocating a list. I also tris using a list for nums in the cython version, it didn't matter.

Sparse matrix¶

But now that I think of it, this sparse approach with calculating the num might be improved if instead of calculating the nums I simply compare non-zero elements. So let's convert genomes to a sparse matrix and then search on it in various ways.

Start with the naive approach:

In [ ]:

Now let's try to use the sparse matrix of scipy.

Import scipy and create different kinds of sparse matrix versions of genomes:

In [4]:

import scipy.sparse as spr

In [63]:

gcsr = spr.csr_matrix(g)
gcsc = spr.csc_matrix(g)
gdok = spr.dok_matrix(g)
glil = spr.lil_matrix(g)
gcoo = spr.coo_matrix(g)
gbsr = spr.bsr_matrix(g)
gdia = spr.dia_matrix(g)

Check which sparse matrix type gives nonzero() faster:

In [106]:

%timeit -n 100 g.nonzero();
%timeit -n 100 gcsr.nonzero();
%timeit -n 100 gcsc.nonzero();
%timeit -n 100 gdok.nonzero();
%timeit -n 100 glil.nonzero();
%timeit -n 100 gcoo.nonzero();
%timeit -n 100 gbsr.nonzero();
%timeit -n 100 gdia.nonzero();

100 loops, best of 3: 14.1 ms per loop
100 loops, best of 3: 124 us per loop
100 loops, best of 3: 125 us per loop
100 loops, best of 3: 7.87 ms per loop
100 loops, best of 3: 1.42 ms per loop
100 loops, best of 3: 28.8 us per loop
100 loops, best of 3: 305 us per loop
100 loops, best of 3: 28.1 ms per loop

coo_matrix is the winner - fastest nonzero(). Let's verify on random matrix:

In [8]:

r = spr.rand(g.shape[0],g.shape[1],g.sum()/float(len(g.flatten())),format='csr')
%timeit -n 100 r.nonzero();
r = spr.rand(g.shape[0],g.shape[1],g.sum()/float(len(g.flatten())),format='csc')
%timeit -n 100 r.nonzero();
r = spr.rand(g.shape[0],g.shape[1],g.sum()/float(len(g.flatten())),format='dok')
%timeit -n 100 r.nonzero();
r = spr.rand(g.shape[0],g.shape[1],g.sum()/float(len(g.flatten())),format='lil')
%timeit -n 100 r.nonzero();
r = spr.rand(g.shape[0],g.shape[1],g.sum()/float(len(g.flatten())),format='coo')
%timeit -n 100 r.nonzero();
r = spr.rand(g.shape[0],g.shape[1],g.sum()/float(len(g.flatten())),format='bsr')
%timeit -n 100 r.nonzero();
r = spr.rand(g.shape[0],g.shape[1],g.sum()/float(len(g.flatten())),format='dia')
%timeit -n 100 r.nonzero();

100 loops, best of 3: 131 us per loop
100 loops, best of 3: 132 us per loop
100 loops, best of 3: 7.78 ms per loop
100 loops, best of 3: 1.38 ms per loop
100 loops, best of 3: 32 us per loop
100 loops, best of 3: 295 us per loop
100 loops, best of 3: 34.6 ms per loop

Indeed, coo_matrix is the winner.

Let's use it. Assume that genomes is coo_matrix...

I didn't find a nice way to do it.

I did find a way to use nonzero() with the lil_matrix:

In [97]:

def find_row_sparse_naive(genomes, target):    
    #genomes_lil = spr.lil_matrix(genomes)
    genz = target.nonzero()[0]
    for i,row in enumerate(glil.rows):
        x = (row==genz)
        if x.all() and x.any():
            return i
    return -1

In [98]:

%timeit find_row(g, ge)
%timeit find_row_nums(nums, ge)

1000 loops, best of 3: 979 us per loop
1000 loops, best of 3: 273 us per loop

In [99]:

%timeit find_row_sparse_naive(glil, ge)

100 loops, best of 3: 11.4 ms per loop

But it is not as fast as the nums approach.

Test the `genome_to_num` function¶

In [132]:

a=g[0,:]
genome_to_num_nonzero(a)

Out[132]:

0.0

In [133]:

a[0]=1
genome_to_num_nonzero(a)

Out[133]:

1.0

In [134]:

a[0]=0
a[-1]=1
genome_to_num_nonzero(a)

Out[134]:

5.3575430359313366e+300

In [136]:

2**999==genome_to_num_nonzero(a)

Out[136]:

True

In [138]:

a[-1]=0
a.sum()

Out[138]:

In [140]:

a.fill(1)
a.sum()

Out[140]:

In [143]:

genome_to_num_nonzero(a)==(2**1000)

Out[143]:

True

Summary¶

I started with 3 secs per 1 generation of simulation. Where am I now?

In [21]:

import simulation
reload(simulation);

In [26]:

%timeit simulation.run(0,0);

Starting simulation with  0 ticks
Simulation finished, 0 ticks, time elapsed 0.319544448837 seconds
Starting simulation with  0 ticks
Simulation finished, 0 ticks, time elapsed 0.285992677134 seconds
Starting simulation with  0 ticks
Simulation finished, 0 ticks, time elapsed 0.283181251529 seconds
Starting simulation with  0 ticks
Simulation finished, 0 ticks, time elapsed 0.290546981349 seconds
1 loops, best of 3: 284 ms per loop

So I got it down from 3 seconds to 0.3 seconds, ~10 times faster. That means that a 500 generations simulations (like the ones I used in my previous work) should take ~100 seconds, in comparison to several hours that they took in my Java implementation.

This is progress.

The profiler for 5 generations:

In [19]:

%%prun
p,g = simulation.run(5,0)

Starting simulation with  5 ticks
Simulation finished, 5 ticks, time elapsed 2.63202320618 seconds

11789 function calls in 0.193 seconds

ncalls tottime percall cumtime percall filename:lineno(function)

5696 1.792 0.000 1.792 0.000 model.py:59(find_row_nums)

 4823    0.160    0.000    0.160    0.000 {numpy.core.multiarray.array}

    6    0.088    0.015    0.089    0.015 model.py:95(clear_empty_classes)

11641    0.085    0.000    0.085    0.000 {method 'nonzero' of 'numpy.ndarray' objects}

 5945    0.074    0.000    0.074    0.000 {method 'put' of 'numpy.ndarray' objects}

11641    0.072    0.000    0.184    0.000 model.py:49(genome_to_num)

    6    0.071    0.012    0.128    0.021 shape_base.py:348(_replace_zero_by_x_arrays)

    6    0.063    0.010    2.228    0.371 model.py:106(mutation_explicit_genomes)

17610    0.061    0.000    0.061    0.000 {method 'sum' of 'numpy.ndarray' objects}

 5945    0.060    0.000    0.093    0.000 model.py:38(hamming_fitness_genome)

    7    0.043    0.006    0.213    0.030 shape_base.py:11(apply_along_axis)

 4773    0.014    0.000    0.014    0.000 {method 'any' of 'numpy.ndarray' objects}

Only functions with >0.009 total time are shown. Not much I can do here - maybe some playing around with numpy dtypes will help, or moving some stuff to cython, but based on my experiments I doubt it.

At any case I'm not sure it is worth it, and apart from find_row_nums there is no function that is a clear case for improvement.

Mutation function optimization¶

Optimize find_row¶

Optimize genome_to_num¶

Sparse matrix¶

Test the genome_to_num function¶

Summary¶

FIN¶

Optimize `find_row`¶

Optimize `genome_to_num`¶

Test the `genome_to_num` function¶