Multilanguage Monte Carlo
Python
Statistical Computing Languages, RSS
We want to estimate the integral
$$ \int^5_{-5} \frac{\exp(-u^2/2)}{\sqrt{2\pi}}\text{d}u $$using a simple Monte Carlo rejection algorithm. The function is the standard normal distribution, so the integral is approximately equal to 1.
Here, we use the IPython notebook (Soon to be called Project Jupyter) to demonstrate possible solutions in several languages. Once you have this notebook set up correctly on your machine, all of the code below is editable and executable from within the web browser.
# rerun Fernando's script to load in css
%run talktools
A first attempt at a solution using pure Python might look like this
import random
from math import exp, sqrt, pi
def f(u):
return(exp(-u**2/2)/sqrt(2*pi))
def loopy_monte(N):
hits = 0
for count in range(N):
x = random.uniform(-5,5)
y = random.uniform(0,0.5)
hits = hits + (y < f(x))
estimate = hits / float(N) * (0.5*10)
return(estimate)
Let's run the calculation for 1 million iterations
loopy_monte(10**6)
1.001425
Note the import statements. Unlike languages such as MATLAB
or R
, Python has few functions built into its global namespace at startup. This means that even simple functions such as sqrt need to be explicity imported from a module. In the above code, we use the modules random and math which are part of the standard Python library, the set of modules that are included with every Python installation.
It is possible to export everything from a module in one hit. Instead of
from math import exp,sqrt,pi
we might have done
from math import *
This idiom is commonly used but is considered bad practice. Python programmers would say that it is not Pythonic. The principle objection to this practice is that it pollutes your namespace with many functions -- many of which the user is probably unaware of. If you do multiple module imports in this manner, there is a good chance that function names from one module would overwrite those of a previous module.
Let's time our function using IPython's %timeit
magic
%timeit loopy_monte(10**6)
1 loops, best of 3: 3.61 s per loop
This might not seem too bad for 1 million iterations but we can do significantly better. As with many high-level languages, loops are very expensive in Python. One approach to making things faster is to vectorise our code. That is, we operate on entire vectors at once rather than on scalars. This can be achieved in Python using the numpy module.
Numpy is not part of the Python standard library and needs to be installed separately. It is, however, considered the de-facto standard for numerical computing in Python and is widely used. If you use a Python distribution such as Anaconda or Enthought, numpy will be available to import out of the box.
import numpy as np
def numpy_f(u):
return(np.exp(-u**2/2)/np.sqrt(2*np.pi))
def numpy_monte(N):
x = np.random.uniform(-5,5,N)
y = np.random.uniform(0,0.5,N)
hits = np.sum(y<numpy_f(x))
estimate = hits / np.float(N) * (0.5*10)
return(estimate)
numpy_monte(10**6)
0.99990499999999993
Note that we've had to do some extra work here in order to make vectorisation possible. In particular, we've had to redfine our function f to a version that uses numpy versions of exp and sqrt. This extra work is worth it, however, since we get significantly faster runtimes.
%timeit numpy_monte(10**6)
10 loops, best of 3: 79.4 ms per loop
On my machine, I got times of 2.02 seconds for the loopy version and 70.5 milliseconds for the numpy version - a speed up of almost 29 times.
Vectorisation is nothing special. We are still performing a loop but it's now implicit. Rather than perform the loop in Python, which is slow, we generate our data and ship it out to C or Fortran routines, where loops are cheap. This is the basis of vectorisation in many other languages such as R and MATLAB.
There are several ways to go parallel in Python. Here, I use the IPython parallel infrastructure.
First, you need to go to the Clusters tab of the Ipython dashboard and start a cluster with your required number of engines. I only have a dual core laptop, so I'll use 2. The following code will not work if you omit this step.
from IPython import parallel
clients = parallel.Client()
clients.block = True
print(clients.ids)
dview = clients[:] # use all engines
[0, 1]
Whenever you run stochastic code in parallel, you need to think carefully about random number generators. The numpy random number generator is the Mersenne Twister algorithm which isn't ideal for parallel work but it's good enough in many instances. In this instance, we are going to give each engine its own seed to work with. This does not guarentee non-overlapping sub-streams but the probability that we are going to have an issue is extremely small. For our purposes here, it's probably going to be fine.
import numpy as np
def numpy_f(u):
return(np.exp(-u**2/2)/np.sqrt(2*np.pi))
# Redefine numpy_monte to take a seed argument
def numpy_monte(N,seed):
np.random.seed(seed)
x = np.random.uniform(-5,5,N)
y = np.random.uniform(0,0.5,N)
hits = np.sum(y<numpy_f(x))
estimate = hits / float(N) * (0.5*10)
return(estimate)
# First, we do this in serial
# Set up
cores = 2
# This contains the number of iterations to be performed by each engine.
# 10 times the number we did earlier. When working in parallel, you need to ensure that each engine has sufficent work to do to
# ensure that communications overheads don't swamp the benefits
N = [10**7]*cores
seeds = range(0,cores) # A list of seeds, one for each engine
# We use the map function to map numpy_monte over the lists created above
# The total number of iterations will be cores * N
%timeit estimate = sum(list(map(numpy_monte,N,seeds)))/cores
estimate = sum(list(map(numpy_monte,N,seeds)))/cores
#Display the esimate
estimate
1 loops, best of 3: 1.4 s per loop
0.99941950000000002
Now we do it in Parallel using the engines created earlier First, we need to define our functions on our engines. So far, we've only defined them locally. We use the px magic for this
%%px
#Define all functions on the engines
import numpy as np
def numpy_f(u):
return(np.exp(-u**2/2)/np.sqrt(2*np.pi))
# Redefine numpy_monte to take a seed argument
def numpy_monte(N,seed):
np.random.seed(seed)
x = np.random.uniform(-5,5,N)
y = np.random.uniform(0,0.5,N)
hits = np.sum(y<numpy_f(x))
estimate = hits / float(N) * (0.5*10)
return(estimate)
#Do the work in parallel using the dview we created earlier
%timeit estimate = sum(dview.map(numpy_monte,N,seeds))/cores
#Print the result
estimate
1 loops, best of 3: 1.12 s per loop
0.99941950000000002
Two points to note here:
Although not supported by the default IPython install, it is possible to include MATLAB code within an IPython notebook. In order for this to work, you need a licensed verison of MATLAB on your machine and a copy of the python-matlab-bridge installed
https://github.com/arokem/python-matlab-bridge
First, the magic:
%load_ext pymatbridge
Starting MATLAB on ZMQ socket ipc:///tmp/pymatbridge Send 'exit' command to kill the server .........MATLAB started and connected!
If the above command was successful, you can use the cell magic %%matlab
to let IPython know that you'd like to execute the code in MATLAB.
%%matlab
a = linspace(-4,4,100);
b = sin(a);
plot(a,b)
Armed with our new MATLAB magic, let's try a loopy solution in MATLAB.
%%matlab
format long
N = 10^6;
hits = 0;
tic
for count = 1:N
x = rand() * 10 - 5;
y = rand() * 0.5;
hits = hits + ( y < (exp(-x^2/2) / sqrt(2*pi)) );
end
estimate = hits / N * (0.5*10)
toc
estimate = 0.998805000000000 Elapsed time is 5.081431 seconds.
That's much slower than I expected it to be! The problem here is that most of the time is, in fact, overhead of the MATLAB
-Python bridge. This is rather unfair on MATLAB
so let's create an .m
file containing our code and run that instead
%%writefile loopy_monte.m
function estimate = loopy_monte(N)
hits = 0;
tic
for count = 1:N
x = rand() * 10 - 5;
y = rand() * 0.5;
hits = hits + ( y < (exp(-x^2/2) / sqrt(2*pi)) );
end
estimate = hits / N * (0.5*10);
toc
Overwriting loopy_monte.m
This allows us to run the MATLAB
code from within our IPython notebook with the minimum of overhead
from pymatbridge import Matlab
# Start a MATLAB session
mlab = Matlab()
mlab.start()
# Change MATLAB's working directory to match Ipython's current working directory
current_dir = %pwd
mlab.run_code('cd ' + current_dir)
# Run the code in MATLAB and print the results
results = mlab.run_code('loopy_monte(10^6)')
print(results['content']['stdout'])
# Stop the MATLAB session
mlab.stop()
Starting MATLAB on ZMQ socket ipc:///tmp/pymatbridge Send 'exit' command to kill the server .MATLAB started and connected! Elapsed time is 0.253124 seconds. ans = 0.999335000000000 MATLAB closed
True
The time shown above matches what I get when I run the code directly in MATLAB
so the Python-MATLAB
-bridge overhead is now rather small.
As with other interpreted languages, it is not considered good practice to solve this type of problem using loops in MATLAB
. However, MATLAB
punishes you a lot less for using loops than Python or R
does. The loopy MATLAB
version is over 10 times faster than the loopy Python version on my machine. This is because MATLAB
has a very high quality JIT (Just In Time) compiler whereas standard Python or R
do not. Julia gains a big advantage here by pre-compiling code.
Despite the JIT compiler, it is often still faster to convert from loops to vectorised code in MATLAB
and that is the case here
%%writefile vector_monte.m
function estimate = vector_monte(N)
% A vectorised version of our monte carlo code
tic
x = rand(1,N) * 10 - 5;
y = rand(1,N) * 0.5;
hits = ( sum(y < (exp(-x.^2/2) / sqrt(2*pi))) );
estimate = hits / N * (0.5*10);
toc
Overwriting vector_monte.m
from pymatbridge import Matlab
# Start a MATLAB session
mlab = Matlab()
mlab.start()
# Change MATLAB's working directory to match Ipython's current working directory
current_dir = %pwd
mlab.run_code('cd ' + current_dir)
# Run the code in MATLAB and print the results
results = mlab.run_code('vector_monte(10^6)')
print(results['content']['stdout'])
# Stop the MATLAB session
mlab.stop()
Starting MATLAB on ZMQ socket ipc:///tmp/pymatbridge Send 'exit' command to kill the server .MATLAB started and connected! Elapsed time is 0.064763 seconds. ans = 1.0001 MATLAB closed
True
This is almost twice as fast as the loopy version on my machine and is around the same speed as the Python/Numpy version.
One of the issues with straight vectorisation is that it uses a lot of memory. For example, say you wanted to evaluate $5*10^8$ iterations, the x vector alone would use almost 4Gb - pretty much all the memory available on low-end laptops.
The solution is to mix vectorisation and for-loops
%%writefile serial_monte.m
function estimate = par_monte(N,jobs)
tic
estimate = 0;
for i=1:jobs
x = rand(1,N) * 10 - 5;
y = rand(1,N) * 0.5;
hits = ( sum(y < (exp(-x.^2/2) / sqrt(2*pi))) );
estimate = estimate + (hits / N * (0.5*10));
end
estimate = estimate / jobs
toc
Overwriting serial_monte.m
Now I can split my $5*10^8$ iterations into 100 smaller vectorised jobs of $5*10^6$ each. Memory's no longer an issue.
from pymatbridge import Matlab
# Start a MATLAB session
mlab = Matlab()
mlab.start()
# Change MATLAB's working directory to match Ipython's current working directory
current_dir = %pwd
mlab.run_code('cd ' + current_dir)
# Run the code in MATLAB and print the results
results = mlab.run_code('serial_monte(5*10^6,100)')
print(results['content']['stdout'])
# Stop the MATLAB session
mlab.stop()
Starting MATLAB on ZMQ socket ipc:///tmp/pymatbridge Send 'exit' command to kill the server .MATLAB started and connected! estimate = 1.0000 Elapsed time is 23.162530 seconds. ans = 1.0000 MATLAB closed
True
Converting to parallel MATLAB is easy in this instance as long as we have access to the Parallel Computing Toolbox. we just type three letters and turn for
into parfor
%%writefile par_monte.m
function estimate = par_monte(N,jobs)
tic
estimate = 0;
parfor i=1:jobs
x = rand(1,N) * 10 - 5;
y = rand(1,N) * 0.5;
hits = ( sum(y < (exp(-x.^2/2) / sqrt(2*pi))) );
estimate = estimate + (hits / N * (0.5*10));
end
estimate = estimate / jobs
toc
Overwriting par_monte.m
from pymatbridge import Matlab
# Start a MATLAB session
mlab = Matlab()
mlab.start()
# Change MATLAB's working directory to match Ipython's current working directory
current_dir = %pwd
mlab.run_code('cd ' + current_dir)
# Open the parallel pool
mlab.run_code('parpool local')
# Run the code in MATLAB and print the results
results = mlab.run_code('par_monte(5*10^6,100)')
print(results['content']['stdout'])
# Stop the MATLAB session
mlab.stop()
Starting MATLAB on ZMQ socket ipc:///tmp/pymatbridge Send 'exit' command to kill the server .....MATLAB started and connected! estimate = 1.0000 Elapsed time is 20.719257 seconds. ans = 1.0000 MATLAB closed
True
A couple of things to note here
MATLAB
, it costs extraMATLAB
is Mersenne Twister. However, inside a parfor
loop this silently changes to use a different random number generator that has better properties for parallel work. You should bear this mind when you need your work to be reproducible or if you are comparing with other languages that do use Mersenne TwisterIt is also possible to run R
code in the IPython
notebook. To do this, you need the latest copy of R
installed on your system along with the rpy2
module. I installed rpy2
with the following command in a terminal window
pip install rpy2
Once this has been done, We need to load the rmagic extension
%load_ext rmagic
The rmagic extension is already loaded. To reload it, use: %reload_ext rmagic
Any cell that we start with %%R
will have its code run in an R session. Here is the code that Colin used to plot the function we want to integrate
%%R
f = function(u) exp(-u^2/2)/sqrt(2*pi)
x = seq(-5, 5, length.out = 100)
plot(x, f(x), type="l", ylim=c(0, 0.5))
abline(h=0, lty=4, col=4);abline(h=0.5, lty=3, col=4); abline(v=c(-5, 5), lty=4, col=4)
Here is the R solution using loops. As with Python and MATLAB, loops are not reccommended in R. This code is based on the original version written by Colin (downloaded 19th November 2014) but changed so that N is 10^6 rather than 10^5 to keep it on par with our other examples.
%%R -o estimate
# start the clock
start_time = proc.time()
f = function(u) exp(-u^2/2)/sqrt(2*pi)
simulate_pt = function(i){
x = runif(1,-5,5); y = runif(1,0,0.5);
y < f(x)
}
N = 10^6; hits = 0
for(i in 1:N)
hits = hits + simulate_pt()
(estimate = hits/N*(0.5*10))
#Stop the clock
print(proc.time() - start_time)
print(estimate)
user system elapsed 11.533 0.124 11.661 [1] 0.9953
On my machine, this took over 11.6. seconds! Almost 6 times slower than the equivalent loop in Python and over 70 times slower than the equivalent loop in MATLAB. R seems to really punish you for using loops.
As with MATLAB, there is a risk that all we are measuring here is the overhead on the connection between R and IPython. Although there is some overhead, it appears to be much less of an issue.
Let's write the above code to a file and execute it in R via the shell.
%%writefile loop_monte.R
# start the clock
start_time = proc.time()
f = function(u) exp(-u^2/2)/sqrt(2*pi)
simulate_pt = function(i){
x = runif(1,-5,5); y = runif(1,0,0.5);
y < f(x)
}
N = 10^6; hits = 0
for(i in 1:N)
hits = hits + simulate_pt()
(estimate = hits/N*(0.5*10))
#Stop the clock
print(proc.time() - start_time)
print(estimate)
Overwriting loop_monte.R
!R --vanilla < loop_monte.R
R version 3.1.2 (2014-10-31) -- "Pumpkin Helmet" Copyright (C) 2014 The R Foundation for Statistical Computing Platform: x86_64-apple-darwin13.4.0 (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. > # start the clock > start_time = proc.time() > f = function(u) exp(-u^2/2)/sqrt(2*pi) > > simulate_pt = function(i){ + x = runif(1,-5,5); y = runif(1,0,0.5); + y < f(x) + } > > N = 10^6; hits = 0 > for(i in 1:N) + hits = hits + simulate_pt() > (estimate = hits/N*(0.5*10)) [1] 0.996925 > > #Stop the clock > print(proc.time() - start_time) user system elapsed 10.180 0.050 10.303 > print(estimate) [1] 0.996925 >
So, 11.68 seconds via the R connector and 10.3 seconds if run directly in the shell. Either way, it's the slowest solution so far.
Vectorisation is pretty much essential in R. Here's the vecorisd version given by Colin. Although the R connector is nice, I'll run it in the shell to give a better idea of the speed difference.
%%writefile vector_monte.R
start_time = proc.time()
N=10^6
f = function(u) exp(-u^2/2)/sqrt(2*pi)
estimate = sum(runif(N, 0, 0.5) < f(runif(N, -5, 5)))/N*(0.5*10)
print(proc.time() - start_time)
print(estimate)
Overwriting vector_monte.R
!R --vanilla < vector_monte.R
R version 3.1.2 (2014-10-31) -- "Pumpkin Helmet" Copyright (C) 2014 The R Foundation for Statistical Computing Platform: x86_64-apple-darwin13.4.0 (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. > start_time = proc.time() > N=10^6 > f = function(u) exp(-u^2/2)/sqrt(2*pi) > estimate = sum(runif(N, 0, 0.5) < f(runif(N, -5, 5)))/N*(0.5*10) > > print(proc.time() - start_time) user system elapsed 0.246 0.013 0.260 > print(estimate) [1] 0.999495 >
On my machine, this comes out at 0.26 seconds giving the slowest vectorised solution we've seen so far.