Introduction¶

This talk is available on github at http://github.com/jseabold/zorro
Clone using git

    git clone git://github.com/jseabold/zorro.git zorro-talk

Checkout using subversion

    svn checkout https://github.com/jseabold/zorro zorro-talk

Start the notebook

cd zorro-talk
    ipython notebook --notebook-dir=.

Most of this presentation is taken from the IPython parallel computing documentation
And from talks given over the years by the core development team of @minrk, @ellisonbg, and @fperez_org, among many others
IPython is well documented, including video tutorials
There is a great support network for IPython on stackoverflow and on their mailing list
This talk is created using the IPython Notebook, which also support parallelism

Talk Dependencies¶

IPython
- Notebooks dependencies
- Parallel dependencies
NumPy
SciPy
sympy (optional)

In [ ]:

import numpy as np
import matplotlib.pyplot as plt

Aims¶

Assume familiarity with basic Python
Cover some basic concepts for parallel (scientific) computing
Provide context for IPython.parallel
Fill-in documentation gaps for the getting started hurdle

Some (Biased) Options for (Scientific Computing) Parallelism in Python¶

Symmetric Multiprocessing (SMP) (Shared Main-Memory)¶

multiprocessing
- appropriate for CPU bound problems
- takes advantage of multiple CPUs/cores
- runs in separate memory space
- OS handles process scheduling
- Would take extra handling to use on a compute cluster
threading
- appropriate for I/O bound problems
- runs in the same memory space
- threading library handles scheduling
- GIL can get in the way here on CPython - keeps you from writing to the same memory address at the same time from two different threads
  - optimized C extension libraries will release the GIL
joblib
- memoization
- embarrassingly parallel for loops
- trivial to use
- SMP-only (at the moment)

Cluster Computing (Memory not available to all processors)¶

IPython.parallel
- This is the focus of this notebook
- Why IPython?
  - Designed by computational scientists for computational scientists
  - Design goal to make things easy and get out of the way
  - Interactive usage for debugging/developing
Other options
- Pyro
- pp (Parallel Python)
- Disco (Map-Reduce)
- ...

Architecture¶

In [ ]:

from IPython.display import Image
Image(filename="./parallel_architecture400.png")

Three Core Parts

The Client
- This is what you use to run your parallel computations
- You will interact with a View on the client
- The type of View depends on the execution model you are using
The Engine(s)
- An IPython "kernel" where the code is executed
- Listens for instructions over a network connection
The Controller
- The IPython controller consists of 1) the Hub and 2) the schedulers
- The Hub is the central process that monitors everything
- The schedulers take care of getting of getting your code where it should go
- The controller is the go between from the Client to the Engines

IPython client and views¶

The Client object connects to the cluster
For each execution model there is a corresponding View
For example, there are two basic views:
- The DirectView class for explicitly running code on a particular engine(s)
- The LoadBalancedView class for running your code on the 'best' engine(s)
You can use as many views as you like, many at the same time
You can read much more about the details of IPython parallel and views in the documentation

Getting Started¶

Take advantage of multiple the processors on your local machine
Say you have 4 processors
How many processors do I have available?

In [ ]:

from multiprocessing import cpu_count
print cpu_count()

Start a controller and 4 engines with the ipcluster program
At the command line type

ipcluster start -n 4

Or, in the notebook at the dashboard

Did it work?¶

In [ ]:

from IPython import parallel

rc = parallel.Client(profile='hpc')
rc.block = True

In [ ]:

rc.ids

In [ ]:

def power(a, b):
    return a**b

Create a direct view of kernel 0
The Client support slicing

In [ ]:

dv = rc[0]
dv

In [ ]:

dv.apply(power, 2, 10)

Recall that slice notation allows you leave out start and stop steps

In [ ]:

X = [1, 2, 3, 4]
X

In [ ]:

X[:]

Use this to send code to all the engines

In [ ]:

rc[:].apply_sync(power, 2, 10)

Python's built-in map function allows you to call a sequence a function over a sequences of arguments

In [ ]:

map(power, [2]*10, range(10))

In parallel, you use view.map

In [ ]:

view = rc.load_balanced_view()
view.map(power, [2]*10, range(10))

When to Parallelize?¶

"Premature optimization is the root of all evil".
                            -Donald Knuth

Write code how you want it to look then optimize
But be smart

Profile Your Code¶

This is important
Know where your bottlenecks are
Python has a great built-in profiler cProfile
- Visualize with RunSnakeRun
There is also line_profiler by Robert Kern
kcachegrind is another option that includes visualization
Use these!

Write Smart Code¶

Get the speed-ups you can first
If you're working with a dynamic language don't write loops
- Take advantage of libraries like NumPy
If you can, compile it
- Write bottlenecks in a compiled language like C/C++ or Fortran
- Really. Look at this Python-Fortran 90 side-by-side comparison
- Use Cython

What kind of speed-up can I expect?¶

Speed-ups follow Ahmdal's law

$$\frac{1}{(1-P)+\frac{P}{S}}$$

P is the proportion of your code that can be parallelized
S is the speed-up you can achieve

$$S=\frac{T_1}{T_p}$$

where $T_1$ is the time it takes to run the serialized code and $T_p$ is the speed-up for using $p$ processors

Ideal speed-up is linear scaling $S=p$
For example, say only 40% of your code can be parallelized
The parallel parts get a speed up of $3=\frac{9min}{3min}$ using 4 processors
Amdahl's law says you can get a speed-up of "only" $1.36\times$

Otherwise stated

$$\frac{1}{(1-P)+\frac{P}{N}}$$

where $N$ is the number of processors

In the limit, the maximum speed-up is $\frac{1}{1-P}$
If P = 90%, the most you can get is a maximum speed-up of 10
There are strongly decreasing returns to $N$

Take-aways

You're fine using a small number of processors
Or hope that $P$ is very high -- so called embarrassingly parallel problems
- This tutorial focuses on these cases

Things to be aware of¶

Overhead of parallel framework vs. function execution time
Pushing and pulling results over the network from the engines takes time
Functions faster than ~100ms need not apply

The Direct Interface¶

The direct interface lets the user interact explicitly with each engine
First, create a direct view

In [ ]:

from IPython import parallel

rc = parallel.Client(profile='hpc')

Above we saw the use of map and apply in parallel
You may have also noticed this bit of code

In [ ]:

rc.block = True

In blocking mode, whenever you execute some code on the engines the controller waits until this code is done executing
Non-blocking mode is the default

Get access to all the engines

In [ ]:

dview = rc.direct_view()

You can block on a call-by-call basis as well, by using apply_sync for synchronous execution

In [ ]:

dview.block = False

dview["a"] = 5 # shorthand for push
dview["b"] = 7

dview.apply_sync(lambda x: a + b + x, 27)

There is also apply_async
Above you'll notice that the assignments defined these variables on the engines in a dictionary-like manner

In [ ]:

d = {}
d["a"] = 5

In [ ]:

This is shorthand for pushing python objects to the engines
DirectViews provide dictionary-like access by key or by using get and update like built-in dicts
This can also be done explicitly with push
push takes a dictionary

In [ ]:

dview.push(dict(msg="Hi, there"), block=True)

In [ ]:

dview.block = True

Python commands can be executed as strings on specific engines

In [ ]:

dview.execute("x = msg")

In [ ]:

dview["x"] # shorthand for pull

You can also use pull explicitly to get back from the engine

In [ ]:

#rc[::2].execute("c = a + b")
# or
dview.execute("c = a + b", targets=[0,2])
#rc[1::2].execute("c = a - b")
# or
dview.execute("c = a - b", targets=[1,3])

In [ ]:

dview.pull("c")

If we were working in non-blocking mode, we would get an AsyncResult object back immediately

In [ ]:

def wait(t):
    import time
    tic = time.time()
    time.sleep(t)
    return time.time() - tic

In [ ]:

ar = dview.apply_async(wait, 2)

In [ ]:

type(ar)

We use its get method to get the result
Calling get blocks

In [ ]:

ar.get()

If we weren't quite so patient, we could ask if our tasks are done by using the ready method

In [ ]:

ar = dview.apply_async(wait, 15)

print ar.ready()

Or we can ask for the result, waiting a maximum of, say, 5 seconds

In [ ]:

ar.get(5)

Often, we can't go on until some results are done
For this, we can use the wait method
wait can take an iterable of AsyncResults

In [ ]:

result_list = [dview.apply_async(wait, 3) for i in range(5)]

In [ ]:

result_list

In [ ]:

dview.wait(result_list)

In [ ]:

result_list[4].get()

Scatter and Gather¶

You can use scatter to partition an iterable across engines
gather pulls the results back
You can use this to do parallel list comprehensions as below
Sometimes this is more convenient than map

In [ ]:

dview.scatter('x', range(64))

%px y = [i**10 for i in x]

y = dview.gather('y')

print y[:10]

The % indicates that we are using an IPython 'magic' function
The available parallel magics are listed in the documentation

The Task Interface¶

The Task interface allows you to use your the engines as a system of workers
You no longer have direct access to the individual engines
If your tasks are easily segmented into pieces that do not depend on each other, the Task Interface may be ideal
However, you can specify complex dependencies to describe task execution order
You can use many standard scheduling paradigms for how tasks should be run or define your own
I am not going to discuss the task interface in detail

In [ ]:

rc = parallel.Client(profile='hpc')

lview = rc.load_balanced_view()

In [ ]:

lview.block = True

parallel_result = lview.map(lambda x:x**10, range(32))

print parallel_result[:10]

Examples¶

Poor Man's Global Optimization¶

In reality, you'll rarely ever want to proceed this way for difficult optimization problems
This shows how you could farm out optimization tasks after some sort of scatter-search

Function with multiple minima
From Judge, et. al. The Theory and Practice of Econometrics
Nonlinear least squares 2 local minima and 20 generated observations

$$y_t = \theta_1 + \theta_2x_{t2} + \theta_2^2x_{t3} + \epsilon_t$$

In [ ]:

y = np.array([4.284, 4.149, 3.877, .533, 2.211, 2.389, 
              2.145, 3.231, 1.998, 1.379, 2.106, 1.428, 
              1.011, 2.179, 2.858, 1.388, 1.651, 1.593,
              1.046, 2.152])

x = np.array([.286, .645, .973, .585, .384, .310, 
              .276, .058, .973, .455, .543, .779, 
              .957, .259, .948, .202, .543, .028, 
              .797, .099, .936, .142, .889, .296, 
              .006, .175, .828, .180, .399, .842, 
              .617, .039, .939, .103, .784, .620, 
              .072, .158, .889, .704]).reshape(20,2)

x = np.column_stack((np.ones(len(x)), x))

In [ ]:

print y

In [ ]:

print x

In [ ]:

def func(params, y, x):
    import numpy as np
    theta = np.r_[params[0], params[1], params[1]**2]
    return y - np.dot(x,theta)

In [ ]:

theta1, theta2 = np.mgrid[-3:3:100j,-3:3:100j]
Z = [np.sum(func([i,j], y, x)**2) for i,j in 
     zip(theta1.flatten(), theta2.flatten())]
Z = np.asarray(Z).reshape(100,100)

fig, ax = plt.subplots(figsize=(6, 6))

V = [16.1, 18, 20, 20.5, 21, 22, 24, 
     25, 30, 40, 50, 100, 200, 300, 
     400, 500, 600, 700]

c = ax.contour(theta1, theta2, Z, V)
im = ax.imshow(Z, interpolation='bilinear', origin='lower',
        cmap=plt.cm.BrBG, extent=(-3,3,-3,3))
cb = plt.colorbar(c)
ax.set_xlabel(r'$\theta_1$')
ax.set_ylabel(r'$\theta_2$')
#ax.scatter([.864737, 2.35447, 2.49860664], [1.235748, -.319186, -0.98261242],
ax.scatter([.864737, 2.49860664], [1.235748, -0.98261242],
            marker="x", s=30, color='black', lw=2)
ax.set_title('Loci of objective function')
ax.set_xlim([-3,3])
ax.set_ylim([-3,3])
ax.grid(False)
plt.show()

In [ ]:

x1 = [0,0] # good
x2 = [2.354471, -.319186] # bad
x3 = [1, 1] # good
x4 = [-3.17604581, -0.680944] # bad
# assume we got these in some sane way
xs = np.random.normal(0, 4, size=(20, 2))

starts = np.row_stack((x1, x2, x3, x4, xs))

In [ ]:

def optimize_func(start_params):
    return leastsq(func, start_params, args=(y, x))[0]

In [ ]:

dview = rc[:]

In [ ]:

with dview.sync_imports():
    from scipy.optimize import leastsq
    import numpy as np

In [ ]:

dview.push(dict(func=func, y=y, x=x));

In [ ]:

results = dview.map_sync(optimize_func, starts)

In [ ]:

opt_func = lambda params : np.sum(func(params, y, x)**2)

In [ ]:

i_best = np.argmin(map(opt_func, np.array([result for result in results])))

In [ ]:

print results[i_best]

Moving to a Cluster¶

Tools like IPython and StarCluster takes care of things for you
Starts the engines on the compute nodes for you
- Most likely 1 per core

StarCluster Configuration to use Amazon's EC2¶

Read more about this here
install StarCluster

$ pip install starcluster --user
setup your base config file

$ starcluster help

Select option 2.
Setup your config file with your SSH keys, etc.
Add your AWS credentials
Add the few IPython-specific lines to your config file
Run

$ starcluster start mycluster
You can login to your master node via ssh by running

$ starcluster sshmaster mycluster -u myuser

Replacing with your information as needed.
Or better yet, follow the instructions above to create a local IPython interpreter or notebook connected to your remote EC2 instance
- You will need to be running the same IPython version as the one on EC2.
- For the default starcluster images, this is still IPython 0.13.1
Run parallel scripts or create views and use IPython just as if you would locally.

Configuration for AU's Zorro HPC¶

Log in to Zorro

Create profile from the terminal

ipython profile create --parallel --profile=your-profile-name

You can have as many as you want. For example, you may have a different profile depending on the queue you want to use or one with different default imports on the engines. I named mine hpc.

Go to

$HOME/.ipython/profile_your-profile-name

You can make sure of your configuration directory by running

ipython locate

In this directory, edit the following lines in ipcluster_config.py to read

c.IPClusterStart.controller_launcher_class = 'LSF'
c.IPClusterEngines.engine_launcher_class = 'LSF'

Set up the controller

Edit the following lines in ipcontroller_config.py to read

c.HubFactory.ip = '*'

This is so that the controller listens on all interfaces for the engines.

Set up the engines

Edit the following lines in ipengine_config.py

c.IPEngineApp.work_dir = u'$HOME/scratch/'
c.EngineFactory.timeout=10

The last step is to edit your ~/.bashrc and add the following line

export PATH=$PATH:/app/epd/bin

Then type

source ~/.bashrc

This is so that you can run the ipcluster or ipcontroller scripts on the head node. Alternatively, you could create symlink in your $HOME/bin folder.

ln -s /app/epd/bin/ipcluster ~/bin/
ln -s /app/epd/bin/ipcontroller ~/bin/

Create your batch scripts

Make two files in your working directory that will be your batch files for the engines and the controller. I named mine lsf.engine.template and lsf.controller.template. After the initial set-up these (or similar) files will control our job submission.

Tell the ipcluster_config.py file about your batch scripts by adding the following lines.

c.LSFEngineSetLauncher.batch_template_file = "lsf.engine.template"
c.LSFControllerLauncher.batch_template_file = "lsf.controller.template"

lsf.engine.template

#!/bin/bash

#BSUB-L /bin/bash
#BSUB-J ipython
#BSUB-q interactive
#BSUB-n {n}
#BSUB-u your-email@american.edu
#BSUB-N
#BSUB-c 5

# the ipython code

# enter your working directory
cd $HOME/scratch

export PATH=$HOME/bin:/app/epd/bin/
export PYTHONPATH=/app/epd-7.3-2-rh5-x86_64/lib/python2.7/site-packages
ipengine --profile=hpc

lsf.controller.template

#!/bin/bash

#BSUB-L /bin/bash
#BSUB-J ipython
#BSUB-q interactive
#BSUB-n 1
#BSUB-u your-email@american.edu
#BSUB-N
#BSUB-c 5          # timeout in minutes

cd $HOME/scratch

export PATH=$HOME/bin:/app/epd/bin/
export PYTHONPATH=/app/epd-7.3-2-rh5-x86_64/lib/python2.7/site-packages
ipcontroller --profile=hpc

You can run the cluster with

ipcluster start --profile=hpc --n=2

I run it with

ipcluster start --profile=hpc --n=2 &

The & puts the job in the background.

Stopping Jobs¶

When your job is done you can run

ipcluster stop --profile=hpc
You can also stop your engines (and the hub) from within your Python scripts by using

rc.shutdown(hub=True)
There are a few other ways to do this. Consult the IPython documentation and examples/

More Examples¶

Logging into Zorro from the Notebook
!
This is not a typical workflow for long-running jobs
I am only doing this because a) I can and b) as a demonstration

start a cluster on Zorro
(Secure) Copy the "ipcontroller-client.json" to your local machine. On your local machine type

    scp js2796a@zorro.american.edu:/home/js2796a/.ipython/profile_hpc/security/ipcontroller-client.json  ~/school/talks/zorro

<li>Run the following code, it will prompt for your password</li>
<li>If you aren't doing a public demo, you can just provide the password by argument</li>
<li>You might still have to type this password at the terminal as well</li>

NOTE: You will need to be running the same version of IPython locally as you are running on the server. This is currently 1.1 on the AU HPC and 0.13.1 on the StarCluster images.

In [ ]:

from IPython import parallel

rc = parallel.Client("./ipcontroller-client.json", sshserver="your-login@hpcserver", timeout=60)

In [ ]:

rc = parallel.Client("./ipcontroller-client.json", sshserver="js2796a@zorro.american.edu", timeout=60)

Alternatively, you can connect to a StarCluster just as easily.

    $ starcluster start mycluster

    $ starcluster sshmaster mycluster -u myuser

    skipper@master:~$ ipython
    In [1]: from IPython.parallel import Client

    In [2]: rc = Client()

    In [3]: view = rc[:]

    In [4]: view.block = True

    In [5]: rc.ids
    Out[5]: [0, 1]

    In [6]: view.execute("import socket; x = socket.gethostname()")
    Out[6]: <AsyncResult: finished>

    In [7]: view["x"]
    Out[7]: ['master', 'node001']

In [ ]:

view = rc[:]

view.block = True

In [ ]:

rc.ids

In [ ]:

view.execute("import socket; x = socket.gethostname()")

view["x"]

Computing n*10 million digits of $\pi$¶

Two files pidigits.py and parallelpi.py are included in this repository
They are from the IPython examples/parallel folder
There are several interesting examples here that you might want to go through
Copy them to your remote working directory

scp js2796a@american.edu:/home/js2796a/scratch/ ~/school/talks/zorro/parallelpi.py scp js2796a@american.edu:/home/js2796a/scratch/ ~/school/talks/zorro/pidigits.py

In [ ]:

from pidigits import plot_one_digit_freqs, txt_file_to_digits, one_digit_freqs
#view.execute("from pidigits import *")

SymPy is a Python library for doing symbolic mathematics
It includes support for arbitrary precision floating point numbers

In [ ]:

import sympy

pi = sympy.pi.evalf(40)

print pi

Create 10,000 digits of $\pi$ using SymPy

In [ ]:

pi = sympy.pi.evalf(10000)

# make a sequence of strings
digits = (d for d in str(pi)[2:])

freqs = one_digit_freqs(digits)

ax = plot_one_digit_freqs(freqs)

We will be using pre-computed values of $\pi$ from Professor Yasumasa Kanada at the University of Tokyo
The digits come in a set of text files with 10 million digits each
We will have each compute node download a file
Then each node will compute the two digit count for each file
In a final step the counts from each engine be will added up
This is an example of how you implement a map-reduce-like workflow

In [ ]:

def compute_two_digit_freqs(filename):
    """
    Read digits of pi from a file and compute the 2 digit frequencies.
    """
    d = txt_file_to_digits(filename)
    freqs = two_digit_freqs(d)
    return freqs

def reduce_freqs(freqlist):
    """
    Add up a list of freq counts to get the total counts.
    """
    allfreqs = np.zeros_like(freqlist[0])
    for f in freqlist:
        allfreqs += f
    return allfreqs

Get the number of engines available

In [ ]:

n = len(rc)

print n

Create the list of files to process.

In [ ]:

filestring = 'pi200m.ascii.%(i)02dof20'

files = [filestring % {'i':i} for i in range(1,n+1)]

files

Download the data files on the engines if they don't already exist:

In [ ]:

view.map(fetch_pi_file, files)

Run 10 million digits on 1 engine

In [ ]:

from timeit import default_timer as clock
t1 = clock()

id0 = rc.ids[0]

freqs10m = rc[id0].apply_sync(compute_two_digit_freqs, files[0])
t2 = clock()

digits_per_second1 = 10.0e6/(t2-t1)
print "Digits per second (1 core, 10m digits):   ", digits_per_second1

Now do the same on each engine in parallel

In [ ]:

t1 = clock()
# Compute the digits
freqs_all = view.map(compute_two_digit_freqs, files[:n])
# Add up the frequencies from each engine.
freqsn10m = reduce_freqs(freqs_all)
t2 = clock()
digits_per_secondn = n*10.0e6/(t2-t1)
print "Digits per second (%i engines, %i0m digits): "%(n,n), digits_per_secondn

In [ ]:

print "Speedup: ", digits_per_secondn/digits_per_second1

In [ ]:

plot_two_digit_freqs(freqsn10m, figsize=(10,10))
plt.title("2 digit sequences in %i0m digits of pi" % n);

Introduction¶

Talk Dependencies¶

Aims¶

Some (Biased) Options for (Scientific Computing) Parallelism in Python¶

Symmetric Multiprocessing (SMP) (Shared Main-Memory)¶

Cluster Computing (Memory not available to all processors)¶

Architecture¶

IPython client and views¶

Getting Started¶

Did it work?¶

When to Parallelize?¶

Profile Your Code¶

Write Smart Code¶

What kind of speed-up can I expect?¶

Things to be aware of¶

The Direct Interface¶

Scatter and Gather¶

The Task Interface¶

Examples¶

Poor Man's Global Optimization¶

Moving to a Cluster¶

StarCluster Configuration to use Amazon's EC2¶

Configuration for AU's Zorro HPC¶

Stopping Jobs¶

More Examples¶

Computing n*10 million digits of $\pi$¶

There's much more¶