%%html
<style type="text/css">
.float-diagram {
height: 400px;
margin-top: -4em;
float: right;
}
</style>
Let's start with an overview of IPython's architecture for parallel and distributed computing. This architecture abstracts out parallelism in a very general way, which enables IPython to support many different styles of parallelism including:
Most importantly, IPython enables all types of parallel applications to
be developed, executed, debugged and monitored interactively. Hence,
the I
in IPython
. Some example use cases for
IPython.parallel
:
Quickly parallelize algorithms that are embarrassingly parallel using a number of simple approaches. Many simple things can be parallelized interactively in one or two lines of code.
Steer traditional MPI applications on a supercomputer from an IPython session on your laptop.
Analyze and visualize large datasets (that could be remote and/or distributed) interactively using IPython and tools like matplotlib/TVTK.
Develop, test and debug new parallel algorithms (that may use MPI or PyZMQ) interactively.
Tie together multiple MPI jobs running on different systems into one giant distributed and parallel system.
Start a parallel job on your cluster and then have a remote collaborator connect to it and pull back data into their local IPython session for plotting and analysis.
Run a set of tasks on a set of CPUs using dynamic load balancing.
The IPython architecture consists of four components:
These components live in the IPython.parallel
package and are
installed with IPython.
The IPython engine is a Python instance that accepts Python commands over a network connection. When multiple engines are started, parallel and distributed computing becomes possible. An important property of an IPython engine is that it blocks while user code is being executed. Read on for how the IPython controller solves this problem to expose a clean asynchronous API to the user.
The IPython controller processes provide an interface for working with a
set of engines. At a general level, the controller is a collection of
processes to which IPython engines and clients can connect. The
controller is composed of a Hub
and a collection of
Schedulers
, which may be in processes or threads.
The controller provides a single point of contact for users who
wish to utilize the engines in the cluster. There is a variety of
different ways of working with a controller, but all of these
models are implemented via the View.apply
method, after
constructing View
objects to represent different collections engines.
The two primary models for interacting with engines are:
Advanced users can readily extend the View models to enable other styles of parallelism.
The center of an IPython cluster is the Hub. The Hub can be viewed as an über-logger, which keeps track of engine connections, schedulers, clients, as well as persist all task requests and results in a database for later use.
All actions that can be performed on the engine go through a Scheduler. While the engines themselves block when user code is run, the schedulers hide that from the user to provide a fully asynchronous interface to a set of engines. Each Scheduler is a small GIL-less function in C provided by pyzmq (the Python load-balanced scheduler being an exception).
All of this is implemented with the lovely ØMQ messaging library, and pyzmq, the lightweight Python bindings, which allows very fast zero-copy communication of objects like numpy arrays.
There is one primary object, the Client
, for
connecting to a cluster. For each execution model, there is a
corresponding View
. These views allow users to
interact with a set of engines through the interface. Here are the two
default views:
DirectView
class for explicit addressing.LoadBalancedView
class for destination-agnostic
scheduling.To follow along with this tutorial, you will need to start the IPython
controller and four IPython engines. The simplest way of doing this is
with the clusters tab,
or you can use the ipcluster
command in a terminal:
$ ipcluster start -n 4
There isn't time to go into it here, but ipcluster can be used to start engines and the controller with various batch systems including:
More information on starting and configuring the IPython cluster in the IPython.parallel docs.
Once you have started the IPython controller and one or more engines, you are ready to use the engines to do something useful.
To make sure everything is working correctly, let's do a very simple demo:
from IPython import parallel
rc = parallel.Client()
rc.block = True
rc.ids
def mul(a,b):
return a*b
def summary():
"""summarize some info about this process"""
import os
import socket
import sys
return {
'cwd': os.getcwd(),
'Python': sys.version,
'hostname': socket.gethostname(),
'pid': os.getpid(),
}
mul(5,6)
summary()
What does it look like to call this function remotely?
Just turn f(*args, **kwargs)
into view.apply(f, *args, **kwargs)
!
rc[0].apply(mul, 5, 6)
rc[0].apply(summary)
And the same thing in parallel?
rc[:].apply(mul, 5, 6)
rc[:].apply(summary)
Python has a builtin map for calling a function with a variety of arguments
map(mul, range(1,10), range(2,11))
So how do we do this in parallel?
view = rc.load_balanced_view()
view.map(mul, range(1,10), range(2,11))
And a preview of parallel magics:
%%px
import os, socket
print os.getpid()
print socket.gethostname()
Now let's get into some more detail about how to use IPython for [remote execution](tutorial/Remote Execution.ipynb).