Python is a dynamic, interpreted programming language that is flexible enough to be used for a variety of tasks, from scripts to mission-critical applications. It currently has a very strong presence in scientific computing. Despite being an easy-to-use, high-level language, it is also capable of achieving very high performance.
Runnable pseudocode: Python's syntax is readable and clear.
from numpy import zeros, random, sqrt
gamma = random.gamma
normal = random.normal
def pygibbs(N=20000, thin=200):
mat = zeros((N,2))
x,y = mat[0]
for i in range(N):
for j in range(thin):
x = gamma(3, y**2 + 4)
y = normal(1./(x+1), 1./sqrt(2*(x+1)))
mat[i] = x,y
return mat
Python objects are dynamically referenced
y = 5
y = 'foo'
x = y = [1, 2, 3]
y[0] = -9
x
[-9, 2, 3]
... but strongly typed!
'5' + 6
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-3-cfde8aeadc63> in <module>() ----> 1 '5' + 6 TypeError: cannot concatenate 'str' and 'int' objects
NumPy and SciPy are fundamental scientific modules for Python.
In particular, NumPy provides:
ndarray
SciPy provides a large suite of standard scientific computing:
# Calculate second difference matrix
import numpy as np
I2 = -2*np.eye(8)
E = np.diag(np.ones(7), k=-1)
I2 + E + E.T
array([[-2., 1., 0., 0., 0., 0., 0., 0.], [ 1., -2., 1., 0., 0., 0., 0., 0.], [ 0., 1., -2., 1., 0., 0., 0., 0.], [ 0., 0., 1., -2., 1., 0., 0., 0.], [ 0., 0., 0., 1., -2., 1., 0., 0.], [ 0., 0., 0., 0., 1., -2., 1., 0.], [ 0., 0., 0., 0., 0., 1., -2., 1.], [ 0., 0., 0., 0., 0., 0., 1., -2.]])
from scipy import linalg
def lse(A, b, B, d, cond=None):
"""
Equality-contrained least squares.
The following algorithm minimizes ||Ax - b|| subject to the
constrain Bx = d.
"""
A, b, B, d = map(np.asanyarray, (A, b, B, d))
p = B.shape[0]
# QR decomposition of constraint matrix B
Q, R = linalg.qr(B.T)
# Solve Ax = b, assuming A is triangular
y = linalg.solve_triangular(R[:p, :p], d, trans='T', lower=False)
A = np.dot(A, Q)
# Least squares solution to Ax = b
z = linalg.lstsq(A[:, p:], b - np.dot(A[:, :p], y),
cond=cond)[0].ravel()
return np.dot(Q[:, :p], y) + np.dot(Q[:, p:], z)
A, b = [[0, 2, 3], [1, 3, 4.5]], [1, 1]
B, d = [[1, 1, 0]], [1]
lse(A, b, B, d)
array([-0.5 , 1.5 , -0.6667])
This is the most widely-used 2-D plotting module for Python, allowing users to generate publication-quality plots. Matplotlib offers a high level of control and the ability to generate interactive output.
import matplotlib.pyplot as plt
fig = plt.figure()
x = np.linspace(0,2*np.pi,100)
y = 2*np.sin(x)
ax = fig.add_subplot(1,2,1)
ax.plot(x,y)
y2 = y + 0.1*np.random.normal( size=x.shape )
ax = fig.add_subplot(1,2,2)
ax.plot(x,y2,'bo')
[<matplotlib.lines.Line2D at 0x10c269fd0>]
from pymc.examples import gelman_bioassay
from pymc import MCMC, Matplot
M = MCMC(gelman_bioassay)
M.sample(1000,verbose=0)
[****************100%******************] 1000 of 1000 complete
Matplot.plot(M.alpha)
Plotting alpha
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Polygon
# Generate some data from five different probability distributions,
# each with different characteristics. We want to play with how an IID
# bootstrap resample of the data preserves the distributional
# properties of the original sample, and a boxplot is one visual tool
# to make this assessment
numDists = 5
randomDists = ['Normal(1,1)',' Lognormal(1,1)', 'Exp(1)', 'Gumbel(6,4)',
'Triangular(2,9,11)']
N = 500
norm = np.random.normal(1,1, N)
logn = np.random.lognormal(1,1, N)
expo = np.random.exponential(1, N)
gumb = np.random.gumbel(6, 4, N)
tria = np.random.triangular(2, 9, 11, N)
# Generate some random indices that we'll use to resample the original data
# arrays. For code brevity, just use the same random indices for each array
bootstrapIndices = np.random.random_integers(0, N-1, N)
normBoot = norm[bootstrapIndices]
expoBoot = expo[bootstrapIndices]
gumbBoot = gumb[bootstrapIndices]
lognBoot = logn[bootstrapIndices]
triaBoot = tria[bootstrapIndices]
data = [norm, normBoot, logn, lognBoot, expo, expoBoot, gumb, gumbBoot,
tria, triaBoot]
fig = plt.figure(figsize=(10,6))
fig.canvas.set_window_title('A Boxplot Example')
ax1 = fig.add_subplot(111)
plt.subplots_adjust(left=0.075, right=0.95, top=0.9, bottom=0.25)
bp = plt.boxplot(data, notch=0, sym='+', vert=1, whis=1.5)
plt.setp(bp['boxes'], color='black')
plt.setp(bp['whiskers'], color='black')
plt.setp(bp['fliers'], color='red', marker='+')
# Add a horizontal grid to the plot, but make it very light in color
# so we can use it for reading data values but not be distracting
ax1.yaxis.grid(True, linestyle='-', which='major', color='lightgrey',
alpha=0.5)
# Hide these grid behind plot objects
ax1.set_axisbelow(True)
ax1.set_title('Comparison of IID Bootstrap Resampling Across Five Distributions')
ax1.set_xlabel('Distribution')
ax1.set_ylabel('Value')
# Now fill the boxes with desired colors
boxColors = ['darkkhaki','royalblue']
numBoxes = numDists*2
medians = range(numBoxes)
for i in range(numBoxes):
box = bp['boxes'][i]
boxX = []
boxY = []
for j in range(5):
boxX.append(box.get_xdata()[j])
boxY.append(box.get_ydata()[j])
boxCoords = zip(boxX,boxY)
# Alternate between Dark Khaki and Royal Blue
k = i % 2
boxPolygon = Polygon(boxCoords, facecolor=boxColors[k])
ax1.add_patch(boxPolygon)
# Now draw the median lines back over what we just filled in
med = bp['medians'][i]
medianX = []
medianY = []
for j in range(2):
medianX.append(med.get_xdata()[j])
medianY.append(med.get_ydata()[j])
plt.plot(medianX, medianY, 'k')
medians[i] = medianY[0]
# Finally, overplot the sample averages, with horixzontal alignment
# in the center of each box
plt.plot([np.average(med.get_xdata())], [np.average(data[i])],
color='w', marker='*', markeredgecolor='k')
# Set the axes ranges and axes labels
ax1.set_xlim(0.5, numBoxes+0.5)
top = 40
bottom = -5
ax1.set_ylim(bottom, top)
xtickNames = plt.setp(ax1, xticklabels=np.repeat(randomDists, 2))
plt.setp(xtickNames, rotation=45, fontsize=8)
# Due to the Y-axis scale being different across samples, it can be
# hard to compare differences in medians across the samples. Add upper
# X-axis tick labels with the sample medians to aid in comparison
# (just use two decimal places of precision)
pos = np.arange(numBoxes)+1
upperLabels = [str(np.round(s, 2)) for s in medians]
weights = ['bold', 'semibold']
for tick,label in zip(range(numBoxes),ax1.get_xticklabels()):
k = tick % 2
ax1.text(pos[tick], top-(top*0.05), upperLabels[tick],
horizontalalignment='center', size='x-small', weight=weights[k],
color=boxColors[k])
# Finally, add a basic legend
plt.figtext(0.80, 0.08, str(N) + ' Random Numbers' ,
backgroundcolor=boxColors[0], color='black', weight='roman',
size='x-small')
plt.figtext(0.80, 0.045, 'IID Bootstrap Resample',
backgroundcolor=boxColors[1],
color='white', weight='roman', size='x-small')
plt.figtext(0.80, 0.015, '*', color='white', backgroundcolor='silver',
weight='roman', size='medium')
plt.figtext(0.815, 0.013, ' Average Value', color='black', weight='roman',
size='x-small')
plt.show()
IPython is an enhanced Python shell which provides a more robust and productive development environment for users. It includes the HTML notebook featured here, as well as support for interactive data visualization and easy high-performance parallel computing.
IPython has a set of predefined ‘magic functions’ that you can call with a command line style syntax. These include:
%run
%edit
%debug
%timeit
%paste
%load_ext
For example, we can use %load_ext
to load an extension to run R code within IPython:
%load_ext rmagic
x,y = arange(10), random.normal(size=10)
%%R -i x,y -o XYcoef
lm.fit <- lm(y~x)
par(mfrow=c(2,2))
plot(lm.fit)
XYcoef <- coef(lm.fit)
XYcoef
array([ 0.41949922, -0.10004703])
Rcpp
benchmarks¶%%R
library(compiler)
library(rbenchmark)
library(inline)
library(RcppGSL)
Rgibbs <- function(N,thin) {
mat <- matrix(0,ncol=2,nrow=N)
x <- 0
y <- 0
for (i in 1:N) {
for (j in 1:thin) {
x <- rgamma(1,3,y*y+4)
y <- rnorm(1,1/(x+1),1/sqrt(2*(x+1)))
}
mat[i,] <- c(x,y)
}
mat
}
RCgibbs <- cmpfun(Rgibbs)
gibbscode <- '
// n and thin are SEXPs which the Rcpp::as function maps to C++ vars
int N = as<int>(n);
int thn = as<int>(thin);
int i,j;
NumericMatrix mat(N, 2);
RNGScope scope; // Initialize Random number generator
// The rest of the code follows the R version
double x=0, y=0;
for (i=0; i<N; i++) {
for (j=0; j<thn; j++) {
x = ::Rf_rgamma(3.0,1.0/(y*y+4));
y = ::Rf_rnorm(1.0/(x+1),1.0/sqrt(2*x+2));
}
mat(i,0) = x;
mat(i,1) = y;
}
return mat; // Return to R
'
RcppGibbs <- cxxfunction(signature(n="int", thin = "int"), gibbscode, plugin="Rcpp")
N <- 1000
thn <- 10
res <- benchmark(Rgibbs(N, thn), RCgibbs(N, thn), RcppGibbs(N, thn), order="relative", replications=10)
print(res)
Loading required package: Rcpp test replications elapsed relative user.self sys.self user.child 3 RcppGibbs(N, thn) 10 0.027 1.00000 0.027 0.001 0 2 RCgibbs(N, thn) 10 0.888 32.88889 0.883 0.005 0 1 Rgibbs(N, thn) 10 1.126 41.70370 1.115 0.010 0 sys.child 3 0 2 0 1 0
Pandas provides high-performance data structures for data manipulation and analysis. In particular, it allows for intelligent data alignment and integrated handling of missing data. Datasets can be easily reshaped, sliced, subsetted, and indexed hierarchically.
$$\text{data.frame} \subsetneq \text{pandas.DataFrame}$$Here's a trivial example of a pandas DataFrame
, populated with columns of various types:
from pandas import *
df = DataFrame({'A' : ['one', 'one', 'two', 'three'] * 6,
'B' : ['A', 'B', 'C'] * 8,
'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 4,
'D' : np.random.randn(24),
'E' : np.random.randn(24),
'F' : DateRange(start='4/1/2012', periods=24)})
df
A | B | C | D | E | F | |
---|---|---|---|---|---|---|
0 | one | A | foo | 0.799663 | 0.030021 | 2012-04-02 00:00:00 |
1 | one | B | foo | -0.522646 | 0.220671 | 2012-04-03 00:00:00 |
2 | two | C | foo | -0.468866 | 0.016852 | 2012-04-04 00:00:00 |
3 | three | A | bar | -0.393212 | 0.593464 | 2012-04-05 00:00:00 |
4 | one | B | bar | -0.575404 | -1.621361 | 2012-04-06 00:00:00 |
5 | one | C | bar | 0.228977 | -1.256740 | 2012-04-09 00:00:00 |
6 | two | A | foo | -1.226878 | -0.467621 | 2012-04-10 00:00:00 |
7 | three | B | foo | -1.449318 | 1.317463 | 2012-04-11 00:00:00 |
8 | one | C | foo | 0.158045 | -0.388078 | 2012-04-12 00:00:00 |
9 | one | A | bar | -0.149894 | 0.498263 | 2012-04-13 00:00:00 |
10 | two | B | bar | 0.061315 | -0.999789 | 2012-04-16 00:00:00 |
11 | three | C | bar | -0.087721 | 0.697154 | 2012-04-17 00:00:00 |
12 | one | A | foo | 0.849112 | 0.044553 | 2012-04-18 00:00:00 |
13 | one | B | foo | -1.305085 | -1.332287 | 2012-04-19 00:00:00 |
14 | two | C | foo | -1.435599 | -1.261241 | 2012-04-20 00:00:00 |
15 | three | A | bar | -0.560977 | -1.133941 | 2012-04-23 00:00:00 |
16 | one | B | bar | -0.672480 | -0.093903 | 2012-04-24 00:00:00 |
17 | one | C | bar | 0.299152 | -0.886278 | 2012-04-25 00:00:00 |
18 | two | A | foo | 1.912571 | -1.428091 | 2012-04-26 00:00:00 |
19 | three | B | foo | 0.474440 | -0.104162 | 2012-04-27 00:00:00 |
20 | one | C | foo | -0.451927 | -0.044103 | 2012-04-30 00:00:00 |
21 | one | A | bar | 0.135511 | 1.126078 | 2012-05-01 00:00:00 |
22 | two | B | bar | -1.067146 | 0.057069 | 2012-05-02 00:00:00 |
23 | three | C | bar | 0.485477 | -0.241935 | 2012-05-03 00:00:00 |
Pandas ships with a few sample datasets, such as 100 records of baseball batting data, which is easily imported into a DataFrame
using the read_csv
function:
baseball = read_csv("/Users/fonnescj/Code/pandas/doc/data/baseball.csv")
baseball
<class 'pandas.core.frame.DataFrame'> Int64Index: 100 entries, 88641 to 89534 Data columns: id 100 non-null values year 100 non-null values stint 100 non-null values team 100 non-null values lg 100 non-null values g 100 non-null values ab 100 non-null values r 100 non-null values h 100 non-null values X2b 100 non-null values X3b 100 non-null values hr 100 non-null values rbi 100 non-null values sb 100 non-null values cs 100 non-null values bb 100 non-null values so 100 non-null values ibb 100 non-null values hbp 100 non-null values sh 100 non-null values sf 100 non-null values gidp 100 non-null values dtypes: float64(9), int64(10), object(3)
The DataFrame
possesses a suite of methods for indexing, slicing and manipulating its contents. Here's an arbitrary record:
baseball.ix[88650]
id johnsra05 year 2006 stint 1 team NYA lg AL g 33 ab 6 r 0 h 1 X2b 0 X3b 0 hr 0 rbi 0 sb 0 cs 0 bb 0 so 4 ibb 0 hbp 0 sh 0 sf 0 gidp 0 Name: 88650
We can carry out spreadsheet-like actions on a DataFrame
, such as pivoting and crosstabs:
pivot_table(baseball, values=['hr','so'], rows=['team', 'year'], aggfunc=sum)
hr | so | ||
---|---|---|---|
team | year | ||
ARI | 2006 | 15 | 58 |
ARI | 2007 | 0 | 13 |
ATL | 2007 | 0 | 29 |
BAL | 2007 | 1 | 23 |
BOS | 2006 | 0 | 1 |
BOS | 2007 | 20 | 101 |
CHA | 2007 | 35 | 134 |
CHN | 2006 | 1 | 4 |
CHN | 2007 | 9 | 48 |
CIN | 2007 | 36 | 127 |
CLE | 2007 | 0 | 29 |
COL | 2007 | 2 | 14 |
DET | 2007 | 37 | 176 |
FLO | 2007 | 0 | 0 |
HOU | 2007 | 14 | 212 |
KCA | 2007 | 2 | 15 |
LAA | 2007 | 0 | 0 |
LAN | 2006 | 0 | 7 |
LAN | 2007 | 36 | 141 |
MIL | 2006 | 0 | 2 |
MIL | 2007 | 9 | 60 |
MIN | 2007 | 6 | 32 |
NYA | 2006 | 0 | 4 |
NYA | 2007 | 0 | 0 |
NYN | 2007 | 61 | 310 |
OAK | 2007 | 8 | 65 |
PHI | 2007 | 0 | 26 |
SDN | 2007 | 0 | 31 |
SFN | 2006 | 6 | 55 |
SFN | 2007 | 40 | 188 |
SLN | 2007 | 13 | 98 |
TBA | 2007 | 0 | 0 |
TEX | 2007 | 28 | 140 |
TOR | 2007 | 58 | 265 |
crosstab(baseball.year, baseball.lg, margins=True, colnames=['league'])
league | AL | NL | All |
---|---|---|---|
year | |||
2006 | 3 | 5 | 8 |
2007 | 35 | 57 | 92 |
All | 38 | 62 | 100 |
Not to mention powerful split/apply/combine methods:
def get_stats(group):
return {'max': group.max(), 'count': group.count(), 'mean': group.mean()}
baseball.groupby('lg')["sb"].apply(get_stats)
lg AL {'count': 38, 'max': 22.0, 'mean': 1.4210526315789473} NL {'count': 62, 'max': 14.0, 'mean': 1.3548387096774193}
statsmodels is a rapidly-evolving module for statistical modeling in Python, providing a set of models, statistical tests, plotting functions including:
import statsmodels.api as sm
data = sm.datasets.scotland.load()
data.exog = sm.add_constant(data.exog)
/Library/Python/2.7/site-packages/statsmodels-0.5.0-py2.7-macosx-10.7-intel.egg/statsmodels/tools/tools.py:301: FutureWarning: The default of `prepend` will be changed to True in 0.5.0, use explicit prepend FutureWarning)
gamma_model = sm.GLM(data.endog, data.exog, family=sm.families.Gamma())
gamma_results = gamma_model.fit()
gamma_results.summary()
Dep. Variable: | y | No. Observations: | 32 |
---|---|---|---|
Model: | GLM | Df Residuals: | 24 |
Model Family: | Gamma | Df Model: | 7 |
Link Function: | inverse_power | Scale: | 0.00358428317349 |
Method: | IRLS | Log-Likelihood: | -83.017 |
Date: | Thu, 14 Jun 2012 | Deviance: | 0.087389 |
Time: | 13:07:46 | Pearson chi2: | 0.0860 |
No. Iterations: | 5 |
coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
x1 | 4.962e-05 | 1.62e-05 | 3.060 | 0.005 | 1.78e-05 8.14e-05 |
x2 | 0.0020 | 0.001 | 3.824 | 0.001 | 0.001 0.003 |
x3 | -7.181e-05 | 2.71e-05 | -2.648 | 0.014 | -0.000 -1.87e-05 |
x4 | 0.0001 | 4.06e-05 | 2.757 | 0.011 | 3.23e-05 0.000 |
x5 | -1.468e-07 | 1.24e-07 | -1.187 | 0.247 | -3.89e-07 9.56e-08 |
x6 | -0.0005 | 0.000 | -2.159 | 0.041 | -0.001 -4.78e-05 |
x7 | -2.427e-06 | 7.46e-07 | -3.253 | 0.003 | -3.89e-06 -9.65e-07 |
const | -0.0178 | 0.011 | -1.548 | 0.135 | -0.040 0.005 |
f2py is a Fortran interface generator for Python that ships with NumPy. It is compatible with NumPy's ndarray
objects, and allows for the use of Fortran 77/90/95 code within Python by simply adding structured comments to functions and subroutines.
SUBROUTINE bernoulli(x,p,nx,np,like)
cf2py logical dimension(nx),intent(in) :: x
cf2py double precision dimension(np),intent(in) :: p
cf2py integer intent(hide),depend(x) :: nx=len(x)
cf2py integer intent(hide),depend(p),check(len(p)==1 || len(p)==len(x)):: np=len(p)
cf2py double precision intent(out) :: like
cf2py threadsafe
IMPLICIT NONE
INTEGER np,nx,i
DOUBLE PRECISION p(np), ptmp, like
LOGICAL x(nx)
LOGICAL not_scalar_p
DOUBLE PRECISION infinity
PARAMETER (infinity = 1.7976931348623157d308)
C Check parameter size
not_scalar_p = (np .NE. 1)
like = 0.0
ptmp = p(1)
do i=1,nx
if (not_scalar_p) ptmp = p(i)
if (ptmp .LT. 0.0) then
like = -infinity
RETURN
endif
if (x(i)) then
like = like + dlog(ptmp)
else
like = like + dlog(1.0D0 - ptmp)
endif
enddo
return
END
from pymc.flib import bernoulli
bernoulli([1,0,1], 0.3)
-2.7646205525906047
Cython is a language that allows Python programmers to write fast code without having to write C/C++/Fortran directly. It looks much like Python code, but with type declarations. Cython takes this code, translates it to C, then compiles the generated C code to create a Python extension.
Gibbs sampler for function:
$$f(x,y) = x x^2 \exp(-xy^2 - y^2 + 2y - 4x)$$using conditional distributions:
$$x|y \sim Gamma(3, y^2 +4)$$$$y|x \sim Normal(\frac{1}{1+x}, \frac{1}{2(1+x)})$$Here again is the pure Python implementation:
from numpy import zeros, random, sqrt
gamma = random.gamma
normal = random.normal
def pygibbs(N=20000, thin=200):
mat = zeros((N,2))
x,y = mat[0]
for i in range(N):
for j in range(thin):
x = gamma(3, y**2 + 4)
y = normal(1./(x+1), 1./sqrt(2*(x+1)))
mat[i] = x,y
return mat
timeit pygibbs(1000, 10)
10 loops, best of 3: 86.3 ms per loop
We can usually get a marginal speedup simply by compiling the Python code, unchanged, using cython:
%load_ext cythonmagic
%%cython
from numpy import zeros, random, sqrt
gamma = random.gamma
normal = random.normal
def pygibbs2(N=20000, thin=200):
mat = zeros((N,2))
x,y = mat[0]
for i in range(N):
for j in range(thin):
x = gamma(3, y**2 + 4)
y = normal(1./(x+1), 1./sqrt(2*(x+1)))
mat[i] = x,y
return mat
timeit pygibbs2(1000, 10)
10 loops, best of 3: 76.5 ms per loop
For additional gains, we can simply declare variable types, like you would in C:
%%cython
from numpy import zeros, random, sqrt
from numpy cimport *
gamma = random.gamma
normal = random.normal
def pygibbs3(int N=20000, int thin=200):
cdef ndarray[float64_t, ndim=2] mat = zeros((N,2))
cdef float64_t x,y = 0
cdef int i,j
for i in range(N):
for j in range(thin):
x = gamma(3, y**2 + 4)
y = normal(1./(x+1), 1./sqrt(2*(x+1)))
mat[i] = x,y
return mat
timeit pygibbs3(1000, 10)
10 loops, best of 3: 68.7 ms per loop
A full-flown "cythonization" involves using GSL's random number generators, and giving cython a few more instructions:
%%cython -lm -lgsl -lgslcblas
cimport cython
import numpy as np
from numpy cimport *
cdef extern from "math.h":
double sqrt(double)
cdef extern from "gsl/gsl_rng.h":
ctypedef struct gsl_rng_type
ctypedef struct gsl_rng
gsl_rng_type *gsl_rng_mt19937
gsl_rng *gsl_rng_alloc(gsl_rng_type * T) nogil
cdef extern from "gsl/gsl_randist.h":
double gamma "gsl_ran_gamma"(gsl_rng * r,double,double)
double gaussian "gsl_ran_gaussian"(gsl_rng * r,double)
cdef gsl_rng *r = gsl_rng_alloc(gsl_rng_mt19937)
@cython.wraparound(False)
@cython.boundscheck(False)
def gibbs(int N=20000,int thin=500):
cdef:
double x=0
double y=0
int i, j
ndarray[float64_t, ndim=2] samples
samples = np.empty((N,thin))
for i from 0 <= i < N:
for j from 0 <= j < thin:
x = gamma(r,3,1.0/(y*y+4))
y = gaussian(r,1.0/sqrt(x+1))
samples[i,0] = x
samples[i,1] = y
return samples
timeit gibbs(1000, 10)
1000 loops, best of 3: 1.38 ms per loop
Before running the next cell, make sure you have first started your cluster, you can use the clusters tab in the dashboard to do so. Because this example transfers lots of large arrays, we recommend that you first configure your cluster to use the 'NoDB' hub messaging support, which removes a few features but has the lowest memory footprint. You can do so by putting in your IPython profile directory a file called ipcontroller_config.py
that contains simply:
# Configuration file for ipcontroller.
c = get_config()
# The class to use for the DB backend
c.HubFactory.db_class = 'IPython.parallel.controller.dictdb.NoDB'
See the IPython docs for further details.
from IPython.parallel import Client
rc = Client()
dv = rc[:]
dv.block = True
dv.activate()
Now, we load the cython magic on all engines and execute the cython magic as well on all engines:
%px %load_ext cythonmagic
Parallel execution on engine(s): [0, 1]
%%px
%%cython -lm -lgsl -lgslcblas
cimport cython
import numpy as np
from numpy cimport *
cdef extern from "math.h":
double sqrt(double)
cdef extern from "gsl/gsl_rng.h":
ctypedef struct gsl_rng_type
ctypedef struct gsl_rng
gsl_rng_type *gsl_rng_mt19937
gsl_rng *gsl_rng_alloc(gsl_rng_type * T) nogil
cdef extern from "gsl/gsl_randist.h":
double gamma "gsl_ran_gamma"(gsl_rng * r,double,double)
double gaussian "gsl_ran_gaussian"(gsl_rng * r,double)
cdef gsl_rng *r = gsl_rng_alloc(gsl_rng_mt19937)
@cython.wraparound(False)
@cython.boundscheck(False)
def gibbs(int N=20000,int thin=500):
cdef:
double x=0
double y=0
int i, j
ndarray[float64_t, ndim=2] samples
samples = np.empty((N,thin))
for i from 0 <= i < N:
for j from 0 <= j < thin:
x = gamma(r,3,1.0/(y*y+4))
y = gaussian(r,1.0/sqrt(x+1))
samples[i,0] = x
samples[i,1] = y
return samples
Parallel execution on engine(s): [0, 1]
Divide the array by the number of nodes. In this case they divide evenly, a more general partitioning of sizes is easy to do as well.
N = 1000
thin = 10
n = N/len(rc.ids)
dv.push(dict(n=n, thin=thin))
# Let's just confirm visually we got what we expect
dv['n']
[500, 500]
We can time purely the execution of the gibbs sampler on the remote nodes
%%timeit
dv.execute('gibbs(n, thin)')
100 loops, best of 3: 8.23 ms per loop
But a more realistic (and costly) benchmark must also include the cost of bringing the results back from the cluster engines to our local namespace. For that, we assign the call to the variable a
on each node and then use the view's gather
method to pull them back in:
%%timeit
dv.execute('a = gibbs(n, thin)')
a = dv.gather('a')
100 loops, best of 3: 13.9 ms per loop
Here we see how on a 4-core machine we get close to a 4-fold speedup on pure execution, but once we have to pay the price of communicating the results back to the local engine, the benefit is less. In a real parallel application, much of the art of optimization comes in trying to keep the movement of large amounts of data to a minimum, as communication is always much more expensive than computation in today's architectures.
Timed on a 11" MacBook Air (1.8 GHz Intel Core i7)
Elapsed time for 10 replications of the Gibbs sampler code, each run for 4 million iterations.
model | elapsed | relative |
---|---|---|
Julia distributed | 3.38 | 1.0 |
GSLGibbs | 5.46 | 1.6 |
Cython | 5.58 | 1.7 |
Julia native RNG | 7.03 | 2.1 |
RcppGibbs | 9.95 | 2.9 |
BoostGibbs | 10.65 | 3.2 |
Julia libRMath | 17.27 | 5.1 |
Pure Python | 278.51 | 82.4 |
RCgibbs | 328.56 | 97.2 |
Rgibbs | 419.87 | 124.2 |