Christopher Fonnesbeck
Department of Biostatistics, Vanderbilt University School of Medicine
reproducing conclusions from a single experiment based on the measurements from that experiment
The most basic form of reproducibility is a complete description of the data and associated analyses (including code!) so the results can be exactly reproduced by others.
Reproducing calculations can be onerous, even with one's own work!
Scientific data are becoming larger and more complex, making simple descriptions inadequate for reproducibility. As a result, most modern research is irreproducible without tremendous effort.
There are a number of steps to scientific endeavors that involve computing:
Many of the standard tools impose barriers between one or more of these steps. This can make it difficult to iterate, reproduce work.
IPython is an enhanced Python shell which provides a more robust and productive development environment for users.
It includes the HTML notebook featured here, as well as support for interactive data visualization and easy high-performance parallel computing.
def f(x):
return (x-3)*(x-5)*(x-7)+85
import numpy as np
x = np.linspace(0, 10, 200)
y = f(x)
plot(x,y)
[<matplotlib.lines.Line2D at 0x1065f8f50>]
The HTML lets you document your workflow using either HTML or Markdown.
The IPython Notebook consists of two related components:
The Notebook can be used by starting the Notebook server with the command:
$ ipython notebook
This initiates an iPython engine, which is a Python instance that takes Python commands over a network connection.
The IPython controller provides an interface for working with a set of engines, to which one or more iPython clients can connect.
The Notebook gives you everything that a browser gives you. For example, you can embed images, videos, or entire websites.
from IPython.display import HTML
HTML("<iframe src=http://co-op.nashvl.org width=700 height=350></iframe>")
from IPython.display import YouTubeVideo
YouTubeVideo("BS4Wd5rwNwE")
Use %load
to add remote code
Mathjax ia a javascript implementation of LaTeX that allows equations to be embedded into HTML.
$$ \int_{a}^{b} f(x)\, dx \approx \frac{1}{2} \sum_{k=1}^{N} \left( x_{k} - x_{k-1} \right) \left( f(x_{k}) + f(x_{k-1}) \right). $$SymPy is a Python library for symbolic mathematics. It supports:
from sympy import *
%load_ext sympyprinting
x, y = symbols("x y")
eq = ((x+y)**2 * (x+1))
eq
expand(eq)
(1/cos(x)).series(x, 0, 6)
IPython has a set of predefined ‘magic functions’ that you can call with a command line style syntax. These include:
%run
%edit
%debug
%timeit
%paste
%load_ext
%lsmagic
Available line magics: %alias %alias_magic %autocall %automagic %bookmark %cd %clear %colors %config %connect_info %debug %dhist %dirs %doctest_mode %ed %edit %env %gui %hist %history %install_default_config %install_ext %install_profiles %killbgscripts %less %load %load_ext %loadpy %logoff %logon %logstart %logstate %logstop %lsmagic %macro %magic %man %more %notebook %page %pastebin %pdb %pdef %pdoc %pfile %pinfo %pinfo2 %popd %pprint %precision %profile %prun %psearch %psource %pushd %pwd %pycat %pylab %qtconsole %quickref %recall %rehashx %reload_ext %rep %rerun %reset %reset_selective %run %save %sc %store %sx %system %tb %time %timeit %unalias %unload_ext %who %who_ls %whos %xdel %xmode Available cell magics: %%! %%bash %%capture %%file %%javascript %%latex %%perl %%prun %%pypy %%python %%python3 %%ruby %%script %%sh %%svg %%sx %%system %%time %%timeit Automagic is ON, % prefix IS NOT needed for line magics.
Timing the execution of code; the timeit
magic exists both in line and cell form:
%timeit np.linalg.eigvals(np.random.rand(100,100))
100 loops, best of 3: 8.34 ms per loop
%%timeit a = np.random.rand(100, 100)
np.linalg.eigvals(a)
100 loops, best of 3: 8.3 ms per loop
IPython also creates aliases for a few common interpreters, such as bash, ruby, perl, etc.
These are all equivalent to %%script <name>
%%ruby
puts "Hello from Ruby #{RUBY_VERSION}"
Hello from Ruby 1.8.7
%%bash
echo "hello from $BASH"
hello from /bin/bash
IPython has an rmagic
extension that contains a some magic functions for working with R via rpy2. This extension can be loaded using the %load_ext
magic as follows:
%load_ext rmagic
x,y = arange(10), random.normal(size=10)
%%R -i x,y -o XYcoef
lm.fit <- lm(y~x)
par(mfrow=c(2,2))
print(summary(lm.fit))
plot(lm.fit)
XYcoef <- coef(lm.fit)
Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max -1.8446 -0.6005 -0.2404 0.8248 2.0626 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.3486 0.7065 0.493 0.635 x -0.1064 0.1323 -0.804 0.445 Residual standard error: 1.202 on 8 degrees of freedom Multiple R-squared: 0.07474, Adjusted R-squared: -0.04091 F-statistic: 0.6463 on 1 and 8 DF, p-value: 0.4447
XYcoef
[ 0.34858412 -0.10639529]
Before running the next cell, make sure you have first started your cluster, you can use the clusters tab in the dashboard to do so.
from IPython.parallel import Client
client = Client()
dv = client.direct_view()
len(dv)
def where_am_i():
import os
import socket
return "In process with pid {0} on host: '{1}'".format(
os.getpid(), socket.gethostname())
where_am_i_direct_results = dv.apply(where_am_i)
where_am_i_direct_results.get()
[In process with pid 79873 on host: 'Cepeda.local', In process with pid 79874 on host: 'Cepeda.local', In process with pid 79875 on host: 'Cepeda.local']
IPython Notebook Viewer Displays static HTML versions of notebooks, and includes a gallery of notebook examples.
NotebookCloud A service that allows you to launch and control IPython Notebook servers on Amazon EC2 from your browser.
A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data A landmark example of reproducible research in genomics: Git repo, iPython notebook, data and scripts.