This notebook demonstrates how to use Sumatra to capture simulation input data and meta data and then export these records into a Pandas data frame. Sumatra has a stand alone web interface built with Django which allows users to view the data. Data can also be imported into Python, but requires a lot of code to manipulate and display in useful custom formats. Pandas seems like the ideal solution for manipulating Sumatra's data. In particular the ability to easily and quickly combine input data, meta data, and output data into custom data frames is really powerful for data analysis, reproduciblity and sharing.
The first step in using Sumatra is to setup a simulation. Here the simulation just runs a diffusion problem using FiPy and outputs the time taken for a time step. The goal of the work is to test FiPy's parallel speed up based on different input parameters.
%matplotlib inline
%load_ext autoreload
%autoreload 2
import numpy as np
import matplotlib.pyplot as plt
Sumatra requires a file with the parameters specified.
import json
params = {'N' : 10, 'suite' : 'trilinos', 'iterations' : 100}
with open('params.json', 'w') as fp:
json.dump(params, fp)
The script file for running the simulation is fipy_timing.py
. It reads the JSON file, runs the simulation and the stores the run times in data.txt
.
%%writefile fipy_timing.py
"""
Usage: fipy_timing.py [<jsonfile>]
"""
from docopt import docopt
import json
import timeit
import numpy as np
import fipy as fp
import os
arguments = docopt(__doc__, version='Run FiPy timing')
jsonfile = arguments['<jsonfile>']
if jsonfile:
with open(jsonfile, 'rb') as ff:
params = json.load(ff)
else:
params = dict()
N = params.get('N', 10)
iterations = params.get('iterations', 100)
suite = params.get('suite', 'trilinos')
sumatra_label = params.get('sumatra_label', '')
attempts = 3
setup_str = '''
import fipy as fp
import numpy as np
np.random.seed(1)
L = 1.
N = {N:d}
m = fp.GmshGrid3D(nx=N, ny=N, nz=N, dx=L / N, dy=L / N, dz=L / N)
v0 = np.random.random(m.numberOfCells)
v = fp.CellVariable(mesh=m)
v0 = np.resize(v0, len(v)) ## Gmsh doesn't always give us the correct sized grid!
eqn = fp.TransientTerm(1e-3) == fp.DiffusionTerm()
v[:] = v0.copy()
import fipy.solvers.{suite} as solvers
solver = solvers.linearPCGSolver.LinearPCGSolver(precon=None, iterations={iterations}, tolerance=1e-100)
eqn.solve(v, dt=1., solver=solver)
v[:] = v0.copy()
'''
timeit_str = '''
eqn.solve(v, dt=1., solver=solver)
fp.parallelComm.Barrier()
'''
timer = timeit.Timer(timeit_str, setup=setup_str.format(N=N, suite=suite, iterations=iterations))
times = timer.repeat(attempts, 1)
if fp.parallelComm.procID == 0:
filepath = os.path.join('Data', sumatra_label)
filename = 'data.txt'
np.savetxt(os.path.join(filepath, filename), times)
Overwriting fipy_timing.py
Without using Sumatra and in serial this is run with
!python fipy_timing.py params.json
and the output data file is
!more Data/data.txt
1.253199577331542969e-02 1.225900650024414062e-02 1.175403594970703125e-02
In this demo, I'm assuming that the working directory is a Git repository set up with
$ git init
$ git add fipy_timing.py $ git ci -m "Add timing script."
Sumatra requires that the script is sitting in the a working copy of a repository.
!git log -1
commit 6a830dac2ea45ea090ec91a4a0f5263be10e95f3
Author: Daniel Wheeler <daniel.wheeler2@gmail.com>
Date: Wed Feb 26 13:50:21 2014 -0500
Fix README.
Once the repository is setup, the Sumatra repository can be configured. Here we are using the distributed
launch mode as we want Sumatra to launch and
record parallel jobs.
%%bash
\rm -rf .smt
smt init smt-demo
smt configure --executable=python --main=fipy_timing.py
smt configure --launch_mode=distributed
smt configure -g uuid
smt configure -c store-diff
smt configure --addlabel=parameters
Sumatra project successfully set up Multiple versions found, using /home/wd15/anaconda/bin/python. If you wish to use a different version, please specify it explicitly Multiple versions found, using /home/wd15/anaconda/bin/mpirun. If you wish to use a different version, please specify it explicitly
Sumatra requires that a Data/
directory exists in the working copy.
!mkdir Data
If we were not using Sumatra, we would launch the job with
$ mpirun -n 2 python fipy_timing.py params.json
The equivalent command using Sumatra is
$ smt run -n 2 params.json
In the following cell we just run a batch of simulations with varying parameters.
import itertools
nprocs = (1, 2, 4, 8)#
iterations_ = (100,)
Ns = (10, 40)
suites = ('trilinos',)
tag='demo4'
for nproc, iterations, N, suite in itertools.product(nprocs, iterations_, Ns, suites):
!smt run --tag=$tag -n $nproc params.json N=$N iterations=$iterations suite=$suite
/home/wd15/anaconda/lib/python2.7/site-packages/matplotlib/__init__.py:740: UserWarning: Found matplotlib configuration in ~/.matplotlib/. To conform with the XDG base directory standard, this configuration location has been deprecated on Linux, and the new location is now '/home/wd15/.config'/matplotlib/. Please move your configuration there to ensure that matplotlib will continue to find it in the future. _get_xdg_config_dir()) /home/wd15/hg/sumatra/sumatra/launch.py:263: UserWarning: mpi4py is not available, so Sumatra is not able to obtain platform information for remote nodes. warnings.warn("mpi4py is not available, so Sumatra is not able to obtain platform information for remote nodes.") Data keys are [data.txt(5a633ee751a2043deb828d28d4daefc0372c5b63)] /home/wd15/anaconda/lib/python2.7/site-packages/matplotlib/__init__.py:740: UserWarning: Found matplotlib configuration in ~/.matplotlib/. To conform with the XDG base directory standard, this configuration location has been deprecated on Linux, and the new location is now '/home/wd15/.config'/matplotlib/. Please move your configuration there to ensure that matplotlib will continue to find it in the future. _get_xdg_config_dir()) /home/wd15/hg/sumatra/sumatra/launch.py:263: UserWarning: mpi4py is not available, so Sumatra is not able to obtain platform information for remote nodes. warnings.warn("mpi4py is not available, so Sumatra is not able to obtain platform information for remote nodes.") Data keys are [data.txt(f66b8ea1a596f8f06321d389fc412e4809d50697)] /home/wd15/anaconda/lib/python2.7/site-packages/matplotlib/__init__.py:740: UserWarning: Found matplotlib configuration in ~/.matplotlib/. To conform with the XDG base directory standard, this configuration location has been deprecated on Linux, and the new location is now '/home/wd15/.config'/matplotlib/. Please move your configuration there to ensure that matplotlib will continue to find it in the future. _get_xdg_config_dir()) /home/wd15/hg/sumatra/sumatra/launch.py:263: UserWarning: mpi4py is not available, so Sumatra is not able to obtain platform information for remote nodes. warnings.warn("mpi4py is not available, so Sumatra is not able to obtain platform information for remote nodes.") Data keys are [data.txt(1d0effa7003d8e75bf998cce5a4f71a72dff7025)] /home/wd15/anaconda/lib/python2.7/site-packages/matplotlib/__init__.py:740: UserWarning: Found matplotlib configuration in ~/.matplotlib/. To conform with the XDG base directory standard, this configuration location has been deprecated on Linux, and the new location is now '/home/wd15/.config'/matplotlib/. Please move your configuration there to ensure that matplotlib will continue to find it in the future. _get_xdg_config_dir()) /home/wd15/hg/sumatra/sumatra/launch.py:263: UserWarning: mpi4py is not available, so Sumatra is not able to obtain platform information for remote nodes. warnings.warn("mpi4py is not available, so Sumatra is not able to obtain platform information for remote nodes.") Data keys are [data.txt(38d19ef93141666ffbabcb30c700028a87631ced)] /home/wd15/anaconda/lib/python2.7/site-packages/matplotlib/__init__.py:740: UserWarning: Found matplotlib configuration in ~/.matplotlib/. To conform with the XDG base directory standard, this configuration location has been deprecated on Linux, and the new location is now '/home/wd15/.config'/matplotlib/. Please move your configuration there to ensure that matplotlib will continue to find it in the future. _get_xdg_config_dir()) /home/wd15/hg/sumatra/sumatra/launch.py:263: UserWarning: mpi4py is not available, so Sumatra is not able to obtain platform information for remote nodes. warnings.warn("mpi4py is not available, so Sumatra is not able to obtain platform information for remote nodes.") Data keys are [data.txt(3fa95ba72c140b64dc07d4d2a7a9d0af38da06c8)] /home/wd15/anaconda/lib/python2.7/site-packages/matplotlib/__init__.py:740: UserWarning: Found matplotlib configuration in ~/.matplotlib/. To conform with the XDG base directory standard, this configuration location has been deprecated on Linux, and the new location is now '/home/wd15/.config'/matplotlib/. Please move your configuration there to ensure that matplotlib will continue to find it in the future. _get_xdg_config_dir()) /home/wd15/hg/sumatra/sumatra/launch.py:263: UserWarning: mpi4py is not available, so Sumatra is not able to obtain platform information for remote nodes. warnings.warn("mpi4py is not available, so Sumatra is not able to obtain platform information for remote nodes.") Data keys are [data.txt(2c702e292ec4da225cdfc414d29f80e9df2ccfd7)] /home/wd15/anaconda/lib/python2.7/site-packages/matplotlib/__init__.py:740: UserWarning: Found matplotlib configuration in ~/.matplotlib/. To conform with the XDG base directory standard, this configuration location has been deprecated on Linux, and the new location is now '/home/wd15/.config'/matplotlib/. Please move your configuration there to ensure that matplotlib will continue to find it in the future. _get_xdg_config_dir()) /home/wd15/hg/sumatra/sumatra/launch.py:263: UserWarning: mpi4py is not available, so Sumatra is not able to obtain platform information for remote nodes. warnings.warn("mpi4py is not available, so Sumatra is not able to obtain platform information for remote nodes.") Data keys are [data.txt(c63d4c44770cdc4329b13dc863762c8b5069dd2e)] /home/wd15/anaconda/lib/python2.7/site-packages/matplotlib/__init__.py:740: UserWarning: Found matplotlib configuration in ~/.matplotlib/. To conform with the XDG base directory standard, this configuration location has been deprecated on Linux, and the new location is now '/home/wd15/.config'/matplotlib/. Please move your configuration there to ensure that matplotlib will continue to find it in the future. _get_xdg_config_dir()) /home/wd15/hg/sumatra/sumatra/launch.py:263: UserWarning: mpi4py is not available, so Sumatra is not able to obtain platform information for remote nodes. warnings.warn("mpi4py is not available, so Sumatra is not able to obtain platform information for remote nodes.") Data keys are [data.txt(c4e026e885b64122b4b578c334ef9f078eac3df1)]
The important part of this story is how to import the data into the Pandas data frame. This is actually trivial as Sumatra's default export format is a JSON file with all the records.
import json
import pandas
!smt export
with open('.smt/records_export.json') as ff:
data = json.load(ff)
df = pandas.DataFrame(data)
The Sumatra data is now in a Pandas data frame, albeit a touch raw.
print df
<class 'pandas.core.frame.DataFrame'> Int64Index: 18 entries, 0 to 17 Data columns (total 23 columns): datastore 18 non-null values dependencies 18 non-null values diff 18 non-null values duration 18 non-null values executable 18 non-null values input_data 18 non-null values input_datastore 18 non-null values label 18 non-null values launch_mode 18 non-null values main_file 18 non-null values outcome 18 non-null values output_data 18 non-null values parameters 18 non-null values platforms 18 non-null values reason 18 non-null values repeats 0 non-null values repository 18 non-null values script_arguments 18 non-null values stdout_stderr 18 non-null values tags 18 non-null values timestamp 18 non-null values user 18 non-null values version 18 non-null values dtypes: float64(1), object(22)
print df[['label', 'duration']]
label duration 0 ac32b9fc6df4 32.574247 1 f372719a0648 6.860735 2 9695c2529109 24.854690 3 bf710e5339ff 3.592334 4 179099003946 28.189878 5 eeffa50a08bd 3.995513 6 180b9c889f94 34.474531 7 0732f6d89fc4 4.214891 8 f2073ab41bd7 34.305814 9 27bb809fa8ad 24.290209 10 179247440765 28.439126 11 0731f5a8e231 32.452093 12 0330697ac505 2.569647 13 a04a49a2107b 1.873670 14 6b3d5ac075a6 1.283991 15 1b124fc57ced 5.125001 16 6b04488b14ed 5.207692 17 5cc0546270c9 5.027596
While all the meta data is important, often we want the input and output data combined into a data frame in a digestible form. Typically, we want a graph of reduced input versus reduced output.
The first step is to introduce columns in the data frame for each of the input parameters (input data). The input data is buried in the launch_mode
and parameters
columns of the raw data frame.
import json
df = df.copy()
df['nproc'] = df.launch_mode.map(lambda x: x['parameters']['n'])
for p in 'N', 'iterations', 'suite':
df[p] = df.parameters.map(lambda x: json.loads(x['content'])[p])
We now have the input data exposed as columns in the data frame.
columns = ['label', 'nproc', 'N', 'iterations', 'suite', 'tags']
print df[columns].sort('nproc')
label nproc N iterations suite tags 15 1b124fc57ced 1 10 100 trilinos [demo2] 6 180b9c889f94 1 40 100 trilinos [demo4] 7 0732f6d89fc4 1 10 100 trilinos [demo4] 11 0731f5a8e231 1 40 100 trilinos [demo3] 17 5cc0546270c9 2 10 100 trilinos [] 4 179099003946 2 40 100 trilinos [demo4] 5 eeffa50a08bd 2 10 100 trilinos [demo4] 16 6b04488b14ed 2 10 100 trilinos [test] 10 179247440765 2 40 100 trilinos [demo3] 14 6b3d5ac075a6 2 10 100 trilinos [demo2] 2 9695c2529109 4 40 100 trilinos [demo4] 3 bf710e5339ff 4 10 100 trilinos [demo4] 9 27bb809fa8ad 4 40 100 trilinos [demo3] 13 a04a49a2107b 4 10 100 trilinos [demo2] 0 ac32b9fc6df4 8 40 100 trilinos [demo4] 1 f372719a0648 8 10 100 trilinos [demo4] 12 0330697ac505 8 10 100 trilinos [demo2] 8 f2073ab41bd7 8 40 100 trilinos [demo3]
The following pulls out the run times stored in the output files from each simulation into a run_time
column.
import os
datafiles = df['output_data'].map(lambda x: x[0]['path'])
datapaths = df['datastore'].map(lambda x: x['parameters']['root'])
data = [np.loadtxt(os.path.join(x, y)) for x, y in zip(datapaths, datafiles)]
df['run_time'] = [min(d) for d in data]
columns.append('run_time')
print df[columns].sort('nproc')
label nproc N iterations suite tags run_time 15 1b124fc57ced 1 10 100 trilinos [demo2] 0.012017 6 180b9c889f94 1 40 100 trilinos [demo4] 0.419316 7 0732f6d89fc4 1 10 100 trilinos [demo4] 0.012037 11 0731f5a8e231 1 40 100 trilinos [demo3] 0.402522 17 5cc0546270c9 2 10 100 trilinos [] 0.011014 4 179099003946 2 40 100 trilinos [demo4] 0.252318 5 eeffa50a08bd 2 10 100 trilinos [demo4] 0.011214 16 6b04488b14ed 2 10 100 trilinos [test] 0.010802 10 179247440765 2 40 100 trilinos [demo3] 0.253387 14 6b3d5ac075a6 2 10 100 trilinos [demo2] 0.011340 2 9695c2529109 4 40 100 trilinos [demo4] 0.173505 3 bf710e5339ff 4 10 100 trilinos [demo4] 0.010188 9 27bb809fa8ad 4 40 100 trilinos [demo3] 0.179195 13 a04a49a2107b 4 10 100 trilinos [demo2] 0.010196 0 ac32b9fc6df4 8 40 100 trilinos [demo4] 0.178471 1 f372719a0648 8 10 100 trilinos [demo4] 0.016224 12 0330697ac505 8 10 100 trilinos [demo2] 0.016702 8 f2073ab41bd7 8 40 100 trilinos [demo3] 0.184142
Create masks based on simulations records that have been tagged with either demo2
or demo3
. We want to plot these results as different curves on the same graph.
tag_mask = df.tags.map(lambda x: 'demo4' in x)
df_tmp = df[tag_mask]
m10 = df_tmp.N.map(lambda x: x == 10)
m40 = df_tmp.N.map(lambda x: x == 40)
df_N10 = df_tmp[m10]
df_N40 = df_tmp[m40]
print df_N10[columns].sort('nproc')
print df_N40[columns].sort('nproc')
label nproc N iterations suite tags run_time 7 0732f6d89fc4 1 10 100 trilinos [demo4] 0.012037 5 eeffa50a08bd 2 10 100 trilinos [demo4] 0.011214 3 bf710e5339ff 4 10 100 trilinos [demo4] 0.010188 1 f372719a0648 8 10 100 trilinos [demo4] 0.016224 label nproc N iterations suite tags run_time 6 180b9c889f94 1 40 100 trilinos [demo4] 0.419316 4 179099003946 2 40 100 trilinos [demo4] 0.252318 2 9695c2529109 4 40 100 trilinos [demo4] 0.173505 0 ac32b9fc6df4 8 40 100 trilinos [demo4] 0.178471
We can plot the results we're interested in. Larger system size gives better parallel speed up.
ax = df_N10.plot('nproc', 'run_time', label='N={0}'.format(df_N10.N.iat[0]))
df_N40.plot('nproc', 'run_time', ylim=0, ax=ax, label='N={0}'.format(df_N40.N.iat[0]))
plt.ylabel('Run Time (s)')
plt.xlabel('Number of Processes')
plt.legend()
<matplotlib.legend.Legend at 0x46e9390>
Using Pandas it is easy to store a custom data frame.
df.to_hdf('store.h5', 'df')
/home/wd15/anaconda/lib/python2.7/site-packages/pandas/io/pytables.py:1992: PerformanceWarning: your performance may suffer as PyTables will pickle object types that it cannot map directly to c-types [inferred_type->mixed,key->axis0] [items->None] warnings.warn(ws, PerformanceWarning) /home/wd15/anaconda/lib/python2.7/site-packages/pandas/io/pytables.py:1992: PerformanceWarning: your performance may suffer as PyTables will pickle object types that it cannot map directly to c-types [inferred_type->mixed,key->block0_items] [items->None] warnings.warn(ws, PerformanceWarning) /home/wd15/anaconda/lib/python2.7/site-packages/pandas/io/pytables.py:1992: PerformanceWarning: your performance may suffer as PyTables will pickle object types that it cannot map directly to c-types [inferred_type->mixed,key->block2_values] [items->[u'datastore', u'dependencies', u'diff', u'executable', u'input_data', u'input_datastore', u'label', u'launch_mode', u'main_file', u'outcome', u'output_data', u'parameters', u'platforms', u'reason', u'repeats', u'repository', u'script_arguments', u'stdout_stderr', u'tags', u'timestamp', u'user', u'version', 'suite']] warnings.warn(ws, PerformanceWarning) /home/wd15/anaconda/lib/python2.7/site-packages/pandas/io/pytables.py:1992: PerformanceWarning: your performance may suffer as PyTables will pickle object types that it cannot map directly to c-types [inferred_type->mixed,key->block2_items] [items->None] warnings.warn(ws, PerformanceWarning)
store = pandas.HDFStore('store.h5')
print store.df.dependencies
0 [{u'name': u'IPython', u'module': u'python', u... 1 [{u'name': u'IPython', u'module': u'python', u... 2 [{u'name': u'IPython', u'module': u'python', u... 3 [{u'name': u'IPython', u'module': u'python', u... 4 [{u'name': u'IPython', u'module': u'python', u... 5 [{u'name': u'IPython', u'module': u'python', u... 6 [{u'name': u'IPython', u'module': u'python', u... 7 [{u'name': u'IPython', u'module': u'python', u... 8 [{u'name': u'IPython', u'module': u'python', u... 9 [{u'name': u'IPython', u'module': u'python', u... 10 [{u'name': u'IPython', u'module': u'python', u... 11 [{u'name': u'IPython', u'module': u'python', u... 12 [{u'name': u'IPython', u'module': u'python', u... 13 [{u'name': u'IPython', u'module': u'python', u... 14 [{u'name': u'IPython', u'module': u'python', u... 15 [{u'name': u'IPython', u'module': u'python', u... 16 [{u'name': u'IPython', u'module': u'python', u... 17 [{u'name': u'IPython', u'module': u'python', u... Name: dependencies, dtype: object
Sumatra stores data in an SQL style database and this isn't ideal for pulling data into Python for data manipulation. Pandas is good for data manipulation and pulling the records out of Sumatra and into Pandas is very easy.