Seite9

Anaconda

I’ve been using Anaconda as my Python distribution and package manager for about two months. In this article I’ll share some general thoughts about and my experience with it. In particular, I’ll go into how I built my own Python extension, which uses Boost.Python, against Anaconda, which was not as straightforward as I had wished. I will also put together a conda package for my extension. Most of this article concentrates on OS X, as this is my main system. I also work on Ubuntu where, in my experience, most of the time things just work, whereas OS X requires tinkering.

About Anaconda

Anaconda was originally a Python distribution for scientific computing, but has grown into a proper package management system for Python. It comes with the conda package manager, has a central repository (binstar.org) and as its distinguishing features supports binary packages and sophisticated virtual environments. Anaconda is developed by Continuum Analytics, a three-year-old Austin-based company focusing on products and services around number crunching and data-heavy work with Python. Anaconda’s goal is, as I understand it, to provide a reliable cross-platform software stack for Python, including “binary-heavy” modules like Numpy which are presently not well-supported by other Python package managers. Thus Anaconda should make it easier to deploy a numerical Python application to a server, to work with different Python and Numpy versions at the same time, or to simply and quickly share code with colleagues. It works on Linux, OS X and Windows and typically gets installed into the user’s home directory. To be honest, I am not using most of its features. I use Anaconda to have a consistent Python environment across my different computers, which include Ubuntu systems and Macs.

In the web context pip and virtualenv together with a requirements.txt seems to be the way to go, but as far as I know this approach does not accommodate binary packages adequately. Wheels are emerging as the new package format standard with a number of advantages over the current egg format, including better support for binary packages. However, I get the impression that wheels still has a number of problems in that area and maybe is just not quite there yet. Armin Ronacher wrote a nice overview and introduction to wheels, including a brief overview of Python packaging history. Travis Oliphant from Continuum Analytics has also written an article on the subject, and goes into some detail comparing conda and wheels, obviously slightly biased in favour of Anaconda.

I am not sure how Anaconda fits in with package managers from the OS, e.g. apt on Ubuntu. Pure Python package managers like pip sit on top of the software stack provided by the operating system. To know the full stack you would specify the OS, e.g. Ubuntu 12.04 LTS, your installed packages from apt and then your requirements.txt. However, with Anaconda providing generic non-Python-specific cross-platform binary packages, it looks like replicating all the hard work and solving all the same difficult problems that platform-specific package managers like apt already solve. Just on multiple platforms, which makes things even harder. Of course, this point is mute on OS X or Windows, where a standard binary package manager does not exist.

Building and packaging a Python extension with Anaconda

My current main project is written in C++ and uses Boost.Python to export functionality from C++ as a Python extension. I also have a small Python wrapper module around this extension. My code depends on Eigen and Boost and uses CMake to build the C++ part and distutils to install the Python part. To build and install I would thus first invoke cmake and then setup.py. Frankly, just getting my Python extension with its Boost dependency correctly built and linked on different systems with different Python versions and different compilers has cost me much more time than I am readily willing to admit. I use MacPorts plus Anaconda on OS X and just Anaconda on my Ubuntu systems. I use IPython.parallel to run my code on different computers and multiple cores (typically only two computers, 20 cores in total). I use the IPython notebook for prototyping, quick runs, analysis and brief write-ups. I had random stability problems with the MacPorts Python stack and IPython.parallel, and that’s why I started using Anaconda in the first place.

Generally, most Python extensions do not use Boost.Python and do not link against the Python library and in this case it is much more straightforward to get the extension to build and work correctly. However, in my case it was not that easy and here are three specific lessons I learnt to get everything working properly with Anaconda.

1. Boost

Unrelated to Anaconda, Boost needs to be compiled with the same compiler used to compile the Python extension. I use GCC 4.8 and hence with MacPorts the correct version of Boost can be installed with

$ sudo port install boost +gcc48

On my OS X system I have at least three different versions of Python installed, the Python coming with OS X, the MacPorts version and the Anaconda version. Care needs to be taken that Boost.Python is linked against the correct Python version. Short of recompiling Boost, a quick fix is to use install_name_tool. For example:

$ sudo install_name_tool -change \
    /opt/local/Library/Frameworks/Python.framework/Versions/2.7/Python \
    ~/anaconda/lib/libpython2.7.dylib \
    /opt/local/lib/libboost_python-mt.dylib

The above command changes the linked Python library from MacPorts (installed in /opt/local) to Anaconda for the Boost.Python library installed via MacPorts. This quick-fix works for me, but probably requires that the different Python libraries are ABI compatible.

2. Linking

On OS X, Anaconda uses relative paths for its Python library’s install name and relative paths for linking, which seems to be somewhat uncommon on that platform. Anaconda uses this approach to enable its powerful virtual environments. The downside is that if you just compile and link your Python extension against Anaconda’s Python, it won’t work. At runtime the wrong Python library gets linked in, resulting in obscure errors like __init__() should return None, not 'NoneType'. The recommended way to fix this is to run install_name_tool after building your extension, similar to what I did with the Boost library above, and change the path of the linked Python library to the correct relative path—relative to where your Python extension my_extension.so ends up being installed. I prefer to set an absolute path, so I don’t have to invoke install_name_tool every time I move my my_extension.so to a different location. So my CMakeLists.txt now calls something like this on OS X

INSTALL_NAME=`otool -D ${PYTHON_LIBRARIES} | tail -n1 | tr -d '\n'`
install_name_tool -change ${INSTALL_NAME} ${PYTHON_LIBRARIES} ${LIB_FILENAME}

where PYTHON_LIBRARIES is the full path to Anaconda’s Python library and LIB_FILENAME points to my_extension.so; INSTALL_NAME would usually be set to libpython2.7.dylib, or similar.

3. Conda package

It’s actually very easy to build a conda package. Even without redistributing the package one big advantage is that conda build and conda install set all the paths of linked libraries correctly automatically, so we don’t have to worry about fixing them manually as outlined above. Of course, if you want to run tests in your source directory (I want to do that), then you still need to set the library paths correctly, otherwise the tests fail.

A conda recipe is used to build a conda package and can consist of only two files in a directory, meta.yaml and build.sh. I put them in a subdirectory conda in my project’s main directory. With that I can build and install my conda package:

$ conda build ./conda/
$ conda install --use-local my_extension

The meta.yaml looks something like this:

package:
    name: my_extension
    version: 8.0.0

source:
    #fn: /path/to/my/local/archive/myproject.tar.bz2
    git_url: ssh://git@bitbucket.org/me/myproject.git

requirements:
    build:
        - cmake
        - python
    run:
        - python

...

The build.sh looks something like this:

#!/bin/bash

# Set PYLIB to either .so (Linux) or .dylib (OS X)
PYLIB="PYTHON_LIBRARY_NOT_FOUND"
for i in $PREFIX/lib/libpython${PY_VER}{.so,.dylib}; do
    if [ -f $i ]; then
        PYLIB=$i
    fi
done

mkdir Release
cd Release
cmake \
    -DCMAKE_BUILD_TYPE=Release \
    -DCMAKE_INSTALL_PREFIX=$PREFIX \
    -DCMAKE_INSTALL_RPATH=$LD_RUN_PATH \
    -DPYTHON_INCLUDE_PATH:PATH=$PREFIX/include/python${PY_VER} \
    -DPYTHON_LIBRARY:FILEPATH=$PYLIB \
    ..
make
cd ..

$PYTHON setup.py install

exit 0

Of course, to do things properly and have a full Anaconda software stack, I would also need to build conda packages for all my dependencies, in this case for Eigen—easy, it’s a header-only library—and Boost—which does not sound like fun. This goes back to my earlier remark that with Anaconda at some point it sounds to me like we start replicating the work done by platform specific package managers. In this case, this would be MacPorts and the Ubuntu repositories, which I use to install Eigen and Boost.

Conclusion

As the folks at Continuum Analytics keep emphasizing, conda works today and solves practical problems. For me that’s mostly providing a stable and consistent Python stack across different computers and platforms. It works well, is easy to use and comes with all numerical packages I need. Installing other missing, non-conda packages on top of that is accomplished with pip. I built my own Python extension against Anaconda, which on OS X was not as straightforward as I had hoped. I should emphasize that for most Python extensions things should be much easier. But using Boost.Python and linking against the Python library made life more complicated for me. On the bright side, I learnt a lot about linking on OS X. I created a conda package which I can distribute to my different computers. I could also give this binary package to my coworkers and just tell them to install the Anaconda distribution and thus get them up and running really quickly.

For comments and questions, please write me an email.

Arpack, Eigen and Numerical Recipes

My coworker Chris and me ran some tests last week, comparing the performance of Arpack, Eigen and Numerical Recipes (C++ version, 3rd edition) for calculating the eigenvalues of sparse real symmetric matrices. While these tests are far from comprehensive and the graphs and texts not very polished, I think it’s still worth sharing our results. This article is a slightly edited version of the original write-up (pdf, nbviewer). The code is on GitHub.

Numerical Recipes and Eigen are dense eigensolvers and always compute the full eigenvalue spectrum. In contrast, Arpack uses sparse matrices and is typically used to calculate only a few eigenvalues and not the full spectrum. It seems to be common knowledge that Numerical Recipes is not very fast. We were still surprised by how slow it really is. Conversely, Arpack gives good performance even when the matrices are not that sparse or when a substantial part (say 10%) of the eigenvalues is calculated.

Chris and me both ran similar, but slightly different test programs. Chris uses the Rice university Arpack and the Arpack++ wrapper, whereas I use Arpack-NG and the Arpaca wrapper.

In [1]:
def create_data_files():
    i_d = 0
    for r_ev in [0.01,0.1,1.0]:
        for r_zeros in [0.1,0.9,0.98]:
            i_d += 1
            fn = 'performance_nr.{}.dat'.format(i_d)
            !./build/arpaca_performance_plot_nr 100 1000 \
                40 10 $r_ev $r_zeros #> $fn
In [2]:
# Takes a very long time
# create_data_files()
In [3]:
import re
import numpy as np

def read_data_file(filename):
    f = open(filename)
    s = f.read()
    f.close()
    n_req = None
    r_ev = None
    r_zeros = None
    d = None
    m = re.search(r'n_rep=(.+)$',s,re.MULTILINE)
    if m: 
        n_rep = int(m.group(1))
    m = re.search(r'r_ev=(.+)$',s,re.MULTILINE)
    if m: 
        r_ev = float(m.group(1))
    m = re.search(r'r_zeros=(.+)$',s,re.MULTILINE)
    if m: 
        r_zeros = float(m.group(1))
    d = np.loadtxt(filename)
    return (n_rep,r_ev,r_zeros,d)

def subplot_data_file(p, filename):
    n_rep,r_ev,r_zeros,d = read_data_file(filename)
    p.plot(d[:,0],d[:,1],label='ARPACK')
    p.plot(d[:,0],d[:,2],label='Eigen')
    p.plot(d[:,0],d[:,3],label='NR')
    p.set_xlabel('matrix dimension')
    p.set_ylabel('time (s)')
    p.legend(loc='upper left')
    p.text(0.98,0.98,
           '$r_{{ev}} = {}$\n$r_{{zeros}} = {}$'.format(r_ev,r_zeros),
           fontsize='12',
           verticalalignment='top',
           horizontalalignment='right',
           transform=p.transAxes)
In [4]:
import matplotlib.pyplot as plt
%matplotlib inline

rows = 1
cols = 2

fig = plt.figure(figsize=(cols*5,rows*5))
i_p = 0

for i in [6,9]:
    i_p += 1
    p = fig.add_subplot(rows, cols, i_p)
    subplot_data_file(p, 'performance_nr.{}.dat'.format(i))

The graph shows the time the eigensolvers take to diagonalize a random matrix of a given size. Here, \(r_{ev}\) is the “ratio” of eigenvalues. Hence \(r_{ev} = 0.1\) means calculating 10% of the eigenvalues (for Arpack only). Eigen and Numerical Recipes always calculate all eigenvalues. \(r_{zeros}\) is the “ratio” of zeros. Hence \(r_{zeros} = 0.98\) means 98% sparse. This test was run on OS X 10.9.1, the compiler was GNU G++ 4.8.2 (installed from MacPorts). Arpack-ng is 3.1.4, installed from MacPorts. Arpaca was used as the Arpack wrapper.

In [5]:
def subplot_chris(p,xlim=None,ylim=None):
    # columns are: N  NRg++4.1.2 Arpack4.1.2 Eigen4.1.2 NR4.8.2 Eigen4.8.2
    filename = 'performance_chris.dat'
    d = np.loadtxt(filename)
    p.plot(d[:,0],d[:,2],label='ARPACK G++ 4.1')
    p.plot(d[:,0],d[:,3],label='Eigen G++ 4.1')
    p.plot(d[:,0],d[:,5],label='Eigen G++ 4.8')
    p.plot(d[:,0],d[:,1],label='NR G++ 4.1')
    p.plot(d[:,0],d[:,4],label='NR G++ 4.8')
    p.set_xlabel('matrix dimension')
    p.set_ylabel('time (s)')
    p.legend(loc='upper left')
    if xlim: p.set_xlim(xlim)
    if ylim: p.set_ylim(ylim)

rows = 1
cols = 2

fig = plt.figure(figsize=(cols*5,rows*5))
i_p = 0

i_p += 1
p = fig.add_subplot(rows, cols, i_p)
subplot_chris(p,(0,1000),(0,25))

i_p += 1
p = fig.add_subplot(rows, cols, i_p)
subplot_chris(p)

A very similar test, this time run on WestGrid’s Jasper (Linux). Uses latest Arpack from Rice university and Arpack++ as the wrapper. The matrix is the Hamiltonian from a real physical problem, with about 1% to 2% non-zeros. Numerical Recipes and Eigen are compiled with two different versions of the GNU G++ compiler, as indicated. The full spectrum of eigenvalues is computed. This scenario corresponds to the \(r_{ev} = 1.0\) and \(r_{zeros} = 0.98\) (the right-most) plot in the last graph.

While the general trends agree between this and the last graph, there are some notable differences: Arpack is much slower here than above. Reversely, Numerical Recipes is faster in this test, compared to the last.

It should be noted that Arpack was used in a black box like fashion (via the wrappers) and there might be considerable room for improving its performance, for example by compromising on precision (we use machine precision) or just using it in a cleverer way. To compute the full spectrum we actually run Arpack twice, once to get all but one eigenvalue, starting from the smallest, and once the compute the missing largest eigenvalue. (Arpack does not seem to allow to compute the whole eigenvalue spectrum directly, at least not in the simple direct manner we are using via the wrappers.)

Summary:

  1. Numerical Recipes is extremely slow.
  2. New compilers are significantly faster than the old ones, especially for sophisticated C++ libraries like Eigen.
  3. To calculate only a couple eigenvalues, use Arpack; and it’s increasingly faster for increasingly sparse matrices. No surprises here.
  4. To calculate all eigenvalues, use Eigen.

True Git

The Graduate Physics Student Association here at the University of Alberta has been organizing a Graduate Student Seminar Series for a couple of weeks now and I would say it’s one of the more successful and well attended GPSA initiatives, with a turnout of about 30 to 40 people every Thursday. (Note that I am a member of the GPSA council and hence likely not the most objective observer.)

Earlier today I had the pleasure of giving a presentation at this seminar. I opted for an introductory talk on Git and version control (slides, pdf). About a third of the audience already knew how to use Git and was probably bored to death. I think I lost half of the audience when I pulled up a command line prompt. That leaves one sixth who really enjoyed the talk, which really is not a bad result. Two people actually stayed for my brief hands-on session after the talk and so I was quite happy.

This is the second presentation I prepared with reveal.js and while it’s far from perfect, it’s nice to get simple presentations done quickly (e.g. in half a night). I am not very happy with some of the default formatting with reveal.js, but this is something that probably could be fixed easily if I would only take the time to tinker with the CSS theme. More importantly, reveal.js is just not good for not-text-heavy, graphics and maths rich presentations—and in my opinion presentations should contain as little text as possible. After all the audience is supposed to listen to you talking and not read lengthy texts on the slides. For one of these more graphics-heavy presentations I used Inkscape and its build-in presentation plugin recently. Again, this approach is far from perfect. The usage of the plugin feels a bit clumsy and slides with lots of text (e.g. code with syntax highlighting) are probably not something it excels at. Still, overall I think I prefer to craft my presentations graphically with Inkscape and in the end I was really happy with the results. However, there’s no denying that it takes much, much more time.

Finally, the obvious question might be why am I not using one of the “standard programs”, like PowerPoint or Keynote. The former does not run on any of my computers, the latter only on my iMac, but not on my Ubuntu Laptop. Still, I might be inclined to try it. I ditched Open/Libre Office long ago, after it managed to disappear half of my slides half an hour before my presentation in an impressive magic trick tour de force. I’ve used Latex Beamer for a couple of years, but its blue-yellow charm wears off rather quickly.

Everything not physics in computational physics

I gave a brief presentation at our group meeting last week. Besides a short introduction to Git, I talked about some of the things that we (or at least I) tend to spend a lot of time on, but that are not directly related to physics. Hence the title.

While the original presentation (pdf) is available, I’ve also included the presentation’s content below, only slightly edited to better fit the blog format.

I wrote the presentation in Markdown, and the slides are courtesy of the amazing reveal.js. In the past I’ve used Latex Beamer for talks. reveal.js is very simple—I got everything working in less than half an hour—and at least for this presentation it worked sufficiently well. Next I need to find out how best to integrate equations and then I could also use reveal.js for my research focused talks.

Contents

  1. Git in 13½ minutes
    1. Intro to Version Control Systems
    2. Hands-on Git
    3. Hands-on Bitbucket
  2. Everything not physics in computational physics
    1. Common workflow
    2. Provenance
    3. Data management
    4. Thoughts on a Python-centric workflow

Before we start

  • Computational physics: develop software
  • Not professionals in software development
  • Spend little time thinking about software development
  • Collaboration on code level is rare
  • Established best practices and proven concepts: rarely adapted
  • Examples
    • Revision control
    • Unit tests and test-driven development

Version Control Systems

  • Every notable software project uses VCS
  • Manage evolving code or documents (e.g. Latex)
  • Indispensable when working together in a team
  • Keep track of changes: “revisions”

Revision control

Image from: Pro Git book

Version Control Systems

  • Examples of VCS
    • Google Docs
    • Subversion
    • Distributed: Git, Mercurial
  • Git
    • Originally developed for the Linux kernel
    • Made hugely popular by Github (https://github.com/)
  • Version control can be complex and complicated
  • Common use cases: dead simple, no reason not to use it

Hands-on: Git 1

$ mkdir my_cool_project
$ cd my_cool_project/
$ git init
Initialized empty Git repository in /Users/burkhard/my_cool_project/.git/

Hands-on: Git 2

$ vim funky_program.cpp
$ git status 
# On branch master
#
# Initial commit
#
# Untracked files:
#   (use "git add <file>..." to include in what will be committed)
#
#       funky_program.cpp
nothing added to commit but untracked files present (use "git add" to track)
$ git add funky_program.cpp
$ git commit -a -m "Initial commit."
[master (root-commit) 1588d78] Initial commit.
 1 file changed, 4 insertions(+)
 create mode 100644 funky_program.cpp

Hands-on: Git 3

$ vim funky_program.cpp
$ git diff
diff --git a/funky_program.cpp b/funky_program.cpp
index c557e87..32e958e 100644
--- a/funky_program.cpp
+++ b/funky_program.cpp
@@ -1,4 +1,5 @@
 int main ()
 {
+    int whizbiz = 0;
     return 0;
 }
$ git commit -a -m "Implemented the new whizbiz feature."
[master 8859ecf] Implemented the new whizbiz feature.
 1 file changed, 1 insertion(+)

Hands-on: Git 4

$ git log
commit 8859ecf6b0cbcd29407ddbfde3bc0f3ae5c953b2
Author: Burkhard Ritter <burkhard@seite9.de>
Date:   Thu Mar 7 18:36:30 2013 -0700

    Implemented the new whizbiz feature.

commit 1588d78bb6ee615499441a76ab3a8fb6a62241c5
Author: Burkhard Ritter <burkhard@seite9.de>
Date:   Thu Mar 7 18:34:29 2013 -0700

    Initial commit.

Hands-on: Git 5

$ git help
[...verbose help message...]
$ git help tag
[...helpful manpage...]
$ git tag -a -m "Version 1." v1
$ git tag
v1
$ git describe 
v1

$ vim funky_program.cpp
$ git commit -a -m "More exciting features."
[master 7684818] More exciting features.
 1 file changed, 1 insertion(+)
$ git describe 
v1-1-g7684818

Hands-on: Git 6

$ git help diff
$ git diff head^
diff --git a/funky_program.cpp b/funky_program.cpp
index 32e958e..0403944 100644
--- a/funky_program.cpp
+++ b/funky_program.cpp
@@ -1,5 +1,6 @@
 int main ()
 {
+    bool evil_bug = true;
     int whizbiz = 0;
     return 0;
 }

Hands-on: Git 7

  • Immediate advantages
    • Simple
    • Keep track of changes
    • No different copies of code floating around in directories
    • Which code produces which result
    • Keep track of progress
    • Track down bugs

Collaboration

  • Distributed VCS: everybody has full copy of repository
  • Different collaboration workflows possible
  • More details: Pro Git book
  • Simple example: Hagen and Kriemhild work together on an epic poem, the Nibelungenlied (in Latex)

hagen@worms:~/nibelungenlied$ git pull kriemhild
hagen@worms:~/nibelungenlied$ git add chapter4.tex
hagen@worms:~/nibelungenlied$ edit chapter3.tex
hagen@worms:~/nibelungenlied$ git commit -a
hagen@worms:~/nibelungenlied$ git push kriemhild

kriemhild@worms:~/nibelungenlied$ edit chapter3.tex
kriemhild@worms:~/nibelungenlied$ git commit -a
kriemhild@worms:~/nibelungenlied$ git pull hagen
kriemhild@worms:~/nibelungenlied$ git pull ssh://kriemhild@worms/home/hagen/nibelungen

Collaboration

More complex example: central server / repository

Distributed workflow

Image from: Pro Git book

Bitbucket

  • Github
  • Extremely popular
  • Social coding
  • Revolutionalized open source” (Wired article)
  • Focus on individuals, not projects
  • Trend: publish everything, everything is public
    (!= free/libre software)
  • Bitbucket
    • Similar, but not hip
    • Unlimited private repositories, 5 collaborators
    • Unlimited academic plan!
  • Could or should our code be public?

Bitbucket

  • Why use Bitbucket?
    • Even easier to use Git
    • Web GUI
    • Backup
    • Sync between computers
    • Collaborate
      • Direct write access
      • Fork, create pull request
      • Teams

Hands-on: Bitbucket 1

Bitbucket

Hands-on: Bitbucket 2

Bitbucket

Hands-on: Bitbucket 3

Bitbucket

Hands-on: Bitbucket 4

Bitbucket

Hands-on: Bitbucket 5

burkhard@macheath:~/my_cool_project$ git remote add origin ssh://git@bitbucket.org/meznom/my_cool_project.git

burkhard@macheath:~/my_cool_project$ git push -u origin --all
[...]

burkhard@macheath:~/my_cool_project$ git status 
# On branch master
nothing to commit, working directory clean

burkhard@macheath:~/my_cool_project$ git pull
Already up-to-date.

Hands-on: Bitbucket 6

Bitbucket

Hands-on: Bitbucket 7

burkhard@lise:~$ git clone git@bitbucket.org:meznom/my_cool_project.git
[...]
burkhard@lise:~$ cd my_cool_project/
burkhard@lise:~/my_cool_project$ vim README.md
burkhard@lise:~/my_cool_project$ git add README.md
burkhard@lise:~/my_cool_project$ git commit -a -m "Added Readme"
[...]
burkhard@lise:~/my_cool_project$ git push
[...]

burkhard@macheath:~/my_cool_project$ git pull
[...]
From ssh://bitbucket.org/meznom/my_cool_project
   7684818..a4b5daa  master     -> origin/master
Updating 7684818..a4b5daa
[...]
burkhard@macheath:~/my_cool_project$ git log -n 1
commit a4b5daad03272a7598b1d2263f946d7ca34fdfe9
Author: Burkhard Ritter <burkhard@seite9.de>
Date:   Thu Mar 7 22:37:55 2013 -0700

    Added Readme

Hands-on: Bitbucket 8

Bitbucket

Hands-on: Bitbucket 9

Bitbucket

Everything not physics in computational physics

  • In everyday work: activities, procedures, issues not related to physics
  • Similar for all of us
  • Problem and project independent
  • Examples
    • Software management
    • Data management
    • Plotting
    • Publishing
  • Similarities are a result of common workflow

Everything not physics in computational physics

  • We do not often talk about these non-physics issues
  • Goal
    • Exchange ideas
    • Establish best practices
    • Possibly share code
  • This talk
    • Open discussion
    • No answers

Common workflow

/----> develop / change code
|
|  /-> input parameter set 1     input parameter set 2     ...
|  | 
|  |          run                         run              ...
|  |   
|  |       output data 1             output data 2         ...
|  |   
|  |   store raw data
|  |   
|  |   postprocess output data
|  |   
|  |   store processed data
|  | 
|  \-- plotting
|
\----- results, final plots

Common workflow

  • Common aspects
    • Fast evolving code
    • Results depend crucially on code version
    • Multiple input parameter sets (possibly: parallelize)
    • Input set -> one output data point
    • Postprocess output
    • Plot processed output
    • Store raw and processed data, results and plots
  • Opportunities for best practices and collaboration

Provenance

  • Verifiablility and reproducibility are at the heart of science
  • Anybody at any time in the future
    • Take any (published) result
    • Go back and be able to reproduce and verify it
  • Requires
    • Meticulously document every single step that led to the result
    • Quite involved for computer simulations

Provenance

  • Pushed by Alps (ETH Zürich, http://alps.comp-phys.org/)
  • VisTrails
    • http://vistrails.org, Polytechnic Institute of New York University
    • Scientific workflow and provenance system”
    • Evolve workflows, document workflows
    • Example
      • Graph in paper
      • VisTrails opens workflow
      • Reproduce (rerun simulations)
  • Complete and complex provenance system (too complex?)
  • Take inspiration and learn

Provenance

VisTrails

Image from: VisTrails Website

Provenance

  • For every result, keep
    • Code version, git describe
    • Input parameters
    • Scripts / programs (or their configuration) for processing raw data
    • Plotting scripts / instructions
  • Central code repository
  • Publish code (i.e. open source)

Data management

In the long run, your data matters more than your code. It’s worth investing some effort to keep your data in good shape for years to come.

Konrad Hinsen, “Caring for Your Data,” Computing in Science and Engineering, vol. 14, no. 6, pp. 70-74, Nov.-Dec., 2012; available on computer.org

Data management

  • Motiviation
    • Publications: data shown and discussed
    • Traditionally
      • Data management low priority (e.g. processing, storage)
      • Data formats undocumented
      • Format conversion error prone
      • Down the road: hard or impossible to interpret data

Data management

  • Data model design
    • Equivalent to software design
    • Describe data model in abstract but plain language
      • Equivalent to pseudo code
      • Specification
      • Documentation
    • Guidelines
      • Avoid redundancy
      • Keep extensibility in mind
      • More details in article

Data management

  • Distinguish data model and representations of data model
  • Representations
    • In memory (i.e. in program code)
    • On disk (i.e. file formats)
  • In memory
    • Code easier to understand
    • Encourages code modularity
    • Different representations in different languages
  • On disk: file formats
    • Different formats for different requirements
      • Binary for performance, e.g. HDF5
      • Ascii for readability, e.g. XML, JSON
    • Same data model: convert between formats easily and without loss

Data management

Example: XML

<molecule name="water">
  <atoms>O H1 H2</atoms>
  <bonds>
    <bond atoms="O H1" order=1 />
    <bond atoms="O H2" order=2 />
  </bonds>
</molecule>

Example: JSON

{
  "type": "molecule",
  "name": "water",
  "atoms": ["O", "H1", "H2"],
  "bonds": [{"order": 1, "atoms": ["O", "H1"]},
            {"order": 1, "atoms": ["O", "H2"]}]
}

Data management

  • Can we establish guidelines and best practices?
  • Document our data models and formats
  • Central data repository?
  • Same class of problems: similar data models
    • E.g. Monte Carlo
  • Use standardized file formats
    • HDF5
    • JSON
  • Possibly: share code for data handling, input, output

Data management

Examples / ideas for Monte Carlo: HDF 5

/experiment
  attributes: id, description
  /measurement
    attributes: count
    /0
      attributes: config
      /observables
        /Magnetization
          /data
          /jack
          /bins
      /run
        attributes: count
        /0
          /observables
            /Magnetization
              /data

Data management

Examples / ideas for Monte Carlo: JSON

{
  "info": {
    "program": "SSEMonteCarlo",
    "version": "unknown",
    "state": "done",
    "seedvalue": 42,
    "run": {
      "0": {
        "startdate": "2013-03-08T09:41:43Z",
        "enddate": "2013-03-08T09:41:43Z"
      },
      "1": {
        "startdate": "2013-03-08T09:46:13Z",
        "enddate": "2013-03-08T09:46:13Z"
      } 
    } 
  },
  "type": "ssemontecarlo.montecarlo.MonteCarlo",
  "params": {
    "type": "ssemontecarlo.montecarlo.Struct",
    "N": 10,
    "beta": 1,
    "h": 10,
    "J": {
      "type": "ssemontecarlo.montecarlo.NNAFHeisenbergInteraction",
      "J": 1
    }
  },
  "mcparams": {
    "type": "ssemontecarlo.montecarlo.Struct",
    "t_warmup": 100,
    ...
  },
  "observables": [...],
  "data": {
    "ExpansionOrder": {
      "mean": 246.39998046875,
      "error": 0.9593617279141327,
      "binCount": 200,
      "binSize": 1
    },
    "Magnetization": {
      "mean": -0.060000009536743164,
      "error": 0.18963782783471672,
      "binCount": 200,
      "binSize": 1
    }
  }
}

A Python-centric workflow

  • IPython is awesome (ipython.org)
  • Browser-based notebooks
    • Similar to Mathematica
    • Might be good fit for some workflows
  • Comprehensive library and tools for parallelization

A Python-centric workflow

  • Idea: (versioned) Python scripts control complete workflow
    • Set input parameters
    • Run program (in parallel)
    • Postprocess data
    • Store raw and processed data (possibly in MongoDB)
    • Do plotting (matplotlib)
  • Core program itself
    • Written in Python (SciPy; PyPy — pypy.org)
    • Written in C++, compiled as Python module (e.g. with Boost)