Seite9

Anaconda

I’ve been using Anaconda as my Python distribution and package manager for about two months. In this article I’ll share some general thoughts about and my experience with it. In particular, I’ll go into how I built my own Python extension, which uses Boost.Python, against Anaconda, which was not as straightforward as I had wished. I will also put together a conda package for my extension. Most of this article concentrates on OS X, as this is my main system. I also work on Ubuntu where, in my experience, most of the time things just work, whereas OS X requires tinkering.

About Anaconda

Anaconda was originally a Python distribution for scientific computing, but has grown into a proper package management system for Python. It comes with the conda package manager, has a central repository (binstar.org) and as its distinguishing features supports binary packages and sophisticated virtual environments. Anaconda is developed by Continuum Analytics, a three-year-old Austin-based company focusing on products and services around number crunching and data-heavy work with Python. Anaconda’s goal is, as I understand it, to provide a reliable cross-platform software stack for Python, including “binary-heavy” modules like Numpy which are presently not well-supported by other Python package managers. Thus Anaconda should make it easier to deploy a numerical Python application to a server, to work with different Python and Numpy versions at the same time, or to simply and quickly share code with colleagues. It works on Linux, OS X and Windows and typically gets installed into the user’s home directory. To be honest, I am not using most of its features. I use Anaconda to have a consistent Python environment across my different computers, which include Ubuntu systems and Macs.

In the web context pip and virtualenv together with a requirements.txt seems to be the way to go, but as far as I know this approach does not accommodate binary packages adequately. Wheels are emerging as the new package format standard with a number of advantages over the current egg format, including better support for binary packages. However, I get the impression that wheels still has a number of problems in that area and maybe is just not quite there yet. Armin Ronacher wrote a nice overview and introduction to wheels, including a brief overview of Python packaging history. Travis Oliphant from Continuum Analytics has also written an article on the subject, and goes into some detail comparing conda and wheels, obviously slightly biased in favour of Anaconda.

I am not sure how Anaconda fits in with package managers from the OS, e.g. apt on Ubuntu. Pure Python package managers like pip sit on top of the software stack provided by the operating system. To know the full stack you would specify the OS, e.g. Ubuntu 12.04 LTS, your installed packages from apt and then your requirements.txt. However, with Anaconda providing generic non-Python-specific cross-platform binary packages, it looks like replicating all the hard work and solving all the same difficult problems that platform-specific package managers like apt already solve. Just on multiple platforms, which makes things even harder. Of course, this point is mute on OS X or Windows, where a standard binary package manager does not exist.

Building and packaging a Python extension with Anaconda

My current main project is written in C++ and uses Boost.Python to export functionality from C++ as a Python extension. I also have a small Python wrapper module around this extension. My code depends on Eigen and Boost and uses CMake to build the C++ part and distutils to install the Python part. To build and install I would thus first invoke cmake and then setup.py. Frankly, just getting my Python extension with its Boost dependency correctly built and linked on different systems with different Python versions and different compilers has cost me much more time than I am readily willing to admit. I use MacPorts plus Anaconda on OS X and just Anaconda on my Ubuntu systems. I use IPython.parallel to run my code on different computers and multiple cores (typically only two computers, 20 cores in total). I use the IPython notebook for prototyping, quick runs, analysis and brief write-ups. I had random stability problems with the MacPorts Python stack and IPython.parallel, and that’s why I started using Anaconda in the first place.

Generally, most Python extensions do not use Boost.Python and do not link against the Python library and in this case it is much more straightforward to get the extension to build and work correctly. However, in my case it was not that easy and here are three specific lessons I learnt to get everything working properly with Anaconda.

1. Boost

Unrelated to Anaconda, Boost needs to be compiled with the same compiler used to compile the Python extension. I use GCC 4.8 and hence with MacPorts the correct version of Boost can be installed with

$ sudo port install boost +gcc48

On my OS X system I have at least three different versions of Python installed, the Python coming with OS X, the MacPorts version and the Anaconda version. Care needs to be taken that Boost.Python is linked against the correct Python version. Short of recompiling Boost, a quick fix is to use install_name_tool. For example:

$ sudo install_name_tool -change \
    /opt/local/Library/Frameworks/Python.framework/Versions/2.7/Python \
    ~/anaconda/lib/libpython2.7.dylib \
    /opt/local/lib/libboost_python-mt.dylib

The above command changes the linked Python library from MacPorts (installed in /opt/local) to Anaconda for the Boost.Python library installed via MacPorts. This quick-fix works for me, but probably requires that the different Python libraries are ABI compatible.

2. Linking

On OS X, Anaconda uses relative paths for its Python library’s install name and relative paths for linking, which seems to be somewhat uncommon on that platform. Anaconda uses this approach to enable its powerful virtual environments. The downside is that if you just compile and link your Python extension against Anaconda’s Python, it won’t work. At runtime the wrong Python library gets linked in, resulting in obscure errors like __init__() should return None, not 'NoneType'. The recommended way to fix this is to run install_name_tool after building your extension, similar to what I did with the Boost library above, and change the path of the linked Python library to the correct relative path—relative to where your Python extension my_extension.so ends up being installed. I prefer to set an absolute path, so I don’t have to invoke install_name_tool every time I move my my_extension.so to a different location. So my CMakeLists.txt now calls something like this on OS X

INSTALL_NAME=`otool -D ${PYTHON_LIBRARIES} | tail -n1 | tr -d '\n'`
install_name_tool -change ${INSTALL_NAME} ${PYTHON_LIBRARIES} ${LIB_FILENAME}

where PYTHON_LIBRARIES is the full path to Anaconda’s Python library and LIB_FILENAME points to my_extension.so; INSTALL_NAME would usually be set to libpython2.7.dylib, or similar.

3. Conda package

It’s actually very easy to build a conda package. Even without redistributing the package one big advantage is that conda build and conda install set all the paths of linked libraries correctly automatically, so we don’t have to worry about fixing them manually as outlined above. Of course, if you want to run tests in your source directory (I want to do that), then you still need to set the library paths correctly, otherwise the tests fail.

A conda recipe is used to build a conda package and can consist of only two files in a directory, meta.yaml and build.sh. I put them in a subdirectory conda in my project’s main directory. With that I can build and install my conda package:

$ conda build ./conda/
$ conda install --use-local my_extension

The meta.yaml looks something like this:

package:
  name: my_extension
  version: 8.0.0

source:
  #fn: /path/to/my/local/archive/myproject.tar.bz2
  git_url: ssh://git@bitbucket.org/me/myproject.git

requirements:
  build:
    - cmake
    - python
  run:
    - python

The build.sh looks something like this:

#!/bin/bash

# Set PYLIB to either .so (Linux) or .dylib (OS X)
PYLIB="PYTHON_LIBRARY_NOT_FOUND"
for i in $PREFIX/lib/libpython${PY_VER}{.so,.dylib}; do
    if [ -f $i ]; then
        PYLIB=$i
    fi
done

mkdir Release
cd Release
cmake \
    -DCMAKE_BUILD_TYPE=Release \
    -DCMAKE_INSTALL_PREFIX=$PREFIX \
    -DCMAKE_INSTALL_RPATH=$LD_RUN_PATH \
    -DPYTHON_INCLUDE_PATH:PATH=$PREFIX/include/python${PY_VER} \
    -DPYTHON_LIBRARY:FILEPATH=$PYLIB \
    ..
make
cd ..

$PYTHON setup.py install

exit 0

Of course, to do things properly and have a full Anaconda software stack, I would also need to build conda packages for all my dependencies, in this case for Eigen—easy, it’s a header-only library—and Boost—which does not sound like fun. This goes back to my earlier remark that with Anaconda at some point it sounds to me like we start replicating the work done by platform specific package managers. In this case, this would be MacPorts and the Ubuntu repositories, which I use to install Eigen and Boost.

Conclusion

As the folks at Continuum Analytics keep emphasizing, conda works today and solves practical problems. For me that’s mostly providing a stable and consistent Python stack across different computers and platforms. It works well, is easy to use and comes with all numerical packages I need. Installing other missing, non-conda packages on top of that is accomplished with pip. I built my own Python extension against Anaconda, which on OS X was not as straightforward as I had hoped. I should emphasize that for most Python extensions things should be much easier. But using Boost.Python and linking against the Python library made life more complicated for me. On the bright side, I learnt a lot about linking on OS X. I created a conda package which I can distribute to my different computers. I could also give this binary package to my coworkers and just tell them to install the Anaconda distribution and thus get them up and running really quickly.

For comments and questions, please write me an email.