This is a local version of a post I wrote for the Software Sustainability Institute.
Software is (almost) never written or run in isolation. Instead, it builds on top of a wide range of dependencies from compilers and language runtime environments to application specific libraries.
This is a huge challenge for reproducible research. Not only should the software we write be sustainable (e.g well versioned, documented, and tested) but the environments that the software exists within also needs to be documented and, ideally, recreatable.
Many have suggested virtual machines/containers as one solution to this problem (for instance the recent Docker Containers for Reproducible Research Workshop (C4RR)), where you ship not just your computational code but also the environment alongside. While this is a good start on tackling this problem I'm not sure it's fully sufficient. Often the environment for the image is constructed using a standard Linux distribution's package manager, and these usually, by default, install the newest (stable) possible version of a package, meaning that two people running a VM/container at two different times might actually create a different environment!
In this post I want to talk about a different approach to reproducible environments that tackles this problem (and many others) at the package manager level. Introducing Nix: "The purely functional package manager"
So what is Nix? The Nix homepage does a much better job at explaining this than I could ever do:
"Nix is a powerful package manager for Linux and other Unix systems that makes package management reliable and reproducible. It provides atomic upgrades and rollbacks, side-by-side installation of multiple versions of a package, multi-user package management and easy setup of build environments" (emphasis mine)
The key difference between Nix and other package managers is that it isolates packages as much as possible. Nix packages are installed into their own directories, rather than a global namespace, and building new/upgrading packages is a case of combining the required dependencies. Because they are isolated this becomes much easier: if one application depends on one version of gcc and another on a second version of gcc it just picks the right one for each package rather than relying on the one in /bin. The isolated directories are based on cryptographic hashes so if I build a package again and even a single bit has changed, it get's it's own isolated directory and anything that was depending on the old version still works! This is where the reproducibility comes from.
From a reproducible research standpoint we may not want to use Nix packages to manage our entire system, however we can use it to provide convenient sandboxing for our experiments. Those who have used virtualenv for Python will know how useful such sandboxes can be. You can think of Nix as virtualenv for your entire environment: the compilers, runtimes, numerical libraries (such as those underlying Python libraries like NumPy) and even Python packages themselves.
Let's say we have a very simple experiment/analysis that depends on:
Of course these themselves depend on various compilers, but we can let Nix pick appropriate versions based on the package descriptions for these 4 dependencies.
Nix packages are described in a functional programming language (which I'll admit looks a little strange to start with, even from someone who is a big fan of Haskell). Essentially every package in the system is described via a function and we combine the functions (lazily) to build the environment we want.
For reproducible research, we don't to necessarily want to describe how to how to build a package itself (since we might only want to execute scripts), instead we are trying to build an environment to run our experiment from a set of existing packages.
This is where nix-shell comes in useful it says: "given some description of an environment, don't try to build the actual package, instead drop me into a shell with all the binary/library paths for the dependencies I specify set up".
Let's see what a package description (that we want to save into shell.nix) for the above dependencies might look like:
At the core of the description is the mkDerivation function that declares how to build a particular package. In our case we are simply calling our (fake) package simpleEnv and we specify the dependencies we need as buildInputs.
When we are ready to drop into our environment we simply use nix-shell in the same directory as shell.nix and it will build the dependencies (grabbing binaries from a package repository if it can) and set up the environment (PATH etc) to point to the correct packages. For example:
We can then do whatever analysis we need and simply exit the nix-shell when we are done, returning to our pre-Nix environment.
As a side note, I recommend calling nix-shell with the -j N argument to enable parallel builds to create environments quicker.
So far we have managed to get a custom environment for a specific experiment, however notice that this isn't quite reproducible yet as it suffers from the same problem as other package managers: we are just grabbing the version of the packages from whichever Nix channel we are registered to.
Lucky for us, a Nix channel is (roughly speaking) just a pointer to a particular commit in the Nixpkgs repository which contains package descriptions (just like the one above) for many existing packages (such as gnuplot above). If we want a reproducible build, all we need to do is fix the version of the nixpkgs repository that we pull our package descriptions from:
We can now send our experiment to another user and we will both to build the same environment. For example, after building the environment on two different machines:
Notice how the package hashes are the same, making it very unlikely these packages are different.
While I haven't yet explored all the features of Nix, it seems like a great tool for reproducible research. There's few things to look out for however:
What is really interesting is that it is possible to fix many of these issues by introducing containers/virtual machines into the workflow allowing us to 1) use Nix on non-Linux/MacOS platforms 2) Giving some level of security via isolation 3) letting you install into /nix as required for binary builds.
Nix is a great tool for creating reproducible environments and hopefully this post piques your interest. In the future I hope to see the Container/Nix combination used to good effect making it easier than ever to fully and accurately reproduce experiments.
If you interested in Nix I recommend looking through at least the quick start and giving it a shot for yourself (it's a very easy install). If you are a software creator then consider writing a Nix derivation for your package and adding it to the nix ecosystem so that anyone can easily access your software via Nix.