tl;dr: If you find you’re spending a lot of time integrating various pieces of software across multiple computers and are currently using a mixture of scripts, build systems, and manual methods to do that, look into configuration managers. They’re easy to pick up and automate the most common tasks. I’m using ansible, because it’s standard, simple, and written in python.
Research software typically requires integrating clusters, high-performance numerical libraries, 30-year-old Fortran applications by geniuses, and 30-minute-old python scripts written by PhD students.
A consistent thorn in my side is downloading, building, installing, and deploying all of that stuff. For example, on a recent project, I needed to:
- Checkout a Java (Maven) project from svn
- Build it with a particular build profile
- Unzip the built binaries
- Install the binaries at a specific location on the client machine
- Install the binaries at specific location on a cluster
- Reconfigure Luigi to run the application with the correct arguments
- Copy some other binaries onto the cluster’s HDFS
- (Sometimes) rebuild all the binaries from source, if the source was monkey-patched due to a runtime bug
- (Sometimes) Nuke all of the above and start fresh
Each step is simple enough, but designing a clean architecture around doing slightly different permutations of those steps is a struggle between doing something the easy way (e.g. a directory containing scripts, hard-coded arguments in the Luigi task) and doing something the correct way.
The correct way (or so I thought) to handle these kinds of problems is
to use a build system. However, there is no agreed-upon “one way” to
download, build, and install software, which is why build systems are
either extremely powerful/flexible (e.g. make
, where anything is
possible) and rigid/declarative (e.g. maven
).
Because there’s so much choice out there, I concluded that researching each would obviously (ahem) be a poor use of my valuable time. So, over the years, I’ve been writing a set of scripts which have been gradually mutating:
- Initially they were bash scripts
- Then they were ruby scripts that mostly doing the same as the bash scripts
- Then they were ruby scripts that integrated some build parts
(e.g. pulling version numbers out of
pom.xml
files), but were mostly doing the same as the bash scripts - Then they were a mixture of structured YAML files containing some of the build steps and ruby filling in the gaps
- Then they were a mixture of YAML files containing metadata (description strings, version numbers), YAML files containing build steps, and Python filling in the gaps because Python’s easier to integrate with the existing researcher/developer’s work
After many months of this, I decided “this sucks, I’ll develop a new, better, way of doing this”. So I spent an entire evening going through the weird, wonderful, and standard build systems out there, justifying why my solution would be better for this problem.
Well, it turns out this problem isn’t suitable for a build system, despite it having similar requirements (check inputs, run something, check outputs, transform files, etc.). Although my searches yielded a menagerie of weird software, what I actually needed was a configuration manager. Ansible being a particularly straightforward one.
This rollercoaster of “there probably isn’t a good solution already available to this problem”, “I’ll hack my own solution!”, “My hacks are a mess, I should build an actual system”, “oh, the system already exists” must be common among software developers. Maybe it’s because the problem isn’t actually about developing a solution: it’s about understanding the problem well enough. If the problem’s truly understood, it will be easier to identify which libraries/algorithms to use to solve it, which will make developing the solution a lot easier. Otherwise, you’ll end up like me: Keeper of the Mutant Scripts.