Adam Kewley

I work with large research data systems. One of those systems—lets call it Choogle, for the sake of this post—is nearly two decades old, which is practically forever in the IT world, which is impressive. Choogle has been around so long that much of the lab’s analysis equipment is tightly integrated with it. For example, a researcher can enter a Choogle ID into an analysis instrument to automatically link their analysis with the sample’s history. This is neat, provided the researcher incorporates Choogle as a central component of their workflow.

From a top-down viewpoint, making researchers submit their sample’s information to Choogle is a better situation than each researcher having a collection of loosely formatted labnotes. Designing lab equipment to require Choogle is a way of encoraging conversion, which is the intention.

What happens, though, if researchers don’t particularly want to use Choogle? Maybe they’re already incorporated a similar (non-Choogle) research system, or maybe they just don’t like the UI. When those researchers want NMR plots, the Choogle requirement becomes a barrier.

A barrier-smashing gameplan emerges. Researchers enter the bare-minimum amount of information required to yield a valid Choogle ID and use that ID to perform analysis. Choogle developers respond by adding validation to force researchers to enter more information. The obvious countermove develops: fill syntactically valid—but garbage—information to bypass the form’s validation.

This cycle continues forever because it’s fundamentally an arms race between researchers, who can “tech up” at will, and Choogle, that can only deploy rigid countermoves. Eventually, Choogle’s developers give up on trying to police the system with code and turn to human engineering: make the researcher’s bosses enforce compliance. However, that just transforms the a human-vs-machine arms race into a human-vs-human one.

I’ve seen this pattern emerge many times. It’s especially prevalent when the system is perceived to be a timesink by its users (that’s usually a design and communication challenge). In Choogle’s case, PhD-qualified scientific researchers can be particularly clever in their validation circumvention. Unfortunately, I’m a data scientist tasked with mining data from Choogle. One thing I’ve got to do is filter out all the “placeholder” samples submitted by devious researchers. The arms race has made my job hard.

For example, one thing I analyze is what components are used in mixtures on Choogle. Easy data to mine. However, there’s a validation rule that prevents researchers from creating a zero-component mixture on Choogle. Some lab analyses only allow “mixture” Choogle IDs. So, knowing the ball game, guess what the researchers do? Of course, thousands of mixtures containing one ingredient (usually water, because it is always going to be available on any chemical research platform).

Choogle, and the tightly-integrated lab kit, is extremely expensive to modify at this point of their lifecycle (estimate a cost a freelance developer would charge to add a validation rule to an <input> element, multiply your estimate by at least 50). Because of that, I’m thinking of inventing a brand-new chemical ingredient in Choogle: fakeonium.

Fakeonium is a farcical chemical that researchers can enter as a mixture ingredient to bypass the one-component validation rule. I can easily filter out fakeonium-containing mixtures—much easier than filtering out the other 500 farcical ingredients. Other data scientists might be pulling their hair out at the idea of this approach, the data must be pure, but real-world constraints and the limitations of IT systems always lead to unforeseen usage patterns.

Fakeonium might seem like an admission of failure on Choogle’s behalf, I disagree. I think it’s an admission that we can’t plan for everything. Heavily integrating general-purpose lab equipment against monolithic systems like Choogle will always lead to these kinds of shortfalls eventually.