The three guiding principles for storing & archiving research data
A few weeks ago, we got the type of email that you never want to see — someone had found a discrepancy in the data in one of our apps and wanted to know if the problem was on our end.
The application in question was the Living Wage Calculator, a prominent, high-traffic app that we developed for Dr. Amy Glasmeier, Professor of Economic Geography and Regional Planning at MIT. The the app calculates the true minimum cost of living for any individual or family in any community in the United States. It has helped to set policy on wages across the country. The data (which is provided by Open Data Nation) goes through a number of transformations before arriving in the database that backs the web application.
Sure enough, one of the numbers had an incorrect precision. Fortunately, it wasn’t one of the living wage values that represents the true value of the tool — it was a supporting comparison value. Still, it begged the questions, “Where did this go wrong?” and “How long has it been this way?”
It turns out that the way you store and process data is critical to being able to precisely track down these same common questions and issues that we have all run into at some point. While there is nothing close to a one-size-fits-all solution, there are some general principles that can help to guide your design. A proper data integrity strategy will help to turn the inevitable glitches, outages, and mistakes into minor correctable inconveniences, instead of unrecoverable disasters.
The big three
When looking at any data storage system, we should consider three major dimensions:
Let’s look at each one in turn:
Durability refers to how resilient the data is to loss. A single copy of the data on a single laptop has very low durability; that data could be easily deleted or lost.
But if we add a set of cloud-based backups (and verify that they’re working), now we’ve added some durability to the equation. If the laptop is stolen, in theory we should be able to recover the data.
What if someone fails to pay the credit card bill? What if your account is mistakenly closed?
In this context, we’re not necessarily talking about web accessibility, such as WCAG compliance. We’re talking about the extent to which the data is ready to be used by your team in their tools of choice.
For example, if there’s one central copy of the data that is updated daily, and if it takes one of your collaborators an hour to download the latest version, then the data is not as accessible as if it had been ready to go on their local network.
Likewise, if the central repository is a relational database, and someone wants to be able to run some analyses in R, then they may need to perform a manual export of the data first. In this case, the data is not as accessible as if they had a flat view of the data all ready for import into R.
Data History and Accountability
Accountability refers to the ability to trace changes to your data to specific events and individuals. This usually refers to some kind of version control.
Ideally accountability happens on a line-by-line or byte-by-byte basis. You want to look at a particular line of code or data and tell who modified it, on what date.
Our approach for complete data integrity (at least in this scenario)
I’d like to stress that each research project is different. Larger data sets or more sophisticated access needs can rule out some solutions that would otherwise be a perfect fit. This case serves as one possible illustration of the principles being applied.
The Living Wage Calculator pulls raw data from several different sources, processes that data into a central data file, and then loads that data file into a relational database that can be used by the web application.
Fortunately, the initial raw data files aren’t too large — they’re on the order of single-digit megabytes. So we keep them in plain CSV format, and we check them into version control with git. This establishes a very strong baseline for the incoming data; we get both durability and accountability with git’s version control.
Next, we process the raw data sources with a set of scripts. These scripts are also kept in git, so that they receive the same durability and accountability as the initial raw data files.
The final step is to load the data into a relational database for use by the web application. One upside to relational databases is they can have excellent accessibility. Many applications are able to access the data directly from this source. But relational databases are terrible at accountability. Fortunately, the data in this application is read only; the user can’t modify it and so the application is essentially display only. That means that we can go back to our central or raw data files for the accountability.
Data structure accountability saves the day
So how did we use this setup to track the root cause of the incorrect precision in the Living Wage Calculator?
First we checked the raw data files to see if the data came in correctly. They had the correct precision, and git reassured us that the files were identical to the ones that were used to generate the later stages of data.
Next we checked the structure of the database. Sure enough, the column in question didn’t have the right type. It was discarding an extra couple digits of precision. We could fix it, but that doesn’t answer the question of why it had this type or how or when it got to be that way.
Fortunately, we modify the database through a controlled set of scripts in version control. Using the history of those scripts (and the unit tests that go with them), we could see that the column type had been set incorrectly during a bulk change a couple years ago. It was actually the unit test that was incorrect, which resulted in creating the wrong code. We could see all the necessary detail about the change in question, including who, what, and when.
By looking even further at the history, we could see that this was one change among many that day. This detail had been missed in the rush and had been overlooked until now.
Within a very short amount of time we were able to update the code, the tests, and redeploy the data with the proper precision. Perhaps even more importantly, we gained insight into what went wrong with our process, so that we could avoid this type of error in the future.
Designing for your own system’s data integrity
The situation above isn’t going to apply to countless other situations:
- If we’d started with a larger data set, we wouldn’t have been able to store it directly in git.
- If we’d been working with less tech-savvy collaborators, the git interface wouldn’t have been friendly enough for general use.
- If the mistake had been in the data instead of the database schema and if the data hadn’t been read-only, it would have been harder to track down the precise history of the failing value.
That said, the principles of durability, accessibility, and accountability can help to guide your data management strategy toward a set of tools and practices that do provide adequate protection. There are accountability options for large binary files, and there are ways of making complicated data structures accessible to less-technical stakeholders.