This presentation was written and delivered by Sharyn Wise (with couple of slides from Peter Sefton) for the Australasia Preserves meeting at the NSW State Library. This was one of a number of short talks from various organisations and was the only one to focus on research data. UTS has not started our Digital Preservation journey yet - we're still packing our things and looking at the map, or 'prepping'.
Notes - Slide 1
Today I’ll be presenting on preparing for preservation at UTS, because we are not there yet, and I suspect it is more of a journey than a destination anyway. Preservation is on our eResearch Roadmap for next year, so I am grateful to learn from the experience of others today. I named this talk in a rather feisty moment, thinking that the story of the Domesday book was the archetypal cautionary tale and would be well-known in preservation circles. If you don't know the story you can google it, but the title in the Guardian says it all – digital data is incredibly fragile in comparison to earlier technologies. Instead I will begin with what I think of as a Domesday tale of research data.
Notes - Slide 2
I still remember seeing media headlines about the potential dangers of unbalanced Omega 6 oil consumption in 2013 – like this one here in Time Magazine. But as so often happens with media headlines, the back story of the actual research paper in BMJ turned out to more complex and interesting. A U.S Cardiovascular Health researcher, conducting a meta-analysis had come across a paper from a 1973 clinical trial, the Sydney Diet Heart Study which according to the lipids hypothesis of the time, had replaced saturated fat with polyunsaturated fat in the diet. This trial had been discontinued because the participants had shown a higher rate of mortality. However, when the Sydney researchers wrote up the results, they had not reported on specific causes of death. The data confusingly showed the opposite of what they had hypothesised, so they published a brief paper and left it at that.
It just so happened that the unsaturated fat used was safflower oil - which we now know uniquely contains only Omega-6 fatty acids. So the US researcher began hunting for the dataset. He eventually tracked down the last surviving member of the research team and asked if the data was available. Luckily the Sydney researcher had kept it among piles of boxes in his garage. After a bit of rummaging, he produced an obsolete 9-track magnetic computer tape. Data Recovery was expensive and difficult, but worth it, because of what it added to knowledge of lipid metabolism: the Omega 6 group had 6 %age points higher risk of death from cardiovascular and coronary heart disease than the control.
Now let's remember that this study could not be replicated without risk to human life so that data would have been gone forever if not recovered. I like this cautionary tale, because it highlights some extra challenges we face. Like how do we know what to preserve, when knowledge grows in ways that cannot always be anticipated? Of course there is the Records Act which requires that research data from human clinical trials is preserved for up to 25 years, but that wouldn’t have been long enough to save this data. Our Records Managers suggest that it should be up to the researchers, as the experts, to recommend their data for longer term preservation. However, would the Sydney Heart Diet trial researchers have selected this data for preservation? Probably not. So do we preserve everything? Well that isn't really financially viable – as we know, its one thing to keep data and quite another thing to manage it over years. And that brings me to the next challenge.
Notes - Slide 3
Which is getting research data under management in the first place. Researchers are suspicious of giving their data to anyone – after all it is their competitive advantage. So lets step back - What is research data?
It can be just about anything – which in itself is daunting. A lot of research data is not human readable, like the last two of these three examples – geospatial data and microscopy data. So obviously we need to preserve more than just the outputs if the data is to remain accessible and usable. Broadly speaking, the usual approach here is to convert to open formats, ideally without losing metadata, and to preference open source software if possible, which we can retain. And this is where the concept of preparing for preservation comes in.
We need to start a lot earlier in the data lifecycle than the archiving stage, to ensure that metadata and provenance data are not lost. And because this may involve changes to researcher practices it is also a huge cultural change challenge. One problem we face is that most scientific instrument data comes off instruments in proprietary formats. This problem is being addressed in Optical microscopy by a consortium called OME - Open Microscopy Environments, and Bioformats, who work to bust open proprietary files from optical microscopes, and extract the metadata as well as the images as TIFs. The microscopy slide shown here is inside Omero, their open source imaging repository. Similarly, the map data is in open standardised formats inside Geoserver, an open source geospatial platform. In both cases, these platforms offer significant enough benefits to researchers that they are willing to use them as repositories for their data. So does Gitlab, the code repository environment shown here. So far so good.
Notes - Slide 4
But how do we bring together these discipline specific repositories into one managed solution? By loosely coupling them together in an architecture we are developing called the Provisioner. A researcher (top centre) accesses the data management catalogue “Stash”, which provisions managed workspaces at the beginning of their projects and at the end, creates data records, where they can upload or link to their data in place. Stash keeps all the various metadata packages we need to manage, describe, contextualise and ultimately preserve the data and as we find suitable research platforms to add, we will write an adaptor to add it in.
Images sourced from Wikimedia Commons ' border='1' width='85%'/>
Notes - Slide 5
So what about smaller, human-readable datasets – like surveys, or heterogeneous humanities data collections, for example? It can be hard to persuade these researchers of the value of managing their data too, rather than throwing it on some media and into a drawer. Even worse is that even for those who want to manage their data properly, there has been a dearth of simple tools for researchers to capture the necessary metadata and bundle their data for archiving, as Cameron Neylon complains here on his blog.
Notes - Slide 6
Enter DataCrate, which is a new standardization effort lead by UTS, specifically for research data. The specification is available on GitHub, we’d welcome your input: feel free to raise an issue with suggestions or disagreements, or to send us a pull request with changes.
The spec is designed for implementers, it explains how DataCrate is built on existing standards, the BagIt packaging specification, and how to add Linked Data metadata in JSON-LD format to describe the package, its files, and its context, such as the people who created it, the organisations which funded it, the equipment used in creating the data and so on.
The plan is that DataCrate will allow us to move data between systems (including by-reference using BagIt’s link features) with rich file-level and package level metadata, with human-readable manifest travelling with the data to help contextualize it for future users.
Notes - Slide 7
This is a screenshot of a data set available here
If sufficient of this metadata is present, then the spec explains how to construct a DataCite citation as we can see here.
Essentially, DataCrates are data packages that have a content-manifest with checksums, to help ensure data-integrity, they have metadata in a linked-data format (using JSON-LD and metadata terms from schema.org and other ontologies) and they have an index.html file which displays the metadata in a human readable summary format, both for the package (crate) and …
Notes - Slide 8
... they have File-level descriptions of the content. So although the original use case I just mentioned was packaging data for retention, we are prepping for preservation here also by employing Richard Lehane's Ziegfried tool to run through several of the steps required to identify preservation metadata.
Notes - Slide 9
So in sum, at UTS we may be early on the journey to preservation, but I've tried to illustrate some ways that we are building preparation into our systems and thinking so that when we come to implement preservation workflows, the means to do so, metadata and tooling for example, will be there. Here are our contact details and we will be happy to answer questions at any time. Thank you