By Peter Sefton
This is a presentation I gave at eResearch Australasia 2017-10-18 about the new Draft (v0.1) Data Crate Specification for data packaging I've just completed, with lots of help from others (credits at the end).
In 2013 Peter Sefton and Peter Bugeia presented at eResearch Australasia on a format for packaging research data(1), using standards based metadata, with one innovative feature – instead of including metadata in a machine readable format only, each data package came with an HTML file that contained both human and machine readable metadata, via RDFa, which allows semantic assertions to be embedded in a web page.
Variations of this technique have been included in various software products over the last few years, but the there was no agreed standard on which vocabularies to use for metadata, or specification of how the files fitted together.
This presentation will describe work in progress on the DataCrate specification(2), illustrated with examples, including a tool to create DataCrate. We will also discuss other work in this area, including Research Object Bundles (3) and DataConservency(4) packaging.
We will be seeking feedback from the community on this work should it continue? Is it useful? Who can help out? The DataCrate spec:
Has both human and machine readable metadata at a package (data set/collection) level as well as at a file level
Allows for and encourages inclusion of contextual metadata such as descriptions of organisations, facilities, experiments and people linked to files with meaningful relationships (eg to say a file was created by a particular machine, as part of a particular experiment, at an organisation).
Is a BagIt profile(5). BagIt(6) is a simple packaging standard for file-based data.
Has a README.html tag file at the root with bagit-style metadata about the distribution (contact details etc) with a link to;
a CATALOG.html file in RDFa, using schema.org metadata inside the payload (data) dir with detailed information about the files in the package, and a redundant CATALOG.json in JSON-LD format
Is extensible easily as it is based on RDF.
Sefton P, Bugeia P. Introducing next year’s model, the data-crate; applied standards for data-set packaging. In: eResearch Australasia 2013 [Internet]. Brisbane, Australia; 2013. Available from: http://eresearchau.files.wordpress.com/2013/08/eresau2013_submission_57.pdf
datacrate: Bagit-based data packaging specification for dissemination of research data with useful human and machine readable metadata: “Make Data Crate Again!” [Internet]. UTS-eResearch; 2017 [cited 2017 Jun 29]. Available from: https://github.com/UTS-eResearch/datacrate
Research Object Bundle [Internet]. [cited 2017 Jun 16]. Available from: https://researchobject.github.io/specifications/bundle/
Data Conservancy Packaging Specification Home [Internet]. [cited 2017 Jun 29]. Available from: http://dataconservancy.github.io/dc-packaging-spec/dc-packaging-spec-1.0.html
Ruest N. BagIt Profiles Specification [Internet]. 2017 Jun. Available from: https://github.com/ruebot/bagit-profiles
Kunze J, Boyko A, Vargas B, Madden L, Littman J. The BagIt File Packaging Format (V0.97) [Internet]. [cited 2013 Mar 1]. Available from: http://tools.ietf.org/html/draft-kunze-bagit-06
This is a presentation I gave at eResearch Australasia 2017-10-18.
Peter Bugeia and I talked about this 4 years ago. This year I got around to leading the effort to standardising what we did back then.
This presentation is structured as a story.
Back in June Cameron Neylon was annoyed
When I saw this cry for help I contacted Cameron and offered to work with him.
More from Cameron.
But actually, there are no simple examples of how to organise "long-tail" data sets for publication. Research data management books will tell you about various metadata standards, but how do you enter the metadata and associate it with your data?
The dataset is available from Zenodo, an open data repository hosted by CERN.
This is a human-readable catalog that lists all the files in the data set.
And has information about their context and the relationships between them.
For example it shows that Cameron is the creator of the dataset. Note that Cameron is idetified by his ORCID ID: http://orcid.org/0000-0002-0068-716X. Using URLs to identify things such as people is one of the key principles of Linked Data.
Here's an example of a relationship between two of the files - one is a translation of another.
The HTML contains RDFa embedded metadata. RDFa is a standard way of embedding sematics in a web page.
RDFa, using the schema.org metadata vocabulary is widely used by search engines.
Movie times, opening times, recipes - these are all some of the things that search engines understand.
This package also has JSON metadata.
The JSON is easily usable by programmers - getting the contact for this dataset for example is a simple operation.
But if needed, the simple "Contact" can be turned into a URI, as per LInked Data principles.
You can look up Contact in the DataCrate JSON-LD context and see that it maps to schema:accountablePerson
Then you can map schema:Accountable person to http://schema.org/accountablePerson
There are also checksums for all the data files.
There's a Bagit manifest file.
Which lists all the files and their checksums, so the validity of the bag can be checked.
This package is like a gift from Cameron, to his collaborators, to other researchers and to his future self.
.. to do this work ...
We used an experimental tool called Calcyte
... I ran Calcyte on Cameron's Google Drive share to create CATALOG.xlsx files ...
Calcyte is experimental early- stage open source software written by my group (mainly me) at UTS.
Calcyte created spreadsheets which functioned as metadata forms that Cameron could fill out.
The spreadsheets are multi-sheet workbooks, giving us scope to describe not only data entities like files, but metadata entities such as people, licenses and organisations.
We spent a couple of months working on this intermittently, it will be quicker next time, but this level of data description will always involve a fair bit of care and work, at least a few hours for this scale of project. It's also important to proofread the result, just as with publishing articles.
The advantages of this approach are that the package has: Human AND machine readable web-native linked-data metadata, not just string-values in XML
This slide is a reminder of what the CATALOG.html file looks like, complete with its DataCite citation, which, when people start citing this, will add to Cameron's academic capital.
This work is based on previous efforts
Cr8it - now being looked after by Newcastle.edu.au (via Western Sydney and Intersect) https://github.com/digitalbridge/crateit/tree/develop
Mike Lake's CAVE repository. https://suss.caves.org.au/cave/
Cr8it and HIEv are covered in our 2013 presentation at eResearch Australasia
It builds on other standards:
The format used in this demo is described in a draft specification.
Use at UTS for our data repository, and for export from various services
Lobby to get support integrated into Zenodo, Figshare et al
Improve capture/packaging tools (Cra8it, Cloudstor Collections
Work with others on aligning this work with other standards, here's a list someone else put together.
Work with RDA on their repository interchange format. https://www.rd-alliance.org/groups/research-data-repository-interoperability-wg.html
I'll leave it with this slogan from our UTS data librarian and friend of eResearch, Liz Stokes.
Thanks to: - Cameron Neylon for being customer zero
Liz Stokes for working on metadata crosswalking/mapping
Mike Lake for coding and ideas
Conal Tuohy and Duncan Loxton for commenting on the draft spec
Amir Aryani for discussions about metadata
And the mainly Sydney-based metadata group who met in the leadup to this work Piyachat Ratana, Sharyn Wise, Michael Lynch, Craig Hamilton, Vicki Picasso, Gerry Devine, Katrin Trewin, Ingrid Mason, Peter Bugeia