This is a presentation by Peter Sefton, Michael Lynch, Liz Stokes and Gerard Devine, delivered at eResearch Australasia 2018 by Peter Sefton.
Notes - Slide 1
In this presentation we will launch version 1.0 of the DataCrate standard. The presentation will cover:
The motivation for this work, and prior art - why we needed to bring together the standards we did in the way that we did.
A walk-through of example data crates from a variety of sources, speleology, clinical trials, simulation, social history, environmental science and microbiology.
An introduction to tools for making data crates with an appeal to attendees to join us in making more tools, for more new kinds of data.
A demonstration of how DataCrates are being used at UTS to move data though the research lifecycle - archiving and publishing data.
Notes - Slide 2
The following people contributed to this presentation:
Notes - Slide 3
There were no existing generic data packaging standards with human and and machine readable
🙅 === FACE WITH NO GOOD GESTURE
Notes - Slide 4
Our motivation was to be able to display and distribute data sets with useful "who, what where" metadata in a way that is easy to for coders to target, and easy for researchers to consume, both as readers and programmers who might want to run code against a data set.
Notes - Slide 5
We have a growing list of examples.
Notes - Slide 6
DataCrate provides human-readable HTML data about files including detailed metadata.
Notes - Slide 7
This slide shows a CreateAction, where an
instrument - a Lidar scanner - was used by an
agent - the person - to create two files.
Notes - Slide 8
This shows a software package (
instrument) acting on a file (
object) used to create another file (
result) - a sepia version of a picture.
Notes - Slide 9
DataCrates contain metadata in JSON-LD.
Notes - Slide 10
Why do we want machine readable data? One reason would be to generate visualisations that help people understand relationships in the data set. Here’s a demo I coded up in about half-an hour before the conference that shows how we might visualise the the way files are created. It shows a Person (me) who is the agent in two CreateActions, one where the
instrument is a camera/lens combination and the
object is the place being pictured, and the result is a file, and one where the
object is said file, the
instrument is a software package, and the result is a sepia version of the original photo.
Notes - Slide 11
Each term used has a link to its definition, eg: https://schema.org/CreateAction
Notes - Slide 12
Tooling is a problem. JSON-LD is a great format, but: There are no utility libraries for things like looking up context keys.
Notes - Slide 13
Calcyte uses multi-worksheet spreadsheets for data entry, based on an idea of Mike Lake’s.
Notes - Slide 14
This works, but it’s not an ideal user iterface.
Notes - Slide 15
Gerard Devine is developing a tool which will allow DataCrate export from the Australian National Data Service funded HIEv system.
HIEv DataCrate - At the Hawkesbury Institute for the Environment at Western Sydney University, a bespoke data capture application (HIEv) harvests a wide range of environmental data (and associated file level metadata) from both automated sensor networks and analysed datasets generated by researchers. Leveraging built-in APIs within the HIEv a new packaging function has been developed, allowing for selected datasets to be identified and packaged in the DataCrate standard, complete with metadata automatically exported from the HIEv metadata holdings into the JSON-LD format. Going forward this will allow datasets within HIEv to be published regularly and in an automated fashion, in a format that will increase their potential for reuse.
Notes - Slide 17
Chris is working to describe the equipment used in the Microbial Imaging Facility (MIF), Here’s a page for a microscope, this is part of work in progress to descibe as much of the context of research in MIF as possible.
Notes - Slide 18
Peter Sefton has developed code to export Omeka Classic repositories to DataCrate. This is an example of one from the University of Western Sydney curated by Katrina Trewin. This uses the Portland Common Data Model for modelling repository structure. We are using these data sets to help develop an Omeka service based on the Omeka S software, along with data from Dspace extracted using another nascent code project.
Notes - Slide 19
Provisioner grew out of two basic requirements, which seem to conflict with one another:
We want to be able integrate research data management into the tools researchers actually use to do their research, rather than as an add-on to an existing process (like DC)
Any such system should give the researchers something besides data management – access to facilities and software, easier publication, etc
We don’t want to build a monolith, and even if we wanted to build a monolith, we wouldn’t be allowed to – the current mood is SAAS, on-premises only if necessary, no single points of failure
The UNIX philosophy of small parts, loosely joined, and the idea that data has gravity
Notes - Slide 21
Next step is to take this to an international meeting to see if we can get some agreement between project using similar approaches.
Notes - Slide 22
Dataspice does a similar thing to DataCrate - they could easily be aligned.
Notes - Slide 23
Research Object Bundle also tries to package data with JSON-LD data, but in a way that is (we think) more complicated to implement, and without the human-readable web-site embedded in the package.
Notes - Slide 24
Notes - Slide 25
Please contribute to or use the spec