This is a presentation by Peter Sefton, Michael Lynch, Liz Stokes and Gerard Devine, delivered at eResearch Australasia 2018 by Peter Sefton.

Launching DataCrate v1.0: a general purpose data packaging format for research data distribution and web-display
<p>

Notes - Slide 1

In this presentation we will launch version 1.0 of the DataCrate standard. The presentation will cover:

  • The motivation for this work, and prior art - why we needed to bring together the standards we did in the way that we did.

  • A walk-through of example data crates from a variety of sources, speleology, clinical trials, simulation, social history, environmental science and microbiology.

  • An introduction to tools for making data crates with an appeal to attendees to join us in making more tools, for more new kinds of data.

  • A demonstration of how DataCrates are being used at UTS to move data though the research lifecycle - archiving and publishing data.





peter.sefton@uts.edu.au
michael.lynch@uts.edu.au
elizabeth.stokes@uts.edu.au
g.devine@westernsydney.edu.au

Notes - Slide 2

The following people contributed to this presentation:

  • peter.sefton@uts.edu.au
  • michael.lynch@uts.edu.au
  • elizabeth.stokes@uts.edu.au
  • g.devine@westernsydney.edu.au



Motivation
πŸ’»+ πŸ’Ύ + πŸ“¦ = πŸ™…

Notes - Slide 3

There were no existing generic data packaging standards with human and and machine readable

πŸ™… === FACE WITH NO GOOD GESTURE




Motivation: package data with maximum useful context
Who …  made it? funded the work? 
What … format are these files? … is the research about?
Where … was it collected? … is it about? 
Why … was it done?  … <link to publication>
How … were these files created? … can I repeat that process?

Notes - Slide 4

Our motivation was to be able to display and distribute data sets with useful "who, what where" metadata in a way that is easy to for coders to target, and easy for researchers to consume, both as readers and programmers who might want to run code against a data set.




Notes - Slide 5

We have a growing list of examples.




Notes - Slide 6

DataCrate provides human-readable HTML data about files including detailed metadata.




Ability to describe file provenance
<p>

Notes - Slide 7

This slide shows a CreateAction, where an instrument - a Lidar scanner - was used by an agent - the person - to create two files.




Software can be an instrument too
<p>

Notes - Slide 8

This shows a software package (instrument) acting on a file (object) used to create another file (result) - a sepia version of a picture.




All metadata is available in JSON-LD 
<p>

Notes - Slide 9

DataCrates contain metadata in JSON-LD.




... so relationships can be visualized
<p>

Notes - Slide 10

Why do we want machine readable data? One reason would be to generate visualisations that help people understand relationships in the data set. Here’s a demo I coded up in about half-an hour before the conference that shows how we might visualise the the way files are created. It shows a Person (me) who is the agent in two CreateActions, one where the instrument is a camera/lens combination and the object is the place being pictured, and the result is a file, and one where the object is said file, the instrument is a software package, and the result is a sepia version of the original photo.




URIs as names for things
<p>

Notes - Slide 11

Each term used has a link to its definition, eg: https://schema.org/CreateAction




(πŸ”§πŸ”¨πŸ”©πŸ”ͺπŸ”¬)ing is an issue for JSON-lD

Notes - Slide 12

Tooling is a problem. JSON-LD is a great format, but: There are no utility libraries for things like looking up context keys.




Calcytejs

Notes - Slide 13

Calcyte uses multi-worksheet spreadsheets for data entry, based on an idea of Mike Lake’s.





<p>

Notes - Slide 14

This works, but it’s not an ideal user iterface.




Notes - Slide 15

Gerard Devine is developing a tool which will allow DataCrate export from the Australian National Data Service funded HIEv system.

HIEv DataCrate - At the Hawkesbury Institute for the Environment at Western Sydney University, a bespoke data capture application (HIEv) harvests a wide range of environmental data (and associated file level metadata) from both automated sensor networks and analysed datasets generated by researchers. Leveraging built-in APIs within the HIEv a new packaging function has been developed, allowing for selected datasets to be identified and packaged in the DataCrate standard, complete with metadata automatically exported from the HIEv metadata holdings into the JSON-LD format. Going forward this will allow datasets within HIEv to be published regularly and in an automated fashion, in a format that will increase their potential for reuse.





<p>

Notes - Slide 16

Christian Evenhuis is developing a tool for exporting microscope images from Omero.





<p>

Notes - Slide 17

Chris is working to describe the equipment used in the Microbial Imaging Facility (MIF), Here’s a page for a microscope, this is part of work in progress to descibe as much of the context of research in MIF as possible.





<p>

Notes - Slide 18

Peter Sefton has developed code to export Omeka Classic repositories to DataCrate. This is an example of one from the University of Western Sydney curated by Katrina Trewin. This uses the Portland Common Data Model for modelling repository structure. We are using these data sets to help develop an Omeka service based on the Omeka S software, along with data from Dspace extracted using another nascent code project.





<p>

Notes - Slide 19

Provisioner grew out of two basic requirements, which seem to conflict with one another:

  • We want to be able integrate research data management into the tools researchers actually use to do their research, rather than as an add-on to an existing process (like DC)

  • Any such system should give the researchers something besides data management – access to facilities and software, easier publication, etc

  • We don’t want to build a monolith, and even if we wanted to build a monolith, we wouldn’t be allowed to – the current mood is SAAS, on-premises only if necessary, no single points of failure

  • The UNIX philosophy of small parts, loosely joined, and the idea that data has gravity




It’s standards all the way down
Oxford Common File Layout ← Static file-based repositories
THIS TALK β†’ DataCrate ← THIS TALK
Data Crate builds on Bagit ←  Data packages w/ checksums, content by ref
Schema.org ← Main metadata standard / Repo metadata standard β†’ PCDM
JSON-LD ← Linked data in programmer-friendly format
<p>

Notes - Slide 20





<p>

Notes - Slide 21

Next step is to take this to an international meeting to see if we can get some agreement between project using similar approaches.





<p>

Notes - Slide 22

Dataspice does a similar thing to DataCrate - they could easily be aligned.





<p>

Notes - Slide 23

Research Object Bundle also tries to package data with JSON-LD data, but in a way that is (we think) more complicated to implement, and without the human-readable web-site embedded in the package.




Help wanted! 
We invite you to:
<ul>
<li>
<p>Critique the standard</p>
</li>
<li>
<p>Generate some more sample data sets as a spec for people who will ...</p>
<ul>
<li>
<p>... write a packaging tool</p>
</li>
<li>
<p>Export from data management system (eg MyTardis :)</p>
</li>
<li>
<p>Write a GUI or web tool for people to create DataCrates</p>
</li>
<li>
<p>Help add viz to our HTML pages</p>
</li>
</ul>
</li>
</ul>
<p>

Notes - Slide 24

Please help.





<p>

Notes - Slide 25

Please contribute to or use the spec