This is a talk delivered in recorded format by Peter Sefton, Nick Thieberger, Marco La Rosa and Mike Lynch at eResearch Australasia 2020.
Research data from all disciplines has interest and value that extends beyond funding cycles and must be managed and preserved for the long term. However, much of the effort in eResearch goes into building systems which provide functionality and services that operate on data but which actually put data at risk. For instance, loading data into a particular tool often means that the data is not be easily retrievable if that tool or service cannot be sustained. At worst, the data is lost.
In this presentation we will introduce the standards based Arkisto platform and show a number of examples from multiple disciplines of current Arkisto deployments, including an institutional Research Data Portal, a snapshot of the Expert Nation history project, crowd-sourced data from historical criminology, and the Pacific And Regional Archive for Digital Sources in Endangered Cultures (PARADISEC).
Across the sector we build services that operate on data but which actually put data at risk.
The Arkisto (https://arkisto-platform.github.io/why/) approach is to work with a set of standards which make data available for long term access. The closest emoji I could find to represent standards was this “standard poodle”. Previously I used a toothbrush - on the basis that “standards are like toothbrushes, everyone wants to use their own”
The first of the two core standards is the Oxford Common File Layout (OCFL) to organize data in a repository as a set of files. This approach is scalable indefinitely, and reduces the risk that data will be locked up in monolithic systems.
This diagram by Mike Lynch shows a series of different sized collections of data, each with a label. The labels (manifests) in this case are purely about data integrity - and contain checksums. The bundles of data are the next level up as we move on to look at Standard number 2.
RO-Crate is the standard Arkisto uses for packaging and describing data sets. It is based on other standards:
Schema.org is used as the main ontology for classes and properties - it has coverage for all the basic Who What Where style metadata and is used by Google’s dataset search and a number of other projects. There are a few terms from other ontologies where Schema.org does not have coverage.
RO-Crates may also have an HTML human readable summary of data. If you find a stray crate in your downloads folder it is easy to click on the HTML file and get a summary of what’s inside - they can also be hosted on the web using a plain-old webserver.
This is a screenshot of an RO-Crate in the UTS data portal. We are looking at its HTML summary.
With extensive metadata.
A growing set of Arkisto-compatible software tools allow data ingest into repositories, and the creation of data discovery portals that connect data to analytical, visualisation and computing tools.
OCFL Spec: https://ocfl.io/
Research Object Crate (RO-Crate) Spec: http://www.researchobject.org/ro-crate
UTS OCFL JS Implementation: https://github.com/uts-eresearch/ocfl-js
UTS RO Crate / SOLR portal: https://github.com/uts-eresearch/oni-express
UTS Describo: https://github.com/UTS-eResearch/describo
UTS Describo Data Packs: https://github.com/UTS-eResearch/describo-data-packs
CoEDL OCFL JS implementation: https://github.com/CoEDL/ocfl-js
CoEDL Modern PARADISEC: https://github.com/CoEDL/modpdsc
CoEDL OCFL tools: https://github.com/CoEDL/ocfl-tools
One important tool is Describo, a desktop (and soon to be online) tool for describing data using the RO-Crate standard. It creates linked-data descriptions that can describe a dataset at the top level, and also individual files or variables inside files.
There are two projects working on the online version of Describo - one at UTS and one led by CERN working with the European National Research Networks.
Describo can be configured for use in specific domains, for example in cultural archives like PARADISEC. This slide shows how users can create entities and link them, and select from pre-defined data loaded in as part of the profile.
Arkisto currently has two data discovery tools that index the contents of an OCFL repository so humans and machines can discover data and connect to analytical, visualisation and computing tools. This is Michael Lynch’s diagram showing, from the left, how data can be “delivered” to a repository via standard tools (such as rsync) over SSH.
An indexing process uses Solr (or another index like Elasticsearch) builds an index of RO-Crate metadata that can be then used for search and faceted browsing over data. There is a user on the right, requesting access to a dataset and a security guard checking her credentials - the user has the rights to see datasets with a license - * - so in this case the system can serve the content to her.
Here is an example from the PARADISEC indexer. As per the Arkisto appraoach, the PARADISEC site is data driven - objects are stored on disk in OCFL using RO-Crate to describe each research object. Indexing tools walk the OCFL filesystem looking for RO-Crates then, using the crate metadata in addition to the OCFL inventory metadata, construct appropriate indexes into the content. In this example we can see version 1 of this item and the metadata we get from the OCFL inventory.
This is an example of a faceted search interface constructed using the Oni portal tool developed at UTS. This image shows a data-export from the Expert Nation https://expertnation.org/ project (tagline “Universities, War and 1920s & 30s Australia”) led by Associate Professor Tamson Pietsch. Professor Pietsch asked us to create an archival snapshot of the state of the dataset to support a book. We are working with Pietsch’s team to configure the portal to be useful in exporting the data. The SectorName facet is particularly important; it shows that the health sector was by far the biggest employer of returned service people.
Here is another dataset - this time we are looking at an RO-Crate sitting on a plain old web site (not a search portal). This is a screenshot of a map with a time-window function showing where one Laura Adams was convicted of 42 offences between 1918 and 1942. The power of the Arkisto platform, based on Standards, is that adding this kind of functionality to other collections with geographical features in it is a matter of writing a few simple bits of code. The component can be re-used because both the data and metadata use the RO-Crate standard (which in turn is built on other standards). The data in this demonstrator came from Alana Piper’s Criminal Characters project.
This is a screenshot of geographical data about a single offender’s sentences that has been exported in to the Time Layered Cultural Map.
We are working on making this an automated service so that any Arkisto portal can be configured to display relevant geo-data and also to be able to export it for analysis to other tools via APIs including at large scale.
The researcher, Dr Alana Piper says:
Analytical possibilities here would be uploading all offenders in bulk and comparing the 'range' results to determine what types of offences or other factors are associated with higher/lower levels of mobility.
A modern catalog driven from OCFL and RO-Crate. This is the landing page built from the content indexed into elastic search. We can see the number of collections, items, contributors and universities at a glance. There are controls for jumping to a specific item or collection and a simple auto-complete search for quickly finding known content. The bottom half is a dynamic list of the most recently updated items.
PARADISEC has viewers for various content types: video and audio with time aligned transcriptions, image set viewers, and document viewers (xml, pdf and microsoft formats). We are working on making these viewers available across Arkisto sites by having a standard set of hooks for adding viewer plugins to a site as needed.
PARADISE has advanced search and deep indexing into transcriptions with the ability to play segments directly from the search interface.
This is another Arkisto based website - it’s a confidential, access-controlled database of successful grant applications built using an OCFL repository and RO-Crate objects and presented by the Oni portal.
The Arkisto website has a growing list of use cases for different data pipelines - here’s a sketch of the architecture we’re working on for Associate Professor Shauna Murray’s group at UTS - managing data from a sensor network in estuaries along the NSW coast.
See the Use Cases page for more.
Arkisto is a flexible research platform which can be used to assemble a variety of data pipelines, for a variety of disciplines.
The emphasis is on FIRST keeping data safe and re-usable by storing and describing it using standards, so that in the absence of budget and resources to maintain complex virtual labs the data are still available for re-use. We THEN use our growing set of interoperable tools to build data hubs with re-usable data viewer plugins and standards-based interoperable analytical services.
There are active projects underway at the University of Melbourne and University of Technology Sydney across a wide range of disciplines and we are seeking funding to enhance the platform.