Publishing versioned datasets using OCFL and nginx
Mike Lynch – University of Technology Sydney
ARDC funding - Data and Services Discovery projects - Institutional Role in a Data Commons

Data repositories

There are a lot of specialized repository applications, from small (Omeka) to large (Hydra, Fedora), all designed as special-purpose homes for datasets and metadata which provide APIs for getting things in and out.

Data repositories

Experience has shown that these solutions don’t scale. Eventually, an institution will have to store a dataset that’s too big either to get in or out, or to store, and will have to look at a workaround like putting the data on disk and pointing to it from a record in the repository.

Oxford Common File Layout
A standard layout for static, file-based repositories
Human and machine readable (JSON)
Lightweight file-level versioning

The Oxford Common File Layout (OCFL) is a standard for laying out arbitrary collections of files as structured repositories. It grew out of a push from the repositories community for a repository structure that wasn’t locked in to a particular application, didn’t have scaling problems, and was easy to migrate to and from.

An OCFL repository is a collection of OCFL Objects, laid out according to a simple standard like PairTree. Objects are immutable and versioned. More details can be found at our FAIR Repositories presentation.

We’re using OCFL for our data publications repository, and will later use it for an internal data repository with access control.

Research Objects + DataCrate
JSON-LD metadata

RO-Crate is a standard for describing individual datasets which evolved from two previous standards, DataCrate and Research Objects. Datasets are described using JSON-LD, a standard for building linked data descriptions in JSON. The ontology is, which is widely supported in industry, and used by Google’s new dataset search engine.

A “crated” dataset is a directory with an arbitrary file hierarchy inside it, and an RO-Crate JSON-LD document with contextual metadata (title, description, contributors, licences) and descriptions of some or all of the contents. An RO-Crate doesn’t have to describe every file inside it, as this would be impractical for some datasets.

An RO-Crate doesn’t even have to have any files other than the RO-Crate files. This allows the system to support metadata-only publication, where the actual data is only available on request.

Research Objects + DataCrate
JSON-LD metadata
HTML preview

An RO-Crate also contains an HTML document which is generated automatically from the JSON-LD. This provides a human-readable view of the metadata with links to the data payloads. It can be used locally or act as the landing page for an RO-Crate published on a web server.

The HTML document is rendered through some lightweight JavaScript, but will degrade gracefully for browsers or users who don’t support JavaScript.

nginx and Solr
fast, versioned web endpoints
Solr indexes metadata and licences
licences enforced on both search results and payloads

OCFL and RO-Crate are the standards we’re using to lay out the repository. To deliver them, we’re using Solr, an efficient search engine, and nginx, a high-performance open-source web server.

Metadata from the RO-Crates is indexed into Solr, including licences representing lists of users who are allowed to access the dataset. Solr is also used to provide a discover interface via a lightweight single-page application.

A simple extension to nginx allows it to resolve incoming URLs to paths in the OCFL repository and serve the appropriate version of the RO-Crate metadata and payload files.

Before serving a file, the nginx handler looks up the Solr index and checks whether the authenticated user has authority to download it.

Nginx enforces licences on both the search results and payloads.

In production for UTS data publication repository
PARADISEC – crosswalking to OCFL and RO-Crate
State Library of NSW – Mitchell collection of public domain books

The OCFL/RO-Crate and nginx stack is meant to accommodate datasets from a wide range of disciplines and sources, from small web uploads, existing repositories and data capture.

As part of the ARDC Data and Services Discovery project, we collaborated with Nick Thieberger and Marco La Rosa of PARADISEC and Euwe Ermita’s team at the State Library of NSW. Both of these institutions have rich digital humanities collections stored in custom repositories behind APIs, and in short day- or two-day workshops we were able to make substantial progress in crosswalking their datasets into OCFL.

Links and acknowledgements
Docker: mikelynch/nginx-ocfl
<p>This research/project is supported by the Australian Research Data Commons (ARDC). The ARDC is enabled by NCRIS.

Creative Commons Licence
This work is licensed under a Creative Commons Attribution 3.0 Australia License.