This post is an introduction to the Provisioner, an open framework for research data management which we're developing in collaboration with the Queensland Cyber Infrastructure Foundation, QCIF and the Australian National Data Service, ANDS.
Provisioner grew out of a project funded by the UTS IT Capital Managemement Program, which is, confusingly, also called Provisioner. In this post, I'll use "Provisioner" to refer to the framework and software, not the project as a whole.
The original goal of the project was to provide a data pipeline which would manage the transfer of microscopy data from the Microbial Imaging Facility, MIF to the visualisation environment at the Data Arena, with some analysis and processing on the way.
During the initial scoping and design work for the project, the middleware system to support the transfer of data from MIF to the Data Arena started to look like a desirable technology in itself. This middleware is what became the Provisioner. The final design for Provisioner is an attempt to address two problems -
how do we support end-to-end data management without taking researchers away from the tools they use for research or requiring them to manually enter administrative data?
how do we integrate data management into research tools without having to customise a growing list of software platforms?
The best introduction to Provisioner is to describe how most researchers will interact with it.
Stash is the UTS research data catalogue, allowing researchers to create RDMPs (research data management plans) and describe datasets. Stash is currently implemented in ReDBox 1.9, an open source platform maintained by QCIF. As part of this project, QCIF are reimplementing ReDBox using modern web frameworks.
The new version of ReDBox will include a service catalogue, which will allow researchers to request a range of IT services for use with a project with an RDPM.
This diagram shows where we'd like to be in a few years with Provisioner managing research data across
- file storage on local or cloud services
- research code repositories in GitLab
- dedicated microscopy management from OMERO
- electronic lab notebooks from labarchives
- secure data collection surveys from REDCap
- and other services that will be requested by our researchers
The current project is aiming to provide connectors to GitLab, OMERO and labarchives.
The Provisioner terminology for these services is research apps (or just apps), and a directory, site or project belonging to a researcher in a particular app is referred to as a workspace.
When a researcher requests a workspace via the service catalogue, Stash creates a workspace record in its own database which is linked to the user and RDMP. It then calls the Provisioner API to request a new workspace. The Provisioner API translates this request into the actual operations needed to provision the workspace in that particular research app, and writes a metadata file into the workspace which links it to the new workspace record, the researcher and the RDMP.
The Stash interface then reports back that the workspace has been created, and gives the research a URI and instructions for how to access it.
This has given the researcher a resource which they can start using, and has ensured via the metadata file that the resource can be linked back to the RDMP and project which govern it.
Once a research has used it to get access to a research app, Provisioner will get out of their way and let them use the app's native interface. Even this minimal use-case will be a big step forward in terms of research data management: we'll have a system which can make sure that new research IT storage and workspaces contain documentation, in the form of the Provisioner metadata, which describes the data and links back to the RDMP and researcher who are responsible for it.
The exact spec of the Provisioner metadata is evolving as we develop the system. We're going to use DataCrate, an evolving standard for packaging and documenting research data. DataCrate provides an HTML version which can be used as a landing page or manifest when a dataset is published - here's a sample dataset of LIDAR data from the Wombeyan caves.
The same data is also provided as a JSON-LD document which is easily machine-readable and follows linked open data principles.
DataCrate is flexible about the level of detail it can provide: it's possible but not necessary to go down to the level of individual files. This means that early in the research life-cycle, a workspace can have broad, high-level descriptions, which can be filled in as required when a dataset is moved into the publication workflow.
The workspace interface in Stash will allow researchers to carry out high-level management tasks. By "high-level", we mean operations which make sense when applied to a workspace in any research app, rather than those which only apply to, for example, microscopy, or file storage.
The range of high-level operations will include:
- sharing a workspace with other researchers
- setting a workspace to be immutable (read-only)
- making a workspace from one app available in a different research app
The Provisioner API is being designed to provide a common interface for these operations. Each research app will have an adaptor which translates these API calls into the native API for that app.
The third operation listed above, making a workspace from one app available in another, is how Provisioner will satisfy the initial requirement: a data pipeline from MIF to the Data Arena. It also has the potential to support many more useful processes, such as data archiving and publication. I'll go into more detail about this in the next blog post.