This is a presentation by Mike Lynch, Peter Sefton and Sharyn Wise, delivered at eResearch Australasia 2018 by Mike Lynch.

A Framework for Integrated Research Data Management
With services for planning, provisioning research storage and applications and describing and packaging research data
<p>Mr Michael Lynch Dr Peter Sefton Ms Sharyn Wise



Provisioner
Integrate research data management into research applications
Allow researchers to self-provision research apps
Apply lessons learned from earlier generations (Data Capture)
Give researchers something, get out of the way
We didn’t want to build a monolith
Small parts, loosely joined, data-centric
<p>A Framework for Integrated Research Data Management

Notes - Slide 2

Provisioner grew out of two conflicting requirements:

  • We want to be able integrate research data management into the tools researchers actually use to do their research, giving the researchers something besides data management – access to facilities and software, easier publication, etc
  • We didn’t want to build a monolith, and wouldn't be allowed to if we did



Case study: microscope to Data Arena
The research pipeline that started it all:
Microscope video of bacteria Pseudomonas aeruginosa
Image recognition and tracking of bacteria
Simulation of bacteria behavior
3D immersive visualisation of simulated bacteria
Practical research: reduce risk of hospital infection
What if we could automate provenance for this pipeline?
A Framework for Integrated Research Data Management

Notes - Slide 3

A real case of complex data management.

Our original idea was to put a data repository in for Data Arena customers, from which the DA team would download datasets.

But the DA team just wanted NFS mounts, and repositories aren’t a good fit for large datasets - high-end visualization and data science both use filesystems.

What they could use: a git repository to manage their code and pipelines (GitLab)

On the other end, the MIF have long wanted to use OMERO for microscopy.

Our original plan was for a point-to-point integration which would only service two flagship installations.

The key insight out of which Provisioner grew was: what if we could connect every research app to every other one? (eventually)




Apps and workspaces
App  = a research application – can be specific or general
OMERO – dedicated microscopy app
GitLab – for research software development
Coming next: file shares, ELNs
Workspace = a research team’s thing within an app
A workspace is linked to an RDMP and an owner
Once it’s created, the researcher goes directly to the app
<p>A Framework for Integrated Research Data Management

Notes - Slide 4

Note that the term "workspace" is already being used by another UTS project – we were gazumped – but we still think it’s the best terminology for the abstraction we're trying to describe.



Workspace API
The workspace abstraction lets us capture common high-level operations:
<p>Create a new workspace for an RDMP
Share a workspace with colleagues
Export data (to a data record or another workspace)
Import data</p>
<p>We are building this out incrementally: create and export are first</p>
<p>A Framework for Integrated Research Data Management



DataCrate – integrated metadata
Builds on widely-supported standards – BagIt and JSON-LD
Provide linked metadata in human- and machine-readable forms
Metadata is still useful outside Provisioner
Targeting schema.org as shared vocabulary
Capture technical metadata with instrument
Capture provenance metadata with createAction and updateAction
Data by reference with tools to fetch files as needed
A Framework for Integrated Research Data Management



ReDBox 2.0
RDMP and dataset description tool
Improved metadata collection and integration
Service catalogue for provisioning workspaces in apps
Data publication workflow
Modern re-implementation (Node.js, Angular, Mongo)
More maintainable
No more curation – web rather than db principles
<p>A Framework for Integrated Research Data Management

Notes - Slide 7

Metadata integration takes a "just-in-time" approach, where it's collected only when it's needed.

Each metadata record is derived from the previous one:

  • Research data management plan
  • Data record
  • Data publication



Agile development!
A Framework for Integrated Research Data Management
Shared API / Orchestrator
ReDBox
ReDBox
Original design
End product

Notes - Slide 8

On the left is the original design, which had a separate Provisioner component (in blue) which orchestrated api requests and talked to the research apps

On the right is the product we built, with no separate orchestrator, and modules within ReDBox which drive the apps

  • What happened to that orchestration layer?
  • Needed to break the abstraction to authenticate to GitLab
  • This is what happens when concepts for APIs hit real life
  • The minimum viable product didn’t end up including it
  • This is a better design: simpler, and we can implement orchestration/queueing if and when it’s needed



Public data repository
A Framework for Integrated Research Data Management
ReDBox 2.0
Solr index
Angular
Staging server
Public server

Notes - Slide 9

At present, we have public datasets in our ReDBox 1.9 instance being fed to RDA.

Data publication is manual: once the metadata is ready we (actually, I) put it on an nginx web server

The new model isn’t that different from this: the repository is a filesystem with static DataCrates. The HTML catalogs in these act as landing pages.

We build a solr index from the machine-readable metadata in the DataCrates, and the user interface to this is a single-page Angular app.

Much better for security: we can host the website on a public-facing VM with just nginx and solr and keep ReDBox inside the firewall

Immutable datasets – we’re in discussion with a research group who want to publish their live data – large networking datasets which they’re in the process of refining.

Try to allow them to do this transparently but still allow integrity, so a given DOI points to a determinate snapshot of their data.

Oxford Common File Layout for the static DataCrate repository




The original case study
How much have we got?
Export from OMERO project to a DataCrate, with linked data recording technical metadata
Link the DataCrate to a ReDBox data record
Create GitLab workspace with the DataCrate, data available by reference / as needed
Export GitLab workspace to a new data record
Publish data
A Framework for Integrated Research Data Management



What’s next?
A Framework for Integrated Research Data Management

Notes - Slide 11

This is a diagram drawn by Gerrad Barthelot, head of our architecture team.

It shows where we’d like to be in a few years – with an expanded range of services and research apps with Provisioner adaptors to Stash 3 (ReDBox 2).

Note that this diagram shows the original Provisioner design with an orchestration layer between ReDBox and each research app.

For now, we’re continuing with the simple model of ReDBox hooks acting as adapters directly talking to apps, and we’ll use middleware between these and the apps if and when we need it, on an app-by-app basis.

Not all of the connections will be fully automated – some may put a request into ServiceConnect, or even just send an email




Thank	you
<p>