This is a presentation by Mike Lynch, Peter Sefton and Sharyn Wise, delivered at eResearch Australasia 2018 by Mike Lynch.
Notes - Slide 2
Provisioner grew out of two conflicting requirements:
- We want to be able integrate research data management into the tools researchers actually use to do their research, giving the researchers something besides data management – access to facilities and software, easier publication, etc
- We didn’t want to build a monolith, and wouldn't be allowed to if we did
Notes - Slide 3
A real case of complex data management.
Our original idea was to put a data repository in for Data Arena customers, from which the DA team would download datasets.
But the DA team just wanted NFS mounts, and repositories aren’t a good fit for large datasets - high-end visualization and data science both use filesystems.
What they could use: a git repository to manage their code and pipelines (GitLab)
On the other end, the MIF have long wanted to use OMERO for microscopy.
Our original plan was for a point-to-point integration which would only service two flagship installations.
The key insight out of which Provisioner grew was: what if we could connect every research app to every other one? (eventually)
Note that the term "workspace" is already being used by another UTS project – we were gazumped – but we still think it’s the best terminology for the abstraction we're trying to describe.
Notes - Slide 4
Notes - Slide 7
Metadata integration takes a "just-in-time" approach, where it's collected only when it's needed.
Each metadata record is derived from the previous one:
- Research data management plan
- Data record
- Data publication
Notes - Slide 8
On the left is the original design, which had a separate Provisioner component (in blue) which orchestrated api requests and talked to the research apps
On the right is the product we built, with no separate orchestrator, and modules within ReDBox which drive the apps
- What happened to that orchestration layer?
- Needed to break the abstraction to authenticate to GitLab
- This is what happens when concepts for APIs hit real life
- The minimum viable product didn’t end up including it
- This is a better design: simpler, and we can implement orchestration/queueing if and when it’s needed
Notes - Slide 9
At present, we have public datasets in our ReDBox 1.9 instance being fed to RDA.
Data publication is manual: once the metadata is ready we (actually, I) put it on an nginx web server
The new model isn’t that different from this: the repository is a filesystem with static DataCrates. The HTML catalogs in these act as landing pages.
We build a solr index from the machine-readable metadata in the DataCrates, and the user interface to this is a single-page Angular app.
Much better for security: we can host the website on a public-facing VM with just nginx and solr and keep ReDBox inside the firewall
Immutable datasets – we’re in discussion with a research group who want to publish their live data – large networking datasets which they’re in the process of refining.
Try to allow them to do this transparently but still allow integrity, so a given DOI points to a determinate snapshot of their data.
Oxford Common File Layout for the static DataCrate repository
Notes - Slide 11
This is a diagram drawn by Gerrad Barthelot, head of our architecture team.
It shows where we’d like to be in a few years – with an expanded range of services and research apps with Provisioner adaptors to Stash 3 (ReDBox 2).
Note that this diagram shows the original Provisioner design with an orchestration layer between ReDBox and each research app.
For now, we’re continuing with the simple model of ReDBox hooks acting as adapters directly talking to apps, and we’ll use middleware between these and the apps if and when we need it, on an app-by-app basis.
Not all of the connections will be fully automated – some may put a request into ServiceConnect, or even just send an email