Enabling Portable Data-centric Scientific Software Environments-Connecting Preserved Software with Data Providers

Gieschke, Rafael ; Rechert, Klaus

[thumbnail of Enabling_Portable_Data_E-Science-Tage_2021.pdf]

Preview

PDF, English - main document
Download (2MB) | Lizenz:

Citation of documents: Please do not cite the URL that is displayed in your browser location input, instead use the DOI, URN or the persistent URL below, as we can guarantee their long-time accessibility.

DOI: 10.11588/heidok.00029715
URN: urn:nbn:de:bsz:16-heidok-297156
URL: http://www.ub.uni-heidelberg.de/archiv/29715

Abstract

Today’s computer assisted research relies heavily on appropriate infrastructure such as storage and data management services as well as (high performance) computing infrastructure. Of at least similar importance is scientific software, often found as customized software-based setups for processing data or to create novel (software-based) models or simulations. Hence, in order to adapt FAIR data principles to software-based research methods and to ensure re-usability of a wide variety of digital research outputs, not only preservation of these software methods is an important ingredient of a sustainable research management strategy, but also facilitating access to data associated with suitable processing software.

Within the CiTAR (Citing and Archving Research)1, an e-Science project, we have developed infrastructure to preserve and to cite software methods and to ensure scalable long-term access and re-use. The service allows researchers to ingest their configured software setup, e.g., in the form of a container or a virtual machine and to re-run these setups without any special knowledge using a web browser or web API. While the service provides convenient APIs and web-based workflows to orchestrate their execution, provisioning of data - e.g., make a data set accessible as an HTTP data stream; if necessary, authenticate the user - remains an open issue.

As part of a newly formed science data center (SDC) BioDATEN2 we have addressed this challenge, by developing technology to simplify the publication of preserved software together with a published data-set, and in general, to orchestrate the reproduction of an experiment from different sources, e.g., data-set, metadata and runtime data, with the main focus on vendor-neutral integration into existing infrastructure wherever possible. Authors are then able to link a previously preserved software environment with published data, such that the software may then either reproduce their results from their input data, visualize data such, e.g., through plots or allow interactive exploration of data and results.

The main challenge for the integration is to orchestrate the interaction between two services and infrastructures as well as a proper encapsulation of the user interface components, e.g., the data publication platform must embed a connection to the software preservation infrastructure, as well as preparing the research data-set as an input for the desired software process. In context of the aforementioned SDC, an InvenioRDM instance is used as a web-based data publication platform and KeyCloak as an OAuth 2.0 authentication and authorization provider. InvenioRDM stores data in an S3-compatible object storage, but provides its own front-end APIs to access saved objects. Unfortunately, this user-facing API may change over time, such that third-party elements may break. Furthermore, the creation of rich data publications should be as simple as possible, to allow any user to create and maintain them themselves.

For this, we have extended the access to preserved scientific software by wrapping it into a standard Web Component. This Web Component is a self-contained HTML element and can be embedded as a Custom Element into the data publication platform’s user interface, a process very similar to embedding a YouTube video. Like any built-in HTML element, it provides a stable interface, e.g., specifies its input data as defined attributes and is able to accept listeners for (lifecycle) events such as start and end of execution. Its stable interface can also be used by the embedding platform to pass OAuth 2.0 compatible access tokens from the publication platform to the preservation infrastructure. By using Shadow DOM, it does not interfere with surrounding user interface/web-page elements even if these change over time.

The presented approach is not limited just to bioinformatics, but is designed to cater any scientific community relying on software-based workflows and digital resources. The cloud-based approach allows other services to re-use the proposed solution as a “drop-in”, independently of their technological infrastructure.

Document type:	Conference Item
Place of Publication:	Heidelberg
Date Deposited:	28 Apr 2021 15:17
Date:	2021
Number of Pages:	1
Event Dates:	04.03. - 05.03.2021
Event Location:	Heidelberg
Event Title:	E-Science-Tage 2021: Share Your Research Data
Faculties / Institutes:	Service facilities > Computing Centre
Collection:	E-Science Days 2021